Expressions with the word search

2.6. Match Whole Words

Problem

Create a regex that matches cat in My cat is brown, but not in category or bobcat. Create another
regex that matches cat in staccato, but not in any of the three
previous subject strings.

Solution

Word boundaries

bcatb
Regex options:
None
Regex flavors: .NET,
Java, JavaScript, PCRE, Perl, Python, Ruby

Nonboundaries

BcatB
Regex options:
None
Regex flavors: .NET,
Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Word boundaries

The regular expression token b is
called a word boundary. It matches at the start
or the end of a word. By itself, it results in a zero-length match.
b is an
anchor, just like the tokens introduced in the
previous section.

Strictly speaking, b matches in these three positions:

  • Before the first character in the subject, if the first
    character is a word character

  • After the last character in the subject, if the last
    character is a word character

  • Between two characters in the subject, where one is a word
    character and the other is not a word character

To run a “whole words only” search using a regular expression,
simply place the word between two word boundaries, as we did with
bcatb. The first
b requires the
c to occur at the very
start of the string, or after a nonword character. The second b requires the t to occur at the very end of
the string, or before a nonword character.

Line break characters are nonword characters. b will match after a line break if the line break is immediately followed by a word character. …

Symbols Hits Examples
sa all words containing the string sa sa, vasaku, sahata, tisa
bsa all words starting with sa sa, sahata, sana; NOT vasaku, tisa
bsab all words sa sa
bsa..b all words consisting of sa + two letters that follow
sa
saka, saku, sana
bsaw+ all words beginning with sa, but not the word sa by itself sahata, sana
b.*anab al words ending in ana sinana, tamuana, sana, bana, maana
(….)l all words with four reduplicated letters pakupaku, vapakupaku, mahumahun, vamahumahun
b(….)l all words beginning with four reduplicated
letters
pakupaku; NOT
vapakupaku
b(….)lanab all words beginning with four reduplicated letters and
ending in ana
vasuvasuana,
hunuhunuana
bva(….)l all words consisting of the prefix va- + four
reduplicated letters
vapakupaku,
vagunagunaha
bvahaa?b all tokens of vahaa and vaha vahaa and vaha

I think that the behavior desired by the OP was not completely achieved using the answers given. Specifically, the desired output of a boolean was not accomplished. The answers given do help illustrate the concept, and I think they are excellent. Perhaps I can illustrate what I mean by stating that I think that the OP used the examples used because of the following.

The string given was,

a = "this is a sample"

The OP then stated,

I want to match whole word — for example match "hi" should return False since "hi" is not a word …

As I understand, the reference is to the search token, "hi" as it is found in the word, "this". If someone were to search the string, a for the word "hi", they should receive False as the response.

The OP continues,

… and "is" should return True since there is no alpha character on the left and on the right side.

In this case, the reference is to the search token "is" as it is found in the word "is". I hope this helps clarify things as to why we use word boundaries. The other answers have the behavior of «don’t return a word unless that word is found by itself — not inside of other words.» The «word boundary» shorthand character class does this job nicely.

Only the word "is" has been used in examples up to this point. I think that these answers are correct, but I think that there is more of the question’s fundamental meaning that needs to be addressed. The behavior of other search strings should be noted to understand the concept. In other words, we need to generalize the (excellent) answer by @georg using re.match(r"bisb", your_string) The same r"bisb" concept is also used in the answer by @OmPrakash, who started the generalizing discussion by showing

>>> y="this isis a sample."
>>> regex=re.compile(r"bisb")  # For ignore case: re.compile(r"bisb", re.IGNORECASE)
>>> regex.findall(y)
[]

Let’s say the method which should exhibit the behavior I’ve discussed is named

find_only_whole_word(search_string, input_string)

The following behavior should then be expected.

>>> a = "this is a sample"
>>> find_only_whole_word("hi", a)
False
>>> find_only_whole_word("is", a)
True

Once again, this is how I understand the OP’s question. We have a step towards that behavior with the answer from @georg , but it’s a little hard to interpret/implement. to wit

>>> import re
>>> a = "this is a sample"
>>> re.search(r"bisb", a)
<_sre.SRE_Match object; span=(5, 7), match='is'>
>>> re.search(r"bhib", a)
>>>

There is no output from the second command. The useful answer from @OmPrakesh shows output, but not True or False.

Here’s a more complete sampling of the behavior to be expected.

>>> find_only_whole_word("this", a)
True
>>> find_only_whole_word("is", a)
True
>>> find_only_whole_word("a", a)
True
>>> find_only_whole_word("sample", a)
True
# Use "ample", part of the word, "sample": (s)ample
>>> find_only_whole_word("ample", a)
False
# (t)his
>>> find_only_whole_word("his", a)
False
# (sa)mpl(e)
>>> find_only_whole_word("mpl", a)
False
# Any random word
>>> find_only_whole_word("applesauce", a)
False
>>>

This can be accomplished by the following code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
#@file find_only_whole_word.py

import re

def find_only_whole_word(search_string, input_string):
  # Create a raw string with word boundaries from the user's input_string
  raw_search_string = r"b" + search_string + r"b"

  match_output = re.search(raw_search_string, input_string)
  ##As noted by @OmPrakesh, if you want to ignore case, uncomment
  ##the next two lines
  #match_output = re.search(raw_search_string, input_string, 
  #                         flags=re.IGNORECASE)

  no_match_was_found = ( match_output is None )
  if no_match_was_found:
    return False
  else:
    return True

##endof:  find_only_whole_word(search_string, input_string)

A simple demonstration follows. Run the Python interpreter from the same directory where you saved the file, find_only_whole_word.py.

>>> from find_only_whole_word import find_only_whole_word
>>> a = "this is a sample"
>>> find_only_whole_word("hi", a)
False
>>> find_only_whole_word("is", a)
True
>>> find_only_whole_word("cucumber", a)
False
# The excellent example from @OmPrakash
>>> find_only_whole_word("is", "this isis a sample")
False
>>>

RegexBuddy—The most comprehensive regular expression library!

Some search tools that use boolean operators also have a special operator called “near”. Searching for “term1 near term2” finds all occurrences of term1 and term2 that occur within a certain “distance” from each other. The distance is a number of words. The actual number depends on the search tool, and is often configurable.

You can easily perform the same task with the proper regular expression.

Emulating “near” with a Regular Expression

With regular expressions you can describe almost any text pattern, including a pattern that matches two words near each other. This pattern is relatively simple, consisting of three parts: the first word, a certain number of unspecified words, and the second word. An unspecified word can be matched with the shorthand character class w+. The spaces and other characters between the words can be matched with W+ (uppercase W this time).

The complete regular expression becomes bword1W+(?:w+W+){1,6}?word2b. The quantifier {1,6}? makes the regex require at least one word between “word1” and “word2”, and allow at most six words.

If the words may also occur in reverse order, we need to specify the opposite pattern as well:

b(?:word1W+(?:w+W+){1,6}?word2|word2W+(?:w+W+){1,6}?word1)b

If you want to find any pair of two words out of a list of words, you can use:

b(word1|word2|word3)(?:W+w+){1,6}?W+(word1|word2|word3)b

The final regex also finds a word near itself. It will match word2 near word2, for example.

In linguistics, a word is the smallest element that can be uttered in isolation with objective or practical meaning. (Wikipedia) (See all definitions)

  • to word (see also)
  • word

Sentences with «word» (usage examples):

  • This whole process will create positive externalities, or, in other words, a benefit for third parties — top ranked cryptocurrency holders — that are not directly involved in the dead-coin for CoinJanitor tokens transaction. (bitcoinnews.com)
  • «In other words, they must come out of the retirement account and go through the «tax fence,» as we say, and then can be directed to an after-tax account which then can be spent or invested as goals dictate.» (investopedia.com)
  • Oh, and the occasion for the latest Yorke rant (there were plenty of curse words we can’t print in his Spotify analysis): It was Spotify’s fifth birthday. (cnbc.com)
  • (see
    more)

Понравилась статья? Поделить с друзьями:
  • Expressions with the word round
  • Expressions with the word red
  • Expressions with the word put
  • Expressions with the word one in them
  • Expressions with the word money