2.6. Match Whole Words
Problem
Create a regex that matches cat
in My cat is brown
, but not in category
or bobcat
. Create another
regex that matches cat
in staccato
, but not in any of the three
previous subject strings.
Solution
Word boundaries
bcatb
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Nonboundaries
BcatB
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Discussion
Word boundaries
The regular expression token ‹b
› is
called a word boundary. It matches at the start
or the end of a word. By itself, it results in a zero-length match.
‹b
› is an
anchor, just like the tokens introduced in the
previous section.
Strictly speaking, ‹b
› matches in these three positions:
-
Before the first character in the subject, if the first
character is a word character -
After the last character in the subject, if the last
character is a word character -
Between two characters in the subject, where one is a word
character and the other is not a word character
To run a “whole words only” search using a regular expression,
simply place the word between two word boundaries, as we did with
‹bcatb
›. The first
‹b
› requires the
‹c
› to occur at the very
start of the string, or after a nonword character. The second ‹b
› requires the ‹t
› to occur at the very end of
the string, or before a nonword character.
Line break characters are nonword characters. ‹b
› will match after a line break if the line break is immediately followed by a word character. …
Symbols | Hits | Examples |
---|---|---|
sa | all words containing the string sa | sa, vasaku, sahata, tisa |
bsa | all words starting with sa | sa, sahata, sana; NOT vasaku, tisa |
bsab | all words sa | sa |
bsa..b | all words consisting of sa + two letters that follow sa |
saka, saku, sana |
bsaw+ | all words beginning with sa, but not the word sa by itself | sahata, sana |
b.*anab | al words ending in ana | sinana, tamuana, sana, bana, maana |
(….)l | all words with four reduplicated letters | pakupaku, vapakupaku, mahumahun, vamahumahun |
b(….)l | all words beginning with four reduplicated letters |
pakupaku; NOT vapakupaku |
b(….)lanab | all words beginning with four reduplicated letters and ending in ana |
vasuvasuana, hunuhunuana |
bva(….)l | all words consisting of the prefix va- + four reduplicated letters |
vapakupaku, vagunagunaha |
bvahaa?b | all tokens of vahaa and vaha | vahaa and vaha |
I think that the behavior desired by the OP was not completely achieved using the answers given. Specifically, the desired output of a boolean was not accomplished. The answers given do help illustrate the concept, and I think they are excellent. Perhaps I can illustrate what I mean by stating that I think that the OP used the examples used because of the following.
The string given was,
a = "this is a sample"
The OP then stated,
I want to match whole word — for example match
"hi"
should returnFalse
since"hi"
is not a word …
As I understand, the reference is to the search token, "hi"
as it is found in the word, "this"
. If someone were to search the string, a
for the word "hi"
, they should receive False
as the response.
The OP continues,
… and
"is"
should returnTrue
since there is no alpha character on the left and on the right side.
In this case, the reference is to the search token "is"
as it is found in the word "is"
. I hope this helps clarify things as to why we use word boundaries. The other answers have the behavior of «don’t return a word unless that word is found by itself — not inside of other words.» The «word boundary» shorthand character class does this job nicely.
Only the word "is"
has been used in examples up to this point. I think that these answers are correct, but I think that there is more of the question’s fundamental meaning that needs to be addressed. The behavior of other search strings should be noted to understand the concept. In other words, we need to generalize the (excellent) answer by @georg using re.match(r"bisb", your_string)
The same r"bisb"
concept is also used in the answer by @OmPrakash, who started the generalizing discussion by showing
>>> y="this isis a sample." >>> regex=re.compile(r"bisb") # For ignore case: re.compile(r"bisb", re.IGNORECASE) >>> regex.findall(y) []
Let’s say the method which should exhibit the behavior I’ve discussed is named
find_only_whole_word(search_string, input_string)
The following behavior should then be expected.
>>> a = "this is a sample"
>>> find_only_whole_word("hi", a)
False
>>> find_only_whole_word("is", a)
True
Once again, this is how I understand the OP’s question. We have a step towards that behavior with the answer from @georg , but it’s a little hard to interpret/implement. to wit
>>> import re
>>> a = "this is a sample"
>>> re.search(r"bisb", a)
<_sre.SRE_Match object; span=(5, 7), match='is'>
>>> re.search(r"bhib", a)
>>>
There is no output from the second command. The useful answer from @OmPrakesh shows output, but not True
or False
.
Here’s a more complete sampling of the behavior to be expected.
>>> find_only_whole_word("this", a)
True
>>> find_only_whole_word("is", a)
True
>>> find_only_whole_word("a", a)
True
>>> find_only_whole_word("sample", a)
True
# Use "ample", part of the word, "sample": (s)ample
>>> find_only_whole_word("ample", a)
False
# (t)his
>>> find_only_whole_word("his", a)
False
# (sa)mpl(e)
>>> find_only_whole_word("mpl", a)
False
# Any random word
>>> find_only_whole_word("applesauce", a)
False
>>>
This can be accomplished by the following code:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
#@file find_only_whole_word.py
import re
def find_only_whole_word(search_string, input_string):
# Create a raw string with word boundaries from the user's input_string
raw_search_string = r"b" + search_string + r"b"
match_output = re.search(raw_search_string, input_string)
##As noted by @OmPrakesh, if you want to ignore case, uncomment
##the next two lines
#match_output = re.search(raw_search_string, input_string,
# flags=re.IGNORECASE)
no_match_was_found = ( match_output is None )
if no_match_was_found:
return False
else:
return True
##endof: find_only_whole_word(search_string, input_string)
A simple demonstration follows. Run the Python interpreter from the same directory where you saved the file, find_only_whole_word.py
.
>>> from find_only_whole_word import find_only_whole_word
>>> a = "this is a sample"
>>> find_only_whole_word("hi", a)
False
>>> find_only_whole_word("is", a)
True
>>> find_only_whole_word("cucumber", a)
False
# The excellent example from @OmPrakash
>>> find_only_whole_word("is", "this isis a sample")
False
>>>
Some search tools that use boolean operators also have a special operator called “near”. Searching for “term1 near term2” finds all occurrences of term1 and term2 that occur within a certain “distance” from each other. The distance is a number of words. The actual number depends on the search tool, and is often configurable.
You can easily perform the same task with the proper regular expression.
Emulating “near” with a Regular Expression
With regular expressions you can describe almost any text pattern, including a pattern that matches two words near each other. This pattern is relatively simple, consisting of three parts: the first word, a certain number of unspecified words, and the second word. An unspecified word can be matched with the shorthand character class w+. The spaces and other characters between the words can be matched with W+ (uppercase W this time).
The complete regular expression becomes bword1W+(?:w+W+){1,6}?word2b. The quantifier {1,6}? makes the regex require at least one word between “word1” and “word2”, and allow at most six words.
If the words may also occur in reverse order, we need to specify the opposite pattern as well:
b(?:word1W+(?:w+W+){1,6}?word2|word2W+(?:w+W+){1,6}?word1)b
If you want to find any pair of two words out of a list of words, you can use:
b(word1|word2|word3)(?:W+w+){1,6}?W+(word1|word2|word3)b
The final regex also finds a word near itself. It will match word2 near word2, for example.
In linguistics, a word is the smallest element that can be uttered in isolation with objective or practical meaning. (Wikipedia) (See all definitions)
- to word (see also)
- word
Sentences with «word» (usage examples):
- This whole process will create positive externalities, or, in other words, a benefit for third parties — top ranked cryptocurrency holders — that are not directly involved in the dead-coin for CoinJanitor tokens transaction. (bitcoinnews.com)
- «In other words, they must come out of the retirement account and go through the «tax fence,» as we say, and then can be directed to an after-tax account which then can be spent or invested as goals dictate.» (investopedia.com)
- Oh, and the occasion for the latest Yorke rant (there were plenty of curse words we can’t print in his Spotify analysis): It was Spotify’s fifth birthday. (cnbc.com)
- (see
more)