End of the word regex

I cannot find in the python’s RE module special symbol for the end of the word…

There is b — beginning of a word, and B — opposite of b, which is match for the any symbol of the word except first… But why there is no just end of the word?

Am i missed something?

asked Nov 4, 2012 at 20:03

RaSergiy's user avatar

Actually b is not just for beginning of the word, but also for the end of the word.

In regex, b means word boundary. So bw+b is a pattern for a single word.

answered Nov 4, 2012 at 20:09

Ωmega's user avatar

ΩmegaΩmega

42.2k33 gold badges133 silver badges196 bronze badges

1

Sorry I misread your question, To my knowledge there is nothing to use to match the end of a word directly, however you should be able to use a pattern like (?<=w)b so you match any boundary with part of a word in front of it, you could further extend this with somthing like (?<=w{3})b to only match after words of 3 or more letters.

Note this does not consume whatever is deliminating the word

If I am trying to figure out regexes I find it is easiest to go and have a play with a tool like these

  • http://re-try.appspot.com/
  • http://regexpal.com/

answered Nov 4, 2012 at 20:14

Hugoagogo's user avatar

HugoagogoHugoagogo

1,59916 silver badges32 bronze badges

2

RegexBuddy—Better than a regular expression tutorial!

The metacharacter b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.

There are three different positions that qualify as word boundaries:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

Simply put: b allows you to perform a “whole words only” search using a regular expression in the form of bwordb. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.

Exactly which characters are word characters depends on the regex flavor you’re working with. In most flavors, characters that are matched by the short-hand character class w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for b but not for w.

Most flavors, except the ones discussed below, have only one metacharacter that matches both before a word and after a word. This is because any position between characters can never be both at the start and at the end of a word. Using only one operator makes things easier for you.

Since digits are considered to be word characters, b4b can be used to match a 4 that is not part of a larger number. This regex does not match 44 sheets of a4. So saying “b matches before and after an alphanumeric sequence” is more exact than saying “before and after a word”.

B is the negated version of b. B matches at every position where b does not. Effectively, B matches at any position between two word characters as well as at any position between two non-word characters.

Looking Inside The Regex Engine

Let’s see what happens when we apply the regex bisb to the string This island is beautiful. The engine starts with the first token b at the first character T. Since this token is zero-length, the position before the character is inspected. b matches here, because the T is a word character and the character before it is the void before the start of the string. The engine continues with the next token: the literal i. The engine does not advance to the next character in the string, because the previous regex token was zero-length. i does not match T, so the engine retries the first token at the next character position.

b cannot match at the position between the T and the h. It cannot match between the h and the i either, and neither between the i and the s.

The next character in the string is a space. b matches here because the space is not a word character, and the preceding character is. Again, the engine continues with the i which does not match with the space.

Advancing a character and restarting with the first regex token, b matches between the space and the second i in the string. Continuing, the regex engine finds that i matches i and s matches s. Now, the engine tries to match the second b at the position before the l. This fails because this position is between two word characters. The engine reverts to the start of the regex and advances one character to the s in island. Again, the b fails to match and continues to do so until the second space is reached. It matches there, but matching the i fails.

But b matches at the position before the third i in the string. The engine continues, and finds that i matches i and s matches s. The last token in the regex, b, also matches at the position before the third space in the string because the space is not a word character, and the character before it is.

The engine has successfully matched the word is in our string, skipping the two earlier occurrences of the characters i and s. If we had used the regular expression is, it would have matched the is in This.

Tcl Word Boundaries

Word boundaries, as described above, are supported by most regular expression flavors. Notable exceptions are the POSIX and XML Schema flavors, which don’t support word boundaries at all. Tcl uses a different syntax.

In Tcl, b matches a backspace character, just like x08 in most regex flavors (including Tcl’s). B matches a single backslash character in Tcl, just like \ in all other regex flavors (and Tcl too).

Tcl uses the letter “y” instead of the letter “b” to match word boundaries. y matches at any word boundary position, while Y matches at any position that is not a word boundary. These Tcl regex tokens match exactly the same as b and B in Perl-style regex flavors. They don’t discriminate between the start and the end of a word.

Tcl has two more word boundary tokens that do discriminate between the start and end of a word. m matches only at the start of a word. That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it. It also matches at the start of the string if the first character in the string is a word character. M matches only at the end of a word. It matches at any position that has a word character to the left of it, and a non-word character to the right of it. It also matches at the end of the string if the last character in the string is a word character.

The only regex engine that supports Tcl-style word boundaries (besides Tcl itself) is the JGsoft engine. In PowerGREP and EditPad Pro, b and B are Perl-style word boundaries, while y, Y, m and M are Tcl-style word boundaries.

In most situations, the lack of m and M tokens is not a problem. ywordy finds “whole words only” occurrences of “word” just like mwordM would. Mwordm could never match anywhere, since M never matches at a position followed by a word character, and m never at a position preceded by one. If your regular expression needs to match characters before or after y, you can easily specify in the regex whether these characters should be word characters or non-word characters. If you want to match any word, yw+y gives the same result as m.+M. Using w instead of the dot automatically restricts the first y to the start of a word, and the second y to the end of a word. Note that y.+y would not work. This regex matches each word, and also each sequence of non-word characters between the words in your subject string. That said, if your flavor supports m and M, the regex engine could apply mw+M slightly faster than yw+y, depending on its internal optimizations.

If your regex flavor supports lookahead and lookbehind, you can use (?<!w)(?=w) to emulate Tcl’s m and (?<=w)(?!w) to emulate M. Though quite a bit more verbose, these lookaround constructs match exactly the same as Tcl’s word boundaries.

If your flavor has lookahead but not lookbehind, and also has Perl-style word boundaries, you can use b(?=w) to emulate Tcl’s m and b(?!w) to emulate M. b matches at the start or end of a word, and the lookahead checks if the next character is part of a word or not. If it is we’re at the start of a word. Otherwise, we’re at the end of a word.

GNU Word Boundaries

The GNU extensions to POSIX regular expressions add support for the b and B word boundaries, as described above. GNU also uses its own syntax for start-of-word and end-of-word boundaries. < matches at the start of a word, like Tcl’s m. > matches at the end of a word, like Tcl’s M.

Boost also treats < and > as word boundaries when using the ECMAScript, extended, egrep, or awk grammar.

POSIX Word Boundaries

The POSIX standard defines [[:<:]] as a start-of-word boundary, and [[:>:]] as an end-of-word boundary. Though the syntax is borrowed from POSIX bracket expressions, these tokens are word boundaries that have nothing to do with and cannot be used inside character classes. Tcl and GNU also support POSIX word boundaries. PCRE supports POSIX word boundaries starting with version 8.34. Boost supports them in all its grammars.

Example

The b metacharacter

To make it easier to find whole words, we can use the metacharacter b. It marks the beginning and the end of an alphanumeric sequence*. Also, since it only serves to mark this locations, it actually matches no character on its own.

*: It is common to call an alphanumeric sequence a word, since we can catch it’s characters with a w (the word characters class). This can be misleading, though, since w also includes numbers and, in most flavors, the underscore.

Examples:

Regex Input Matches?
bstackb stackoverflow No, since there’s no ocurrence of the whole word stack
bstackb foo stack bar Yes, since there’s nothing before nor after stack
bstackb stack!overflow Yes: there’s nothing before stack and !is not a word character
bstack stackoverflow Yes, since there’s nothing before stack
overflowb stackoverflow Yes, since there’s nothing after overflow

The B metacharacter

This is the opposite of b, matching against the location of every non-boundary character. Like b, since it matches locations, it matches no character on its own. It is useful for finding non whole words.

Examples:

Regex Input Matches?
BbB abc Yes, since b is not surrounded by word boundaries.
BaB abc No, a has a word boundary on its left side.
aB abc Yes, a does not have a word boundary on its right side.
B,B a,,,b Yes, it matches the second comma because B will also match the space between two non-word characters (it should be noted that there is a word boundary to the left of the first comma and to the right of the second).
Symbols Hits Examples
sa all words containing the string sa sa, vasaku, sahata, tisa
bsa all words starting with sa sa, sahata, sana; NOT vasaku, tisa
bsab all words sa sa
bsa..b all words consisting of sa + two letters that follow
sa
saka, saku, sana
bsaw+ all words beginning with sa, but not the word sa by itself sahata, sana
b.*anab al words ending in ana sinana, tamuana, sana, bana, maana
(….)l all words with four reduplicated letters pakupaku, vapakupaku, mahumahun, vamahumahun
b(….)l all words beginning with four reduplicated
letters
pakupaku; NOT
vapakupaku
b(….)lanab all words beginning with four reduplicated letters and
ending in ana
vasuvasuana,
hunuhunuana
bva(….)l all words consisting of the prefix va- + four
reduplicated letters
vapakupaku,
vagunagunaha
bvahaa?b all tokens of vahaa and vaha vahaa and vaha

This table summarizes the meaning of various strings in different
regexp syntaxes. It is intended as a quick reference, rather than a
tutorial or specification. Please report any errors.

String GNU grep BRE (grep) ERE (egrep) GNU Emacs Perl Python Tcl
. Any character Any character except Any character except n Any character
[…] Bracket
Expression
Character Set Character Class Bracket
Expression
(re) Subexpression Grouping
re{…} Match re multiple times Match re multiple times
(re) Subexpression Grouping
re{…} Match re multiple times Match re multiple times
re{…}? Nongreedy {}
digit Back-reference
^ Start of line
$ End of line
re? re 0 or 1 times
re* re 0 or more times
re+ re one or more times
l|r l or r l or r
*? Non-greedy *
+? Non-greedy +
?? Non-greedy ?
A Start of string
b Either end of word Either end of word
B Not either end of word Not either end of word Synonym for
cC Any in category C
CC Any not in category C
C Any octet
d Digit
D Non-digit
G At pos()
m Start of word
M End of word
pproperty

p{property}
Unicode property
Pproperty

P{property}
Not unicode property
sC Any with syntax C
SC Any with syntax not C
s Whitespace
S Non-whitespace
w Same as [[:alnum:]] Same as sw Alphanumeric and _
W Same as [^[:alnum:]] Same as Sw Not alphanumeric or _
X Combining sequence
y Either end of word
y Not either end of word
Z End of string/last line End of string
z End of string
` Start of buffer/string
End of buffer/string
< Start of word Start of word
> End of word End of word
re? re 0 or 1
re+ re 1 or more
l|r l or r l or r
(?#text) Comment, ignored
(?modifiers) Embedded modifiers
(?modifiers:re) Shy grouping + modifiers
(?:re) Shy grouping
(?:…) Shy grouping
(?=re) Lookahead
(?!re) Negative lookahead
(?<=p) Lookbehind
(?<!o) Negative lookbehind
(?{code})

(??{code})
Embedded Perl
(?>re) Independent expression
(?(cond)re)

(?(cond)re|re)
Condition expression
(?P<name>re) Symbolic grouping
(?P=name) Symbolic backref
String GNU grep BRE (grep) ERE (egrep) GNU Emacs Perl Python Tcl

Who Uses What?

BRE refers to POSIX «basic regular expressions» and ERE is POSIX
«extended regular expressions».

APIs

regcomp
uses BREs by default but can
also use EREs. It has a variety of other options which modify the
syntax slightly.

Boost’s regex++
supports a variety of syntaxes.

PCRE is almost the same as
Perl, though it doesn’t support the embedded Perl feature and the
man page lists a number of other differences.

Languages

awk
is supposed to use EREs, plus the
extra C-style escapes \, a, b,
f, n, r, t, v with
their usual meanings. sed is supposed to
use BREs, plus n with its usual meaning.

lex is
also supposed to use EREs with
some extensions: «…» quotes everything inside it
(backslash escapes are recognized); an initial
<state> matches a start condition;
r/x matches r only when followed by
x; and {name} matches the value of a
substitution symbol. A variety of escape sequences, including the
usual C ones, are recognized. Possibly this deserves a new
column.

Tools

grep is supposed
to use BREs, except
that grep -E uses EREs. (GNU grep fits
some extensions in where POSIX leaves the behaviour unspecified).
egrep uses EREs. grep -F doesn’t use regexps at all, of
course.

ed uses
BREs. ex and vi
use BREs but additionally support <
and > as described above, and use ~ to match
the replacement part of the previous substitution.

expr
uses BREs with all patterns
implicitly anchored at the start.

The regexp syntax accepted by less depends on
how it is built but PCRE and POSIX EREs are likely outcomes on
modern systems.

Vim has enough differences and extensions
that it perhaps deserves a column (or two) to itself.

Subexpressions, Grouping and Back-References

Subexpressions or groups are surrounded by ( and
), or sometimes ( and ). They serve
two purposes; firstly they override the precedence rules of other
operators, and secondly they «capture» part of the text matched by a
regexp. This can then be used later on in the regexp via the
digit syntax (this is called a back-reference) or
outside the regexp to extract the appropriate part of a string.

«Shy grouping» has the precedence-overriding feature but not the
capturing feature.

«Symbolic grouping» allows groups to be identified by name rather
than number.

Match Multiple Times

The syntax of this varies a bit; sometimes you used { and },
and sometimes you use { and }. However the idea is the same:

  • RE{N} will match RE exactly N times.
  • RE{N,} will match RE N or more times.
  • RE{N,M} will match RE between N and M times (inclusive).

It is worth nothing that the GNU Grep manual says:

   Traditional `egrep' did not support the `{' metacharacter, and some
`egrep' implementations support `{' instead, so portable scripts
should avoid `{' in `egrep' patterns and should use `[{]' to match a
literal `{'.

Bracket Expressions

This refers to expressions in [square brackets], for which POSIX
defines a complicated syntax all of their own.

Firstly, if the first character after the [ is a
^ (caret) then the sense of the match is reversed.

The rest of the bracket expression consists of a sequence of
elements selected from the following list. The bracket expression
as a whole matches any character (or character sequence) that is
matched by at least one of them (or is matched by none of them, if
an initial ^ was used).

1. Collating symbols. These look like
[.element.], where element is a collating
element (i.e. a symbolic name for a multi-character string), and
match the value of the collating element in the current locale.
This doesn’t seem to work in GNU grep.

2. Equivalence classes. These look like
[=element=], where element is a collating
element. They match any collating element (single or multiple
characters) which has the same primary weight as element,
i.e. if they appear in the same place in the current locale’s
collation sequence. This doesn’t seem to work in GNU grep.

3. Character classes. These look like [:class:],
where class is the name of the character class to match. The
following character classes exist in all locales:

[:alnum:] [:alpha:] [:blank:] [:cntrl:] [:digit:]
[:graph:] [:lower:] [:print:] [:space:] [:upper:]

4. Range expressions. These look like
startend where start and end
are either single characters or collating symbols. The behaviour is
only specified in the POSIX locale, where they match all the
characters between start and end inclusive.

5. Single characters. These match themselves.

To include a ], put it immediately after the opening
[ or [^; if it occurs later it will close the
bracket expression. The hyphen (-) is not treated as a
range separator if it appears first or last, or as the endpoint of a
range.

Emacs «character sets» are similar to bracket expressions, except
that collating symbols, equivalence classes and character classes
aren’t supported.

Perl «character classes» are also similar. They support POSIX
character class syntax (argh, confusing names!) and recognize, but
don’t support, collating symbols or equivalence classes.

GNU Grep and .

GNU Grep has slightly strange handling of . and
newlines.

Firstly, the manual says that . matches «any single
character». Superficially it appears not to match the newline
character:

$ echo | grep .
$ 

The outcome is actually in keeping with standard and
traditional behaviour for grep, where the newline is not included in
the text to be matched. But that doesn’t appear to be quite what’s
going on with the GNU version, as explicitly searching for a newline
does produce a match:

$ echo | perl -e 'exec("/usr/bin/grep","n");'

$ 

So is there a newline to match against or not?

The other case to consider is when the -z or
—null-data option is used. In that case, .
definitely does match a newline, exactly as the manual says:

$ perl -e 'print "n";' | grep -z . | od -tx1
0000000 0a 00
0000002
$ 

Perl Variations

. and newlines

The /s modifier changes the meaning of . to
match any haracter including n.

Anchors

The /m modifier causes ^ and $ to
match at the start of any line within the subject string rathe than
just the start and end of the subject string.

«Lookbehind» Matching

Perl’s lookbehind matches,
i.e. (?<=p) and
(?<!p) only work for fixed-width
patterns, not arbitrary regular expressions.

Sources

The POSIX regular expression specification can be found at http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html.
For the regexp languages used by particular programs, I looked at
the documentation for GNU Grep
2.4.2; GNU
Emacs 21.2.1; Perl 5.6.1;
Python 2.2.1; Tcl 8.3.3; and less 458.

All errors are my own!

RJK | Contents

Понравилась статья? Поделить с друзьями:
  • End of the word of god
  • End of the word music
  • End of the word game
  • End of the word films
  • End of the row in excel vba