The metacharacter b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: b allows you to perform a “whole words only” search using a regular expression in the form of bwordb. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.
Exactly which characters are word characters depends on the regex flavor you’re working with. In most flavors, characters that are matched by the short-hand character class w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for b but not for w.
Most flavors, except the ones discussed below, have only one metacharacter that matches both before a word and after a word. This is because any position between characters can never be both at the start and at the end of a word. Using only one operator makes things easier for you.
Since digits are considered to be word characters, b4b can be used to match a 4 that is not part of a larger number. This regex does not match 44 sheets of a4. So saying “b matches before and after an alphanumeric sequence” is more exact than saying “before and after a word”.
B is the negated version of b. B matches at every position where b does not. Effectively, B matches at any position between two word characters as well as at any position between two non-word characters.
Looking Inside The Regex Engine
Let’s see what happens when we apply the regex bisb to the string This island is beautiful. The engine starts with the first token b at the first character T. Since this token is zero-length, the position before the character is inspected. b matches here, because the T is a word character and the character before it is the void before the start of the string. The engine continues with the next token: the literal i. The engine does not advance to the next character in the string, because the previous regex token was zero-length. i does not match T, so the engine retries the first token at the next character position.
b cannot match at the position between the T and the h. It cannot match between the h and the i either, and neither between the i and the s.
The next character in the string is a space. b matches here because the space is not a word character, and the preceding character is. Again, the engine continues with the i which does not match with the space.
Advancing a character and restarting with the first regex token, b matches between the space and the second i in the string. Continuing, the regex engine finds that i matches i and s matches s. Now, the engine tries to match the second b at the position before the l. This fails because this position is between two word characters. The engine reverts to the start of the regex and advances one character to the s in island. Again, the b fails to match and continues to do so until the second space is reached. It matches there, but matching the i fails.
But b matches at the position before the third i in the string. The engine continues, and finds that i matches i and s matches s. The last token in the regex, b, also matches at the position before the third space in the string because the space is not a word character, and the character before it is.
The engine has successfully matched the word is in our string, skipping the two earlier occurrences of the characters i and s. If we had used the regular expression is, it would have matched the is in This.
Tcl Word Boundaries
Word boundaries, as described above, are supported by most regular expression flavors. Notable exceptions are the POSIX and XML Schema flavors, which don’t support word boundaries at all. Tcl uses a different syntax.
In Tcl, b matches a backspace character, just like x08 in most regex flavors (including Tcl’s). B matches a single backslash character in Tcl, just like \ in all other regex flavors (and Tcl too).
Tcl uses the letter “y” instead of the letter “b” to match word boundaries. y matches at any word boundary position, while Y matches at any position that is not a word boundary. These Tcl regex tokens match exactly the same as b and B in Perl-style regex flavors. They don’t discriminate between the start and the end of a word.
Tcl has two more word boundary tokens that do discriminate between the start and end of a word. m matches only at the start of a word. That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it. It also matches at the start of the string if the first character in the string is a word character. M matches only at the end of a word. It matches at any position that has a word character to the left of it, and a non-word character to the right of it. It also matches at the end of the string if the last character in the string is a word character.
The only regex engine that supports Tcl-style word boundaries (besides Tcl itself) is the JGsoft engine. In PowerGREP and EditPad Pro, b and B are Perl-style word boundaries, while y, Y, m and M are Tcl-style word boundaries.
In most situations, the lack of m and M tokens is not a problem. ywordy finds “whole words only” occurrences of “word” just like mwordM would. Mwordm could never match anywhere, since M never matches at a position followed by a word character, and m never at a position preceded by one. If your regular expression needs to match characters before or after y, you can easily specify in the regex whether these characters should be word characters or non-word characters. If you want to match any word, yw+y gives the same result as m.+M. Using w instead of the dot automatically restricts the first y to the start of a word, and the second y to the end of a word. Note that y.+y would not work. This regex matches each word, and also each sequence of non-word characters between the words in your subject string. That said, if your flavor supports m and M, the regex engine could apply mw+M slightly faster than yw+y, depending on its internal optimizations.
If your regex flavor supports lookahead and lookbehind, you can use (?<!w)(?=w) to emulate Tcl’s m and (?<=w)(?!w) to emulate M. Though quite a bit more verbose, these lookaround constructs match exactly the same as Tcl’s word boundaries.
If your flavor has lookahead but not lookbehind, and also has Perl-style word boundaries, you can use b(?=w) to emulate Tcl’s m and b(?!w) to emulate M. b matches at the start or end of a word, and the lookahead checks if the next character is part of a word or not. If it is we’re at the start of a word. Otherwise, we’re at the end of a word.
GNU Word Boundaries
The GNU extensions to POSIX regular expressions add support for the b and B word boundaries, as described above. GNU also uses its own syntax for start-of-word and end-of-word boundaries. < matches at the start of a word, like Tcl’s m. > matches at the end of a word, like Tcl’s M.
Boost also treats < and > as word boundaries when using the ECMAScript, extended, egrep, or awk grammar.
POSIX Word Boundaries
The POSIX standard defines [[:<:]] as a start-of-word boundary, and [[:>:]] as an end-of-word boundary. Though the syntax is borrowed from POSIX bracket expressions, these tokens are word boundaries that have nothing to do with and cannot be used inside character classes. Tcl and GNU also support POSIX word boundaries. PCRE supports POSIX word boundaries starting with version 8.34. Boost supports them in all its grammars.
Regex Boundaries and Delimiters—Standard and Advanced
Although this page starts with the regex word boundary b, it aims to go far beyond: it will also introduce less-known boundaries, as well as explain how to make your own—DIY Boundaries.
Jumping Points
For easy navigation, here are some jumping points to various sections of the page:
✽ Boundaries vs. Anchors
✽ Word Boundary: b
✽ Not-a-word-boundary: B
✽ Left- and Right-of-Word Boundaries
✽ Making Your Own Boundaries
✽ DIY Boundary Workshop: «real word boundary»
✽ DIY Boundary: between a letter and a digit
✽ Double Negative Delimiter: Character, or Beginning of String
(direct link)
Boundaries vs. Anchors
Why are ^ and $ called anchors while b is called a boundary?
These tokens have one thing in common: they are assertions about the engine’s current position in the string. Therefore, none of them consume characters.
Anchors assert that the current position in the string matches a certain position: the beginning, the end, or in the case of G the position immediately following the last match.
In contrast, boundaries make assertions about what can be matched to the left and right of the current position.
The distinction is blurry. Typically, you would translate ^ as something like «assert that the current position is the beginning of the string». But if you were in a mood to play with logic, you could say:
Imagine that a string is a space between two walls—one to the left and one to the right. All the positions in the string are within that space. Then we could translate the ^ anchor as:
Assert that immediately to the left of the current position, we can find the left wall, while to the right of the current position we cannot find the left wall.
Yep, in that light, our anchor is a boundary—we look left and right. We’ll keep anchors and boundaries on separate pages because there’s a lot of ground to cover, but just keep that in mind.
(direct link)
Word Boundary: b
The word boundary b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).
The regex bcatb would therefore match cat in a black cat, but it wouldn’t match it in catatonic, tomcat or certificate. Removing one of the boundaries, bcat would match cat in catfish, and catb would match cat in tomcat, but not vice-versa. Both, of course, would match cat on its own.
Word boundaries are useful when you want to match a sequence of letters (or digits) on their own, or to ensure that they occur at the beginning or the end of a sequence of characters.
Be aware, though, that bcatb will not match cat in _cat or in cat25 because there is no boundary between an underscore and a letter, nor between a letter and a digit: these all belong to what regex defines as word characters. If you want to create a «real word boundary» (where a word is only allowed to have letters), see the recipe below in the section on DYI boundaries.
(direct link)
Difference between Engines
As you can see on the regex cheat sheet, b behaves differently depending on the engine:
✽ In PCRE (PHP, R…) with the Unicode mode turned off, JavaScript and Python 2.7, it matches where only one side is an ASCII letter, digit or underscore.
✽ In PCRE (PHP, R…) with the Unicode mode turned on, .NET, Java, Perl, Python 3 and Ruby, it matches a position where only one side is a Unicode letter, digit or underscore.
(direct link)
Not-a-word-boundary: B
B matches all positions where b doesn’t match. Therefore, it matches:
✽ When neither side is a word character, for instance at any position in the string $=(@-%++) (including the beginning and end of the string)
✽ When both sides are a word character, for instance between the H and the i in Hi!
This may not seem very useful, but sometimes B is just what you want. For instance,
✽ BcatB will find cat fully surrounded by word characters, as in certificate, but neither on its own nor at the beginning or end of words.
✽ catB will find cat both in certificate and catfish, but neither in tomcat nor on its own.
✽ Bcat will find cat both in certificate and tomcat, but neither in catfish nor on its own.
✽ Bcat|catB will find cat in embedded situation, e.g. in certificate, catfish or tomcat, but not on its own.
Difference between Engines
In all engines that support it, B matches positions that are not matched by b. Since b behaves differently in various engines, see b engine variations a few paragraphs above.
(direct link)
Left- and Right-of-Word Boundaries
The PCRE (PHP, R, …) version 8.34+ and MySQL engines support the POSIX character classes for the beginning-of-word boundary [[:<:]] and the end-of-word boundary [[:>:]]
✽ [[:<:]]cat matches cat in the word on its own as well as in catfish, but neither in tomcat nor in certificate.
✽ cat[[:<:]] never matches as a word cannot start in the middle of a word.
✽ cat[[:>:]] matches cat in the word on its own as well as in tomcat, but neither in catfish nor in certificate.
✽ [[:>:]]cat never matches as a word cannot end in the middle of a word.
For MySQL, the definition of a word character is an ASCII letter, digit or underscore—and this set of characters drives the interpretation of these «start of word» and «end of word» boundaries.
PCRE offers these boundaries as a convenience for occasions when someone might want to paste POSIX regex into a PCRE-powered language (or, more likely, switch the regex library used by an old C program), but the engine makes the following substitutions before starting the match:
✽ The start of word boundary [[:<:]] is converted to b(?=w)
✽ The end of word boundary [[:>:]] is converted to b(?<=w)
Therefore, the «start of word» and «end of word» boundaries derive their meaning from the b boundary. In non-Unicode mode, it matches a position where only one side is an ASCII letter, digit or underscore. In Unicode mode, it matches a position where only one side is a Unicode letter, digit or underscore.
Other Engines
I’ve never yet encountered a situation where I wished I had one of these boundaries. Most likely, if it ever arises, I automatically solve it by using lookarounds. If you ever want to use these specific boundaries in a language that doesn’t support them, one solution among several is to copy the patterns (from two paragraphs above) that PCRE uses to convert the boundaries to regular syntax.
(direct link)
Making Your Own Boundaries
Finding a boundary between a word character and a non-word character is convenient, and we can thank b for that. But there are many other cases where we could use a boundary for which regex does not provide explicit syntax. For instance, how do you match the position between a letter and a digit? We’ll make this exact boundary further down, but let’s get there at a comfortable pace.
Delimiters
As a first example, let’s look at a line in an email reply:
> and then she told him she wouldn’t settle for less than a Hawaiian pizza, and
Let’s say we want a boundary that finds the position between the > and an ASCII letter.
As a first approach, we could use a lookbehind. Assuming we’re in multi-line mode, where the anchor ^ matches at the beginning of any line, the lookbehind (?<=^> ) asserts that what precedes the current position is the beginning of the string, then a «greater-than» symbol > and a space.
Therefore, something like (?<=^> )w+ would find the first word of the line. This works, but I would not call (?<=^> ) a boundary. Whereas a boundary asserts that there is a difference between what lies to the left and what lies to the right, our lookbehind only looks in one direction. If we used it on its own, it would match after the space character > in > >>>: it doesn’t care about what follows. It is what I would call a delimiter, rather than a boundary.
Delimiters are very useful, and they are a major source of business for regex lookarounds. For instance, .*?(?=END) would match an entire line up to—but not including—the word END: the lookahead (?=END) serves as an ending delimiter. Likewise, (?<=START) serves as a beginning delimiter in (?<=START).*, which matches an entire line after—but not including—the word START.
Further down, we will look at a useful technique: double-negative delimiters.
Boundaries: Look Left and Right
To finish our boundary for the position following the start of an email reply line and preceding a letter, we also need to look to the right. We do that by adding a lookahead after the lookbehind:
(?<=^> )(?=[a-zA-Z])
After asserting that what precedes the current position is a «greater than» and a space, we assert that what follows is a letter. Note that the order of the lookahead and the lookbehind do not matter, as they do not consume any characters: they look to the left and to the right with our feet firmly planted in the same spot in the string. Therefore, the reverse-order boundary
(?=[a-zA-Z])(?<=^> )
works equally well.
After either of these patterns, we can confidently use any regex meta-character—such as the dot—and be sure that it will match a letter: they are true boundaries.
(direct link)
Generalizing the idea: home-made word boundary
We can use this technique to construct any boundary we like. The coming sections will show some examples in detail, but to whet our appetite, how would you build a word boundary if your regex engine didn’t support b?
When it matches on the left of word characters, a word boundary is able to check that what follows is a word character but what precedes is not. In lookaround terms, this is (?=w)(?<!w).
When it matches on the right of word characters, a word boundary is able to check that what precedes is a word character but what follows is not. In lookaround terms, this is (?<=w)(?!w)
A word boundary must match either of these positions. Grouping them together inside an alternation, our homemade word boundary becomes:
(?:(?=w)(?<!w)|(?<=w)(?!w))
Yes, b is a bit shorter.
(direct link)
DIY Boundary Workshop: «real word boundary»
With some variations depending on the engine, regex usually defines a word character as a letter, digit or underscore. A word boundary bdetects a position where one side is such a character, and the other is not.
In the everyday world, most people would probably say that in the English language, a word character is a letter. Others might allow for hyphens. In some situations, it might therefore be useful to have a «real word boundary» that detects the edge between an ASCII letter and a non-letter. How do we do that?
As a start, with lookarounds you can make a left-side and a right-side boundary:
(?i)(?<=^|[^a-z])cat(?=$|[^a-z])
The left side asserts that what precedes is either the beginning of the string or a character that is a non-letter. The right side asserts that what follows is either the end of the string or a non-letter.
Your next step could be to combine the two to form a boundary that can be popped on either side:
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
On the left side, of the alternation, we have our earlier left boundary, and we add a lookahead to check that what follows is a letter. On the right side of the alternation, we have our earlier right boundary, and we add a lookbehind to check that what precedes us is a letter.
Needless to say, if you need to paste this wherever you want a «real word boundary», this is a bit heavy. With engines that support pre-defined subroutines—Perl, PCRE (PHP, R, …)—you can define the boundary once and for all, then use it wherever you like by referring to its name:
(?x) # free-spacing mode (?(DEFINE) # Define some subroutines (?<alphaB> # Define "alphaB" boundary # This boundary matches when # only one side is a letter (?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z]) ) # End alphaB definition ) # End DEFINE # The actual regex matching starts here # We can use our "alphaB" boundary wherever we like (?&alphaB)cat(?&alphaB)
This would work really well as a component of a large parsing regex.
(direct link)
DIY Boundary: between a letter and a digit
Once we have this recipe, producing boundaries is simple. For instance, with minor tweaks, we can produce a boundary that matches between ASCII letters and digits. I called this pre-defined boundary by the descriptive name A1.
(?x) # free-spacing mode (?(DEFINE) # Define some subroutines (?<A1> # Define "A1" boundary # This boundary matches when # one side is a letter and # the other is a number (?i)(?<=^|d)(?=[a-z])|(?<=[a-z])(?=$|d) ) # End A1 definition ) # End DEFINE # The actual regex matching starts here # We can use our "A1" boundary wherever we like (?&A1)cat(?&A1)
If your engine doesn’t support pre-defined subroutines, you would have to paste this monster in your regex:
(?:(?i)(?<=^|d)(?=[a-z])|(?<=[a-z])(?=$|d))
(direct link)
Double Negative Delimiter: Character, or Edge of String
In this section I would like to introduce you to a useful family of delimiters that use a fiendish technique: double negative delimiters.
Consider the string 0# 1 #2 #3# 4# #5. In this string, we want to match 0, 3 and 5, i.e. digits where each side is either a hash or one of the edges of the string.
One first thought might be to use a capture group: (?:^|#)(d)(?:$|#). This exactly performs the task specified in the previous paragraph—first matching either the beginning of the string or a hash, then a digit, then either the end of the string or a hash. The desired digits are captured to Group 1.
To get rid of the capture group, you will probably think of using lookarounds: (?<=^|#)d(?=$|#). This is nearly exactly the same as the first regex, except that the sides are no longer matched, but just checked with a lookbehind and a lookahead. This works in .NET, PCRE (C, PHP, R, …), Java and Ruby (or Python with the regex module), but not in other engines as traditional lookbehind must have a fixed width (see Lookbehind: Fixed-Width / Constrained Width / Infinite Width).
In Perl, you can get around this problem with (?:^|#K)d(?=$|#), where we match the left-side hash (if any) then drop it with the K. This would also work in PCRE and Ruby.
But here is the solution I would like to introduce you to:
(?<![^#])d(?![^#])
This is a bit of a brain twister. On the left side, the negative lookbehind (?<![^#]) asserts that what precedes the current position is not one character that is not a hash. Flipping the double negative back to a positive assertion, this says that if there is a character behind us, it must be a hash. What is allowed behind us is therefore either a hash character or «not a character» (the beginning of the string).
Why the double negative? Isn’t that the same as the positive lookbehind (?<=#)? Well, no: this positive lookbehind requires a hash character—whereas we also want to allow the absence of any character on the left.
The negative lookahead at the end of the string follows the same principle: (?![^#]) asserts that what follows is not a character that is not a hash—i.e., if it is a character, it must be a hash.
Limitation
This technique works for single-line strings. As soon as you move to multiple lines, 0# no longer matches at the beginning of lines 2 and beyond. That is because there is a character before the 0: the n, and it is not a hash. Likewise, #5 no longer matches at the end of any line but the last, because there is now a line break character—not a hash—after the 5.
Extension
To get your eyes accustomed to the technique, let’s apply it to other tasks.
To match A, B or E in A0 1B1 2C D3 4E, i.e capital letters that have either a digit or a string-end on each side, you can use this pattern:
(?<!D)[A-Z](?!D)
To match A, C or F in A -B- C -D -E F, i.e capital letters that have either a space or a string-end on each side, you can use this pattern:
(?<!S)[A-Z](?!S)
Finally, an unlikely example: to match the tilde, hash or colon in ~A ? 2! _#4 @5 6:, i.e special characters that have either a word character or a string-end on each side, you can use this pattern:
(?<!W)[~#:@?!](?!W)
Everything You’ve Wanted to know about Capture Groups
This table summarizes the meaning of various strings in different
regexp syntaxes. It is intended as a quick reference, rather than a
tutorial or specification. Please report any errors.
String | GNU grep | BRE (grep) | ERE (egrep) | GNU Emacs | Perl | Python | Tcl |
. | Any character | Any character except | Any character except n | Any character | |||
[…] | Bracket Expression |
Character Set | Character Class | Bracket Expression |
|||
(re) | Subexpression | Grouping | |||||
re{…} | Match re multiple times | Match re multiple times | |||||
(re) | Subexpression | Grouping | |||||
re{…} | Match re multiple times | Match re multiple times | |||||
re{…}? | Nongreedy {} | ||||||
digit | Back-reference | ||||||
^ | Start of line | ||||||
$ | End of line | ||||||
re? | re 0 or 1 times | ||||||
re* | re 0 or more times | ||||||
re+ | re one or more times | ||||||
l|r | l or r | l or r | |||||
*? | Non-greedy * | ||||||
+? | Non-greedy + | ||||||
?? | Non-greedy ? | ||||||
A | Start of string | ||||||
b | Either end of word | Either end of word | |||||
B | Not either end of word | Not either end of word | Synonym for | ||||
cC | Any in category C | ||||||
CC | Any not in category C | ||||||
C | Any octet | ||||||
d | Digit | ||||||
D | Non-digit | ||||||
G | At pos() | ||||||
m | Start of word | ||||||
M | End of word | ||||||
pproperty p{property} |
Unicode property | ||||||
Pproperty P{property} |
Not unicode property | ||||||
sC | Any with syntax C | ||||||
SC | Any with syntax not C | ||||||
s | Whitespace | ||||||
S | Non-whitespace | ||||||
w | Same as [[:alnum:]] | Same as sw | Alphanumeric and _ | ||||
W | Same as [^[:alnum:]] | Same as Sw | Not alphanumeric or _ | ||||
X | Combining sequence | ||||||
y | Either end of word | ||||||
y | Not either end of word | ||||||
Z | End of string/last line | End of string | |||||
z | End of string | ||||||
` | Start of buffer/string | ||||||
‘ | End of buffer/string | ||||||
< | Start of word | Start of word | |||||
> | End of word | End of word | |||||
re? | re 0 or 1 | ||||||
re+ | re 1 or more | ||||||
l|r | l or r | l or r | |||||
(?#text) | Comment, ignored | ||||||
(?modifiers) | Embedded modifiers | ||||||
(?modifiers:re) | Shy grouping + modifiers | ||||||
(?:re) | Shy grouping | ||||||
(?:…) | Shy grouping | ||||||
(?=re) | Lookahead | ||||||
(?!re) | Negative lookahead | ||||||
(?<=p) | Lookbehind | ||||||
(?<!o) | Negative lookbehind | ||||||
(?{code}) (??{code}) |
Embedded Perl | ||||||
(?>re) | Independent expression | ||||||
(?(cond)re) (?(cond)re|re) |
Condition expression | ||||||
(?P<name>re) | Symbolic grouping | ||||||
(?P=name) | Symbolic backref | ||||||
String | GNU grep | BRE (grep) | ERE (egrep) | GNU Emacs | Perl | Python | Tcl |
Who Uses What?
BRE refers to POSIX «basic regular expressions» and ERE is POSIX
«extended regular expressions».
APIs
regcomp
uses BREs by default but can
also use EREs. It has a variety of other options which modify the
syntax slightly.
Boost’s regex++
supports a variety of syntaxes.
PCRE is almost the same as
Perl, though it doesn’t support the embedded Perl feature and the
man page lists a number of other differences.
Languages
awk
is supposed to use EREs, plus the
extra C-style escapes \, a, b,
f, n, r, t, v with
their usual meanings. sed is supposed to
use BREs, plus n with its usual meaning.
lex is
also supposed to use EREs with
some extensions: «…» quotes everything inside it
(backslash escapes are recognized); an initial
<state> matches a start condition;
r/x matches r only when followed by
x; and {name} matches the value of a
substitution symbol. A variety of escape sequences, including the
usual C ones, are recognized. Possibly this deserves a new
column.
Tools
grep is supposed
to use BREs, except
that grep -E uses EREs. (GNU grep fits
some extensions in where POSIX leaves the behaviour unspecified).
egrep uses EREs. grep -F doesn’t use regexps at all, of
course.
ed uses
BREs. ex and vi
use BREs but additionally support <
and > as described above, and use ~ to match
the replacement part of the previous substitution.
expr
uses BREs with all patterns
implicitly anchored at the start.
The regexp syntax accepted by less depends on
how it is built but PCRE and POSIX EREs are likely outcomes on
modern systems.
Vim has enough differences and extensions
that it perhaps deserves a column (or two) to itself.
Subexpressions, Grouping and Back-References
Subexpressions or groups are surrounded by ( and
), or sometimes ( and ). They serve
two purposes; firstly they override the precedence rules of other
operators, and secondly they «capture» part of the text matched by a
regexp. This can then be used later on in the regexp via the
digit syntax (this is called a back-reference) or
outside the regexp to extract the appropriate part of a string.
«Shy grouping» has the precedence-overriding feature but not the
capturing feature.
«Symbolic grouping» allows groups to be identified by name rather
than number.
Match Multiple Times
The syntax of this varies a bit; sometimes you used { and },
and sometimes you use { and }. However the idea is the same:
- RE{N} will match RE exactly N times.
- RE{N,} will match RE N or more times.
- RE{N,M} will match RE between N and M times (inclusive).
It is worth nothing that the GNU Grep manual says:
Traditional `egrep' did not support the `{' metacharacter, and some `egrep' implementations support `{' instead, so portable scripts should avoid `{' in `egrep' patterns and should use `[{]' to match a literal `{'.
Bracket Expressions
This refers to expressions in [square brackets], for which POSIX
defines a complicated syntax all of their own.
Firstly, if the first character after the [ is a
^ (caret) then the sense of the match is reversed.
The rest of the bracket expression consists of a sequence of
elements selected from the following list. The bracket expression
as a whole matches any character (or character sequence) that is
matched by at least one of them (or is matched by none of them, if
an initial ^ was used).
1. Collating symbols. These look like
[.element.], where element is a collating
element (i.e. a symbolic name for a multi-character string), and
match the value of the collating element in the current locale.
This doesn’t seem to work in GNU grep.
2. Equivalence classes. These look like
[=element=], where element is a collating
element. They match any collating element (single or multiple
characters) which has the same primary weight as element,
i.e. if they appear in the same place in the current locale’s
collation sequence. This doesn’t seem to work in GNU grep.
3. Character classes. These look like [:class:],
where class is the name of the character class to match. The
following character classes exist in all locales:
[:alnum:] [:alpha:] [:blank:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:space:] [:upper:]
4. Range expressions. These look like
start—end where start and end
are either single characters or collating symbols. The behaviour is
only specified in the POSIX locale, where they match all the
characters between start and end inclusive.
5. Single characters. These match themselves.
To include a ], put it immediately after the opening
[ or [^; if it occurs later it will close the
bracket expression. The hyphen (-) is not treated as a
range separator if it appears first or last, or as the endpoint of a
range.
Emacs «character sets» are similar to bracket expressions, except
that collating symbols, equivalence classes and character classes
aren’t supported.
Perl «character classes» are also similar. They support POSIX
character class syntax (argh, confusing names!) and recognize, but
don’t support, collating symbols or equivalence classes.
GNU Grep and .
GNU Grep has slightly strange handling of . and
newlines.
Firstly, the manual says that . matches «any single
character». Superficially it appears not to match the newline
character:
$ echo | grep . $
The outcome is actually in keeping with standard and
traditional behaviour for grep, where the newline is not included in
the text to be matched. But that doesn’t appear to be quite what’s
going on with the GNU version, as explicitly searching for a newline
does produce a match:
$ echo | perl -e 'exec("/usr/bin/grep","n");' $
So is there a newline to match against or not?
The other case to consider is when the -z or
—null-data option is used. In that case, .
definitely does match a newline, exactly as the manual says:
$ perl -e 'print "n";' | grep -z . | od -tx1 0000000 0a 00 0000002 $
Perl Variations
. and newlines
The /s modifier changes the meaning of . to
match any haracter including n.
Anchors
The /m modifier causes ^ and $ to
match at the start of any line within the subject string rathe than
just the start and end of the subject string.
«Lookbehind» Matching
Perl’s lookbehind matches,
i.e. (?<=p) and
(?<!p) only work for fixed-width
patterns, not arbitrary regular expressions.
Sources
The POSIX regular expression specification can be found at http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html.
For the regexp languages used by particular programs, I looked at
the documentation for GNU Grep
2.4.2; GNU
Emacs 21.2.1; Perl 5.6.1;
Python 2.2.1; Tcl 8.3.3; and less 458.
All errors are my own!
RJK | Contents
In regex, the anchors have zero width. They are not used for matching characters. Rather they match a position i.e. before, after, or between characters.
To match the start or the end of a line, we use the following anchors:
- Caret (^) matches the position before the first character in the string.
- Dollar ($) matches the position right after the last character in the string.
Regex | String | Matches |
---|---|---|
^a |
abc | Matches a |
c$ |
abc | Matches c |
^[a-zA-Z]+$ |
abc | Matches abc |
^[abc]$ |
abc | Matches a or b or c |
[^abc] |
abc | Does not match. A matching string begins with any character but a,b,c. |
^[mts][aeiou] |
mother | Matches. Searches for words that start with m, t or s. Then immediately followed by a vowel. |
[^n]g$ |
king ng |
Does not match. The string should end with g, but not ng. |
[^k]g$ |
kong | Matches. |
^g.+g$ |
gang | Matches. Word would start and end with g. Any number of letters in between. |
See Also: Java regex to allow only alphanumeric characters
2. Regex to Match Start of Line
"^<insertPatternHere>"
- The caret
^
matches the position before the first character in the string. - Applying
^h
to howtodoinjava matchesh
. - Applying
^t
to howtodoinjava does not match anything because it expects the string to start witht
. - If we have a multi-line string, by default caret symbol matches the position before the very first character in the whole string. To match the position before the first character of any line, we must enable the multi-line mode in the regular expression.
In this case, caret changes from matching at only the start the entire string to the start of any line within the string.
Description | Matching Pattern |
---|---|
The line starts with a number | “^\d” or “^[0-9]” |
The line starts with a character | “^[a-z]” or “^[A-Z]” |
The line starts with a character (case-insensitive) | “^[a-zA-Z]” |
The line starts with a word | “^word” |
The line starts with a special character | “^[!@#\$%\^\&*\)\(+=._-]” |
Pattern.compile("^[0-9]").matcher("1stKnight").find();
Pattern.compile("^[a-zA-Z]").matcher("FirstKnight").find();
Pattern.compile("^First").matcher("FirstKnight").find();
Pattern.compile("^[!@#\$%\^\&*\)\(+=._-]").matcher("*1stKnight").find();
Program output.
true
true
true
true
3. Regex to Match End of Line
"<insertPatternHere>$"
- The dollar
$
matches the position after the last character in the string. - Applying
a$
to howtodoinjava matchesa
. - Applying
v$
to howtodoinjava does not match anything because it expects the string to end withv
. - If we have a multi-line string, by default dollar symbol matches the position after the very last character in the whole string.
To match the position after the last character of any line, we must enable the multi-line mode in the regular expression. In this case, dollar changes from matching at only the last the entire string to the last of any line within the string.
Description | Matching Pattern |
---|---|
The line ends with a number | “\d$” or “[0-9]$” |
The line ends with a character | “[a-z]$” or “[A-Z]$” |
The line ends with a character (case-insensitive) | [a-zA-Z]$ |
The line ends with a word | “word$” |
The line ends with a special character | “[!@#\$%\^\&*\)\(+=._-]$” |
Pattern.compile("[0-9]$").matcher("FirstKnight123").find();
Pattern.compile("[a-zA-Z]$").matcher("FirstKnight").find();
Pattern.compile("Knight$").matcher("FirstKnight").find();
Pattern.compile("[!@#\$%\^\&*\)\(+=._-]$")
.matcher("FirstKnight&").find();
Program output.
true
true
true
true
Drop me your questions related to programs for regex starts with and ends with java.
Happy Learning !!
2.6. Match Whole Words
Problem
Create a regex that matches cat
in My cat is brown
, but not in category
or bobcat
. Create another
regex that matches cat
in staccato
, but not in any of the three
previous subject strings.
Solution
Word boundaries
bcatb
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Nonboundaries
BcatB
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Discussion
Word boundaries
The regular expression token ‹b
› is
called a word boundary. It matches at the start
or the end of a word. By itself, it results in a zero-length match.
‹b
› is an
anchor, just like the tokens introduced in the
previous section.
Strictly speaking, ‹b
› matches in these three positions:
-
Before the first character in the subject, if the first
character is a word character -
After the last character in the subject, if the last
character is a word character -
Between two characters in the subject, where one is a word
character and the other is not a word character
To run a “whole words only” search using a regular expression,
simply place the word between two word boundaries, as we did with
‹bcatb
›. The first
‹b
› requires the
‹c
› to occur at the very
start of the string, or after a nonword character. The second ‹b
› requires the ‹t
› to occur at the very end of
the string, or before a nonword character.
Line break characters are nonword characters. ‹b
› will match after a line break if the line break is immediately followed by a word character. …