The metacharacter b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: b allows you to perform a “whole words only” search using a regular expression in the form of bwordb. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.
Exactly which characters are word characters depends on the regex flavor you’re working with. In most flavors, characters that are matched by the short-hand character class w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for b but not for w.
Most flavors, except the ones discussed below, have only one metacharacter that matches both before a word and after a word. This is because any position between characters can never be both at the start and at the end of a word. Using only one operator makes things easier for you.
Since digits are considered to be word characters, b4b can be used to match a 4 that is not part of a larger number. This regex does not match 44 sheets of a4. So saying “b matches before and after an alphanumeric sequence” is more exact than saying “before and after a word”.
B is the negated version of b. B matches at every position where b does not. Effectively, B matches at any position between two word characters as well as at any position between two non-word characters.
Looking Inside The Regex Engine
Let’s see what happens when we apply the regex bisb to the string This island is beautiful. The engine starts with the first token b at the first character T. Since this token is zero-length, the position before the character is inspected. b matches here, because the T is a word character and the character before it is the void before the start of the string. The engine continues with the next token: the literal i. The engine does not advance to the next character in the string, because the previous regex token was zero-length. i does not match T, so the engine retries the first token at the next character position.
b cannot match at the position between the T and the h. It cannot match between the h and the i either, and neither between the i and the s.
The next character in the string is a space. b matches here because the space is not a word character, and the preceding character is. Again, the engine continues with the i which does not match with the space.
Advancing a character and restarting with the first regex token, b matches between the space and the second i in the string. Continuing, the regex engine finds that i matches i and s matches s. Now, the engine tries to match the second b at the position before the l. This fails because this position is between two word characters. The engine reverts to the start of the regex and advances one character to the s in island. Again, the b fails to match and continues to do so until the second space is reached. It matches there, but matching the i fails.
But b matches at the position before the third i in the string. The engine continues, and finds that i matches i and s matches s. The last token in the regex, b, also matches at the position before the third space in the string because the space is not a word character, and the character before it is.
The engine has successfully matched the word is in our string, skipping the two earlier occurrences of the characters i and s. If we had used the regular expression is, it would have matched the is in This.
Tcl Word Boundaries
Word boundaries, as described above, are supported by most regular expression flavors. Notable exceptions are the POSIX and XML Schema flavors, which don’t support word boundaries at all. Tcl uses a different syntax.
In Tcl, b matches a backspace character, just like x08 in most regex flavors (including Tcl’s). B matches a single backslash character in Tcl, just like \ in all other regex flavors (and Tcl too).
Tcl uses the letter “y” instead of the letter “b” to match word boundaries. y matches at any word boundary position, while Y matches at any position that is not a word boundary. These Tcl regex tokens match exactly the same as b and B in Perl-style regex flavors. They don’t discriminate between the start and the end of a word.
Tcl has two more word boundary tokens that do discriminate between the start and end of a word. m matches only at the start of a word. That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it. It also matches at the start of the string if the first character in the string is a word character. M matches only at the end of a word. It matches at any position that has a word character to the left of it, and a non-word character to the right of it. It also matches at the end of the string if the last character in the string is a word character.
The only regex engine that supports Tcl-style word boundaries (besides Tcl itself) is the JGsoft engine. In PowerGREP and EditPad Pro, b and B are Perl-style word boundaries, while y, Y, m and M are Tcl-style word boundaries.
In most situations, the lack of m and M tokens is not a problem. ywordy finds “whole words only” occurrences of “word” just like mwordM would. Mwordm could never match anywhere, since M never matches at a position followed by a word character, and m never at a position preceded by one. If your regular expression needs to match characters before or after y, you can easily specify in the regex whether these characters should be word characters or non-word characters. If you want to match any word, yw+y gives the same result as m.+M. Using w instead of the dot automatically restricts the first y to the start of a word, and the second y to the end of a word. Note that y.+y would not work. This regex matches each word, and also each sequence of non-word characters between the words in your subject string. That said, if your flavor supports m and M, the regex engine could apply mw+M slightly faster than yw+y, depending on its internal optimizations.
If your regex flavor supports lookahead and lookbehind, you can use (?<!w)(?=w) to emulate Tcl’s m and (?<=w)(?!w) to emulate M. Though quite a bit more verbose, these lookaround constructs match exactly the same as Tcl’s word boundaries.
If your flavor has lookahead but not lookbehind, and also has Perl-style word boundaries, you can use b(?=w) to emulate Tcl’s m and b(?!w) to emulate M. b matches at the start or end of a word, and the lookahead checks if the next character is part of a word or not. If it is we’re at the start of a word. Otherwise, we’re at the end of a word.
GNU Word Boundaries
The GNU extensions to POSIX regular expressions add support for the b and B word boundaries, as described above. GNU also uses its own syntax for start-of-word and end-of-word boundaries. < matches at the start of a word, like Tcl’s m. > matches at the end of a word, like Tcl’s M.
Boost also treats < and > as word boundaries when using the ECMAScript, extended, egrep, or awk grammar.
POSIX Word Boundaries
The POSIX standard defines [[:<:]] as a start-of-word boundary, and [[:>:]] as an end-of-word boundary. Though the syntax is borrowed from POSIX bracket expressions, these tokens are word boundaries that have nothing to do with and cannot be used inside character classes. Tcl and GNU also support POSIX word boundaries. PCRE supports POSIX word boundaries starting with version 8.34. Boost supports them in all its grammars.
Regex Boundaries and Delimiters—Standard and Advanced
Although this page starts with the regex word boundary b, it aims to go far beyond: it will also introduce less-known boundaries, as well as explain how to make your own—DIY Boundaries.
Jumping Points
For easy navigation, here are some jumping points to various sections of the page:
✽ Boundaries vs. Anchors
✽ Word Boundary: b
✽ Not-a-word-boundary: B
✽ Left- and Right-of-Word Boundaries
✽ Making Your Own Boundaries
✽ DIY Boundary Workshop: «real word boundary»
✽ DIY Boundary: between a letter and a digit
✽ Double Negative Delimiter: Character, or Beginning of String
(direct link)
Boundaries vs. Anchors
Why are ^ and $ called anchors while b is called a boundary?
These tokens have one thing in common: they are assertions about the engine’s current position in the string. Therefore, none of them consume characters.
Anchors assert that the current position in the string matches a certain position: the beginning, the end, or in the case of G the position immediately following the last match.
In contrast, boundaries make assertions about what can be matched to the left and right of the current position.
The distinction is blurry. Typically, you would translate ^ as something like «assert that the current position is the beginning of the string». But if you were in a mood to play with logic, you could say:
Imagine that a string is a space between two walls—one to the left and one to the right. All the positions in the string are within that space. Then we could translate the ^ anchor as:
Assert that immediately to the left of the current position, we can find the left wall, while to the right of the current position we cannot find the left wall.
Yep, in that light, our anchor is a boundary—we look left and right. We’ll keep anchors and boundaries on separate pages because there’s a lot of ground to cover, but just keep that in mind.
(direct link)
Word Boundary: b
The word boundary b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).
The regex bcatb would therefore match cat in a black cat, but it wouldn’t match it in catatonic, tomcat or certificate. Removing one of the boundaries, bcat would match cat in catfish, and catb would match cat in tomcat, but not vice-versa. Both, of course, would match cat on its own.
Word boundaries are useful when you want to match a sequence of letters (or digits) on their own, or to ensure that they occur at the beginning or the end of a sequence of characters.
Be aware, though, that bcatb will not match cat in _cat or in cat25 because there is no boundary between an underscore and a letter, nor between a letter and a digit: these all belong to what regex defines as word characters. If you want to create a «real word boundary» (where a word is only allowed to have letters), see the recipe below in the section on DYI boundaries.
(direct link)
Difference between Engines
As you can see on the regex cheat sheet, b behaves differently depending on the engine:
✽ In PCRE (PHP, R…) with the Unicode mode turned off, JavaScript and Python 2.7, it matches where only one side is an ASCII letter, digit or underscore.
✽ In PCRE (PHP, R…) with the Unicode mode turned on, .NET, Java, Perl, Python 3 and Ruby, it matches a position where only one side is a Unicode letter, digit or underscore.
(direct link)
Not-a-word-boundary: B
B matches all positions where b doesn’t match. Therefore, it matches:
✽ When neither side is a word character, for instance at any position in the string $=(@-%++) (including the beginning and end of the string)
✽ When both sides are a word character, for instance between the H and the i in Hi!
This may not seem very useful, but sometimes B is just what you want. For instance,
✽ BcatB will find cat fully surrounded by word characters, as in certificate, but neither on its own nor at the beginning or end of words.
✽ catB will find cat both in certificate and catfish, but neither in tomcat nor on its own.
✽ Bcat will find cat both in certificate and tomcat, but neither in catfish nor on its own.
✽ Bcat|catB will find cat in embedded situation, e.g. in certificate, catfish or tomcat, but not on its own.
Difference between Engines
In all engines that support it, B matches positions that are not matched by b. Since b behaves differently in various engines, see b engine variations a few paragraphs above.
(direct link)
Left- and Right-of-Word Boundaries
The PCRE (PHP, R, …) version 8.34+ and MySQL engines support the POSIX character classes for the beginning-of-word boundary [[:<:]] and the end-of-word boundary [[:>:]]
✽ [[:<:]]cat matches cat in the word on its own as well as in catfish, but neither in tomcat nor in certificate.
✽ cat[[:<:]] never matches as a word cannot start in the middle of a word.
✽ cat[[:>:]] matches cat in the word on its own as well as in tomcat, but neither in catfish nor in certificate.
✽ [[:>:]]cat never matches as a word cannot end in the middle of a word.
For MySQL, the definition of a word character is an ASCII letter, digit or underscore—and this set of characters drives the interpretation of these «start of word» and «end of word» boundaries.
PCRE offers these boundaries as a convenience for occasions when someone might want to paste POSIX regex into a PCRE-powered language (or, more likely, switch the regex library used by an old C program), but the engine makes the following substitutions before starting the match:
✽ The start of word boundary [[:<:]] is converted to b(?=w)
✽ The end of word boundary [[:>:]] is converted to b(?<=w)
Therefore, the «start of word» and «end of word» boundaries derive their meaning from the b boundary. In non-Unicode mode, it matches a position where only one side is an ASCII letter, digit or underscore. In Unicode mode, it matches a position where only one side is a Unicode letter, digit or underscore.
Other Engines
I’ve never yet encountered a situation where I wished I had one of these boundaries. Most likely, if it ever arises, I automatically solve it by using lookarounds. If you ever want to use these specific boundaries in a language that doesn’t support them, one solution among several is to copy the patterns (from two paragraphs above) that PCRE uses to convert the boundaries to regular syntax.
(direct link)
Making Your Own Boundaries
Finding a boundary between a word character and a non-word character is convenient, and we can thank b for that. But there are many other cases where we could use a boundary for which regex does not provide explicit syntax. For instance, how do you match the position between a letter and a digit? We’ll make this exact boundary further down, but let’s get there at a comfortable pace.
Delimiters
As a first example, let’s look at a line in an email reply:
> and then she told him she wouldn’t settle for less than a Hawaiian pizza, and
Let’s say we want a boundary that finds the position between the > and an ASCII letter.
As a first approach, we could use a lookbehind. Assuming we’re in multi-line mode, where the anchor ^ matches at the beginning of any line, the lookbehind (?<=^> ) asserts that what precedes the current position is the beginning of the string, then a «greater-than» symbol > and a space.
Therefore, something like (?<=^> )w+ would find the first word of the line. This works, but I would not call (?<=^> ) a boundary. Whereas a boundary asserts that there is a difference between what lies to the left and what lies to the right, our lookbehind only looks in one direction. If we used it on its own, it would match after the space character > in > >>>: it doesn’t care about what follows. It is what I would call a delimiter, rather than a boundary.
Delimiters are very useful, and they are a major source of business for regex lookarounds. For instance, .*?(?=END) would match an entire line up to—but not including—the word END: the lookahead (?=END) serves as an ending delimiter. Likewise, (?<=START) serves as a beginning delimiter in (?<=START).*, which matches an entire line after—but not including—the word START.
Further down, we will look at a useful technique: double-negative delimiters.
Boundaries: Look Left and Right
To finish our boundary for the position following the start of an email reply line and preceding a letter, we also need to look to the right. We do that by adding a lookahead after the lookbehind:
(?<=^> )(?=[a-zA-Z])
After asserting that what precedes the current position is a «greater than» and a space, we assert that what follows is a letter. Note that the order of the lookahead and the lookbehind do not matter, as they do not consume any characters: they look to the left and to the right with our feet firmly planted in the same spot in the string. Therefore, the reverse-order boundary
(?=[a-zA-Z])(?<=^> )
works equally well.
After either of these patterns, we can confidently use any regex meta-character—such as the dot—and be sure that it will match a letter: they are true boundaries.
(direct link)
Generalizing the idea: home-made word boundary
We can use this technique to construct any boundary we like. The coming sections will show some examples in detail, but to whet our appetite, how would you build a word boundary if your regex engine didn’t support b?
When it matches on the left of word characters, a word boundary is able to check that what follows is a word character but what precedes is not. In lookaround terms, this is (?=w)(?<!w).
When it matches on the right of word characters, a word boundary is able to check that what precedes is a word character but what follows is not. In lookaround terms, this is (?<=w)(?!w)
A word boundary must match either of these positions. Grouping them together inside an alternation, our homemade word boundary becomes:
(?:(?=w)(?<!w)|(?<=w)(?!w))
Yes, b is a bit shorter.
(direct link)
DIY Boundary Workshop: «real word boundary»
With some variations depending on the engine, regex usually defines a word character as a letter, digit or underscore. A word boundary bdetects a position where one side is such a character, and the other is not.
In the everyday world, most people would probably say that in the English language, a word character is a letter. Others might allow for hyphens. In some situations, it might therefore be useful to have a «real word boundary» that detects the edge between an ASCII letter and a non-letter. How do we do that?
As a start, with lookarounds you can make a left-side and a right-side boundary:
(?i)(?<=^|[^a-z])cat(?=$|[^a-z])
The left side asserts that what precedes is either the beginning of the string or a character that is a non-letter. The right side asserts that what follows is either the end of the string or a non-letter.
Your next step could be to combine the two to form a boundary that can be popped on either side:
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
On the left side, of the alternation, we have our earlier left boundary, and we add a lookahead to check that what follows is a letter. On the right side of the alternation, we have our earlier right boundary, and we add a lookbehind to check that what precedes us is a letter.
Needless to say, if you need to paste this wherever you want a «real word boundary», this is a bit heavy. With engines that support pre-defined subroutines—Perl, PCRE (PHP, R, …)—you can define the boundary once and for all, then use it wherever you like by referring to its name:
(?x) # free-spacing mode (?(DEFINE) # Define some subroutines (?<alphaB> # Define "alphaB" boundary # This boundary matches when # only one side is a letter (?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z]) ) # End alphaB definition ) # End DEFINE # The actual regex matching starts here # We can use our "alphaB" boundary wherever we like (?&alphaB)cat(?&alphaB)
This would work really well as a component of a large parsing regex.
(direct link)
DIY Boundary: between a letter and a digit
Once we have this recipe, producing boundaries is simple. For instance, with minor tweaks, we can produce a boundary that matches between ASCII letters and digits. I called this pre-defined boundary by the descriptive name A1.
(?x) # free-spacing mode (?(DEFINE) # Define some subroutines (?<A1> # Define "A1" boundary # This boundary matches when # one side is a letter and # the other is a number (?i)(?<=^|d)(?=[a-z])|(?<=[a-z])(?=$|d) ) # End A1 definition ) # End DEFINE # The actual regex matching starts here # We can use our "A1" boundary wherever we like (?&A1)cat(?&A1)
If your engine doesn’t support pre-defined subroutines, you would have to paste this monster in your regex:
(?:(?i)(?<=^|d)(?=[a-z])|(?<=[a-z])(?=$|d))
(direct link)
Double Negative Delimiter: Character, or Edge of String
In this section I would like to introduce you to a useful family of delimiters that use a fiendish technique: double negative delimiters.
Consider the string 0# 1 #2 #3# 4# #5. In this string, we want to match 0, 3 and 5, i.e. digits where each side is either a hash or one of the edges of the string.
One first thought might be to use a capture group: (?:^|#)(d)(?:$|#). This exactly performs the task specified in the previous paragraph—first matching either the beginning of the string or a hash, then a digit, then either the end of the string or a hash. The desired digits are captured to Group 1.
To get rid of the capture group, you will probably think of using lookarounds: (?<=^|#)d(?=$|#). This is nearly exactly the same as the first regex, except that the sides are no longer matched, but just checked with a lookbehind and a lookahead. This works in .NET, PCRE (C, PHP, R, …), Java and Ruby (or Python with the regex module), but not in other engines as traditional lookbehind must have a fixed width (see Lookbehind: Fixed-Width / Constrained Width / Infinite Width).
In Perl, you can get around this problem with (?:^|#K)d(?=$|#), where we match the left-side hash (if any) then drop it with the K. This would also work in PCRE and Ruby.
But here is the solution I would like to introduce you to:
(?<![^#])d(?![^#])
This is a bit of a brain twister. On the left side, the negative lookbehind (?<![^#]) asserts that what precedes the current position is not one character that is not a hash. Flipping the double negative back to a positive assertion, this says that if there is a character behind us, it must be a hash. What is allowed behind us is therefore either a hash character or «not a character» (the beginning of the string).
Why the double negative? Isn’t that the same as the positive lookbehind (?<=#)? Well, no: this positive lookbehind requires a hash character—whereas we also want to allow the absence of any character on the left.
The negative lookahead at the end of the string follows the same principle: (?![^#]) asserts that what follows is not a character that is not a hash—i.e., if it is a character, it must be a hash.
Limitation
This technique works for single-line strings. As soon as you move to multiple lines, 0# no longer matches at the beginning of lines 2 and beyond. That is because there is a character before the 0: the n, and it is not a hash. Likewise, #5 no longer matches at the end of any line but the last, because there is now a line break character—not a hash—after the 5.
Extension
To get your eyes accustomed to the technique, let’s apply it to other tasks.
To match A, B or E in A0 1B1 2C D3 4E, i.e capital letters that have either a digit or a string-end on each side, you can use this pattern:
(?<!D)[A-Z](?!D)
To match A, C or F in A -B- C -D -E F, i.e capital letters that have either a space or a string-end on each side, you can use this pattern:
(?<!S)[A-Z](?!S)
Finally, an unlikely example: to match the tilde, hash or colon in ~A ? 2! _#4 @5 6:, i.e special characters that have either a word character or a string-end on each side, you can use this pattern:
(?<!W)[~#:@?!](?!W)
Everything You’ve Wanted to know about Capture Groups
CHARACTER CLASSES OR CHARACTER SETS
With a “character class”, also called “character set”, you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. If you want to match an a or an e, use [ae]. You could use this in gr[ae]y to match either gray or grey. Very useful if you do not know whether the document you are searching through is written in American or British English.
A character class matches only a single character. gr[ae]y will not match graay, graey or any such thing. The order of the characters inside a character class does not matter. The results are identical.
You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can use more than one range. [0-9a-fA-F] matches a single hexadecimal digit, case insensitively. You can combine ranges and single characters. [0-9a-fxA-FX] matches a hexadecimal digit or the letter X. Again, the order of the characters and the ranges does not matter.
THE DOT MATCHES (ALMOST) ANY CHARACTER
In regular expressions, the dot or period is one of the most commonly used metacharacters. Unfortunately, it is also the most commonly misused metacharacter.
The dot matches a single character, without caring what that character is. The only exception are newline characters. In all regex flavors discussed in this tutorial, the dot will not match a newline character by default. So by default, the dot is short for the negated character class [^n] (UNIX regex flavors) or [^rn] (Windows regex flavors).
This exception exists mostly because of historic reasons. The first tools that used regular expressions were line-based. They would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the string could never contain newlines, so the dot could never match them.
Modern tools and languages can apply regular expressions to very large strings or even entire files. All regex flavors discussed here have an option to make the dot match all characters, including newlines. In RegexBuddy, EditPad Pro or PowerGREP, you simply tick the checkbox labeled “dot matches newline”.
In Perl, the mode where the dot also matches newlines is called “single-line mode”. This is a bit unfortunate, because it is easy to mix up this term with “multi-line mode”. Multi-line mode only affects anchors, and single-line mode only affects the dot. You can activate single-line mode by adding an s after the regex code, like this: m/^regex$/s;.
Other languages and regex libraries have adopted Perl’s terminology. When using the regex classes of the .NET framework, you activate this mode by specifying RegexOptions.Singleline, such as in Regex.Match(«string», «regex», RegexOptions.Singleline).
In all programming languages and regex libraries I know, activating single-line mode has no effect other than making the dot match newlines. So if you expose this option to your users, please give it a clearer label like was done in RegexBuddy, EditPad Pro and PowerGREP.
JavaScript and VBScript do not have an option to make the dot match line break characters. In those languages, you can use a character class such as [sS] to match any character. This character matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class matches any character.
Use The Dot Sparingly
The dot is a very powerful regex metacharacter. It allows you to be lazy. Put in a dot, and everything will match just fine when you test the regex on valid data. The problem is that the regex will also match in cases where it should not match. If you are new to regular expressions, some of these cases may not be so obvious at first.
I will illustrate this with a simple example. Let’s say we want to match a date in mm/dd/yy format, but we want to leave the user the choice of date separators. The quick solution is dd.dd.dd. Seems fine at first. It will match a date like 02/12/03 just fine. Trouble is: 02512703 is also considered a valid date by this regular expression. In this match, the first dot matched 5, and the second matched 7. Obviously not what we intended.
dd[- /.]dd[- /.]dd is a better solution. This regex allows a dash, space, dot and forward slash as date separators. Remember that the dot is not a metacharacter inside a character class, so we do not need to escape it with a backslash.
This regex is still far from perfect. It matches 99/99/99 as a valid date. [0-1]d[- /.][0-3]d[- /.]dd is a step ahead, though it will still match 19/39/99. How perfect you want your regex to be depends on what you want to do with it. If you are validating user input, it has to be perfect. If you are parsing data files from a known source that generates its files in the same way every time, our last attempt is probably more than sufficient to parse the data without errors. You can find a better regex to match dates in the example section.
Use Negated Character Sets Instead of the Dot
I will explain this in depth when I present you the repeat operators star and plus, but the warning is important enough to mention it here as well. I will illustrate with an example.
Suppose you want to match a double-quoted string. Sounds easy. We can have any number of any character between the double quotes, so «.*» seems to do the trick just fine. The dot matches any character, and the star allows the dot to be repeated any number of times, including zero. If you test this regex on Put a «string» between double quotes, it will match «string» just fine. Now go ahead and test it on Houston, we have a problem with «string one» and «string two». Please respond.
Ouch. The regex matches «string one» and «string two». Definitely not what we intended. The reason for this is that the star is greedy.
In the date-matching example, we improved our regex by replacing the dot with a character class. Here, we will do the same. Our original definition of a double-quoted string was faulty. We do not want any number of any character between the quotes. We want any number of characters that are not double quotes or newlines between the quotes. So the proper regex is «[^»rn]*».
Start of String and End of String Anchors
Thus far, I have explained literal characters and character classes. In both cases, putting one in a regex will cause the regex engine to try to match a single character.
Anchors are a different breed. They do not match any character at all. Instead, they match a position before, after or between characters. They can be used to “anchor” the regex match at a certain position. The caret ^ matches the position before the first character in the string. Applying ^a to abc matches a. ^b will not match abc at all, because the b cannot be matched right after the start of the string, matched by ^. See below for the inside view of the regex engine.
Similarly, $ matches right after the last character in the string. c$ matches c in abc, while a$ does not match at all.
Useful Applications
When using regular expressions in a programming language to validate user input, using anchors is very important. If you use the code if ($input =~ m/d+/) in a Perl script to see if the user entered an integer number, it will accept the input even if the user entered qsdf4ghjk, because d+ matches the 4. The correct regex to use is ^d+$. Because “start of string” must be matched before the match of d+, and “end of string” must be matched right after it, the entire string must consist of digits for ^d+$ to be able to match.
It is easy for the user to accidentally type in a space. When Perl reads from a line from a text file, the line break will also be stored in the variable. So before validating input, it is good practice to trim leading and trailing whitespace. ^s+ matches leading whitespace and s+$ matches trailing whitespace. In Perl, you could use $input =~ s/^s+|s+$//g. Handy use of alternation and /g allows us to do this in a single line of code.
Using ^ and $ as Start of Line and End of Line Anchors
If you have a string consisting of multiple lines, like first linensecond line (where n indicates a line break), it is often desirable to work with lines, rather than the entire string. Therefore, all the regex engines discussed in this tutorial have the option to expand the meaning of both anchors. ^ can then match at the start of the string (before the f in the above string), as well as after each line break (between n and s). Likewise, $ will still match at the end of the string (after the last e), and also before every line break (between e and n).
In text editors like EditPad Pro or GNU Emacs, and regex tools like PowerGREP, the caret and dollar always match at the start and end of each line. This makes sense because those applications are designed to work with entire files, rather than short strings.
In all programming languages and libraries discussed on this website , except Ruby, you have to explicitly activate this extended functionality. It is traditionally called “multi-line mode”. In Perl, you do this by adding an m after the regex code, like this: m/^regex$/m;. In .NET, the anchors match before and after newlines when you specify RegexOptions.Multiline, such as in Regex.Match(«string», «regex», RegexOptions.Multiline).
Permanent Start of String and End of String Anchors
A only ever matches at the start of the string. Likewise, Z only ever matches at the end of the string. These two tokens never match at line breaks. This is true in all regex flavors discussed in this tutorial, even when you turn on “multiline mode”. In EditPad Pro and PowerGREP, where the caret and dollar always match at the start and end of lines, A and Z only match at the start and the end of the entire file.
JavaScript, POSIX and XML do not support A and Z. You’re stuck with using the caret and dollar for this purpose.
The GNU extensions to POSIX regular expressions use ` (backtick) to match the start of the string, and ‘ (single quote) to match the end of the string.
Zero-Length Matches
We saw that the anchors match at a position, rather than matching a character. This means that when a regex only consists of one or more anchors, it can result in a zero-length match. Depending on the situation, this can be very useful or undesirable. Using ^d*$ to test if the user entered a number (notice the use of the star instead of the plus), would cause the script to accept an empty string as a valid input. See below.
However, matching only a position can be very useful. In email, for example, it is common to prepend a “greater than” symbol and a space to each line of the quoted message. In VB.NET, we can easily do this with Dim Quoted as String = Regex.Replace(Original, «^», «> «, RegexOptions.Multiline). We are using multi-line mode, so the regex ^ matches at the start of the quoted message, and after each newline. The Regex.Replace method will remove the regex match from the string, and insert the replacement string (greater than symbol and a space). Since the match does not include any characters, nothing is deleted. However, the match does include a starting position, and the replacement string is inserted there, just like we want it.
Strings Ending with a Line Break
Even though Z and $ only match at the end of the string (when the option for the caret and dollar to match at embedded line breaks is off), there is one exception. If the string ends with a line break, then Z and $ will match at the position before that line break, rather than at the very end of the string. This “enhancement” was introduced by Perl, and is copied by many regex flavors, including Java, .NET and PCRE. In Perl, when reading a line from a file, the resulting string will end with a line break. Reading a line from a file with the text “joe” results in the string joen. When applied to this string, both ^[a-z]+$ and A[a-z]+Z will match joe.
If you only want a match at the absolute very end of the string, use z (lower case z instead of upper case Z). A[a-z]+z does not match joen. z matches after the line break, which is not matched by the character class.
Looking Inside the Regex Engine
Let’s see what happens when we try to match ^4$ to 749n486n4 (where n represents a newline character) in multi-line mode. As usual, the regex engine starts at the first character: 7. The first token in the regular expression is ^. Since this token is a zero-width token, the engine does not try to match it with the character, but rather with the position before the character that the regex engine has reached so far. ^ indeed matches the position before 7. The engine then advances to the next regex token: 4. Since the previous token was zero-width, the regex engine does not advance to the next character in the string. It remains at 7. 4 is a literal character, which does not match 7. There are no other permutations of the regex, so the engine starts again with the first regex token, at the next character: 4. This time, ^ cannot match at the position before the 4. This position is preceded by a character, and that character is not a newline. The engine continues at 9, and fails again. The next attempt, at n, also fails. Again, the position before n is preceded by a character, 9, and that character is not a newline.
Then, the regex engine arrives at the second 4 in the string. The ^ can match at the position before the 4, because it is preceded by a newline character. Again, the regex engine advances to the next regex token, 4, but does not advance the character position in the string. 4 matches 4, and the engine advances both the regex token and the string character. Now the engine attempts to match $ at the position before (indeed: before) the 8. The dollar cannot match here, because this position is followed by a character, and that character is not a newline.
Yet again, the engine must try to match the first token again. Previously, it was successfully matched at the second 4, so the engine continues at the next character, 8, where the caret does not match. Same at the six and the newline.
Finally, the regex engine tries to match the first token at the third 4 in the string. With success. After that, the engine successfully matches 4 with 4. The current regex token is advanced to $, and the current character is advanced to the very last position in the string: the void after the string. No regex token that needs a character to match can match here. Not even a negated character class. However, we are trying to match a dollar sign, and the mighty dollar is a strange beast. It is zero-width, so it will try to match the position before the current character. It does not matter that this “character” is the void after the string. In fact, the dollar will check the current character. It must be either a newline, or the void after the string, for $ to match the position before the current character. Since that is the case after the example, the dollar matches successfully.
Since $ was the last token in the regex, the engine has found a successful match: the last 4 in the string.
Another Inside Look
Earlier I mentioned that ^d*$ would successfully match an empty string. Let’s see why.
There is only one “character” position in an empty string: the void after the string. The first token in the regex is ^. It matches the position before the void after the string, because it is preceded by the void before the string. The next token is d*. As we will see later, one of the star’s effects is that it makes the d, in this case, optional. The engine will try to match d with the void after the string. That fails, but the star turns the failure of the d into a zero-width success. The engine will proceed with the next regex token, without advancing the position in the string. So the engine arrives at $, and the void after the string. We already saw that those match. At this point, the entire regex has matched the empty string, and the engine reports success.
Caution for Programmers
A regular expression such as $ all by itself can indeed match after the string. If you would query the engine for the character position, it would return the length of the string if string indices are zero-based, or the length+1 if string indices are one-based in your programming language. If you would query the engine for the length of the match, it would return zero.
What you have to watch out for is that String[Regex.MatchPosition] may cause an access violation or segmentation fault, because MatchPosition can point to the void after the string. This can also happen with ^ and ^$ if the last character in the string is a newline.
Word Boundaries
The metacharacter b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: b allows you to perform a “whole words only” search using a regular expression in the form of bwordb. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.
In all flavors, the characters [a-zA-Z0-9_] are word characters. These are also matched by the short-hand character class w. Flavors showing “ascii” for word boundaries in the flavor comparison recognize only these as word characters. Flavors showing “YES” also recognize letters and digits from other languages or all of Unicode as word characters. Notice that Java supports Unicode for b but not for w. Python offers flags to control which characters are word characters (affecting both b and w).
In Perl and the other regex flavors discussed in this tutorial, there is only one metacharacter that matches both before a word and after a word. This is because any position between characters can never be both at the start and at the end of a word. Using only one operator makes things easier for you.
Since digits are considered to be word characters, b4b can be used to match a 4 that is not part of a larger number. This regex will not match 44 sheets of a4. So saying “b matches before and after an alphanumeric sequence” is more exact than saying “before and after a word”.
Negated Word Boundary
B is the negated version of b. B matches at every position where b does not. Effectively, B matches at any position between two word characters as well as at any position between two non-word characters.
Looking Inside the Regex Engine
Let’s see what happens when we apply the regex bisb to the string This island is beautiful. The engine starts with the first token b at the first character T. Since this token is zero-length, the position before the character is inspected. b matches here, because the T is a word character and the character before it is the void before the start of the string. The engine continues with the next token: the literal i. The engine does not advance to the next character in the string, because the previous regex token was zero-width. i does not match T, so the engine retries the first token at the next character position.
b cannot match at the position between the T and the h. It cannot match between the h and the i either, and neither between the i and the s.
The next character in the string is a space. b matches here because the space is not a word character, and the preceding character is. Again, the engine continues with the i which does not match with the space.
Advancing a character and restarting with the first regex token, b matches between the space and the second i in the string. Continuing, the regex engine finds that i matches i and s matches s. Now, the engine tries to match the second b at the position before the l. This fails because this position is between two word characters. The engine reverts to the start of the regex and advances one character to the s in island. Again, the b fails to match and continues to do so until the second space is reached. It matches there, but matching the i fails.
But b matches at the position before the third i in the string. The engine continues, and finds that i matches i and s matches s. The last token in the regex, b, also matches at the position before the third space in the string because the space is not a word character, and the character before it is.
The engine has successfully matched the word is in our string, skipping the two earlier occurrences of the characters i and s. If we had used the regular expression is, it would have matched the is in This.
Tcl Word Boundaries
Word boundaries, as described above, are supported by most regular expression flavors. Notable exceptions are the POSIX and XML Schema flavors, which don’t support word boundaries at all. Tcl uses a different syntax.
In Tcl, b matches a backspace character, just like x08 in most regex flavors (including Tcl’s). B matches a single backslash character in Tcl, just like \ in all other regex flavors (and Tcl too).
Tcl uses the letter “y” instead of the letter “b” to match word boundaries. y matches at any word boundary position, while Y matches at any position that is not a word boundary. These Tcl regex tokens match exactly the same as b and B in Perl-style regex flavors. They don’t discriminate between the start and the end of a word.
Tcl has two more word boundary tokens that do discriminate between the start and end of a word. m matches only at the start of a word. That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it. It also matches at the start of the string if the first character in the string is a word character. M matches only at the end of a word. It matches at any position that has a word character to the left of it, and a non-word character to the right of it. It also matches at the end of the string if the last character in the string is a word character.
The only regex engine that supports Tcl-style word boundaries (besides Tcl itself) is the JGsoft engine. In PowerGREP and EditPad Pro, b and B are Perl-style word boundaries, and y, Y, m and M are Tcl-style word boundaries.
In most situations, the lack of m and M tokens is not a problem. ywordy finds “whole words only” occurrences of “word” just like mwordM would. Mwordm could never match anywhere, since M never matches at a position followed by a word character, and m never at a position preceded by one. If your regular expression needs to match characters before or after y, you can easily specify in the regex whether these characters should be word characters or non-word characters. E.g. if you want to match any word, yw+y will give the same result as m.+M. Using w instead of the dot automatically restricts the first y to the start of a word, and the second y to the end of a word. Note that y.+y would not work. This regex matches each word, and also each sequence of non-word characters between the words in your subject string. That said, if your flavor supports m and M, the regex engine could apply mw+M slightly faster than yw+y, depending on its internal optimizations.
If your regex flavor supports lookahead and lookbehind, you can use (?<!w)(?=w) to emulate Tcl’s m and (?<=w)(?!w) to emulate M. Though quite a bit more verbose, these lookaround constructs match exactly the same as Tcl’s word boundaries.
If your flavor has lookahead but not lookbehind, and also has Perl-style word boundaries, you can use b(?=w) to emulate Tcl’s m and b(?!w) to emulate M. b matches at the start or end of a word, and the lookahead checks if the next character is part of a word or not. If it is we’re at the start of a word. Otherwise, we’re at the end of a word.
GNU Word Boundaries
The GNU extensions to POSIX regular expressions add support for the b and B word boundaries, as described above. GNU also uses it’s own syntax for start-of-word and end-of-word boundaries. < matches at the start of a word, like Tcl’s m. > matches at the end of a word, like Tcl’s M.
Alternation with The Vertical Bar or Pipe Symbol
I already explained how you can use character classes to match a single character out of several possible characters. Alternation is similar. You can use alternation to match a single regular expression out of several possible regular expressions.
If you want to search for the literal text cat or dog, separate both options with a vertical bar or pipe symbol: cat|dog. If you want more options, simply expand the list: cat|dog|mouse|fish.
The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you will need to use round brackets for grouping. If we want to improve the first example to match whole words only, we would need to use b(cat|dog)b. This tells the regex engine to find a word boundary, then either “cat” or “dog”, and then another word boundary. If we had omitted the round brackets, the regex engine would have searched for “a word boundary followed by cat”, or, “dog” followed by a word boundary.
Remember That The Regex Engine Is Eager
I already explained that the regex engine is eager. It will stop searching as soon as it finds a valid match. The consequence is that in certain situations, the order of the alternatives matters. Suppose you want to use a regex to match a list of function names in a programming language: Get, GetValue, Set or SetValue. The obvious solution is Get|GetValue|Set|SetValue. Let’s see how this works out when the string is SetValue.
The regex engine starts at the first token in the regex, G, and at the first character in the string, S. The match fails. However, the regex engine studied the entire regular expression before starting. So it knows that this regular expression uses alternation, and that the entire regex has not failed yet. So it continues with the second option, being the second G in the regex. The match fails again. The next token is the first S in the regex. The match succeeds, and the engine continues with the next character in the string, as well as the next token in the regex. The next token in the regex is the e after the S that just successfully matched. e matches e. The next token, t matches t.
At this point, the third option in the alternation has been successfully matched. Because the regex engine is eager, it considers the entire alternation to have been successfully matched as soon as one of the options has. In this example, there are no other tokens in the regex outside the alternation, so the entire regex has successfully matched Set in SetValue.
Contrary to what we intended, the regex did not match the entire string. There are several solutions. One option is to take into account that the regex engine is eager, and change the order of the options. If we use GetValue|Get|SetValue|Set, SetValue will be attempted before Set, and the engine will match the entire string. We could also combine the four options into two and use the question mark to make part of them optional: Get(Value)?|Set(Value)?. Because the question mark is greedy, SetValue will be attempted before Set.
The best option is probably to express the fact that we only want to match complete words. We do not want to match Set or SetValue if the string is SetValueFunction. So the solution is b(Get|GetValue|Set|SetValue)b or b(Get(Value)?|Set(Value)?)b. Since all options have the same end, we can optimize this further to b(Get|Set)(Value)?b.
All regex flavors discussed on this website work this way, except one: the POSIX standard mandates that the longest match be returned, regardless if the regex engine is implemented using an NFA or DFA algorithm.
Optional Items
The question mark makes the preceding token in the regular expression optional. E.g.: colou?r matches both colour and color.
You can make several tokens optional by grouping them together using round brackets, and placing the question mark after the closing bracket. E.g.: Nov(ember)? will match Nov and November.
You can write a regular expression that matches many alternatives by including more than one question mark. Feb(ruary)? 23(rd)? matches February 23rd, February 23, Feb 23rd and Feb 23.
Important Regex Concept: Greediness
With the question mark, I have introduced the first metacharacter that is greedy. The question mark gives the regex engine two choices: try to match the part the question mark applies to, or do not try to match it. The engine will always try to match that part. Only if this causes the entire regular expression to fail, will the engine try ignoring the part the question mark applies to.
The effect is that if you apply the regex Feb 23(rd)? to the string Today is Feb 23rd, 2003, the match will always be Feb 23rd and not Feb 23. You can make the question mark lazy (i.e. turn off the greediness) by putting a second question mark after the first.
I will say a lot more about greediness when discussing the other repetition operators.
Looking Inside The Regex Engine
Let’s apply the regular expression colou?r to the string The colonel likes the color green.
The first token in the regex is the literal c. The first position where it matches successfully is the c in colonel. The engine continues, and finds that o matches o, l matches l and another o matches o. Then the engine checks whether u matches n. This fails. However, the question mark tells the regex engine that failing to match u is acceptable. Therefore, the engine will skip ahead to the next regex token: r. But this fails to match n as well. Now, the engine can only conclude that the entire regular expression cannot be matched starting at the c in colonel. Therefore, the engine starts again trying to match c to the first o in colonel.
After a series of failures, c will match with the c in color, and o, l and o match the following characters. Now the engine checks whether u matches r. This fails. Again: no problem. The question mark allows the engine to continue with r. This matches r and the engine reports that the regex successfully matched color in our string.
Repetition with Star and Plus
I already introduced one repetition operator or quantifier: the question mark. It tells the engine to attempt to match the preceding token zero times or once, in effect making it optional.
The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more. <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. The sharp brackets are literals. The first character class matches a letter. The second character class matches a letter or digit. The star repeats the second character class. Because we used the star, it’s OK if the second character class matches nothing. So our regex will match a tag like <B>. When matching <HTML>, the first character class will match H. The star will cause the second character class to be repeated three times, matching T, M and L with each step.
I could also have used <[A-Za-z0-9]+>. I did not, because this regex would match <1>, which is not a valid HTML tag. But this regex may be sufficient if you know the string you are searching through does not contain any such invalid tags.
Limiting Repetition
Modern regex flavors, like those discussed in this tutorial, have an additional repetition operator that allows you to specify how many times a token can be repeated. The syntax is {min,max}, where min is a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present but max is omitted, the maximum number of matches is infinite. So {0,} is the same as *, and {1,} is the same as +. Omitting both the comma and max tells the engine to repeat the token exactly min times.
You could use b[1-9][0-9]{3}b to match a number between 1000 and 9999. b[1-9][0-9]{2,4}b matches a number between 100 and 99999. Notice the use of the word boundaries.
Watch Out for The Greediness!
Suppose you want to use a regex to match an HTML tag. You know that the input will be a valid HTML file, so the regular expression does not need to exclude any invalid use of sharp brackets. If it sits between sharp brackets, it is an HTML tag.
Most people new to regular expressions will attempt to use <.+>. They will be surprised when they test it on a string like This is a <EM>first</EM> test. You might expect the regex to match <EM> and when continuing after that match, </EM>.
But it does not. The regex will match <EM>first</EM>. Obviously not what we wanted. The reason is that the plus is greedy. That is, the plus causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire regex to fail, will the regex engine backtrack. That is, it will go back to the plus, make it give up the last iteration, and proceed with the remainder of the regex. Let’s take a look inside the regex engine to see in detail how this works and why this causes our regex to fail. After that, I will present you with two possible solutions.
Like the plus, the star and the repetition using curly braces are greedy.
Looking Inside The Regex Engine
The first token in the regex is <. This is a literal. As we already know, the first place where it will match is the first < in the string. The next token is the dot, which matches any character except newlines. The dot is repeated by the plus. The plus is greedy. Therefore, the engine will repeat the dot as many times as it can. The dot matches E, so the regex continues to try to match the dot with the next character. M is matched, and the dot is repeated once more. The next character is the >. You should see the problem by now. The dot matches the >, and the engine continues repeating the dot. The dot will match all remaining characters in the string. The dot fails when the engine has reached the void after the end of the string. Only at this point does the regex engine continue with the next token: >.
So far, <.+ has matched <EM>first</EM> test and the engine has arrived at the end of the string. > cannot match here. The engine remembers that the plus has repeated the dot more often than is required. (Remember that the plus requires the dot to match only once.) Rather than admitting failure, the engine will backtrack. It will reduce the repetition of the plus by one, and then continue trying the remainder of the regex.
So the match of .+ is reduced to EM>first</EM> tes. The next token in the regex is still >. But now the next character in the string is the last t. Again, these cannot match, causing the engine to backtrack further. The total match so far is reduced to <EM>first</EM> te. But > still cannot match. So the engine continues backtracking until the match of .+ is reduced to EM>first</EM. Now, > can match the next character in the string. The last token in the regex has been matched. The engine reports that <EM>first</EM> has been successfully matched.
Remember that the regex engine is eager to return a match. It will not continue backtracking further to see if there is another possible match. It will report the first valid match it finds. Because of greediness, this is the leftmost longest match.
Laziness Instead of Greediness
The quick fix to this problem is to make the plus lazy instead of greedy. Lazy quantifiers are sometimes also called “ungreedy” or “reluctant”. You can do that by putting a question mark behind the plus in the regex. You can do the same with the star, the curly braces and the question mark itself. So our example becomes <.+?>. Let’s have another look inside the regex engine.
Again, < matches the first < in the string. The next token is the dot, this time repeated by a lazy plus. This tells the regex engine to repeat the dot as few times as possible. The minimum is one. So the engine matches the dot with E. The requirement has been met, and the engine continues with > and M. This fails. Again, the engine will backtrack. But this time, the backtracking will force the lazy plus to expand rather than reduce its reach. So the match of .+ is expanded to EM, and the engine tries again to continue with >. Now, > is matched successfully. The last token in the regex has been matched. The engine reports that <EM> has been successfully matched. That’s more like it.
An Alternative to Laziness
In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing, or perhaps in a custom syntax coloring scheme for EditPad Pro.
Finally, remember that this tutorial only talks about regex-directed engines. Text-directed engines do not backtrack. They do not get the speed penalty, but they also do not support lazy repetition operators.
Repeating Q…E Escape Sequences
The Q…E sequence escapes a string of characters, matching them as literal characters. The escaped characters are treated as individual characters. If you place a quantifier after the E, it will only be applied to the last character. E.g. if you apply Q*d+*E+ to *d+**d+*, the match will be *d+**. Only the asterisk is repeated. Java 4 and 5 have a bug that causes the whole Q..E sequence to be repeated, yielding the whole subject string as the match. This was fixed in Java 6.
Use Round Brackets for Grouping
By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a regex operator, e.g. a repetition operator, to the entire group. I have already used round brackets for this purpose in previous topics throughout this tutorial.
Note that only round brackets can be used for grouping. Square brackets define a character class, and curly braces are used by a special repetition operator.
Round Brackets Create a Backreference
Besides grouping part of a regular expression together, round brackets also create a “backreference”. A backreference stores the part of the string matched by the part of the regular expression inside the parentheses.
That is, unless you use non-capturing parentheses. Remembering part of the regex match in a backreference, slows down the regex engine because it has more work to do. If you do not use the backreference, you can speed things up by using non-capturing parentheses, at the expense of making your regular expression slightly harder to read.
The regex Set(Value)? matches Set or SetValue. In the first case, the first backreference will be empty, because it did not match anything. In the second case, the first backreference will contain Value.
If you do not use the backreference, you can optimize this regular expression into Set(?:Value)?. The question mark and the colon after the opening round bracket are the special syntax that you can use to tell the regex engine that this pair of brackets should not create a backreference. Note the question mark after the opening bracket is unrelated to the question mark at the end of the regex. That question mark is the regex operator that makes the previous token optional. This operator cannot appear after an opening round bracket, because an opening bracket by itself is not a valid regex token. Therefore, there is no confusion between the question mark as an operator to make a token optional, and the question mark as a character to change the properties of a pair of round brackets. The colon indicates that the change we want to make is to turn off capturing the backreference.
How to Use Backreferences
Backreferences allow you to reuse part of the regex match. You can reuse it inside the regular expression (see below), or afterwards. What you can do with it afterwards, depends on the tool or programming language you are using. The most common usage is in search-and-replace operations. The replacement text will use a special syntax to allow text matched by capturing groups to be reinserted. This syntax differs greatly between various tools and languages, far more than the regex syntax does. Please check the replacement text reference for details.
Using Backreferences in The Regular Expression
Backreferences can not only be used after a match has been found, but also during the match. Suppose you want to match a pair of opening and closing HTML tags, and the text in between. By putting the opening tag into a backreference, we can reuse the name of the tag for the closing tag. Here’s how: <([A-Z][A-Z0-9]*)b[^>]*>.*?</1> . This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first backreference. This backreference is reused with 1 (backslash one). The / before it is simply the forward slash in the closing HTML tag that we are trying to match.
To figure out the number of a particular backreference, scan the regular expression from left to right and count the opening round brackets. The first bracket starts backreference number one, the second number two, etc. Non-capturing parentheses are not counted. This fact means that non-capturing parentheses have another benefit: you can insert them into a regular expression without changing the numbers assigned to the backreferences. This can be very useful when modifying a complex regular expression.
You can reuse the same backreference more than once. ([a-c])x1x1 will match axaxa, bxbxb and cxcxc.
Looking Inside The Regex Engine
Let’s see how the regex engine applies the above regex to the string Testing <B><I>bold italic</I></B> text. The first token in the regex is the literal <. The regex engine will traverse the string until it can match at the first < in the string. The next token is [A-Z]. The regex engine also takes note that it is now inside the first pair of capturing parentheses. [A-Z] matches B. The engine advances to [A-Z0-9] and >. This match fails. However, because of the star, that’s perfectly fine. The position in the string remains at >. The position in the regex is advanced to [^>].
This step crosses the closing bracket of the first pair of capturing parentheses. This prompts the regex engine to store what was matched inside them into the first backreference. In this case, B is stored.
After storing the backreference, the engine proceeds with the match attempt. [^>] does not match >. Again, because of another star, this is not a problem. The position in the string remains at >, and position in the regex is advanced to >. These obviously match. The next token is a dot, repeated by a lazy star. Because of the laziness, the regex engine will initially skip this token, taking note that it should backtrack in case the remainder of the regex fails.
The engine has now arrived at the second < in the regex, and the second < in the string. These match. The next token is /. This does not match I, and the engine is forced to backtrack to the dot. The dot matches the second < in the string. The star is still lazy, so the engine again takes note of the available backtracking position and advances to < and I. These do not match, so the engine again backtracks.
The backtracking continues until the dot has consumed <I>bold italic. At this point, < matches the third < in the string, and the next token is / which matches /. The next token is 1. Note that the token is the backreference, and not B. The engine does not substitute the backreference in the regular expression. Every time the engine arrives at the backreference, it will read the value that was stored. This means that if the engine had backtracked beyond the first pair of capturing parentheses before arriving the second time at 1, the new value stored in the first backreference would be used. But this did not happen here, so B it is. This fails to match at I, so the engine backtracks again, and the dot consumes the third < in the string.
Backtracking continues again until the dot has consumed <I>bold italic</I>. At this point, < matches < and / matches /. The engine arrives again at 1. The backreference still holds B. B matches B. The last token in the regex, > matches >. A complete match has been found: <B><I>bold italic</I></B>.
Backtracking Into Capturing Groups
You may have wondered about the word boundary b in the <([A-Z][A-Z0-9]*)b[^>]*>.*?</1> mentioned above. This is to make sure the regex won’t match incorrectly paired tags such as <boo>bold</b>. You may think that cannot happen because the capturing group matches boo which causes 1 to try to match the same, and fail. That is indeed what happens. But then the regex engine backtracks.
Let’s take the regex <([A-Z][A-Z0-9]*)[^>]*>.*?</1> without the word boundary and look inside the regex engine at the point where 1 fails the first time. First, .*? continues to expand until it has reached the end of the string, and </1> has failed to match each time .*? matched one more character.
Then the regex engine backtracks into the capturing group. [A-Z0-9]* has matched oo, but would just as happily match o or nothing at all. When backtracking, [A-Z0-9]* is forced to give up one character. The regex engine continues, exiting the capturing group a second time. Since [A-Z][A-Z0-9]* has now matched bo, that is what is stored into the capturing group, overwriting boo that was stored before. [^>]* matches the second o in the opening tag. >.*?</ matches >bold<. 1 fails again.
The regex engine does all the same backtracking once more, until [A-Z0-9]* is forced to give up another character, causing it to match nothing, which the star allows. The capturing group now stores just b. [^>]* now matches oo. >.*?</ once again matches >bold<. 1 now succeeds, as does > and an overall match is found. But not the one we wanted.
There are several solutions to this. One is to use the word boundary. When [A-Z0-9]* backtracks the first time, reducing the capturing group to bo, b fails to match between o and o. This forces [A-Z0-9]* to backtrack again immediately. The capturing group is reduced to b and the word boundary fails between b and o. There are no further backtracking positions, so the whole match attempt fails.
The reason we need the word boundary is that we’re using [^>]* to skip over any attributes in the tag. If your paired tags never have any attributes, you can leave that out, and use <([A-Z][A-Z0-9]*)>.*?</1>. Each time [A-Z0-9]* backtracks, the > that follows it will fail to match, quickly ending the match attempt.
If you didn’t expect the regex engine to backtrack into capturing groups, you can use an atomic group. The regex engine always backtracks into capturing groups, and never captures atomic groups. You can put the capturing group inside an atomic group to get an atomic capturing group: (?>(atomic capture)). In this case, we can put the whole opening tag into the atomic group: (?><([A-Z][A-Z0-9]*)[^>]*>).*?</1>. The tutorial section on atomic grouping has all the details.
Backreferences to Failed Groups
The previous section applies to all regex flavors, except those few that don’t support capturing groups at all. Flavors behave differently when you start doing things that don’t fit the “match the text matched by a previous capturing group” job description.
There is a difference between a backreference to a capturing group that matched nothing, and one to a capturing group that did not participate in the match at all. The regex (q?)b1 will match b. q? is optional and matches nothing, causing (q?) to successfully match and capture nothing. b matches b and 1 successfully matches the nothing captured by the group.
The regex (q)?b1 however will fail to match b. (q) fails to match at all, so the group never gets to capture anything at all. Because the whole group is optional, the engine does proceed to match b. However, the engine now arrives at 1 which references a group that did not participate in the match attempt at all. This causes the backreference to fail to match at all, mimicking the result of the group. Since there’s no ? making 1 optional, the overall match attempt fails.
The only exception is JavaScript. According to the official ECMA standard, a backreference to a non-participating capturing group must successfully match nothing just like a backreference to a participating group that captured nothing does. In other words, in JavaScript, (q?)b1 and (q)?b1 both match b.
Forward References and Invalid References
Modern flavors, notably JGsoft, .NET, Java, Perl, PCRE and Ruby allow forward references. That is: you can use a backreference to a group that appears later in the regex. Forward references are obviously only useful if they’re inside a repeated group. Then there can be situations in which the regex engine evaluates the backreference after the group has already matched. Before the group is attempted, the backreference will fail like a backreference to a failed group does.
If forward references are supported, the regex (2two|(one))+ will match oneonetwo. At the start of the string, 2 fails. Trying the other alternative, one is matched by the second capturing group, and subsequently by the first group. The first group is then repeated. This time, 2 matches one as captured by the second group. two then matches two. With two repetitions of the first group, the regex has matched the whole subject string.
A nested reference is a backreference inside the capturing group that it references, e.g. (1two|(one))+. This regex will give exactly the same behavior with flavors that support forward references. Some flavors that don’t support forward references do support nested references. This includes JavaScript.
With all other flavors, using a backreference before its group in the regular expression is the same as using a backreference to a group that doesn’t exist at all. All flavors discussed in this tutorial, except JavaScript and Ruby, treat backreferences to undefined groups as an error. In JavaScript and Ruby, they always result in a zero-width match. For Ruby this is a potential pitfall. In Ruby, (a)(b)?2 will fail to match a, because 2 references a non-participating group. But (a)(b)?7 will match a. For JavaScript this is logical, as backreferences to non-participating groups do the same. Both regexes will match a.
Repetition and Backreferences
As I mentioned in the above inside look, the regex engine does not permanently substitute backreferences in the regular expression. It will use the last match saved into the backreference each time it needs to be used. If a new match is found by capturing parentheses, the previously saved match is overwritten. There is a clear difference between ([abc]+) and ([abc])+. Though both successfully match cab, the first regex will put cab into the first backreference, while the second regex will only store b. That is because in the second regex, the plus caused the pair of parentheses to repeat three times. The first time, c was stored. The second time a and the third time b. Each time, the previous value was overwritten, so b remains.
This also means that ([abc]+)=1 will match cab=cab, and that ([abc])+=1 will not. The reason is that when the engine arrives at 1, it holds b which fails to match c. Obvious when you look at a simple example like this one, but a common cause of difficulty with regular expressions nonetheless. When using backreferences, always double check that you are really capturing what you want.
Useful Example: Checking for Doubled Words
When editing text, doubled words such as “the the” easily creep in. Using the regex b(w+)s+1b in your text editor, you can easily find them. To delete the second word, simply type in 1 as the replacement text and click the Replace button.
Parentheses and Backreferences Cannot Be Used Inside Character Classes
Round brackets cannot be used inside character classes, at least not as metacharacters. When you put a round bracket in a character class, it is treated as a literal character. So the regex [(a)b] matches a, b, ( and ).
Backreferences also cannot be used inside a character class. The 1 in regex like (a)[1b] will be interpreted as an octal escape in most regex flavors. So this regex will match an a followed by either x01 or a b.
Named Capturing Groups
All modern regular expression engines support capturing groups, which are numbered from left to right, starting with one. The numbers can then be used in backreferences to match the same text again in the regular expression, or to use part of the regex match for further processing. In a complex regular expression with many capturing groups, the numbering can get a little confusing.
Named Capture with Python, PCRE and PHP
Python’s regex module was the first to offer a solution: named capture. By assigning a name to a capturing group, you can easily reference it by name. (?P<name>group) captures the match of group into the backreference “name”. You can reference the contents of the group with the numbered backreference 1 or the named backreference (?P=name).
The open source PCRE library has followed Python’s example, and offers named capture using the same syntax. The PHP preg functions offer the same functionality, since they are based on PCRE.
Python’s sub() function allows you to reference a named group as 1 or g<name>. This does not work in PHP. In PHP, you can use double-quoted string interpolation with the $regs parameter you passed to pcre_match(): $regs[‘name’].
Named Capture with .NET’s System.Text.RegularExpressions
The regular expression classes of the .NET framework also support named capture. Unfortunately, the Microsoft developers decided to invent their own syntax, rather than follow the one pioneered by Python. Currently, no other regex flavor supports Microsoft’s version of named capture.
Here is an example with two capturing groups in .NET style: (?<first>group)(?’second’group). As you can see, .NET offers two syntaxes to create a capturing group: one using sharp brackets, and the other using single quotes. The first syntax is preferable in strings, where single quotes may need to be escaped. The second syntax is preferable in ASP code, where the sharp brackets are used for HTML tags. You can use the pointy bracket flavor and the quoted flavors interchangeably.
To reference a capturing group inside the regex, use k<name> or k’name’. Again, you can use the two syntactic variations interchangeably.
When doing a search-and-replace, you can reference the named group with the familiar dollar sign syntax: ${name}. Simply use a name instead of a number between the curly braces.
Multiple Groups with The Same Name
The .NET framework allows multiple groups in the regular expression to have the same name. If you do so, both groups will store their matches in the same Group object. You won’t be able to distinguish which group captured the text. This can be useful in regular expressions with multiple alternatives to match the same thing. E.g. if you want to match “a” followed by a digit 0..5, or “b” followed by a digit 4..7, and you only care about the digit, you could use the regex a(?’digit'[0-5])|b(?’digit'[4-7]). The group named “digit” will then give you the digit 0..7 that was matched, regardless of the letter.
Python and PCRE do not allow multiple groups to use the same name. Doing so will give a regex compilation error.
Names and Numbers for Capturing Groups
Here is where things get a bit ugly. Python and PCRE treat named capturing groups just like unnamed capturing groups, and number both kinds from left to right, starting with one. The regex (a)(?P<x>b)(c)(?P<y>d) matches abcd as expected. If you do a search-and-replace with this regex and the replacement 1234, you will get abcd. All four groups were numbered from left to right, from one till four. Easy and logical.
Things are quite a bit more complicated with the .NET framework. The regex (a)(?<x>b)(c)(?<y>d) again matches abcd. However, if you do a search-and-replace with $1$2$3$4 as the replacement, you will get acbd. Probably not what you expected.
The .NET framework does number named capturing groups from left to right, but numbers them after all the unnamed groups have been numbered. So the unnamed groups (a) and (c) get numbered first, from left to right, starting at one. Then the named groups (?<x>b) and (?<y>d) get their numbers, continuing from the unnamed groups, in this case: three.
To make things simple, when using .NET’s regex support, just assume that named groups do not get numbered at all, and reference them by name exclusively. To keep things compatible across regex flavors, I strongly recommend that you do not mix named and unnamed capturing groups at all. Either give a group a name, or make it non-capturing as in (?:nocapture). Non-capturing groups are more efficient, since the regex engine does not need to keep track of their matches.
Best of Both Worlds
The JGsoft regex engine supports both .NET-style and Python-style named capture. Python-style named groups are numbered along unnamed ones, like Python does. .NET-style named groups are numbered afterwards, like .NET does. You can mix both styles in the same regex. The JGsoft engine allows multiple groups to use the same name, regardless of the syntax used.
In PowerGREP, named capturing groups play a special roles. Groups with the same name are shared between all regular expressions and replacement texts in the same PowerGREP action. This allows captured by a named capturing group in one part of the action to be referenced in a later part of the action. Because of this, PowerGREP does not allow numbered references to named capturing groups at all. When mixing named and numbered groups in a regex, the numbered groups are still numbered following the Python and .NET rules, like the JGsoft flavor always does.
Regular Expression Advanced Syntax Reference
Grouping and Backreferences |
||
Syntax |
Description |
Example |
(regex) | Round brackets group the regex between them. They capture the text matched by the regex inside them that can be reused in a backreference, and they allow you to apply regex operators to the entire grouped regex. | (abc){3} matches abcabcabc. First group matches abc. |
(?:regex) | Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything and do not create backreferences. | (?:abc){3} matches abcabcabc. No groups. |
1 through 9 | Substituted with the text matched between the 1st through 9th pair of capturing parentheses. Some regex flavors allow more than 9 backreferences. | (abc|def)=1 matches abc=abc or def=def, but not abc=def or def=abc. |
Modifiers |
||
Syntax |
Description |
Example |
(?i) | Turn on case insensitivity for the remainder of the regular expression. (Older regex flavors may turn it on for the entire regex.) | te(?i)st matches teST but not TEST. |
(?-i) | Turn off case insensitivity for the remainder of the regular expression. | (?i)te(?-i)st matches TEst but not TEST. |
(?s) | Turn on “dot matches newline” for the remainder of the regular expression. (Older regex flavors may turn it on for the entire regex.) | |
(?-s) | Turn off “dot matches newline” for the remainder of the regular expression. | |
(?m) | Caret and dollar match after and before newlines for the remainder of the regular expression. (Older regex flavors may apply this to the entire regex.) | |
(?-m) | Caret and dollar only match at the start and end of the string for the remainder of the regular expression. | |
(?x) | Turn on free-spacing mode to ignore whitespace between regex tokens, and allow # comments. | |
(?-x) | Turn off free-spacing mode. | |
(?i-sm) | Turns on the option “i” and turns off “s” and “m” for the remainder of the regular expression. (Older regex flavors may apply this to the entire regex.) | |
(?i-sm:regex) | Matches the regex inside the span with the option “i” turned on and “m” and “s” turned off. | (?i:te)st matches TEst but not TEST. |
Atomic Grouping and Possessive Quantifiers |
||
Syntax |
Description |
Example |
(?>regex) | Atomic groups prevent the regex engine from backtracking back into the group (forcing the group to discard part of its match) after a match has been found for the group. Backtracking can occur inside the group before it has matched completely, and the engine can backtrack past the entire group, discarding its match entirely. Eliminating needless backtracking provides a speed increase. Atomic grouping is often indispensable when nesting quantifiers to prevent a catastrophic amount of backtracking as the engine needlessly tries pointless permutations of the nested quantifiers. | x(?>w+)x is more efficient than xw+x if the second x cannot be matched. |
?+, *+, ++ and {m,n}+ | Possessive quantifiers are a limited yet syntactically cleaner alternative to atomic grouping. Only available in a few regex flavors. They behave as normal greedy quantifiers, except that they will not give up part of their match for backtracking. | x++ is identical to (?>x+) |
Lookaround |
||
Syntax |
Description |
Example |
(?=regex) | Zero-width positive lookahead. Matches at a position where the pattern inside the lookahead can be matched. Matches only the position. It does not consume any characters or expand the match. In a pattern like one(?=two)three, both two and three have to match at the position where the match of one ends. | t(?=s) matches the second t in streets. |
(?!regex) | Zero-width negative lookahead. Identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match. | t(?!s) matches the first t in streets. |
(?<=regex) | Zero-width positive lookbehind. Matches at a position if the pattern inside the lookahead can be matched ending at that position (i.e. to the left of that position). Depending on the regex flavor you’re using, you may not be able to use quantifiers and/or alternation inside lookbehind. | (?<=s)t matches the first t in streets. |
(?<!regex) | Zero-width negative lookbehind. Matches at a position if the pattern inside the lookahead cannot be matched ending at that position. | (?<!s)t matches the second t in streets. |
Continuing from The Previous Match |
||
Syntax |
Description |
Example |
G | Matches at the position where the previous match ended, or the position where the current match attempt started (depending on the tool or regex flavor). Matches at the start of the string during the first match attempt. | G[a-z] first matches a, then matches b and then fails to match in ab_cd. |
Conditionals |
||
Syntax |
Description |
Example |
(?(?=regex)then|else) | If the lookahead succeeds, the “then” part must match for the overall regex to match. If the lookahead fails, the “else” part must match for the overall regex to match. Not just positive lookahead, but all four lookarounds can be used. Note that the lookahead is zero-width, so the “then” and “else” parts need to match and consume the part of the text matched by the lookahead as well. | (?(?<=a)b|c) matches the second b and the first c in babxcac |
(?(1)then|else) | If the first capturing group took part in the match attempt thus far, the “then” part must match for the overall regex to match. If the first capturing group did not take part in the match, the “else” part must match for the overall regex to match. | (a)?(?(1)b|c) matches ab, the first c and the second c in babxcac |
Comments |
||
Syntax |
Description |
Example |
(?#comment) | Everything between (?# and ) is ignored by the regex engine. | a(?#foobar)b matches ab |
Sample Regular Expressions
Below, you will find many example patterns that you can use for and adapt to your own purposes. Key techniques used in crafting each regex are explained, with links to the corresponding pages in the tutorial where these concepts and techniques are explained in great detail.
If you are new to regular expressions, you can take a look at these examples to see what is possible. Regular expressions are very powerful. They do take some time to learn. But you will earn back that time quickly when using regular expressions to automate searching or editing tasks in EditPad Pro or PowerGREP, or when writing scripts or applications in a variety of languages.
RegexBuddy offers the fastest way to get up to speed with regular expressions. RegexBuddy will analyze any regular expression and present it to you in a clearly to understand, detailed outline. The outline links to RegexBuddy’s regex tutorial (the same one you find on this website), where you can always get in-depth information with a single click.
Oh, and you definitely do not need to be a programmer to take advantage of regular expressions!
Grabbing HTML Tags
<TAGb[^>]*>(.*?)</TAG> matches the opening and closing pair of a specific HTML tag. Anything between the tags is captured into the first backreference. The question mark in the regex makes the star lazy, to make sure it stops before the first closing tag rather than before the last, like a greedy star would do. This regex will not properly match tags nested inside themselves, like in <TAG>one<TAG>two</TAG>one</TAG>.
<([A-Z][A-Z0-9]*)b[^>]*>(.*?)</1> will match the opening and closing pair of any HTML tag. Be sure to turn off case sensitivity. The key in this solution is the use of the backreference 1 in the regex. Anything between the tags is captured into the second backreference. This solution will also not match tags nested in themselves.
Trimming Whitespace
You can easily trim unnecessary whitespace from the start and the end of a string or the lines in a text file by doing a regex search-and-replace. Search for ^[ t]+ and replace with nothing to delete leading whitespace (spaces and tabs). Search for [ t]+$ to trim trailing whitespace. Do both by combining the regular expressions into ^[ t]+|[ t]+$ . Instead of [ t] which matches a space or a tab, you can expand the character class into [ trn] if you also want to strip line breaks. Or you can use the shorthand s instead.
IP Addresses
Matching an IP address is another good example of a trade-off between regex complexity and exactness. bd{1,3}.d{1,3}.d{1,3}.d{1,3}b will match any IP address just fine, but will also match 999.999.999.999 as if it were a valid IP address. Whether this is a problem depends on the files or data you intend to apply the regex to. To restrict all 4 numbers in the IP address to 0..255, you can use this complex beast: b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)b (everything on a single line). The long regex stores each of the 4 numbers of the IP address into a capturing group. You can use these groups to further process the IP number.
If you don’t need access to the individual numbers, you can shorten the regex with a quantifier to: b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)b . Similarly, you can shorten the quick regex to b(?:d{1,3}.){3}d{1,3}b
More Detailed Examples
Numeric Ranges. Since regular expressions work with text rather than numbers, matching specific numeric ranges requires a bit of extra care.
Matching a Floating Point Number. Also illustrates the common mistake of making everything in a regular expression optional.
Matching an Email Address. There’s a lot of controversy about what is a proper regex to match email addresses. It’s a perfect example showing that you need to know exactly what you’re trying to match (and what not), and that there’s always a trade-off between regex complexity and accuracy.
Matching Valid Dates. A regular expression that matches 31-12-1999 but not 31-13-1999.
Finding or Verifying Credit Card Numbers. Validate credit card numbers entered on your order form. Find credit card numbers in documents for a security audit.
Matching Complete Lines. Shows how to match complete lines in a text file rather than just the part of the line that satisfies a certain requirement. Also shows how to match lines in which a particular regex does not match.
Removing Duplicate Lines or Items. Illustrates simple yet clever use of capturing parentheses or backreferences.
Regex Examples for Processing Source Code. How to match common programming language syntax such as comments, strings, numbers, etc.
Two Words Near Each Other. Shows how to use a regular expression to emulate the “near” operator that some tools have.
Common Pitfalls
Catastrophic Backtracking. If your regular expression seems to take forever, or simply crashes your application, it has likely contracted a case of catastrophic backtracking. The solution is usually to be more specific about what you want to match, so the number of matches the engine has to try doesn’t rise exponentially.
Making Everything Optional. If all the parts in your regex are optional, it will match a zero-width string anywhere. Your regex will need to express the facts that different parts are optional depending on which parts are present.
Repeating a Capturing Group vs. Capturing a Repeated Group. Repeating a capturing group will capture only the last iteration of the group. Capture a repeated group if you want to capture all iterations.
Mixing Unicode and 8-bit Character Codes. Using 8-bit character codes like x80 with a Unicode engine and subject string may give unexpected results.
Matching Numeric Ranges with a Regular Expression
Since regular expressions deal with text rather than with numbers, matching a number in a given range takes a little extra care. You can’t just write [0-255] to match a number between 0 and 255. Though a valid regex, it matches something entirely different. [0-255] is a character class with three elements: the character range 0-2, the character 5 and the character 5 (again). This character class matches a single digit 0, 1, 2 or 5, just like [0125].
Since regular expressions work with text, a regular expression engine treats 0 as a single character, and 255 as three characters. To match all characters from 0 to 255, we’ll need a regex that matches between one and three characters.
The regex [0-9] matches single-digit numbers 0 to 9. [1-9][0-9] matches double-digit numbers 10 to 99. That’s the easy part.
Matching the three-digit numbers is a little more complicated, since we need to exclude numbers 256 through 999. 1[0-9][0-9] takes care of 100 to 199. 2[0-4][0-9] matches 200 through 249. Finally, 25[0-5] adds 250 till 255.
As you can see, you need to split up the numeric range in ranges with the same number of digits, and each of those ranges that allow the same variation for each digit. In the 3-digit range in our example, numbers starting with 1 allow all 10 digits for the following two digits, while numbers starting with 2 restrict the digits that are allowed to follow.
Putting this all together using alternation we get: [0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]. This matches the numbers we want, with one caveat: regular expression searches usually allow partial matches, so our regex would match 123 in 12345. There are two solutions to this.
If you’re searching for these numbers in a larger document or input string, use word boundaries to require a non-word character (or no character at all) to precede and to follow any valid match. The regex then becomes b([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])b. Since the alternation operator has the lowest precedence of all, the round brackets are required to group the alternatives together. This way the regex engine will try to match the first word boundary, then try all the alternatives, and then try to match the second word boundary after the numbers it matched. Regular expression engines consider all alphanumeric characters, as well as the underscore, as word characters.
If you’re using the regular expression to validate input, you’ll probably want to check that the entire input consists of a valid number. To do this, use anchors instead of word boundaries: ^([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$.
Here are a few more common ranges that you may want to match:
- 000..255: ^([01][0-9][0-9]|2[0-4][0-9]|25[0-5])$
- 0 or 000..255: ^([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])$
- 0 or 000..127: ^(0?[0-9]?[0-9]|1[0-1][0-9]|12[0-7])$
- 0..999: ^([0-9]|[1-9][0-9]|[1-9][0-9][0-9])$
- 000..999: ^[0-9]{3}$
- 0 or 000..999: ^[0-9]{1,3}$
- 1..999: ^([1-9]|[1-9][0-9]|[1-9][0-9][0-9])$
- 001..999: ^(00[1-9]|0[1-9][0-9]|[1-9][0-9][0-9])$
- 1 or 001..999: ^(0{0,2}[1-9]|0?[1-9][0-9]|[1-9][0-9][0-9])$
- 0 or 00..59: ^[0-5]?[0-9]$
- 0 or 000..366: ^(0?[0-9]?[0-9]|[1-2][0-9][0-9]|3[0-5][0-9]|36[0-6])$
Matching Floating Point Numbers with a Regular Expression
In this example, I will show you how you can avoid a common mistake often made by people inexperienced with regular expressions. As an example, we will try to build a regular expression that can match any floating point number. Our regex should also match integers, and floating point numbers where the integer part is not given (i.e. zero). We will not try to match numbers with an exponent, such as 1.5e8 (150 million in scientific notation).
At first thought, the following regex seems to do the trick: [-+]?[0-9]*.?[0-9]*. This defines a floating point number as an optional sign, followed by an optional series of digits (integer part), followed by an optional dot, followed by another optional series of digits (fraction part).
Spelling out the regex in words makes it obvious: everything in this regular expression is optional. This regular expression will consider a sign by itself or a dot by itself as a valid floating point number. In fact, it will even consider an empty string as a valid floating point number. This regular expression can cause serious trouble if it is used in a scripting language like Perl or PHP to verify user input.
Not escaping the dot is also a common mistake. A dot that is not escaped will match any character, including a dot. If we had not escaped the dot, 4.4 would be considered a floating point number, and 4X4 too.
When creating a regular expression, it is more important to consider what it should not match, than what it should. The above regex will indeed match a proper floating point number, because the regex engine is greedy. But it will also match many things we do not want, which we have to exclude.
Here is a better attempt: [-+]?([0-9]*.[0-9]+|[0-9]+). This regular expression will match an optional sign, that is either followed by zero or more digits followed by a dot and one or more digits (a floating point number with optional integer part), or followed by one or more digits (an integer).
This is a far better definition. Any match will include at least one digit, because there is no way around the [0-9]+ part. We have successfully excluded the matches we do not want: those without digits.
We can optimize this regular expression as: [-+]?[0-9]*.?[0-9]+.
If you also want to match numbers with exponents, you can use: [-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)? . Notice how I made the entire exponent part optional by grouping it together, rather than making each element in the exponent optional.
Finally, if you want to validate if a particular string holds a floating point number, rather than finding a floating point number within longer text, you’ll have to anchor your regex: ^[-+]?[0-9]*.?[0-9]+$ or ^[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?$. You can find additional variations of these regexes in RegexBuddy’s library.
How to Find or Validate an Email Address
The regular expression I receive the most feedback, not to mention “bug” reports on, is the one you’ll find right on this site’s home page: b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}b. This regular expression, I claim, matches any email address. Most of the feedback I get refutes that claim by showing one email address that this regex doesn’t match. Usually, the “bug” report also includes a suggestion to make the regex “perfect”.
As I explain below, my claim only holds true when one accepts my definition of what a valid email address really is, and what it’s not. If you want to use a different definition, you’ll have to adapt the regex. Matching a valid email address is a perfect example showing that (1) before writing a regex, you have to know exactly what you’re trying to match, and what not; and (2) there’s often a trade-off between what’s exact, and what’s practical.
The virtue of my regular expression above is that it matches 99% of the email addresses in use today. All the email address it matches can be handled by 99% of all email software out there. If you’re looking for a quick solution, you only need to read the next paragraph. If you want to know all the trade-offs and get plenty of alternatives to choose from, read on.
If you want to use the regular expression above, there’s two things you need to understand. First, long regexes make it difficult to nicely format paragraphs. So I didn’t include a-z in any of the three character classes. This regex is intended to be used with your regex engine’s “case insensitive” option turned on. (You’d be surprised how many “bug” reports I get about that.) Second, the above regex is delimited with word boundaries, which makes it suitable for extracting email addresses from files or larger blocks of text. If you want to check whether the user typed in a valid email address, replace the word boundaries with start-of-string and end-of-string anchors, like this: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}$.
The previous paragraph also applies to all following examples. You may need to change word boundaries into start/end-of-string anchors, or vice versa. And you will need to turn on the case insensitive matching option.
Trade-Offs in Validating Email Addresses
Yes, there are a whole bunch of email addresses that my pet regex doesn’t match. The most frequently quoted example are addresses on the .museum top level domain, which is longer than the 4 letters my regex allows for the top level domain. I accept this trade-off because the number of people using .museum email addresses is extremely low. I’ve never had a complaint that the order forms or newsletter subscription forms on the JGsoft websites refused a .museum address (which they would, since they use the above regex to validate the email address).
To include .museum, you could use ^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,6}$. However, then there’s another trade-off. This regex will match john@mail.office. It’s far more likely that John forgot to type in the .com top level domain rather than having just created a new .office top level domain without ICANN’s permission.
This shows another trade-off: do you want the regex to check if the top level domain exists? My regex doesn’t. Any combination of two to four letters will do, which covers all existing and planned top level domains except .museum. But it will match addresses with invalid top-level domains like asdf@asdf.asdf. By not being overly strict about the top-level domain, I don’t have to update the regex each time a new top-level domain is created, whether it’s a country code or generic domain.
^[A-Z0-9._%+-]+@[A-Z0-9.-]+.(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$could be used to allow any two-letter country code top level domain, and only specific generic top level domains. By the time you read this, the list might already be out of date. If you use this regular expression, I recommend you store it in a global constant in your application, so you only have to update it in one place. You could list all country codes in the same manner, even though there are almost 200 of them.
Email addresses can be on servers on a subdomain, e.g. john@server.department.company.com. All of the above regexes will match this email address, because I included a dot in the character class after the @ symbol. However, the above regexes will also match john@aol…com which is not valid due to the consecutive dots. You can exclude such matches by replacing [A-Z0-9.-]+. with (?:[A-Z0-9-]+.)+ in any of the above regexes. I removed the dot from the character class and instead repeated the character class and the following literal dot. E.g. b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+.)+[A-Z]{2,4}b will match john@server.department.company.com but not john@aol…com.
Another trade-off is that my regex only allows English letters, digits and a few special symbols. The main reason is that I don’t trust all my email software to be able to handle much else. Even though John.O’Hara@theoharas.com is a syntactically valid email address, there’s a risk that some software will misinterpret the apostrophe as a delimiting quote. E.g. blindly inserting this email address into a SQL will cause it to fail if strings are delimited with single quotes. And of course, it’s been many years already that domain names can include non-English characters. Most software and even domain name registrars, however, still stick to the 37 characters they’re used to.
The conclusion is that to decide which regular expression to use, whether you’re trying to match an email address or something else that’s vaguely defined, you need to start with considering all the trade-offs. How bad is it to match something that’s not valid? How bad is it not to match something that is valid? How complex can your regular expression be? How expensive would it be if you had to change the regular expression later? Different answers to these questions will require a different regular expression as the solution. My email regex does what I want, but it may not do what you want.
Regexes Don’t Send Email
Don’t go overboard in trying to eliminate invalid email addresses with your regular expression. If you have to accept .museum domains, allowing any 6-letter top level domain is often better than spelling out a list of all current domains. The reason is that you don’t really know whether an address is valid until you try to send an email to it. And even that might not be enough. Even if the email arrives in a mailbox, that doesn’t mean somebody still reads that mailbox.
The same principle applies in many situations. When trying to match a valid date, it’s often easier to use a bit of arithmetic to check for leap years, rather than trying to do it in a regex. Use a regular expression to find potential matches or check if the input uses the proper syntax, and do the actual validation on the potential matches returned by the regular expression. Regular expressions are a powerful tool, but they’re far from a panacea.
The Official Standard: RFC 2822
Maybe you’re wondering why there’s no “official” fool-proof regex to match email addresses. Well, there is an official definition, but it’s hardly fool-proof.
The official standard is known as RFC 2822. It describes the syntax that valid email addresses must adhere to. You can (but you shouldn’t–read on) implement it with this regular expression:
(?:[a-z0-9!#$%&’*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*|»(?:[x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|\[x01-x09x0bx0cx0e-x7f])*»)@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f]|\[x01-x09x0bx0cx0e-x7f])+)])
This regex has two parts: the part before the @, and the part after the @. There are two alternatives for the part before the @: it can either consist of a series of letters, digits and certain symbols, including one or more dots. However, dots may not appear consecutively or at the start or end of the email address. The other alternative requires the part before the @ to be enclosed in double quotes, allowing any string of ASCII characters between the quotes. Whitespace characters, double quotes and backslashes must be escaped with backslashes.
The part after the @ also has two alternatives. It can either be a fully qualified domain name (e.g. regular-expressions.info), or it can be a literal Internet address between square brackets. The literal Internet address can either be an IP address, or a domain-specific routing address.
The reason you shouldn’t use this regex is that it only checks the basic syntax of email addresses. john@aol.com.nospam would be considered a valid email address according to RFC 2822. Obviously, this email address won’t work, since there’s no “nospam” top-level domain. It also doesn’t guarantee your email software will be able to handle it. Not all applications support the syntax using double quotes or square brackets. In fact, RFC 2822 itself marks the notation using square brackets as obsolete.
We get a more practical implementation of RFC 2822 if we omit the syntax using double quotes and square brackets. It will still match 99.99% of all email addresses in actual use today.
[a-z0-9!#$%&’*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
A further change you could make is to allow any two-letter country code top level domain, and only specific generic top level domains. This regex filters dummy email addresses like asdf@adsf.adsf. You will need to update it as new top-level domains are added.
[a-z0-9!#$%&’*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)b
So even when following official standards, there are still trade-offs to be made. Don’t blindly copy regular expressions from online libraries or discussion forums. Always test them on your own data and with your own applications.
Regular Expression Matching a Valid Date
^(19|20)dd[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$ matches a date in yyyy-mm-dd format from between 1900-01-01 and 2099-12-31, with a choice of four separators. The anchors make sure the entire variable is a date, and not a piece of text containing a date. The year is matched by (19|20)dd. I used alternation to allow the first two digits to be 19 or 20. The round brackets are mandatory. Had I omitted them, the regex engine would go looking for 19 or the remainder of the regular expression, which matches a date between 2000-01-01 and 2099-12-31. Round brackets are the only way to stop the vertical bar from splitting up the entire regular expression into two options.
The month is matched by 0[1-9]|1[012], again enclosed by round brackets to keep the two options together. By using character classes, the first option matches a number between 01 and 09, and the second matches 10, 11 or 12.
The last part of the regex consists of three options. The first matches the numbers 01 through 09, the second 10 through 29, and the third matches 30 or 31.
Smart use of alternation allows us to exclude invalid dates such as 2000-00-00 that could not have been excluded without using alternation. To be really perfectionist, you would have to split up the month into various options to take into account the length of the month. The above regex still matches 2003-02-31, which is not a valid date. Making leading zeros optional could be another enhancement.
If you want to require the delimiters to be consistent, you could use a backreference. ^(19|20)dd([- /.])(0[1-9]|1[012])2(0[1-9]|[12][0-9]|3[01])$ will match 1999-01-01 but not 1999/01-01.
Again, how complex you want to make your regular expression depends on the data you are using it on, and how big a problem it is if an unwanted match slips through. If you are validating the user’s input of a date in a script, it is probably easier to do certain checks outside of the regex. For example, excluding February 29th when the year is not a leap year is far easier to do in a scripting language. It is far easier to check if a year is divisible by 4 (and not divisible by 100 unless divisible by 400) using simple arithmetic than using regular expressions.
Here is how you could check a valid date in Perl. I also added round brackets to capture the year into a backreference.
sub isvaliddate { my $input = shift; if ($input =~ m!^((?:19|20)dd)[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$!) { # At this point, $1 holds the year, $2 the month and $3 the day of the date entered if ($3 == 31 and ($2 == 4 or $2 == 6 or $2 == 9 or $2 == 11)) { return 0; # 31st of a month with 30 days } elsif ($3 >= 30 and $2 == 2) { return 0; # February 30th or 31st } elsif ($2 == 2 and $3 == 29 and not ($1 % 4 == 0 and ($1 % 100 != 0 or $1 % 400 == 0))) { return 0; # February 29th outside a leap year } else { return 1; # Valid date } } else { return 0; # Not a date } }
To match a date in mm/dd/yyyy format, rearrange the regular expression to ^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)dd$ . For dd-mm-yyyy format, use ^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)dd$ . You can find additional variations of these regexes in RegexBuddy’s library.
Finding or Verifying Credit Card Numbers
With a few simple regular expressions, you can easily verify whether your customer entered a valid credit card number on your order form. You can even determine the type of credit card being used. Each card issuer has its own range of card numbers, identified by the first 4 digits.
You can use a slightly different regular expression to find credit card numbers, or number sequences that might be credit card numbers, within larger documents. This can be very useful to prove in a security audit that you’re not improperly exposing your clients’ financial details.
We’ll start with the order form.
Stripping Spaces and Dashes
The first step is to remove all non-digits from the card number entered by the customer. Physical credit cards have spaces within the card number to group the digits, making it easier for humans to read or type in. So your order form should accept card numbers with spaces or dashes in them.
To remove all non-digits from the card number, simply use the “replace all” function in your scripting language to search for the regex [^0-9]+ and replace it with nothing. If you only want to replace spaces and dashes, you could use [ -]+. If this regex looks odd, remember that in a character class, the hyphen is a literal when it occurs right before the closing bracket (or right after the opening bracket or negating caret).
If you’re wondering what the plus is for: that’s for performance. If the input has consecutive non-digits, e.g. 1===2, then the regex will match the three equals signs at once, and delete them in one replacement. Without the plus, three replacements would be required. In this case, the savings are only a few microseconds. But it’s a good habit to keep regex efficiency in the back of your mind. Though the savings are minimal here, so is the effort of typing the extra plus.
Validating Credit Card Numbers on Your Order Form
Validating credit card numbers is the ideal job for regular expressions. They’re just a sequence of 13 to 16 digits, with a few specific digits at the start that identify the card issuer. You can use the specific regular expressions below to alert customers when they try to use a kind of card you don’t accept, or to route orders using different cards to different processors. All these regexes were taken from RegexBuddy’s library.
- Visa: ^4[0-9]{12}(?:[0-9]{3})?$ All Visa card numbers start with a 4. New cards have 16 digits. Old cards have 13.
- MasterCard: ^5[1-5][0-9]{14}$ All MasterCard numbers start with the numbers 51 through 55. All have 16 digits.
- American Express: ^3[47][0-9]{13}$ American Express card numbers start with 34 or 37 and have 15 digits.
- Diners Club: ^3(?:0[0-5]|[68][0-9])[0-9]{11}$ Diners Club card numbers begin with 300 through 305, 36 or 38. All have 14 digits. There are Diners Club cards that begin with 5 and have 16 digits. These are a joint venture between Diners Club and MasterCard, and should be processed like a MasterCard.
- Discover: ^6(?:011|5[0-9]{2})[0-9]{12}$ Discover card numbers begin with 6011 or 65. All have 16 digits.
- JCB: ^(?:2131|1800|35d{3})d{11}$ JCB cards beginning with 2131 or 1800 have 15 digits. JCB cards beginning with 35 have 16 digits.
If you just want to check whether the card number looks valid, without determining the brand, you can combine the above six regexes into ^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35d{3})d{11})$. You’ll see I’ve simply alternated all the regexes, and used a non-capturing group to put the anchors outside the alternation. You can easily delete the card types you don’t accept from the list.
These regular expressions will easily catch numbers that are invalid because the customer entered too many or too few digits. They won’t catch numbers with incorrect digits. For that, you need to follow the Luhn algorithm, which cannot be done with a regex. And of course, even if the number is mathematically valid, that doesn’t mean a card with this number was issued or if there’s money in the account. The benefit or the regular expression is that you can put it in a bit of JavaScript to instantly check for obvious errors, instead of making the customer wait 30 seconds for your credit card processor to fail the order. And if your card processor charges for failed transactions, you’ll really want to implement both the regex and the Luhn validation.
Finding Credit Card Numbers in Documents
With two simple modifications, you could use any of the above regexes to find card numbers in larger documents. Simply replace the caret and dollar with a word boundary, e.g.: b4[0-9]{12}(?:[0-9]{3})?b.
If you’re planning to search a large document server, a simpler regular expression will speed up the search. Unless your company uses 16-digit numbers for other purposes, you’ll have few false positives. The regex bd{13,16}b will find any sequence of 13 to 16 digits.
When searching a hard disk full of files, you can’t strip out spaces and dashes first like you can when validating a single card number. To find card numbers with spaces or dashes in them, use b(?:d[ -]*?){13,16}b. This regex allows any amount of spaces and dashes anywhere in the number. This is really the only way. Visa and MasterCard put digits in sets of 4, while Amex and Discover use groups of 4, 5 and 6 digits. People typing in the numbers may have different ideas yet.
Deleting Duplicate Lines From a File
If you have a file in which all lines are sorted (alphabetically or otherwise), you can easily delete (consecutive) duplicate lines. Simply open the file in your favorite text editor, and do a search-and-replace searching for ^(.*)(r?n1)+$ and replacing with 1. For this to work, the anchors need to match before and after line breaks (and not just at the start and the end of the file or string), and the dot must not match newlines.
Here is how this works. The caret will match only at the start of a line. So the regex engine will only attempt to match the remainder of the regex there. The dot and star combination simply matches an entire line, whatever its contents, if any. The round brackets store the matched line into the first backreference.
Next we will match the line separator. I put the question mark into r?n to make this regex work with both Windows (rn) and UNIX (n) text files. So up to this point we matched a line and the following line break.
Now we need to check if this combination is followed by a duplicate of that same line. We do this simply with 1. This is the first backreference which holds the line we matched. The backreference will match that very same text.
If the backreference fails to match, the regex match and the backreference are discarded, and the regex engine tries again at the start of the next line. If the backreference succeeds, the plus symbol in the regular expression will try to match additional copies of the line. Finally, the dollar symbol forces the regex engine to check if the text matched by the backreference is a complete line. We already know the text matched by the backreference is preceded by a line break (matched by r?n). Therefore, we now check if it is also followed by a line break or if it is at the end of the file using the dollar sign.
The entire match becomes linenline (or linenlinenline etc.). Because we are doing a search and replace, the line, its duplicates, and the line breaks in between them, are all deleted from the file. Since we want to keep the original line, but not the duplicates, we use 1 as the replacement text to put the original line back in.
Removing Duplicate Items From a String
We can generalize the above example to afterseparator(item)(separator1)+beforeseparator, where afterseparator and beforeseparator are zero-width. So if you want to remove consecutive duplicates from a comma-delimited list, you could use (?<=,|^)([^,]*)(,1)+(?=,|$).
The positive lookbehind (?<=,|^) forces the regex engine to start matching at the start of the string or after a comma. ([^,]*) captures the item. (,1)+ matches consecutive duplicate items. Finally, the positive lookahead (?=,|$) checks if the duplicate items are complete items by checking for a comma or the end of the string.
Example Regexes to Match Common Programming Language Constructs
Regular expressions are very useful to manipulate source code in a text editor or in a regex-based text processing tool. Most programming languages use similar constructs like keywords, comments and strings. But often there are subtle differences that make it tricky to use the correct regex. When picking a regex from the list of examples below, be sure to read the description with each regex to make sure you are picking the correct one.
Unless otherwise indicated, all examples below assume that the dot does not match newlines and that the caret and dollar do match at embedded line breaks. In many programming languages, this means that single line mode must be off, and multi line mode must be on.
When used by themselves, these regular expressions may not have the intended result. If a comment appears inside a string, the comment regex will consider the text inside the string as a comment. The string regex will also match strings inside comments. The solution is to use more than one regular expression, like in this pseudo-code:
GlobalStartPosition := 0; while GlobalStartPosition < LengthOfText do GlobalMatchPosition := LengthOfText; MatchedRegEx := NULL; foreach RegEx in RegExList do RegEx.StartPosition := GlobalStartPosition; if RegEx.Match and RegEx.MatchPosition < GlobalMatchPosition then MatchedRegEx := RegEx; GlobalMatchPosition := RegEx.MatchPosition; endif endforeach if MatchedRegEx <> NULL then // At this point, MatchedRegEx indicates which regex matched // and you can do whatever processing you want depending on // which regex actually matched. endif GlobalStartPosition := GlobalMatchPosition; endwhile
If you put a regex matching a comment and a regex matching a string in RegExList, then you can be sure that the comment regex will not match comments inside strings, and vice versa.
An alternative solution is to combine regexes: (comment)|(string). The alternation has the same effect as the code snipped above. Using backreferences, you can figure out which part of the regex actually matched. The drawback of this solution is that the combined regular expression quickly becomes difficult to read or maintain.
Comments
#.*$ matches a single-line comment starting with a # and continuing until the end of the line. Similarly, //.*$ matches a single-line comment starting with //.
If the comment must appear at the start of the line, use ^#.*$ . If only whitespace is allowed between the start of the line and the comment, use ^s*#.*$ . Compiler directives or pragmas in C can be matched this way. Note that in this last example, any leading whitespace will be part of the regex match. Use capturing parentheses to separate the whitespace and the comment.
/*.*?*/ matches a C-style multi-line comment if you turn on the option for the dot to match newlines. The general syntax is begin.*?end. C-style comments do not allow nesting. If the “begin” part appears inside the comment, it is ignored. As soon as the “end” part if found, the comment is closed.
If your programming language allows nested comments, there is no straightforward way to match them using a regular expression, since regular expressions cannot count. Additional logic is required.
Strings
«[^»rn]*» matches a single-line string that does not allow the quote character to appear inside the string. Using the negated character class is more efficient than using a lazy dot. «[^»]*»allows the string to span across multiple lines.
«[^»\rn]*(?:\.[^»\rn]*)*» matches a single-line string in which the quote character can appear if it is escaped by a backslash. Though this regular expression may seem more complicated than it needs to be, it is much faster than simpler solutions which can cause a whole lot of backtracking in case a double quote appears somewhere all by itself rather than part of a string. «[^»\]*(?:\.[^»\]*)*»allows the string to span multiple lines.
You can adapt the above regexes to match any sequence delimited by two (possibly different) characters. If we use b for the starting character, e and the end, and x as the escape character, the version without escape becomes b[^ern]*e, and the version with escape becomes b[^exrn]*(?:x.[^exrn]*)*e.
Numbers
bd+b matches a positive integer number. Do not forget the word boundaries! [-+]?bd+b allows for a sign.
b0[xX][0-9a-fA-F]+bmatches a C-style hexadecimal number.
((b[0-9]+)?.)?[0-9]+b matches an integer number as well as a floating point number with optional integer part. (b[0-9]+.([0-9]+b)?|.[0-9]+b)matches a floating point number with optional integer as well as optional fractional part, but does not match an integer number.
((b[0-9]+)?.)?b[0-9]+([eE][-+]?[0-9]+)?bmatches a number in scientific notation. The mantissa can be an integer or floating point number with optional integer part. The exponent is optional.
b[0-9]+(.[0-9]+)?(e[+-]?[0-9]+)?balso matches a number in scientific notation. The difference with the previous example is that if the mantissa is a floating point number, the integer part is mandatory.
If you read through the floating point number example, you will notice that the above regexes are different from what is used there. The above regexes are more stringent. They use word boundaries to exclude numbers that are part of other things like identifiers. You can prepend [-+]? to all of the above regexes to include an optional sign in the regex. I did not do so above because in programming languages, the + and – are usually considered operators rather than signs.
Reserved Words or Keywords
Matching reserved words is easy. Simply use alternation to string them together: b(first|second|third|etc)b Again, do not forget the word boundaries.
Find Two Words Near Each Other
Some search tools that use boolean operators also have a special operator called “near”. Searching for “term1 near term2” finds all occurrences of term1 and term2 that occur within a certain “distance” from each other. The distance is a number of words. The actual number depends on the search tool, and is often configurable.
You can easily perform the same task with the proper regular expression.
Emulating “near” with a Regular Expression
With regular expressions you can describe almost any text pattern, including a pattern that matches two words near each other. This pattern is relatively simple, consisting of three parts: the first word, a certain number of unspecified words, and the second word. An unspecified word can be matched with the shorthand character class w+. The spaces and other characters between the words can be matched with W+ (uppercase W this time).
The complete regular expression becomes bword1W+(?:w+W+){1,6}?word2b . The quantifier {1,6}? makes the regex require at least one word between “word1” and “word2”, and allow at most six words.
If the words may also occur in reverse order, we need to specify the opposite pattern as well: b(?:word1W+(?:w+W+){1,6}?word2|word2W+(?:w+W+){1,6}?word1)b
If you want to find any pair of two words out of a list of words, you can use: b(word1|word2|word3)(?:W+w+){1,6}?W+(word1|word2|word3)b. This regex will also find a word near itself, e.g. it will match word2 near word2.
Matching Whole Lines of Text
Often, you want to match complete lines in a text file rather than just the part of the line that satisfies a certain requirement. This is useful if you want to delete entire lines in a search-and-replace in a text editor, or collect entire lines in an information retrieval tool.
To keep this example simple, let’s say we want to match lines containing the word “John”. The regex John makes it easy enough to locate those lines. But the software will only indicate John as the match, not the entire line containing the word.
The solution is fairly simple. To specify that we need an entire line, we will use the caret and dollar sign and turn on the option to make them match at embedded newlines. In software aimed at working with text files like EditPad Pro and PowerGREP, the anchors always match at embedded newlines. To match the parts of the line before and after the match of our original regular expression John, we simply use the dot and the star. Be sure to turn off the option for the dot to match newlines.
The resulting regex is: ^.*John.*$. You can use the same method to expand the match of any regular expression to an entire line, or a block of complete lines. In some cases, such as when using alternation, you will need to group the original regex together using round brackets.
Finding Lines Containing or Not Containing Certain Words
If a line can meet any out of series of requirements, simply use alternation in the regular expression. ^.*b(one|two|three)b.*$ matches a complete line of text that contains any of the words “one”, “two” or “three”. The first backreference will contain the word the line actually contains. If it contains more than one of the words, then the last (rightmost) word will be captured into the first backreference. This is because the star is greedy. If we make the first star lazy, like in ^.*?b(one|two|three)b.*$, then the backreference will contain the first (leftmost) word.
If a line must satisfy all of multiple requirements, we need to use lookahead. ^(?=.*?boneb)(?=.*?btwob)(?=.*?bthreeb).*$ matches a complete line of text that contains all of the words “one”, “two” and “three”. Again, the anchors must match at the start and end of a line and the dot must not match line breaks. Because of the caret, and the fact that lookahead is zero-width, all of the three lookaheads are attempted at the start of the each line. Each lookahead will match any piece of text on a single line (.*?) followed by one of the words. All three must match successfully for the entire regex to match. Note that instead of words like bwordb, you can put any regular expression, no matter how complex, inside the lookahead. Finally, .*$ causes the regex to actually match the line, after the lookaheads have determined it meets the requirements.
If your condition is that a line should not contain something, use negative lookahead. ^((?!regexp).)*$ matches a complete line that does not match regexp. Notice that unlike before, when using positive lookahead, I repeated both the negative lookahead and the dot together. For the positive lookahead, we only need to find one location where it can match. But the negative lookahead must be tested at each and every character position in the line. We must test that regexp fails everywhere, not just somewhere.
Finally, you can combine multiple positive and negative requirements as follows: ^(?=.*?bmust-haveb)(?=.*?bmandatoryb)((?!avoid|illegal).)*$ . When checking multiple positive requirements, the .* at the end of the regular expression full of zero-width assertions made sure that we actually matched something. Since the negative requirement must match the entire line, it is easy to replace the .* with the negative test.
1
Introduction to Regular Expressions
Here’s the scenario: you’re given the job of checking the pages on a web server for doubled words (such as “this this”), a common problem with documents subject to heavy editing. Your job is to create a solution that will:
- Accept any number of files to check, report each line of each file that has doubled words, highlight (using standard ANSI escape sequences) each doubled word, and ensure that the source filename appears with each line in the report.
- Work across lines, even finding situations where a word at the end of one line is repeated at the beginning of the next.
- Find doubled words despite capitalization differences, such as with ‘
The the...
’, as well as allow differing amounts of whitespace (spaces, tabs, new-lines, and the like) to lie between the words. - Find doubled words even when separated by HTML tags. HTML tags are for marking up text on World Wide Web pages, for example, to make a word bold: ‘
...it is <B>very</B> very important...
’.
That’s certainly a tall order! But, it’s a real problem that needs to be solved. At one point while working on the manuscript for this book, I ran such a tool on what I’d written so far and was surprised at the way numerous doubled words had crept in. There are many programming languages one could use to solve the problem, but one with regular expression support can make the job substantially easier.
Regular expressions are the key to powerful, flexible, and efficient text processing. Regular expressions themselves, with a general pattern notation almost like a mini programming language, allow you to describe and parse text. With additional support provided by the particular tool being used, regular expressions can add, remove, isolate, and generally fold, spindle, and mutilate all kinds of text and data.
It might be as simple as a text editor’s search command or as powerful as a full text processing language. This book shows you the many ways regular expressions can increase your productivity. It teaches you how to think regular expressions so that you can master them, taking advantage of the full magnitude of their power.
A full program that solves the doubled-word problem can be implemented in just a few lines of many of today’s popular languages. With a single regular-expression search-and-replace command, you can find and highlight doubled words in the document. With another, you can remove all lines without doubled words (leaving only the lines of interest left to report). Finally, with a third, you can ensure that each line to be displayed begins with the name of the file the line came from. We’ll see examples in Perl and Java in the next chapter.
The host language (Perl, Java, VB.NET, or whatever) provides the peripheral processing support, but the real power comes from regular expressions. In harnessing this power for your own needs, you learn how to write regular expressions to identify text you want, while bypassing text you don’t. You can then combine your expressions with the language’s support constructs to actually do something with the text (add appropriate highlighting codes, remove the text, change the text, and so on).
Solving Real Problems
Knowing how to wield regular expressions unleashes processing powers you might not even know were available. Numerous times in any given day, regular expressions help me solve problems both large and small (and quite often, ones that are small but would be large if not for regular expressions).
Showing an example that provides the key to solving a large and important problem illustrates the benefit of regular expressions clearly, but perhaps not so obvious is the way regular expressions can be used throughout the day to solve rather “uninteresting” problems. I use “uninteresting” in the sense that such problems are not often the subject of bar-room war stories, but quite interesting in that until they’re solved, you can’t get on with your real work.
As a simple example, I needed to check a lot of files (the 70 or so files comprising the source for this book, actually) to confirm that each file contained ‘SetSize
’ exactly as often (or as rarely) as it contained ‘ResetSize
’. To complicate matters, I needed to disregard capitalization (such that, for example, ‘setSIZE
’ would be counted just the same as ‘SetSize
’). Inspecting the 32,000 lines of text by hand certainly wasn’t practical.
Even using the normal “find this word” search in an editor would have been arduous, especially with all the files and all the possible capitalization differences.
Regular expressions to the rescue! Typing just a single, short command, I was able to check all files and confirm what I needed to know. Total elapsed time: perhaps 15 seconds to type the command, and another 2 seconds for the actual check of all the data. Wow! (If you’re interested to see what I actually used, peek ahead to page 36.)
As another example, I was once helping a friend with some email problems on a remote machine, and he wanted me to send a listing of messages in his mailbox file. I could have loaded a copy of the whole file into a text editor and manually removed all but the few header lines from each message, leaving a sort of table of contents. Even if the file wasn’t as huge as it was, and even if I wasn’t connected via a slow dial-up line, the task would have been slow and monotonous. Also, I would have been placed in the uncomfortable position of actually seeing the text of his personal mail.
Regular expressions to the rescue again! I gave a simple command (using the common search tool egrep described later in this chapter) to display the From:
and Subject:
line from each message. To tell egrep exactly which kinds of lines I wanted to see, I used the regular expression ⌈^(From|Subject):
⌋.
Once he got his list, he asked me to send a particular (5,000-line!) message. Again, using a text editor or the mail system itself to extract just the one message would have taken a long time. Rather, I used another tool (one called sed) and again used regular expressions to describe exactly the text in the file I wanted. This way, I could extract and send the desired message quickly and easily.
Saving both of us a lot of time and aggravation by using the regular expression was not “exciting,” but surely much more exciting than wasting an hour in the text editor. Had I not known regular expressions, I would have never considered that there was an alternative. So, to a fair extent, this story is representative of how regular expressions and associated tools can empower you to do things you might have never thought you wanted to do.
Once you learn regular expressions, you’ll realize that they’re an invaluable part of your toolkit, and you’ll wonder how you could ever have gotten by without them.†
A full command of regular expressions is an invaluable skill. This book provides the information needed to acquire that skill, and it is my hope that it provides the motivation to do so, as well.
Regular Expressions as a Language
Unless you’ve had some experience with regular expressions, you won’t understand the regular expression ⌈^(From|Subject):
⌋ from the last example, but there’s nothing magic about it. For that matter, there is nothing magic about magic. The magician merely understands something simple which doesn’t appear to be simple or natural to the untrained audience. Once you learn how to hold a card while making your hand look empty, you only need practice before you, too, can “do magic.” Like a foreign language — once you learn it, it stops sounding like gibberish.
The Filename Analogy
Since you have decided to use this book, you probably have at least some idea of just what a “regular expression” is. Even if you don’t, you are almost certainly already familiar with the basic concept.
You know that report.txt is a specific filename, but if you have had any experience with Unix or DOS/Windows, you also know that the pattern “*.txt
” can be used to select multiple files. With filename patterns like this (called file globs or wildcards), a few characters have special meaning. The star means “match anything,” and a question mark means “match any one character.” So, with the file glob “*.txt
”, we start with a match-anything ⌈*⌋ and end with the literal ⌈.txt
⌋, so we end up with a pattern that means “select the files whose names start with anything and end with .txt
”.
Most systems provide a few additional special characters, but, in general, these filename patterns are limited in expressive power. This is not much of a shortcoming because the scope of the problem (to provide convenient ways to specify groups of files) is limited, well, simply to filenames.
On the other hand, dealing with general text is a much larger problem. Prose and poetry, program listings, reports, HTML, code tables, word lists… you name it, if a particular need is specific enough, such as “selecting files,” you can develop some kind of specialized scheme or tool to help you accomplish it. However, over the years, a generalized pattern language has developed, which is powerful and expressive for a wide variety of uses. Each program implements and uses them differently, but in general, this powerful pattern language and the patterns themselves are called regular expressions.
The Language Analogy
Full regular expressions are composed of two types of characters. The special characters (like the * from the filename analogy) are called metacharacters, while the rest are called literal, or normal text characters. What sets regular expressions apart from filename patterns are the advanced expressive powers that their metacharacters provide. Filename patterns provide limited metacharacters for limited needs, but a regular expression “language” provides rich and expressive metacharacters for advanced uses.
It might help to consider regular expressions as their own language, with literal text acting as the words and metacharacters as the grammar. The words are combined with grammar according to a set of rules to create an expression that communicates an idea. In the email example, the expression I used to find lines beginning with ‘From:
’ or ‘Subject:
’ was . The metacharacters are underlined; we’ll get to their interpretation soon.
As with learning any other language, regular expressions might seem intimidating at first. This is why it seems like magic to those with only a superficial understanding, and perhaps completely unapproachable to those who have never seen it at all. But, just as † would soon become clear to a student of Japanese, the regular expression in
s!<emphasis>([0-9]+(.[0-9]+){3})</emphasis>!<inet>$1</inet>!
will soon become crystal clear to you, too.
This example is from a Perl language script that my editor used to modify a manuscript. The author had mistakenly used the typesetting tag <emphasis>
to mark Internet IP addresses (which are sets of periods and numbers that look like 209.204.146.22
). The incantation uses Perl’s text-substitution command with the regular expression
⌈<emphasis>([0-9]+(.[0-9]+){3})</emphasis>⌋
to replace such tags with the appropriate <inet>
tag, while leaving other uses of <emphasis>
alone. In later chapters, you’ll learn all the details of exactly how this type of incantation is constructed, so you’ll be able to apply the techniques to your own needs, with your own application or programming language.
The goal of this book
The chance that you will ever want to replace <emphasis>
tags with <inet>
tags is small, but it is very likely that you will run into similar “replace this with that” problems. The goal of this book is not to teach solutions to specific problems, but rather to teach you how to think regular expressions so that you will be able to conquer whatever problem you may face.
The Regular-Expression Frame of Mind
As we’ll soon see, complete regular expressions are built up from small building-block units. Each individual building block is quite simple, but since they can be combined in an infinite number of ways, knowing how to combine them to achieve a particular goal takes some experience. So, this chapter provides a quick overview of some regular-expression concepts. It doesn’t go into much depth, but provides a basis for the rest of this book to build on, and sets the stage for important side issues that are best discussed before we delve too deeply into the regular expressions themselves.
While some examples may seem silly (because some are silly), they represent the kind of tasks that you will want to do — you just might not realize it yet. If each point doesn’t seem to make sense, don’t worry too much. Just let the gist of the lessons sink in. That’s the goal of this chapter.
If You Have Some Regular-Expression Experience
If you’re already familiar with regular expressions, much of this overview will not be new, but please be sure to at least glance over it anyway. Although you may be aware of the basic meaning of certain metacharacters, perhaps some of the ways of thinking about and looking at regular expressions will be new.
Just as there is a difference between playing a musical piece well and making music, there is a difference between knowing about regular expressions and really understanding them. Some of the lessons present the same information that you are already familiar with, but in ways that may be new and which are the first steps to really understanding.
Searching Text Files: Egrep
Finding text is one of the simplest uses of regular expressions — many text editors and word processors allow you to search a document using a regular-expression pattern. Even simpler is the utility egrep. Give egrep a regular expression and some files to search, and it attempts to match the regular expression to each line of each file, displaying only those lines in which a match is found. egrep is freely available for many systems, including DOS, MacOS, Windows, Unix, and so on. See this book’s web site, http://regex.info, for links on how to obtain a copy of egrep for your system.
Returning to the email example from page 3, the command I actually used to generate a makeshift table of contents from the email file is shown in Figure 1-1. egrep interprets the first command-line argument as a regular expression, and any remaining arguments as the file(s) to search. Note, however, that the single quotes shown in Figure 1-1 are not part of the regular expression, but are needed by my command shell.† When using egrep, I usually wrap the regular expression with single quotes. Exactly which characters are special, in what contexts, to whom (to the regular-expression, or to the tool), and in what order they are interpreted are all issues that grow in importance when you move to regular-expression use in full-fledged programming languages—something we’ll see starting in the next chapter.
Figure 1-1: Invoking egrep from the command line
We’ll start to analyze just what the various parts of the regex mean in a moment, but you can probably already guess just by looking that some of the characters have special meanings. In this case, the parentheses, the ⌈^
⌋, and the ⌈|
⌋ characters are regular-expression metacharacters, and combine with the other characters to generate the result I want.
On the other hand, if your regular expression doesn’t use any of the dozen or so metacharacters that egrep understands, it effectively becomes a simple “plain text” search. For example, searching for ⌈cat
⌋ in a file finds and displays all lines with the three letters c·a·t
in a row. This includes, for example, any line containing .
Even though the line might not have the word cat
, the c·a·t
sequence in vacation
is still enough to be matched. Since it’s there, egrep goes ahead and displays the whole line. The key point is that regular-expression searching is not done on a “word” basis—egrep can understand the concept of bytes and lines in a file, but it generally has no idea of English’s (or any other language’s) words, sentences, paragraphs, or other high-level concepts.
Egrep Metacharacters
Let’s start to explore some of the egrep metacharacters that supply its regular-expression power. I’ll go over them quickly with a few examples, leaving the detailed examples and descriptions for later chapters.
Typographical Conventions
Before we begin, please make sure to review the typographical conventions explained in the preface, on page xxi. This book forges a bit of new ground in the area of typesetting, so some of my notations may be unfamiliar at first.
Start and End of the Line
Probably the easiest metacharacters to understand are ⌈^
⌋ (caret) and ⌈$
⌋ (dollar), which represent the start and end, respectively, of the line of text as it is being checked. As we’ve seen, the regular expression ⌈cat
⌋ finds c·a·t
anywhere on the line, but ⌈^cat
⌋ matches only if the c·a·t
is at the beginning of the line—the ⌈^
⌋ is used to effectively anchor the match (of the rest of the regular expression) to the start of the line. Similarly, ⌈cat$
⌋ finds c·a·t
only at the end of the line, such as a line ending with scat
.
It’s best to get into the habit of interpreting regular expressions in a rather literal way. For example, don’t think
⌈^cat
⌋ matches a line with cat
at the beginning
but rather:
⌈^cat
⌋ matches if you have the beginning of a line, followed immediately by c
, followed immediately by a
, followed immediately by t
.
They both end up meaning the same thing, but reading it the more literal way allows you to intrinsically understand a new expression when you see it. How would egrep interpret ⌈^cat$
⌋, ⌈^$
⌋, or even simply ⌈^
⌋ alone? Turn the page to check your interpretations.
The caret and dollar are special in that they match a position in the line rather than any actual text characters themselves. Of course, there are various ways to actually match real text. Besides providing literal characters like ⌈cat
⌋ in your regular expression, you can also use some of the items discussed in the next few sections.
Character Classes
Matching any one of several characters
Let’s say you want to search for “grey,” but also want to find it if it were spelled “gray.” Ther egular-expression construct ⌈[···]
⌋, usually called a character class, lets you list the characters you want to allow at that point in the match. While ⌈e
⌋ matches just an e
, and ⌈a
⌋ matches just an a
, the regular expression ⌈[ea]
⌋ matches either. So, then, consider ⌈gr[ea]y
⌋: this means to find “g
, followed by r
, followed by either an e
or an a
, all followed by y
.” Because I’m a really poor speller, I’m always using regular expressions like this against a huge list of English words to figure out proper spellings. One I use often is ⌈sep[ea]r[ea]te
⌋, because I can never remember whether the word is spelled “seperate,” “separate,” “separete,” or what. The one that pops up in the list is the proper spelling; regular expressions to the rescue.
Notice how outside of a class, literal characters (like the ⌈g
⌋ and ⌈r
⌋ of ⌈gr[ae]y
⌋) have an implied “and then” between them — “match ⌈g
⌋ and then match ⌈r
⌋ …” It’s completely opposite inside a character class. The contents of a class is a list of characters that can match at that point, so the implication is “or.”
As another example, maybe you want to allow capitalization of a word’s first letter, such as with ⌈[Ss]mith
⌋. Remember that this still matches lines that contain smith
(or Smith
) embedded within another word, such as with blacksmith
. I don’t want to harp on this throughout the overview, but this issue does seem to be the source of problems among some new users. I’ll touch on some ways to handle this embedded-word problem after we examine a few more metacharacters.
You can list in the class as many characters as you like. For example, ⌈[123456]
⌋ matches any of the listed digits. This particular class might be useful as part of ⌈<H[123456]>
⌋, which matches <H1>
, <H2>
, <H3>
, etc. This can be useful when searching for HTML headers.
Within a character class, the character-class metacharacter ‘-
’ (dash) indicates a range of characters: ⌈<H[1-6]>
⌋ is identical to the previous example. ⌈[0-9]
⌋ and ⌈[a-z]
⌋ are common shorthands for classes to match digits and English lowercase letters, respectively. Multiple ranges are fine, so ⌈[0123456789abcdefABCDEF]
⌋ can be written as ⌈[0-9a-fA-F]
⌋ (or, perhaps, ⌈[A-Fa-f0-9]
⌋, since the order in which ranges are given doesn’t matter). These last three examples can be useful when processing hexadecimal numbers. You can freely combine ranges with literal characters: ⌈[0-9A-Z_!.?]
⌋ matches a digit, uppercase letter, underscore, exclamation point, period, or a question mark.
Note that a dash is a metacharacter only within a character class — otherwise it matches the normal dash character. In fact, it is not even always a metacharacter within a character class. If it is the first character listed in the class, it can’t possibly indicate a range, so it is not considered a metacharacter. Along the same lines, the question mark and period at the end of the class are usually regular-expression metacharacters, but only when not within a class (so, to be clear, the only special characters within the class in ⌈[0-9A-Z_!.?]
⌋ are the two dashes).
Consider character classes as their own mini language. The rules regarding which metacharacters are supported (and what they do) are completely different inside and outside of character classes.
We’ll see more examples of this shortly.
Negated character classes
If you use ⌈[^···]
⌋ instead of ⌈[···]
⌋, the class matches any character that isn’t listed. For example, ⌈[^1-6]
⌋ matches a character that’s not 1
through 6
. The leading ^
in the class “negates” the list, so rather than listing the characters you want to include in the class, you list the characters you don’t want to be included.
You might have noticed that the ^
used here is the same as the start-of-line caret introduced on page 8. The character is the same, but the meaning is completely different. Just as the English word “wind” can mean different things depending on the context (sometimes a strong breeze, sometimes what you do to a clock), so can a metacharacter. We’ve already seen one example, the range-building dash. It is valid only inside a character class (and at that, only when not first inside the class). ^
is a line anchor outside a class, but a class metacharacter inside a class (but, only when it is immediately after the class’s opening bracket; otherwise, it’s not special inside a class). Don’t fear — these are the most complex special cases; others we’ll see later aren’t so bad.
As another example, let’s search that list of English words for odd words that have q
followed by something other than u
. Translating that into a regular expression, it becomes ⌈q[^u]
⌋. I tried it on the list I have, and there certainly weren’t many. I did find a few, including a number of words that I didn’t even know were English.
Here’s what happened. (What I typed is in bold.)
% egrep ‘q[^u]’ word.list
Iraqi
Iraqian
miqra
qasida
qintar
qoph
zaqqum%
Two notable words not listed are “Qantas”, the Australian airline, and “Iraq”. Although both words are in the word.list file, neither were displayed by my egrep command. Why? Think about it for a bit, and then turn the page to check your reasoning.
Remember, a negated character class means “match a character that’s not listed” and not “don’t match what is listed.” These might seem the same, but the Iraq
example shows the subtle difference. A convenient way to view a negated class is that it is simply a shorthand for a normal class that includes all possible characters except those that are listed.
Matching Any Character with Dot
The metacharacter ⌈·
⌋ (usually called dot or point) is a shorthand for a character class that matches any character. It can be convenient when you want to have an “any character here” placeholder in your expression. For example, if you want to search for a date such as 03/19/76, 03-19-76
, or even 03.19.76
, you could go to the trouble to construct a regular expression that uses character classes to explicitly allow ‘/
’, ‘-
’, or ‘·
’ between each number, such as ⌈03[-./]19[-./]76
⌋. However, you might also try simply using ⌈03.19.76
⌋.
Quite a few things are going on with this example that might be unclear at first. In ⌈03[-./]l9[-./]76
⌋, the dots are not metacharacters because they are within a character class. (Remember, the list of metacharacters and their meanings are different inside and outside of character classes.) The dashes are also not class metacharacters in this case because each is the first thing after [
or [^
. Had they not been first, as with ⌈[.-/]
⌋, they would be the class range metacharacter, which would be a mistake in this situation.
With ⌈03.19.76
⌋, the dots are metacharacters — ones that match any character (including the dash, period, and slash that we are expecting). However, it is important to know that each dot can match any character at all, so it can match, say, .
So, ⌈03[-./]19[-./]76
⌋ is more precise, but it’s more difficult to read and write. ⌈03.19.76
⌋ is easy to understand, but vague. Which should we use? It all depends upon what you know about the data being searched, and just how specific you feel you need to be. One important, recurring issue has to do with balancing your knowledge of the text being searched against the need to always be exact when writing an expression. For example, if you know that with your data it would be highly unlikely for ⌈03.19.76
⌋ to match in an unwanted place, it would certainly be reasonable to use it. Knowing the target text well is an important part of wielding regular expressions effectively.
Alternation
Matching any one of several subexpressions
A very convenient metacharacter is ⌈|
⌋, which means “or.” It allows you to combine multiple expressions into a single expression that matches any of the individual ones. For example, ⌈Bob
⌋ and ⌈Robert
⌋ are separate expressions, but ⌈Bob|Robert
⌋ is one expression that matches either. When combined this way, the subexpressions are called alternatives.
Looking back to our ⌈gr[ea]y
⌋ example, it is interesting to realize that it can be written as ⌈grey|gray
⌋, and even ⌈gr(a|e)y
⌋. The latter case uses parentheses to constrain the alternation. (For the record, parentheses are metacharacters too.) Note that something like ⌈gr[a|e]y
⌋ is not what we want —within a class, the ‘|’ character is just a normal character, like ⌈a
⌋ and ⌈e
⌋.
With ⌈gr(a|e)y
⌋, the parentheses are required because without them, ⌈gra|ey
⌋ means “⌈gra
⌋ or ⌈ey
⌋,” which is not what we want here. Alternation reaches far, but not beyond parentheses. Another example is ⌈(First|1st)•[Ss]treet
⌋.† Actually, since both ⌈First
⌋ and ⌈1st
⌋ end with ⌈st
⌋, the combination can be shortened to ⌈(Fir|1)st•[Ss]treet
⌋. That’s not necessarily quite as easy to read, but be sure to understand that ⌈(first|1st)
⌋ and ⌈(fir|1)st
⌋ effectively mean the same thing.
Here’s an example involving an alternate spelling of my name. Compare and contrast the following three expressions, which are all effectively the same:
⌈Jeffrey|Jeffery⌋
⌈Jeff(rey|ery)⌋
⌈Jeff(re|er)y⌋
To have them match the British spellings as well, they could be:
⌈(Geoff|Jeff)(rey|ery)⌋
⌈(Geo|Je)ff(rey|ery)⌋
⌈(Geo|Je)ff(re|er)y⌋
Finally, note that these three match effectively the same as the longer (but simpler) ⌈Jeffrey|Geoffery|Jeffery|Geoffrey
⌋. They’re all different ways to specify the same desired matches.
Although the ⌈gr[ea]y
⌋ versus ⌈gr(a|e)y
⌋ examples might blur the distinction, be careful not to confuse the concept of alternation with that of a character class. A character class can match just a single character in the target text. With alternation, since each alternative can be a full-fledged regular expression in and of itself, each alternative can match an arbitrary amount of text. Character classes are almost like their own special mini-language (with their own ideas about metacharacters, for example), while alternation is part of the “main” regular expression language. You’ll find both to be extremely useful.
Also, take care when using caret or dollar in an expression that has alternation. Compare ⌈^From|Subject|Date:
•⌋ with ⌈^(From|Subject|Date):•
⌋. Both appear similar to our earlier email example, but what each matches (and therefore how useful it is) differs greatly. The first is composed of three alternatives, so it matches “⌈^From
⌋ or ⌈Subject
⌋ or ⌈Date:•
⌋,” which is not particularly useful. We want the leading caret and trailing ⌈:•
⌋ to apply to each alternative. We can accomplish this by using parentheses to “constrain” the alternation:
⌈^(From|Subject|Date):•⌋
The alternation is constrained by the parentheses, so literally, this regex means “match the start of the line, then one of ⌈From
⌋, ⌈Subject
⌋, or ⌈Date
⌋, and then match ⌈:•
⌋.” Effectively, it matches:
1) |
start-of-line, followed by |
or 2) |
start-of-line, followed by |
or 3) |
start-of-line, followed by |
Putting it less literally, it matches lines beginning with ‘From:•
’, ‘Subject:•
’, or ‘Date:•
’, which is quite useful for listing the messages in an email file.
Here’s an example:
% egrep ‘^(From|Subject|Date): ‘ mailbox
From: elvis@tabloid.org (The King)
Subject: be seein’ ya around
Date: Mon, 23 Oct 2006 11:04:13
From: The Prez <president@whitehouse.gov>
Date: Wed, 25 Oct 2006 8:36:24
Subject: now, about your vote
Ignoring Differences in Capitalization
This email header example provides a good opportunity to introduce the concept of a case-insensitive match. The field types in an email header usually appear with leading capitalization, such as “Subject” and “From,” but the email standard actually allows mixed capitalization, so things like “DATE” and “from” are also allowed. Unfortunately, the regular expression in the previous section doesn’t match those.
One approach is to replace ⌈From
⌋ with ⌈[Ff][Rr][Oo][Mm]
⌋ to match any form of “from,” but this is quite cumbersome, to say the least. Fortunately, there is a way to tell egrep to ignore case when doing comparisons, i.e., to perform the match in a case insensitive manner in which capitalization differences are simply ignored. It is not a part of the regular-expression language, but is a related useful feature many tools provide. egrep’s command-line option “-i
” tells it to do a case-insensitive match. Place -i
on the command line before the regular expression:
% egrep -i ‘^(From|Subject|Date): ‘ mailbox
This brings up all the lines we matched before, but also includes lines such as:
SUBJECT: MAKE MONEY FAST
I find myself using the -i
option quite frequently (perhaps related to the footnote on page 12!) so I recommend keeping it in mind. We’ll see other convenient support features like this in later chapters.
Word Boundaries
A common problem is that a regular expression that matches the word you want can often also match where the “word” is embedded within a larger word. I mentioned this briefly in the cat, gray
, and Smith
examples. It turns out, though, that some versions of egrep offer limited support for word recognition: namely the ability to match the boundary of a word (where a word begins or ends).
You can use the (perhaps odd looking) metasequences ⌈<
⌋ and ⌈>
⌋ if your version happens to support them (not all versions of egrep do). You can think of them as word-based versions of ⌈^
⌋ and ⌈$
⌋ that match the position at the start and end of a word, respectively. Like the line anchors caret and dollar, they anchor other parts of the regular expression but don’t actually consume any characters during a match. The expression ⌈<cat>
⌋ literally means “match if we can find a start-of-word position, followed immediately by c·a·t
, followed immediately by an end-of-word position.” More naturally, it means “find the word cat
.” If you wanted, you could use ⌈<cat
⌋ or ⌈cat>
⌋ to find words starting and ending with cat.
Note that ⌈<
⌋ and ⌈>
⌋ alone are not metacharacters — when combined with a backslash, the sequences become special. This is why I called them “metasequences.” It’s their special interpretation that’s important, not the number of characters, so for the most part I use these two meta-words interchangeably.
Remember, not all versions of egrep support these word-boundary metacharacters, and those that do don’t magically understand the English language. The “start of a word” is simply the position where a sequence of alphanumeric characters begins; “end of word” is where such a sequence ends. Figure 1-2 on the next page shows a sample line with these positions marked.
The word-starts (as egrep recognizes them) are marked with up arrows, the word-ends with down arrows. As you can see, “start and end of word” is better phrased as “start and end of an alphanumeric sequence,” but perhaps that’s too much of a mouthful.
Figure 1-2: Start and end of “word” positions
In a Nutshell
Table 1-1 summarizes the metacharacters we have seen so far.
Table 1-1: Summary of Metacharacters Seen So Far
Metacharacter |
Name |
Matches |
. |
dot |
any one character |
|
character class |
any character listed |
|
negated character class |
any character not listed |
|
caret |
the position at the start of the line |
|
dollar |
the position at the end of the line |
|
backslash less-than |
†the position at the start of a word |
|
backslash greater-than |
†the position at the end of a word |
†not supported by all versions of egrep |
||
|
or; bar |
matches either expression it separates |
|
parentheses |
used to limit scope of ⌈ |
In addition to the table, important points to remember include:
- The rules about which characters are and aren’t metacharacters (and exactly what they mean) are different inside a character class. For example, dot is a metacharacter outside of a class, but not within one. Conversely, a dash is a metacharacter within a class (usually), but not outside. Moreover, a caret has one meaning outside, another if specified inside a class immediately after the opening
[
, and a third if given elsewhere in the class. - Don’t confuse alternation with a character class. The class ⌈
[abc]
⌋ and the alternation ⌈(a|b|c)
⌋ effectively mean the same thing, but the similarity in this example does not extend to the general case. A character class can match exactly one character, and that’s true no matter how long or short the specified list of acceptable characters might be.Alternation, on the other hand, can have arbitrarily long alternatives, each tex-tually unrelated to the other: ⌈
<(1,000,000|million|thousand•thou)>
⌋. However, alternation can’t be negated like a character class. - A negated character class is simply a notational convenience for a normal character class that matches everything not listed. Thus, ⌈
[^x]
⌋ doesn’t mean ” match unless there is anx
,” but rather “match if there is something that is notx
.” The difference is subtle, but important. The first concept matches a blank line, for example, while ⌈[^x]
⌋ does not. - The useful
-i
option discounts capitalization during a match ( 15).†
What we have seen so far can be quite useful, but the real power comes from optional and counting elements, which we’ll look at next.
Optional Items
Let’s look at matching color
or colour.
Since they are the same except that one has a u
and the other doesn’t, we can use ⌈colou?r
⌋ to match either. The metacharacter ⌈?
⌋ (question mark) means optional. It is placed after the character that is allowed to appear at that point in the expression, but whose existence isn’t actually required to still be considered a successful match.
Unlike other metacharacters we have seen so far, the question mark attaches only to the immediately-preceding item. Thus, ⌈colou?r
⌋ is interpreted as “⌈c
⌋ then ⌈o
⌋ then ⌈l
⌋ then ⌈o
⌋ then ⌈u?
⌋ then ⌈r
⌋.”
The ⌈u?
⌋ part is always successful: sometimes it matches a u
in the text, while other times it doesn’t. The whole point of the ?
-optional part is that it’s successful either way. This isn’t to say that any regular expression that contains ?
is always successful. For example, against ‘semicolon
’, both ⌈colo
⌋ and ⌈u?
⌋ are successful (matching colo
and nothing, respectively). However, the final ⌈r
⌋ fails, and that’s what disallows semicolon
, in the end, from being matched by ⌈colou?r
⌋.
As another example, consider matching a date that represents July fourth, with the “July” part being either July
or Jul
, and the “fourth” part being fourth, 4th
, or simply 4.
Of course, we could just use ⌈(July|Jul)•(fourth|4th|4)
⌋, but let’s explore other ways to express the same thing.
First, we can shorten the ⌈(July|Jul)
⌋ to ⌈(July?)
⌋. Do you see how they are effectively the same? The removal of the ⌈|⌋ means that the parentheses are no longer really needed. Leaving the parentheses doesn’t hurt, but with them removed, ⌈July
?⌋ is a bit less cluttered. This leaves us with ⌈July?•(fourth|4th|4)⌋.
Moving now to the second half, we can simplify the ⌈4th|4
⌋ to ⌈4(th)?
⌋. As you can see, ⌈?
⌋ can attach to a parenthesized expression. Inside the parentheses can be as complex a subexpression as you like, but “from the outside” it is considered a single unit. Grouping for ⌈?
⌋ (and other similar metacharacters which I’ll introduce momentarily) is one of the main uses of parentheses.
Our expression now looks like ⌈July?•(fourth|4(th)?)
⌋. Although there are a fair number of metacharacters, and even nested parentheses, it is not that difficult to decipher and understand. This discussion of two essentially simple examples has been rather long, but in the meantime we have covered tangential topics that add a lot, if perhaps only subconsciously, to our understanding of regular expressions. Also, it’s given us some experience in taking different approaches toward the same goal. As we advance through this book (and through to a better understanding), you’ll find many opportunities for creative juices to flow while trying to find the optimal way to solve a complex problem. Far from being some stuffy science, writing regular expressions is closer to an art.
Other Quantifiers: Repetition
Similar to the question mark are ⌈+
⌋ (plus) and ⌈*
⌋ (an asterisk, but as a regular-expression metacharacter, I prefer the term star). The metacharacter ⌈+
⌋ means “one or more of the immediately-preceding item,” and ⌈*
⌋ means “any number, including none, of the item.” Phrased differently, ⌈···*
⌋ means “try to match it as many times as possible, but it’s OK to settle for nothing if need be.” The construct with plus, ⌈···+
⌋, is similar in that it also tries to match as many times as possible, but different in that it fails if it can’t match at least once. These three metacharacters, question mark, plus, and star, are called quantifiers because they influence the quantity of what they govern.
Like ⌈···?
⌋, the ⌈···*
⌋ part of a regular expression always succeeds, with the only issue being what text (if any) is matched. Contrast this to ⌈···+
⌋, which fails unless the item matches at least once.
For example, ⌈•?
⌋ allows a single optional space, but ⌈•*
⌋ allows any number of optional spaces. We can use this to make page 9’s <H[1-6]>
example flexible. The HTML specification† says that spaces are allowed immediately before the closing >
, such as with <H3•>
and <H4•••>
. Inserting ⌈•*
⌋ into our regular expression where we want to allow (but not require) spaces, we get ⌈<H[1-6]•*>
⌋. This still matches <H1>
, as no spaces are required, but it also flexibly picks up the other versions.
Exploring further, let’s search for an HTML tag such as <HR•SIZE=14>
, which indicates that a line (a Horizontal Rule) 14 pixels thick should be drawn across the screen. Like the <H3>
example, optional spaces are allowed before the closing angle bracket. Additionally, they are allowed on either side of the equal sign. Finally, one space is required between the HR
and SIZE
, although more are allowed. To allow more, we could just add ⌈•*
⌋ to the ⌈•
⌋ already there, but instead let’s change it to ⌈•+
⌋. The plus allows extra spaces while still requiring at least one, so it’s effectively the same as ⌈••*
⌋, but more concise. All these changes leave us with ⌈<HR•+SIZE•*=•*14•*>
⌋.
Although flexible with respect to spaces, our expression is still inflexible with respect to the size given in the tag. Rather than find tags with only one particular size such as 14
, we want to find them all. To accomplish this, we replace the ⌈14
⌋ with an expression to find a general number. Well, in this case, a “number” is one or more digits. A digit is ⌈[0-9]
⌋, and “one or more” adds a plus, so we end up replacing ⌈14
⌋ by ⌈[0-9]+
⌋. (A character class is one “unit,” so can be subject directly to plus, question mark, and so on, without the need for parentheses.)
This leaves us with ⌈<HR•+ SIZE •* = •* [0-9]+ •*>
⌋, which is certainly a mouthful even though I’ve presented it with the metacharacters bold, added a bit of spacing to make the groupings more apparent, and am using the “visible space” symbol ‘•’ for clarity. (Luckily, egrep has the -i
case-insensitive option, 15, which means I don’t have to use ⌈[Hh][Rr]
⌋ instead of ⌈HR
⌋.) The unadorned regular expression ⌈<HR +SIZE *= *[0-9]+ *>
⌋ likely appears even more confusing. This example looks particularly odd because the subjects of most of the stars and pluses are space characters, and our eye has always been trained to treat spaces specially. That’s a habit you will have to break when reading regular expressions, because the space character is a normal character, no different from, say, j
or 4.
(In later chapters, we’ll see that some other tools support a special mode in which white-space is ignored, but egrep has no such mode.)
Continuing to exploit a good example, let’s consider that the size attribute is optional, so you can simply use <HR>
if the default size is wanted. (Extra spaces are allowed before the >
, as always.) How can we modify our regular expression so that it matches either type? The key is realizing that the size part is optional (that’s a hint). Turn the page to check your answer.
Take a good look at our latest expression (in the answer box) to appreciate the differences among the question mark, star, and plus, and what they really mean in practice. Table 1-2 on the next page summarizes their meanings.
Note that each quantifier has some minimum number of matches required to succeed, and a maximum number of matches that it will ever attempt. With some, the minimum number is zero; with some, the maximum number is unlimited.
Table 1-2: Summary of Quantifier “Repetition Metacharacters”
Minimum Required |
Maximum to Try |
Meaning |
|
|
none |
1 |
one allowed; none required (“one optional”) |
|
none |
no limit |
unlimited allowed; none required (“any amount OK”) |
|
1 |
no limit |
unlimited allowed; one required (“at least one”) |
Defined range of matches: intervals
Some versions of egrep support a metasequence for providing your own minimum and maximum: ⌈···{
min,max}
⌋. This is called the interval quantifier. For example, ⌈···{3,12}
⌋ matches up to 12 times if possible, but settles for three. One might use ⌈[a-zA-Z]{1,5}
⌋ to match a US stock ticker (from one to five letters). Using this notation, {0, 1}
is the same as a question mark.
Not many versions of egrep support this notation yet, but many other tools do, so it’s covered in Chapter 3 when we look in detail at the broad spectrum of metacharacters in common use today.
Parentheses and Backreferences
So far, we have seen two uses for parentheses: to limit the scope of alternation, ⌈|⌋, and to group multiple characters into larger units to which you can apply quantifiers like question mark and star. I’d like to discuss another specialized use that’s not common in egrep (although GNU’s popular version does support it), but which is commonly found in many other tools.
In many regular-expression flavors, parentheses can “remember” text matched by the subexpression they enclose. We’ll use this in a partial solution to the doubled-word problem at the beginning of this chapter. If you knew the the specific doubled word to find (such as “the” earlier in this sentence — did you catch it?), you could search for it explicitly, such as with ⌈the•the
⌋. In this case, you would also find items such as , but you could easily get around that problem if your egrep supports the word-boundary metasequences ⌈<···>
⌋ mentioned on page 15: ⌈<the•the>
⌋. We could use ⌈•+
⌋ for the space for even more flexibility.
However, having to check for every possible pair of words would be an impossible task. Wouldn’t it be nice if we could match one generic word, and then say “now match the same thing again”? If your egrep supports backreferencing, you can. Backreferencing is a regular-expression feature that allows you to match new text that is the same as some text matched earlier in the expression.
We start with ⌈<the•+the>
⌋ and replace the initial ⌈the
⌋ with a regular expression to match a general word, say ⌈[A-Za-z]+
⌋. Then, for reasons that will become clear in the next paragraph, let’s put parentheses around it. Finally, we replace the second ‘the
’ by the special metasequence ⌈1
⌋. This yields ⌈<([A-Za-z]+)•+1>
⌋.
With tools that support backreferencing, parentheses “remember” the text that the subexpression inside them matches, and the special metasequence ⌈1
⌋ represents that text later in the regular expression, whatever it happens to be at the time.
Of course, you can have more than one set of parentheses in a regular expression. Use ⌈1
⌋, ⌈2
⌋, ⌈3
⌋, etc., to refer to the first, second, third, etc. sets. Pairs of parentheses are numbered by counting opening parentheses from the left, so with ⌈([a-z])([0-9])12
⌋, the ⌈1
⌋ refers to the text matched by ⌈[a-z]
⌋, and ⌈2
⌋ refers to the text matched by ⌈[0-9]
⌋.
With our ‘the•the
’ example, ⌈[A-Za-z]+
⌋ matches the first ‘the
’. It is within the first set of parentheses, so the ‘the
’ matched becomes available via ⌈1
⌋. If the following ⌈•+
⌋ matches, the subsequent ⌈1
⌋ will require another ‘the
’. If ⌈1
⌋ is successful, then ⌈>
⌋ makes sure that we are now at an end-of-word boundary (which we wouldn’t be were the text ‘the•theft
’). If successful, we’ve found a repeated word. It’s not always the case that that is an error (such as with “that” in this sentence), but that’s for you to decide once the suspect lines are shown.
When I decided to include this example, I actually tried it on what I had written so far. (I used a version of egrep that supports both ⌈<···>
⌋ and backreferencing.) To make it more useful, so that ‘The•the
’ would also be found, I used the case-insensitive -i
option mentioned on page 15.†
Here’s the command I ran:
% egrep -i ‘<([a-z]+) +1>’ files···
I was surprised to find fourteen sets of mistakenly ‘doubled•doubled
’ words! I corrected them, and since then have built this type of regular-expression check into the tools that I use to produce the final output of this book, to ensure none creep back in.
As useful as this regular expression is, it is important to understand its limitations. Since egrep considers each line in isolation, it isn’t able to find when the ending word of one line is repeated at the beginning of the next. For this, a more flexible tool is needed, and we will see some examples in the next chapter.
The Great Escape
One important thing I haven’t mentioned yet is how to actually match a character that a regular expression would normally interpret as a metacharacter. For example, if I searched for the Internet hostname ega.att.com
using ⌈ega.att.com
⌋, it could end up matching something like . Remember, ⌈·
⌋ is a metacharacter that matches any character, including a space.
The metasequence to match an actual period is a period preceded by a backslash: ⌈ega.att.com
⌋. The sequence ⌈.
⌋ is described as an escaped period or escaped dot, and you can do this with all the normal metacharacters, except in a character-class.†
A backslash used in this way is called an “escape” — when a metacharacter is escaped, it loses its special meaning and becomes a literal character. If you like, you can consider the sequence to be a special metasequence to match the literal character. It’s all the same.
As another example, you could use ⌈([a-zA-Z]+)
⌋ to match a word within parentheses, such as ‘(very)
’. The backslashes in the ⌈(
⌋ and ⌈)
⌋ sequences remove the special interpretation of the parentheses, leaving them as literals to match parentheses in the text.
When used before a non-metacharacter, a backslash can have different meanings depending upon the version of the program. For example, we have already seen how some versions treat ⌈<
⌋, ⌈>
⌋, ⌈1
⌋, etc. as metasequences. We will see many more examples in later chapters.
Expanding the Foundation
I hope the examples and explanations so far have helped to establish the basis for a solid understanding of regular expressions, but please realize that what I’ve provided so far lacks depth. There’s so much more out there.
Linguistic Diversification
I mentioned a number of regular expression features that most versions of egrep support. There are other features, some of which are not supported by all versions, which I’ll leave for later chapters.
Unfortunately, the regular expression language is no different from any other in that it has various dialects and accents. It seems each new program employing regular expressions devises its own “improvements.” The state of the art continually moves forward, but changes over the years have resulted in a wide variety of regular expression “flavors.” We’ll see many examples in the following chapters.
The Goal of a Regular Expression
From the broadest top-down view, a regular expression either matches within a lump of text (with egrep, each line) or it doesn’t. When crafting a regular expression, you must consider the ongoing tug-of-war between having your expression match the lines you want, yet still not matching lines you don’t want.
Also, while egrep doesn’t care where in the line the match occurs, this concern is important for many other regular-expression uses. If your text is something such as
…zip is 44272. If you write, send $4.95 to cover postage and…
and you merely want to find lines matching ⌈[0-9]+
⌋, you don’t care which numbers are matched. However, if your intent is to do something with the number (such as save to a file, add, replace, and such—we will see examples of this kind of processing in the next chapter), you’ll care very much exactly which numbers are matched.
A Few More Examples
As with any language, experience is a very good thing, so I’m including a few more examples of regular expressions to match some common constructs.
Half the battle when writing regular expressions is getting successful matches when and where you want them. The other half is to not match when and where you don’t want. In practice, both are important, but for the moment, I would like to concentrate on the “getting successful matches” aspect. Even though I don’t take these examples to their fullest depths, they still provide useful insight.
Variable names
Many programming languages have identifiers (variable names and such) that are allowed to contain only alphanumeric characters and underscores, but which may not begin with a digit. They are matched by ⌈[a-zA-Z_][a-zA-Z_0-9]*
⌋. The first character class matches what the first character can be, the second (with its accompanying star) allows the rest of the identifier. If there is a limit on the length of an identifier, say 32 characters, you might replace the star with ⌈{0,31}
⌋ if the ⌈{
min,max}
⌋ notation is supported. (This construct, the interval quantifier, was briefly mentioned on page 20.)
A string within double quotes
A simple solution to matching a string within double quotes might be: ⌈"[^"]*"
⌋
The double quotes at either end are to match the opening and closing double quotes of the string. Between them, we can have anything… except another double quote! So, we use ⌈[^"]
⌋ to match all characters except a double quote, and apply using a star to indicate we can have any number of such non double-quote characters.
A more useful (but more complex) definition of a double-quoted string allows double quotes within the string if they are escaped with a backslash, such as in “nail•the•2"x4"•plank"
. We’ll see this example several times in future chapters while covering the many details of how a match is actually carried out.
Dollar amount (with optional cents)
One approach to matching a dollar amount is: ⌈$[0-9]+(.[0-9][0-9])?
⌋
From a top-level perspective, this is a simple regular expression with three parts: ⌈$
⌋ and ⌈···+
⌋ and ⌈(···)?
⌋, which might be loosely paraphrased as “a literal dollar sign, a bunch of one thing, and finally perhaps another thing.” In this case, the “one thing” is a digit (with a bunch of them being a number), and “another thing” is the combination of a decimal point followed by two digits.
This example is a bit naive for several reasons. For example, it considers dollar amounts like $1000
, but not $1,000
. It does allow for optional cents, but frankly, that’s not really very useful when applied with egrep. egrep never cares exactly how much is matched, but merely whether there is a match. Allowing something optional at the end never changes whether there’s an overall match to begin with.
But, if you need to find lines that contain just a price, and nothing else, you can wrap the expression with ⌈^···$
⌋. In this case, the optional cents part becomes important since it might or might not come between the dollar amount and the end of the line, and allowing or disallowing it makes the difference in achieving an overall match.
One type of value our expression doesn’t match is ‘$.49
’. To solve this, you might be tempted to change the plus to a star, but that doesn’t work. As to why, I’ll leave it as a teaser until we look at a similar example in Chapter 5 ( 194).
An HTTP/HTML URL
The format of web URLs can be complex, so constructing a regular expression to match any possible URL can be equally complex. However, relaxing your standards slightly can allow you to match most common URLs with a fairly simple expression. One common reason I might do this, for example, would be to search my email archive for a URL that I vaguely remember having received, but which I think I might recognize when I see it.
The general form of a common HTTP/HTML URL is along the lines of
http://hostname/path.html
although ending with .htm
is common as well.
The rules about what can and can’t be a hostname (computer name, such as www.yahoo.com) are complex, but for our needs we can realize that if it follows ‘http://
’, it’s probably a hostname, so we can make do with something simple, such as ⌈[-a-z0-9_.]+
⌋. The path part can be even more varied, so we’ll use ⌈[-a-z0-9_:@&?=+,.!/~*%$]*
⌋ for that. Notice that these classes have the dash first, to ensure that it’s taken as a literal character and included in the list, as opposed to part of a range ( 9).
Putting these all together, we might use as our first attempt something like:
% egrep -i ‘<http://[-a-z0-9_.:]+/[-a-z0-9_:@&?=+,.!/~*%$]*.html?>’ files
Again, since we’ve taken liberties and relaxed what we’ll match, we could well match something such as ‘http:// . . . . /foo.html
’, which is certainly not a valid URL. Do we care about this? It all depends on what you’re trying to do. For my scan of my email archive, it doesn’t really matter if I get a few false matches. Heck, I could probably get away with even something as simple as:
% egrep -i ‘<http://[^ ]*.html?>’ files…
As we’ll learn when getting deeper into how to craft an expression, knowing the data you’ll be searching is an important aspect of finding the balance between complexity and completeness. We’ll visit this example again, in more detail, in the next chapter.
An HTML tag
With a tool like egrep, it doesn’t seem particularly common or useful to simply match lines with HTML tags. But, exploring a regular expression that matches HTML tags exactly can be quite fruitful, especially when we delve into more advanced tools in the next chapter.
Looking at simple cases like ‘<TITLE>
’ and ‘<HR>
’, we might think to try ⌈<.*>
⌋. This simplistic approach is a frequent first thought, but it’s certainly incorrect. Converting ⌈<.*>
⌋ into English reads “match a ‘<
’, followed by as much of anything as can be matched, followed by ‘>
’.” Well, when phrased that way, it shouldn’t be surprising that it can match more than just one tag, such as the marked portion of 'this example
’.
This might have been a bit surprising, but we’re still in the first chapter, and our understanding at this point is only superficial. I have this example here to highlight that regular expressions are not a difficult subject, but they can be tricky if you don’t truly understand them. Over the next few chapters, we’ll look at all the details required to understand and solve this problem.
Time of day, such as “9:17 am” or “12:30 pm”
Matching a time can be taken to varying levels of strictness. Something such as
⌈[0-9]?[0-9]:[0-9][0-9]•(am|pm)⌋
picks up both 9:17•am
and 12:30•pm
, but also allows something nonsensical like 99:99•pm
.
Looking at the hour, we realize that if it is a two-digit number, the first digit must be a one. But, ⌈1?[0-9]
⌋ still allows an hour of 19
(and also an hour of 0
), so maybe it is better to break the hour part into two possibilities: ⌈1[012]
⌋ for two-digit hours and ⌈[1-9]
⌋ for single-digit hours. The result is ⌈(1[012]|[1-9])
⌋.
The minute part is easier. The first digit should be ⌈[0-5]
⌋. For the second, we can stick with the current ⌈[0-9]
⌋. This gives ⌈(1[012]|[1-9]):[0-5][0-9]•(am|pm)
⌋ when we put it all together.
Using the same logic, can you extend this to handle 24-hour time with hours from 0
through 23
? As a challenge, allow for a leading zero, at least through to 09:59
. Try building your solution, and then turn the page to check mine.
Regular Expression Nomenclature
Regex
As you might guess, using the full phrase “regular expression” can get a bit tiring, particularly in writing. Instead, I normally use “regex.” It just rolls right off the tongue (it rhymes with “FedEx,” with a hard g sound like “regular” and not a soft one like in “Regina”) and it is amenable to a variety of uses like “when you regex…,” “budding regexers,” and even “regexification.”† I use the phrase “regex engine” to refer to the part of a program that actually does the work of carrying out a match attempt.
Matching
When I say a regex “matches” a string, I really mean that it matches in a string. Technically, the regex ⌈a
⌋ doesn’t match cat
, but matches the a
in . It’s not something that people tend to confuse, but it’s still worthy of mention.
Metacharacter
Whether a character is a metacharacter (or “metasequence”—I use the words interchangeably) depends on exactly where in the regex it’s used. For example, ⌈*⌋ is a metacharacter, but only when it’s not within a character class and when not escaped. “Escaped” means that it has a backslash in front of it—usually. The star is escaped in ⌈*
⌋, but not in ⌈\*
⌋ (where the first backslash escapes the second), although the star “has a backslash in front of it” in both examples.
Depending upon the regex flavor, there are various situations when certain characters are and aren’t metacharacters. Chapter 3 discusses this in more detail.
Flavor
As I’ve hinted, different tools use regular expressions for many different things, and the set of metacharacters and other features that each support can differ. Let’s look at word boundaries again as an example. Some versions of egrep support the <···>
notation we’ve seen. However, some do not support the separate word-start and word-end, but one catch-all ⌈b
⌋ metacharacter (which we haven’t seen yet — we’ll see it in the next chapter). Still others support both, and many others support neither.
I use the term “flavor” to describe the sum total of all these little implementation decisions. In the language analogy, it’s the same as a dialect of an individual speaker. Superficially, this concept refers to which metacharacters are and aren’t supported, but there’s much more to it. Even if two programs both support ⌈<···>
⌋, they might disagree on exactly what they do and don’t consider to be a word. This concern is important when you use the tool.
Don’t confuse “flavor” with “tool.” Just as two people can speak the same dialect, two completely different programs can support exactly the same regex flavor. Also, two programs with the same name (and built to do the same task) often have slightly (and sometimes not-so-slightly) different flavors. Among the various programs called egrep, there is a wide variety of regex flavors supported.
In the late 1990s, the particularly expressive flavor offered by the Perl programming language was widely recognized for its power, and soon other languages were offering Perl-inspired regular expressions (many even acknowledging the inspirational source by labeling themselves “Perl-compatible”). The adopters include PHP, Python, many Java regex packages, Microsoft’s .NET Framework, Tcl, and a variety of C libraries, to name a few. Yet, all are different in important respects. On top of this, Perl’s regular expressions themselves are evolving and growing (sometimes, now, in response to advances seen with other tools). As always, the overall landscape continues to become more varied and confusing.
Subexpression
The term “subexpression” simply refers to part of a larger expression, although it often refers to some part of an expression within parentheses, or to an alternative of ⌈|
⌋. For example, with ⌈^(Subject|Date):•
⌋, the ⌈Subject|Date
⌋ is usually referred to as a subexpression. Within that, the alternatives ⌈Subject
⌋ and ⌈Date
⌋ are each referred to as subexpressions as well. But technically, ⌈S
⌋ is a subexpression, as is ⌈u
⌋, and ⌈b
⌋, and ⌈j
⌋, …
Something such as 1-6
isn’t considered a subexpression of ⌈H[1-6]•*
⌋, since the ‘1-6
’ is part of an unbreakable “unit,” the character class. But, ⌈H
⌋, ⌈[1-6]
⌋, and ⌈•*
⌋ are all subexpressions of ⌈H[1-6]•*
⌋.
Unlike alternation, quantifiers (star, plus, and question mark) always work with the smallest immediately-preceding subexpression. This is why with ⌈mis+pell
⌋, the +
governs the ⌈s
⌋, not the ⌈mis
⌋ or ⌈is
⌋. Of course, when what immediately precedes a quantifier is a parenthesized subexpression, the entire subexpression (no matter how complex) is taken as one unit.
Character
The word “character” can be a loaded term in computing. The character that a byte represents is merely a matter of interpretation. A byte with such-and-such a value has that same value in any context in which you might wish to consider it, but which character that value represents depends on the encoding in which it’s viewed. As a concrete example, two bytes with decimal values 64
and 53
represent the characters “@” and “5” respectively, if considered in the ASCII encoding, yet on the other hand are completely different if considered in the EBCDIC encoding (they are a space and some kind of a control character).
On the third hand, if those two bytes are considered in one of the popular encodings for Japanese characters, together they represent the single character . Yet, to represent this same character in another of the Japanese encodings requires two completely different bytes. Those two different bytes, by the way, yield the two characters “Àμ” in the popular Latin-1 encoding, but yield the one Korean character in one of the Unicode encodings.† The point is this: how bytes are to be interpreted is a matter of perspective (called an encoding), and to be successful, you’ve got to make sure that your perspective agrees with the perspective taken by the tool you’re using.
Until recently, text-processing tools generally treated their data as a bunch of ASCII bytes, without regard to the encoding you might be intending. Recently, however, more and more systems are using some form of Unicode to process data internally (Chapter 3 includes an introduction to Unicode 105). On such systems, if the regular-expression subsystem has been implemented properly, the user doesn’t normally have to pay much attention to these issues. That’s a big “if,” which is why Chapter 3 looks at this issue in depth.
Improving on the Status Quo
When it comes down to it, regular expressions are not difficult. But, if you talk to the average user of a program or language that supports them, you will likely find someone that understands them “a bit,” but does not feel secure enough to really use them for anything complex or with any tool but those they use most often.
Traditionally, regular expression documentation tends to be limited to a short and incomplete description of one or two metacharacters, followed by a table of the rest. Examples often use meaningless regular expressions like ⌈a*((ab)*|b*)
⌋, and text like ‘a•xxx•ce•xxxxxx•ci•xxx•d
’. They also tend to completely ignore subtle but important points, and often claim that their flavor is the same as some other well-known tool, almost always forgetting to mention the exceptions where they inevitably differ. The state of regex documentation needs help.
Now, I don’t mean to imply that this chapter fills the gap for all regular expressions, or even for egrep regular expressions. Rather, this chapter merely provides the foundation upon which the rest of this book is built. It may be ambitious, but I hope this book does fill the gaps for you. I received many gratifying responses to the first edition, and have worked very hard to make this one even better, both in breadth and in depth.
Perhaps because regular-expression documentation has traditionally been so lacking, I feel the need to make the extra effort to make things particularly clear. Because I want to make sure you can use regular expressions to their fullest potential, I want to make sure you really, really understand them.
This is both good and bad.
It is good because you will learn how to think regular expressions. You will learn which differences and peculiarities to watch out for when faced with a new tool with a different flavor. You will know how to express yourself even with a weak, stripped-down regular expression flavor. You will understand what makes one expression more efficient than another, and will be able to balance tradeoffs among complexity, efficiency, and match results. When faced with a particularly complex task, you will know how to work through an expression the way the program would, constructing it as you go. In short, you will be comfortable using regular expressions to their fullest.
The problem is that the learning curve of this method can be rather steep, with three separate issues to tackle:
- How regular expressions are used Most programs use regular expressions in ways that are more complex than egrep. Before we can discuss in detail how to write a really useful expression, we need to look at the ways regular expressions can be used. We start in the next chapter.
- Regular expression features Selecting the proper tool to use when faced with a problem seems to be half the battle, so I don’t want to limit myself to only using one utility throughout this book. Different programs, and often even different versions of the same program, provide different features and metacharacters. We must survey the field before getting into the details of using them. This is the subject of Chapter 3.
- How regular expressions really work Before we can learn from useful (but often complex) examples, we need to “look under the hood” to understand just how a regular expression search is conducted. As we’ll see, the order in which certain metacharacters are checked can be very important. In fact, regular expression engines can be implemented in different ways, so different programs sometimes do different things with the same expression. We examine this meaty subject in Chapters 4, 5, and 6.
This last point is the most important and the most difficult to address. The discussion is unfortunately sometimes a bit dry, with the reader chomping at the bit to get to the fun part — tackling real problems. However, understanding how the regex engine really works is the key to really understanding.
You might argue that you don’t want to be taught how a car works when you simply want to know how to drive. But, learning to drive a car is a poor analogy for learning about regular expressions. My goal is to teach you how to solve problems with regular expressions, and that means constructing regular expressions. The better analogy is not how to drive a car, but how to build one. Before you can build a car, you have to know how it works.
Chapter 2 gives more experience with driving. Chapter 3 takes a short look at the history of driving, and a detailed look at the bodywork of a regex flavor. Chapter 4 looks at the all-important engine of a regex flavor. Chapter 5 shows some extended examples, Chapter 6 shows you how to tune up certain kinds of engines, and the chapters after that examine some specific makes and models. Particularly in Chapters 4, 5, and 6, we’ll spend a lot of time under the hood, so make sure to have your coveralls and shop rags handy.
Summary
Table 1-3 summarizes the egrep metacharacters we’ve looked at in this chapter.
Table 1-3: Egrep Metacharacter Summary
Items to Match a Single Character |
||
Metacharacter |
Matches |
|
|
dot |
Matches any one character |
|
character class |
Matches any one character listed |
|
negated character class |
Matches any one character not listed |
|
escaped character |
When char is a metacharacter, or the escaped combination is not otherwise special, matches the literal char |
Items Appended to Provide “Counting”: The Quantifiers |
||
|
question |
One allowed, but it is optional |
|
star |
Any number allowed, but all are optional |
|
plus |
At least one required; additional are optional |
|
specified range† |
Min required, max allowed |
Items That Match a Position |
||
|
caret |
Matches the position at the start of the line |
|
dollar |
Matches the position at the end of the line |
|
word boundary† |
Matches the position at the start of a word |
|
word boundary† |
Matches the position at the end of a word |
Other |
||
|
alternation |
Matches either expression it separates |
|
parentheses |
Limits scope of alternation, provides grouping for the quantifiers, and “captures” for backreferences |
|
backreference† |
Matches text previously matched within first, second, etc., set of parentheses. |
†not supported by all versions of egrep |
In addition, be sure that you understand the following points:
- Not all egrep programs are the same. The metacharacters supported, as well as their exact meanings, are often different — see your local documentation ( 23).
- Three reasons for using parentheses are constraining alternation ( 13), grouping ( 14), and capturing ( 21).
- Character classes are special, and have their own set of metacharacters totally distinct from the “main” regex language ( 10).
- Alternation and character classes are fundamentally different, providing unrelated services that appear, in only one limited situation, to overlap ( 13).
- A negated character class is still a “positive assertion” — even negated, a character class must match a character to be successful. Because the listing of characters to match is negated, the matched character must be one of those not listed in the class ( 12).
- The useful
-i
option discounts capitalization during a match ( 15). - There are three types of escaped items:
- The pairing of ⌈
⌋ and a metacharacter is a metasequence to match the literal character (for example, ⌈
*
⌋ matches a literal asterisk). - The pairing of ⌈
⌋ and selected non-metacharacters becomes a metasequence with an implementation-defined meaning (for example, ⌈
<
⌋ often means “start of word”). - The pairing of ⌈
⌋ and any other character defaults to simply matching the character (that is, the backslash is ignored).
Remember, though, that a backslash within a character class is not special at all with most versions of egrep, so it provides no “escape services” in such a situation.
- The pairing of ⌈
- Items governed by a question mark or star don’t need to actually match any characters to “match successfully.” They are always successful, even if they don’t match anything ( 17).
Personal Glimpses
The doubled-word task at the start of this chapter might seem daunting, yet regular expressions are so powerful that we could solve much of the problem with a tool as limited as egrep, right here in the first chapter. I’d like to fill this chapter with flashy examples, but because I’ve concentrated on the solid foundation for the later chapters, I fear that someone completely new to regular expressions might read this chapter, complete with all the warnings and cautions and rules and such, and feel “why bother?”
My brothers were once teaching some friends how to play schaffkopf, a card game that’s been in my family for generations. It is much more exciting than it appears at first glance, but has a rather steep learning curve. After about half an hour, my sister-in-law Liz, normally the quintessence of patience, got frustrated with the seemingly complex rules and said “Can’t we just play rummy?” Yet, as it turned out, they all ended up playing late into the night, including Liz. Once they were able to get over the initial hump of the learning curve, a first-hand taste of the excitement was all it took to hook them. My brothers knew it would, but it took some time and work to get to the point where Liz and the others new to the game could appreciate what they were getting into.
It might take some time to become acclimated to regular expressions, so until you get a real taste of the excitement by using them to solve your problems, it might all feel just a bit too academic. If so, I hope you will resist the desire to “play rummy.” Once you understand the power that regular expressions provide, the small amount of work spent learning them will feel trivial indeed.
In regex, the anchors have zero width. They are not used for matching characters. Rather they match a position i.e. before, after, or between characters.
To match the start or the end of a line, we use the following anchors:
- Caret (^) matches the position before the first character in the string.
- Dollar ($) matches the position right after the last character in the string.
Regex | String | Matches |
---|---|---|
^a |
abc | Matches a |
c$ |
abc | Matches c |
^[a-zA-Z]+$ |
abc | Matches abc |
^[abc]$ |
abc | Matches a or b or c |
[^abc] |
abc | Does not match. A matching string begins with any character but a,b,c. |
^[mts][aeiou] |
mother | Matches. Searches for words that start with m, t or s. Then immediately followed by a vowel. |
[^n]g$ |
king ng |
Does not match. The string should end with g, but not ng. |
[^k]g$ |
kong | Matches. |
^g.+g$ |
gang | Matches. Word would start and end with g. Any number of letters in between. |
See Also: Java regex to allow only alphanumeric characters
2. Regex to Match Start of Line
"^<insertPatternHere>"
- The caret
^
matches the position before the first character in the string. - Applying
^h
to howtodoinjava matchesh
. - Applying
^t
to howtodoinjava does not match anything because it expects the string to start witht
. - If we have a multi-line string, by default caret symbol matches the position before the very first character in the whole string. To match the position before the first character of any line, we must enable the multi-line mode in the regular expression.
In this case, caret changes from matching at only the start the entire string to the start of any line within the string.
Description | Matching Pattern |
---|---|
The line starts with a number | “^\d” or “^[0-9]” |
The line starts with a character | “^[a-z]” or “^[A-Z]” |
The line starts with a character (case-insensitive) | “^[a-zA-Z]” |
The line starts with a word | “^word” |
The line starts with a special character | “^[!@#\$%\^\&*\)\(+=._-]” |
Pattern.compile("^[0-9]").matcher("1stKnight").find();
Pattern.compile("^[a-zA-Z]").matcher("FirstKnight").find();
Pattern.compile("^First").matcher("FirstKnight").find();
Pattern.compile("^[!@#\$%\^\&*\)\(+=._-]").matcher("*1stKnight").find();
Program output.
true
true
true
true
3. Regex to Match End of Line
"<insertPatternHere>$"
- The dollar
$
matches the position after the last character in the string. - Applying
a$
to howtodoinjava matchesa
. - Applying
v$
to howtodoinjava does not match anything because it expects the string to end withv
. - If we have a multi-line string, by default dollar symbol matches the position after the very last character in the whole string.
To match the position after the last character of any line, we must enable the multi-line mode in the regular expression. In this case, dollar changes from matching at only the last the entire string to the last of any line within the string.
Description | Matching Pattern |
---|---|
The line ends with a number | “\d$” or “[0-9]$” |
The line ends with a character | “[a-z]$” or “[A-Z]$” |
The line ends with a character (case-insensitive) | [a-zA-Z]$ |
The line ends with a word | “word$” |
The line ends with a special character | “[!@#\$%\^\&*\)\(+=._-]$” |
Pattern.compile("[0-9]$").matcher("FirstKnight123").find();
Pattern.compile("[a-zA-Z]$").matcher("FirstKnight").find();
Pattern.compile("Knight$").matcher("FirstKnight").find();
Pattern.compile("[!@#\$%\^\&*\)\(+=._-]$")
.matcher("FirstKnight&").find();
Program output.
true
true
true
true
Drop me your questions related to programs for regex starts with and ends with java.
Happy Learning !!