5.4. Find All Except a Specific Word
Problem
You want to use a regular expression to match any complete
word except cat
. Catwoman
, vindicate
, and other words that
merely contain the letters “cat” should be matched—just not cat
.
Solution
A negative lookahead can help you rule out specific words, and is
key to this next regex:
b(?!catb)w+
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Discussion
Although a negated character class (written as ‹[^⋯]
›) makes it easy to match anything
except a specific character, you can’t just write ‹[^cat]
› to match anything except
the word cat
.
‹[^cat]
› is a valid regex,
but it matches any character except c
, a
, or t
. Hence, although ‹b[^cat]+b
› would avoid matching
the word cat
,
it wouldn’t match the word time
either, because it contains the
forbidden letter t
. The regular expression ‹b[^c][^a][^t]w*
› is no good
either, because it would reject any word with c
as its first letter, a
as its second letter,
or t
as its
third. Furthermore, that doesn’t restrict the first three letters to
word characters, and it only matches words with at least three
characters since none of the negated character classes are
optional.
With all that in mind, let’s take another look at how the regular
expression shown at the beginning of this recipe solved the
problem:
b # Assert position at a word boundary. (?! # Not followed by: cat # Match "cat". b # Assert position at a word boundary. ) # End the negative lookahead. w+ ...
Match string not containing string
Given a list of strings (words or other characters), only return the strings that do not match.
Comments
Top Regular Expressions
Cheat Sheet
Character classes | |
---|---|
. | any character except newline |
w d s | word, digit, whitespace |
W D S | not word, digit, whitespace |
[abc] | any of a, b, or c |
[^abc] | not a, b, or c |
[a-g] | character between a & g |
Anchors | |
^abc$ | start / end of the string |
b | word boundary |
Escaped characters | |
. * \ | escaped special characters |
t n r | tab, linefeed, carriage return |
u00A9 | unicode escaped © |
Groups & Lookaround | |
(abc) | capture group |
1 | backreference to group #1 |
(?:abc) | non-capturing group |
(?=abc) | positive lookahead |
(?!abc) | negative lookahead |
Quantifiers & Alternation | |
a* a+ a? | 0 or more, 1 or more, 0 or 1 |
a{5} a{2,} | exactly five, two or more |
a{1,3} | between one & three |
a+? a{2,}? | match as few as possible |
ab|cd | match ab or cd |
The fact that regex doesn’t support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:
The regex above will match any string, or line without a line break, not containing the (sub) string ‘hede’.As mentioned, this is not something regex is “good” at (or should do), but still, it is possible.
Explanation
A string is just a list of
characters. Before, and after each character, there’s an empty string. So a list of
characters will have
empty strings. Consider the string
:
1
2
3
4+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+
S = |e1| A |e2| B |e3| h |e4| e |e5| d |e6| e |e7| C |e8| D |e9|
+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+
index 0 1 2 3 4 5 6 7
where the
‘s are the empty strings. The regex
looks ahead to see if there’s no substring
to be seen, and if that is the case (so something else is seen), then the
(dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don’t consume any characters. They only assert/validate something.
So, in my example, every empty string is first validated to see if there’s no
up ahead, before a character is consumed by the
(dot). The regex
will do that only once, so it is wrapped in a group, and repeated zero or more times:
. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed:
As you can see, the input
will fail because on
, the regex
fails (there is
up ahead!).
Jan 18 at 15:49 community-wiki Thanks to Bart Kiers
Regular expression is a group of characters or symbols which is used to find a specific pattern from a text. And this is a kind of cheat sheet of learning regular expressions.
Regular expression is a group of characters or symbols which is used to find a specific pattern from a text.
A regular expression is a pattern that is matched against a subject string from left to right. The word «Regular expression» is a
mouthful, you will usually find the term abbreviated as «regex» or «regexp». Regular expression is used for replacing a text within
a string, validating form, extract a substring from a string based upon a pattern match, and so much more.
Imagine you are writing an application and you want to set the rules when user choosing their username. We want the username can
contains letter, number, underscore and hyphen. We also want to limit the number of characters in username so it does not look ugly.
We use the following regular expression to validate a username:
Above regular expression can accept the strings john_doe
, jo-hn_doe
and john12_as
. It does not match Jo
because that string
contains uppercase letter and also it is too short.
Table of Contents
- Basic Matchers
- Meta character
- Full stop
- Character set
- Negated character set
- Repetitions
- The Star
- The Plus
- The Question Mark
- Braces
- Character Group
- Alternation
- Escaping special character
- Anchors
- Caret
- Dollar
- Shorthand Character Sets
- Lookaround
- Positive Lookahead
- Negative Lookahead
- Positive Lookbehind
- Negative Lookbehind
- Flags
- Case Insensitive
- Global search
- Multiline
- Bonus
1. Basic Matchers
A regular expression is just a pattern of letters and digits that we used to search in a text. For example the regular expression
cat
means: the letter c
, followed by the letter a
, followed by the letter t
.
"cat" => The cat sat on the mat
The regular expression 123
matches the string «123». The regular expression is matched against an input string by comparing each
character in the regular expression to each character in the input string, one after another. Regular expressions are normally
case-sensitive so the regular expression Cat
would not match the string «cat».
"Cat" => The cat sat on the Cat
2. Meta Characters
Meta characters are the building blocks of the regular expressions. Meta characters do not stand for themselves but instead are
interpreted in some special way. Some meta characters have a special meaning that are written inside the square brackets.
The meta character are as follows:
Meta character | Description |
---|---|
. | Period matches any single character except a line break. |
[ ] | Character class. Matches any character contained between the square brackets. |
[^ ] | Negated character class. Matches any character that is not contained between the square brackets |
* | Matches 0 or more repetitions of the preceding symbol. |
+ | Matches 1 or more repetitions of the preceding symbol. |
? | Makes the preceding symbol optional. |
{n,m} | Braces. Matches at least «n» but not more than «m» repetitions of the preceding symbol. |
(xyz) | Character group. Matches the characters xyz in that exact order. |
| | Alternation. Matches either the characters before or the characters after the symbol. |
Escapes the next character. This allows you to match reserved characters [ ] ( ) { } . * + ? ^ $ | |
|
^ | Matches the beginning of the input. |
$ | Matches the end of the input. |
2.1 Full stop
Full stop .
is the simplest example of meta character. The meta character .
matches any single character. It will not match return
or new line characters. For example the regular expression .ar
means: any character, followed by the letter a
, followed by the
letter r
.
".ar" => The car parked in the garage.
2.2 Character set
Character sets are also called character class. Square brackets are used to specify character sets. Use hyphen inside character set to
specify the characters range. The order of the character range inside square brackets doesn’t matter. For example the regular
expression [Tt]he
means: an uppercase T
or lowercase t
, followed by the letter h
, followed by the letter e
.
"[Tt]he" => The car parked in the garage.
A period inside a character set, however, means a literal period. The regular expression ar[.]
means: a lowercase character a
, followed by letter r
, followed by a period .
character.
"ar[.]" => A garage is a good place to park a car.
2.2.1 Negated character set
In general the caret symbol represents the start of the string, but when it is typed after the opening square bracket it negates the
character set. For example the regular expression [^c]ar
means: any character except c
, followed by the character a
, followed by
the letter r
.
"[^c]ar" => The car parked in the garage.
2.3 Repetitions
Following meta characters +
, *
or ?
are used to specify how many times a subpattern can occurs. These meta characters act
differently in different situations.
2.3.1 The Star
The symbol *
matches zero or more repetitions of the preceding matcher. The regular expression a*
means: zero or more repetitions
of preceding lowercase character a
. But if it appears after a character set or class that it finds the repetitions of the whole
character set. For example the regular expression [a-z]*
means: any number of lowercase letters in a row.
"[a-z]*" => The car parked in the garage #21.
The *
symbol can be used with the meta character .
to match any string of characters .*
. The *
symbol can be used with the
whitespace character s
to match a string of whitespace characters. For example the expression s*cats*
means: zero or more
spaces, followed by lowercase character c
, followed by lowercase character a
, followed by lowercase character t
, followed by
zero or more spaces.
"s*cats*" => The fat cat sat on the cat.
2.3.2 The Plus
The symbol +
matches one or more repetitions of the preceding character. For example the regular expression c.+t
means: lowercase
letter c
, followed by any number of character, followed by the lowercase character t
.
"c.+t" => The fat cat sat on the mat.
2.3.3 The Question Mark
In regular expression the meta character ?
makes the preceding character optional. This symbol matches zero or one instance of
the preceding character. For example the regular expression [T]?he
means: Optional the uppercase letter T
, followed by the lowercase
character h
, followed by the lowercase character e
.
"[T]he" => The car is parked in the garage.
"[T]?he" => The car is parked in the garage.
2.4 Braces
In regular expression braces that are also called quantifiers used to specify the number of times that a group of character or a
character can be repeated. For example the regular expression [0-9]{2,3}
means: Match at least 2 digits but not more than 3 (
characters in the range of 0 to 9).
"[0-9]{2,3}" => The number was 9.9997 but we rounded it off to 10.0.
We can leave out the second number. For example the regular expression [0-9]{2,}
means: Match 2 or more digits. If we also remove
the comma the regular expression [0-9]{2}
means: Match exactly 2 digits.
"[0-9]{2,}" => The number was 9.9997 but we rounded it off to 10.0.
"[0-9]{2}" => The number was 9.9997 but we rounded it off to 10.0.
2.5 Character Group
Character group is a group of sub-pattern that is written inside Parentheses (...)
. As we discussed before that in regular expression
if we put quantifier after character than it will repeats the preceding character. But if we put quantifier after a character group than
it repeats the whole character group. For example the regular expression (ab)*
matches zero or more repetitions of the character «ab».
We can also use the alternation |
meta character inside character group. For example the regular expression (c|g|p)ar
means: lowercase character c
,
g
or p
, followed by character a
, followed by character r
.
"(c|g|p)ar" => The car is parked in the garage.
2.6 Alternation
In regular expression Vertical bar |
is used to define alternation. Alternation is like a condition between multiple expressions. Now,
you maybe thinking that character set and alternation works the same way. But the big difference between character set and alternation
is that character set works on character level but alternation works on expression level. For example the regular expression
(T|t)he|car
means: uppercase character T
or lowercase t
, followed by lowercase character h
, followed by lowercase character e
or lowercase character c
, followed by lowercase character a
, followed by lowercase character r
.
"(T|t)he|car" => The car is parked in the garage.
2.7 Escaping special character
Backslash is used in regular expression to escape the next character. This allows to to specify a symbol as a matching character
including reserved characters { } [ ] / + * . $ ^ | ?
. To use a special character as a matching character prepend before it.
For example the regular expression .
is used to match any character except new line. Now to match .
in an input string the regular
expression (f|c|m)at.?
means: lowercase letter f
, c
or m
, followed by lowercase character a
, followed by lowercase letter
t
, followed by optional .
character.
"(f|c|m)at.?" => The fat cat sat on the mat.
2.8 Anchors
In regular expression to check if the matching symbol is the starting symbol or ending symbol of the input string for this purpose
we use anchors. Anchors are of two types: First type is Caret ^
that check if the matching character is the start character of the
input and the second type is Dollar $
that checks if matching character is the last character of the input string.
2.8.1 Caret
Caret ^
symbol is used to check if matching character is the first character of the input string. If we apply the following regular
expression ^a
(if a is the starting symbol) to input string abc
it matches a
. But if we apply regular expression ^b
on above
input string it does not match anything. Because in input string abc
«b» is not the starting symbol. Let’s take a look on another
regular expression ^(T|t)he
which means: uppercase character T
or lowercase character t
is the start symbol of the input string,
followed by lowercase character h
, followed by lowercase character e
.
"(T|t)he" => The car is parked in the garage.
"^(T|t)he" => The car is parked in the garage.
2.8.2 Dollar
Dollar $
symbol is used to check if matching character is the last character of the input string. For example regular expression
(at.)$
means: a lowercase character a
, followed by lowercase character t
, followed by a .
character and the matcher
must be end of the string.
"(at.)" => The fat cat. sat. on the mat.
"(at.)$" => The fat cat sat on the mat.
3. Shorthand Character Sets
Regular expression provides shorthands for the commonly used character sets, which offer convenient shorthands for commonly used
regular expressions. The shorthand character sets are as follows:
Shorthand | Description |
---|---|
. | Any character except new line |
w | Matches alphanumeric characters: [a-zA-Z0-9_] |
W | Matches non-alphanumeric characters: [^w] |
d | Matches digit: [0-9] |
D | Matches non-digit: [^d] |
s | Matches whitespace character: [tnfrp{Z}] |
S | Matches non-whitespace character: [^s] |
4. Lookaround
Lookbehind and lookahead sometimes known as lookaround are specific type of non-capturing group (Use to match the pattern but not
included in matching list). Lookaheads are used when we have the condition that this pattern is preceded or followed by another certain
pattern. For example we want to get all numbers that are preceded by $
character from the following input string $4.44 and $10.88
.
We will use following regular expression (?<=$)[0-9.]*
which means: get all the numbers which contains .
character and preceded
by $
character. Following are the lookarounds that are used in regular expressions:
Symbol | Description |
---|---|
?= | Positive Lookahead |
?! | Negative Lookahead |
?<= | Positive Lookbehind |
?<! | Negative Lookbehind |
4.1 Positive Lookahead
The positive lookahead asserts that the first part of the expression must be followed by the lookahead expression. The returned match
only contains the text that is matched by the first part of the expression. To define a positive lookahead braces are used and within
those braces question mark with equal sign is used like this (?=...)
. Lookahead expression is written after the equal sign inside
braces. For example the regular expression (T|t)he(?=sfat)
means: optionally match lowercase letter t
or uppercase letter T
,
followed by letter h
, followed by letter e
. In braces we define positive lookahead which tells regular expression engine to match
The
or the
which are followed by the word fat
.
"(T|t)he(?=sfat)" => The fat cat sat on the mat.
4.2 Negative Lookahead
Negative lookahead is used when we need to get all matches from input string that are not followed by a pattern. Negative lookahead
defined same as we define positive lookahead but the only difference is instead of equal =
character we use negation !
character
i.e. (?!...)
. Let’s take a look at the following regular expression (T|t)he(?!sfat)
which means: get all The
or the
words from
input string that are not followed by the word fat
precedes by a space character.
"(T|t)he(?!sfat)" => The fat cat sat on the mat.
4.3 Positive Lookbehind
Positive lookbehind is used to get all the matches that are preceded by a specific pattern. Positive lookbehind is denoted by
(?<=...)
. For example the regular expression (?<=(T|t)hes)(fat|mat)
means: get all fat
or mat
words from input string that
are after the word The
or the
.
"(?<=(T|t)hes)(fat|mat)" => The fat cat sat on the mat.
4.4 Negative Lookbehind
Negative lookbehind is used to get all the matches that are not preceded by a specific pattern. Negative lookbehind is denoted by
(?<!...)
. For example the regular expression (?<!(T|t)hes)(cat)
means: get all cat
words from input string that
are after not after the word The
or the
.
"(?<!(T|t)hes)(cat)" => The cat sat on cat.
5. Flags
Flags are also called modifiers because they modify the output of a regular expression. These flags can be used in any order or
combination, and are an integral part of the RegExp.
Flag | Description |
---|---|
i | Case insensitive: Sets matching to be case-insensitive. |
g | Global Search: Search for a pattern throughout the input string. |
m | Multiline: Anchor meta character works on each line. |
5.1 Case Insensitive
The i
modifier is used to perform case-insensitive matching. For example the regular expression /The/gi
means: uppercase letter
T
, followed by lowercase character h
, followed by character e
. And at the end of regular expression the i
flag tells the
regular expression engine to ignore the case. As you can see we also provided g
flag because we want to search for the pattern in
the whole input string.
"The" => The fat cat sat on the mat.
"/The/gi" => The fat cat sat on the mat.
5.2 Global search
The g
modifier is used to perform a global match (find all matches rather than stopping after the first match). For example the
regular expression/.(at)/g
means: any character except new line, followed by lowercase character a
, followed by lowercase
character t
. Because we provided g
flag at the end of the regular expression now it will find every matches from whole input
string.
".(at)" => The fat cat sat on the mat.
"/.(at)/g" => The fat cat sat on the mat.
5.3 Multiline
The m
modifier is used to perform a multi line match. As we discussed earlier anchors (^, $)
are used to check if pattern is
the beginning of the input or end of the input string. But if we want that anchors works on each line we use m
flag. For example the
regular expression /at(.)?$/gm
means: lowercase character a
, followed by lowercase character t
, optionally anything except new
line. And because of m
flag now regular expression engine matches pattern at the end of each line in a string.
"/.at(.)?$/" => The fat cat sat on the mat.
"/.at(.)?$/gm" => The fat cat sat on the mat.
Bonus
- Positive Integers:
^d+$
- Negative Integers:
^-d+$
- US Phone Number:
^+?[ds]{3,}$
- US Phone with code:
^+?[ds]+(?[ds]{10,}$
- Integers:
^-?d+$
- Username:
^[wd_.]{4,16}$
- Alpha-numeric characters:
^[a-zA-Z0-9]*$
- Alpha-numeric characters with spaces:
^[a-zA-Z0-9 ]*$
- Password:
^(?=^.{6,}$)((?=.*[A-Za-z0-9])(?=.*[A-Z])(?=.*[a-z]))^.*$
- email:
^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,4})*$
- IPv4 address:
^((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))*$
- Lowercase letters only:
^([a-z])*$
- Uppercase letters only:
^([A-Z])*$
- URL:
^(((http|https|ftp)://)?([[a-zA-Z0-9]-.])+(.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]/+=%&_.~?-]*))*$
- VISA credit card numbers:
^(4[0-9]{12}(?:[0-9]{3})?)*$
- Date (MM/DD/YYYY):
^(0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])[- /.](19|20)?[0-9]{2}$
- Date (YYYY/MM/DD):
^(19|20)?[0-9]{2}[- /.](0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])$
- MasterCard credit card numbers:
^(5[1-5][0-9]{14})*$
Last modified: Fri Nov 27 09:44:51 2020
Table of Contents
- What is a Regular Expression?
- The Structure of a Regular Expression
- The Anchor Characters: ^ and $
- Matching a character with a character set
- Match any character with .
- Specifying a Range of Characters with […]
- Exceptions in a character set
- Repeating character sets with *
- Matching a specific number of sets with { and }
- Matching words with < and >
- Backreferences — Remembering patterns with (, ) and 1
- Potential Problems
- Extended Regular Expressions
- POSIX character sets
- Perl Extensions
- Thanks
Regular Expressions and Extended Pattern Matching
Bruce Barnett
Note that this was written in 1991, before Linux. In the
1980’s, it was common to have different sets of regular expression
features with different features. ed(1) was different from sed(1)
which was different from vi(1), etc. Note that
Sun went through every utility and forced each one to use one of two
distinct regular expression libraries — regular or extended. I wrote this tutorial for Sun
users, and some of the commands discussed are now obsolete.
On Linux and other UNIX systems, you might find out that some of these
features are not implemented. Your mileage may vary.
Copyright © 1991 Bruce Barnett & General Electric Company
Copyright © 2001, 2008, 2013 Bruce Barnett
All Rights reserved
Original version written in 1994 and published in the Sun Observer
What is a Regular Expression?
A regular expression is a set of characters that specify a pattern.
The term
«regular» has nothing to do with a high-fiber diet. It comes from a term used to
describe grammars and formal languages.
Regular expressions are used when you want to search for specific lines
of text containing a particular pattern.
Most of the UNIX utilities operate on ASCII files a line at a time.
Regular expressions search for patterns on a single line, and not for
patterns that start on one line and end on another.
It is simple to search for a specific word or string of characters.
Almost every editor on every
computer system can do this.
Regular expressions are more powerful and flexible.
You can search for words of a certain size. You can search for a word
with four or more vowels that end with an
«s». Numbers, punctuation characters, you name it, a regular expression can
find it.
What happens once the program you are using finds it is another matter.
Some just search for the pattern. Others print out the line containing
the pattern. Editors can replace the string with a new pattern.
It all depends on the utility.
Regular expressions confuse people because they look a lot like the
file matching patterns the shell uses.
They even act the same way—almost.
The square brackets are similar, and the asterisk acts similar to, but
not identical to the asterisk in a regular expression.
In particular, the Bourne shell, C shell,
find, and
cpio use file name matching patterns and not regular expressions.
Remember that shell
meta-characters are expanded before the shell passes the arguments to
the program.
To prevent this expansion, the special characters in a regular
expression must be quoted when passed as an option from the shell.
You already know how to do this because I covered this topic in last
month’s tutorial.
The Structure of a Regular Expression
There are three important parts to a regular expression.
Anchors are used to specify the position of the pattern in relation to a line of
text.
Character Sets match one or more characters in a single position.
Modifiers specify how many times the previous character set is repeated.
A simple example that demonstrates all three parts is the regular
expression
«^#*». The up arrow is an anchor that indicates the beginning of the line.
The character
«#» is a simple character set that matches the
single character
«#». The asterisk is a modifier.
In a regular expression it specifies that the previous character set
can appear any number of times, including zero.
This is a useless regular expression, as you will see shortly.
There are also two types of regular expressions: the
«Basic» regular expression, and the
«extended» regular expression.
A few utilities like
awk and
egrep use the extended expression.
Most use the
«basic» regular expression.
From now on, if I talk about a
«regular expression,» it describes a feature in both types.
Here is a table of the Solaris (around 1991) commands that allow you to specify regular
expressions:
Utility | Regular Expression Type |
vi | Basic |
sed | Basic |
grep | Basic |
csplit | Basic |
dbx | Basic |
dbxtool | Basic |
more | Basic |
ed | Basic |
expr | Basic |
lex | Basic |
pg | Basic |
nl | Basic |
rdist | Basic |
awk | Extended |
nawk | Extended |
egrep | Extended |
EMACS | EMACS Regular Expressions |
PERL | PERL Regular Expressions |
The Anchor Characters: ^ and $
Most UNIX text facilities are line oriented. Searching for patterns
that span several lines is not easy to do.
You see, the end of line character is not included in the block of
text that is searched.
It is a separator.
Regular expressions examine the text between the separators.
If you want to search for a pattern that is at one end or the other,
you use
anchors. The character
«^» is the starting anchor, and the character
«$» is the end anchor.
The regular expression
«^A» will match all lines that start with a capital A.
The expression
«A$» will match all lines that end with the capital A.
If the anchor characters are not used at the proper end of the
pattern, then they no longer act as anchors.
That is, the
«^» is only an anchor if it is the first character in a regular
expression.
The
«$» is only an anchor if it is the last character.
The expression
«$1» does not have an anchor.
Neither does
«1^». If you need to match a
«^» at the beginning of the line, or a
«$» at the end of a line, you must
escape the special characters with a backslash.
Here is a summary:
Pattern | Matches |
^A | «A» at the beginning of a line |
A$ | «A» at the end of a line |
A^ | «A^» anywhere on a line |
$A | «$A» anywhere on a line |
^^ | «^» at the beginning of a line |
$$ | «$» at the end of a line |
The use of
«^» and
«$» as indicators of the beginning or end of a line is a convention
other utilities use.
The
vi editor uses these two characters as commands to go to the beginning or
end of a line.
The C shell uses
«!^» to specify the first argument of the previous line, and
«!$» is the last argument on the previous line.
It is one of those choices that other utilities go along with to
maintain consistency.
For instance,
«$» can refer to the last line of a file when using
ed and
sed.
Cat -e marks end of lines with a
«$». You might see it in other programs as well.
Matching a character with a character set
The simplest character set is a character.
The regular expression
«the» contains three character sets:
«t,»
«h» and
«e». It will match any line with the string
«the» inside it. This would also match the word
«other». To prevent this, put spaces before and after the pattern:
» the «. You can combine the string with an anchor.
The pattern
«^From: » will match the lines of a mail message that identify the sender.
Use this pattern with grep to print every address in your incoming mail box:
- grep ‘^From: ‘ /usr/spool/mail/$USER
Some characters have a special meaning in regular expressions.
If you want to search for such a character, escape it with a backslash.
Match any character with .
The character
«.» is one of those special meta-characters.
By itself it will match any character, except the end-of-line
character.
The pattern that will match a line with a single characters is
- ^.$
Specifying a Range of Characters with […]
If you want to match specific characters, you can use the square
brackets to identify the exact characters you are searching for.
The pattern that will match any line of text that contains exactly one
number is
- ^[0123456789]$
This is verbose.
You can use the hyphen between two characters to specify a range:
- ^[0-9]$
You can intermix explicit characters with character ranges.
This pattern will match a single character that is a letter, number,
or underscore:
- [A-Za-z0-9_]
Character sets can be combined by placing them next to each other.
If you wanted to search for a word that
- Started with a capital letter
«T». - Was the first word on a line
- The second letter was a lower case letter
- Was exactly three letters long, and
- The third letter was a vowel
the regular expression would be
«^T[a-z][aeiou] «.
Exceptions in a character set
You can easily search for all characters except those in square
brackets by putting a
«^» as the first character after the
«[«. To match all characters except vowels use
«[^aeiou]».
Like the anchors in places that can’t be considered an anchor, the
characters
«]» and
«-» do not have a special meaning if they directly follow
«[«. Here are some examples:
Regular Expression | Matches |
[] | The characters «[]» |
[0] | The character «0» |
[0-9] | Any number |
[^0-9] | Any character other than a number |
[-0-9] | Any number or a «-« |
[0-9-] | Any number or a «-« |
[^-0-9] | Any character except a number or a «-« |
[]0-9] | Any number or a «]» |
[0-9]] | Any number followed by a «]» |
[0-9-z] | Any number, |
or any character between «9» and «z». | |
[0-9-a]] | Any number, or |
a «-«, a «a», or a «]» |
Repeating character sets with *
The third part of a regular expression is the modifier.
It is used to specify how may times you expect to see the previous
character set. The special character
«*» matches
zero or more copies.
That is, the regular expression
«0*» matches
zero or more zeros, while the expression
«[0-9]*» matches zero or more numbers.
This explains why the pattern
«^#*» is useless, as it matches any number of
«#’s» at the beginning of the line, including
zero. Therefore this will match every line, because every line starts with
zero or more
«#’s».
At first glance, it might seem that starting the count at zero is
stupid.
Not so.
Looking for an unknown number of characters is very important.
Suppose you wanted to look for a number at the beginning of a line,
and there may or may not be spaces before the number.
Just use
«^ *» to match zero or more spaces at the beginning of the line.
If you need to match one or more, just repeat the character set.
That is,
«[0-9]*» matches zero or more numbers, and
«[0-9][0-9]*» matches one or more numbers.
Matching a specific number of sets with { and }
You can continue the above technique if you want to specify a minimum
number of character sets. You cannot specify a maximum number of sets
with the
«*» modifier. There is a special pattern you can use to specify the
minimum and maximum number of repeats.
This is done by putting those two numbers between
«{» and
«}». The backslashes deserve a special discussion.
Normally a backslash
turns off the special meaning for a character.
A period is matched by a
«.» and an asterisk is matched by a
«*».
If a backslash is placed before a
«<,»
«>,»
«{,»
«},»
«(,»
«),» or before a digit, the backslash
turns on a special meaning.
This was done because these special functions were added late in the
life of regular expressions.
Changing the meaning of
«{» would have broken old expressions. This is a horrible crime punishable
by a year of hard labor writing COBOL programs.
Instead, adding a backslash added functionality without breaking old
programs. Rather than complain about the unsymmetry, view it as evolution.
Having convinced you that
«{» isn’t a plot to confuse you, an example is in order. The regular
expression to match 4, 5, 6, 7 or 8 lower case letters is
- [a-z]{4,8}
Any numbers between 0 and 255 can be used.
The second number may be omitted, which removes the upper limit.
If the comma and the second number are omitted, the pattern must be
duplicated the exact number of times specified by the first number.
You must remember that modifiers like
«*» and
«{1,5}» only act as modifiers if they follow a character set.
If they were at the beginning of a pattern, they would not be a modifier.
Here is a list of examples, and the exceptions:
Regular Expression | Matches |
_ | |
* | Any line with an asterisk |
* | Any line with an asterisk |
\ | Any line with a backslash |
^* | Any line starting with an asterisk |
^A* | Any line |
^A* | Any line starting with an «A*» |
^AA* | Any line if it starts with one «A» |
^AA*B | Any line with one or more «A»‘s followed by a «B» |
^A{4,8}B | Any line starting with 4, 5, 6, 7 or 8 «A»‘s |
followed by a «B» | |
^A{4,}B | Any line starting with 4 or more «A»‘s |
followed by a «B» | |
^A{4}B | Any line starting with «AAAAB» |
{4,8} | Any line with «{4,8}» |
A{4,8} | Any line with «A{4,8}» |
Matching words with < and >
Searching for a word isn’t quite as simple as it at first appears.
The string
«the» will match the word
«other». You can put spaces before and after the letters and use this regular
expression:
» the «. However, this does not match words at the beginning or end of the line.
And it does not match the case where there is a punctuation mark
after the word.
There is an easy solution.
The characters
«<» and
«>» are similar to the
«^» and
«$» anchors,
as they don’t occupy a position of a character.
They do
«anchor» the expression between to only match if it is on a word boundary.
The pattern to search for the word
«the» would be
«<[tT]he>». The character before the
«t» must be either a new line character, or anything except a letter,
number, or underscore.
The character after the
«e» must also be a character other than a number, letter, or underscore
or it could be the end of line character.
Backreferences — Remembering patterns with (, ) and 1
Another pattern that requires a special mechanism is searching for
repeated words.
The expression
«[a-z][a-z]» will match any two lower case letters.
If you wanted to search for lines that had two adjoining identical
letters, the above pattern wouldn’t help.
You need a way of remembering what you found, and seeing if
the same pattern occurred again.
You can mark part of a pattern using
«(» and
«)». You can recall the remembered pattern with
«» followed by a single digit.
Therefore, to search for two identical letters, use
«([a-z])1». You can have 9 different remembered patterns.
Each occurrence of
«(» starts a new pattern.
The regular expression that would match a 5 letter palindrome,
(e.g. «radar»), would be
- ([a-z])([a-z])[a-z]21
Potential Problems
That completes the discussion of the basic regular expression.
Before I discuss the extensions the extended expressions offer, I
wanted to mention two potential problem areas.
The
«<» and
«>» characters were introduced in the
vi editor. The other programs didn’t have this ability at that time.
Also the
«{min,max}» modifier is new and earlier utilities didn’t have this ability.
This made it difficult for the novice user of regular expressions,
because it seemed each utility has a different convention.
Sun has retrofited the newest regular expression library to all of
their programs, so they all have the same ability.
If you try to use these newer features on other vendor’s machines, you
might find they don’t work the same way.
The other potential point of confusion is the extent of the pattern
matches. Regular expressions match the longest possible pattern.
That is, the regular expression
- A.*B
matches
«AAB» as well as
«AAAABBBBABCCCCBBBAAAB». This doesn’t cause many problems using
grep, because an oversight in a regular expression will just match more
lines than desired.
If you use
sed, and your patterns get carried away, you may end up deleting more than
you wanted to.
Extended Regular Expressions
Two programs use the extended regular expressions:
egrep and
awk. With these extensions, those special characters preceded by a backslash
no longer have the special meaning:
«{» ,
«}»,
«<«,
«>»,
«(«,
«)» as well as the
«digit«. There is a very good reason for this, which I will
delay explaining to build up suspense.
The character
«?» matches 0 or 1 instances of the character set before, and the
character
«+» matches one or more copies of the character set.
You can’t use the { and } in the extended regular expressions,
but if you could, you might consider the
«?» to be the same as
«{0,1}» and the
«+» to be the same as
«{1,}».
By now, you are wondering why the extended regular expressions
are even worth using. Except for two abbreviations, there are no
advantages, and a lot of disadvantages.
Therefore, examples would be useful.
The three important characters in the expanded regular expressions are
«(«,
«|», and
«)». Together, they let you match a
choice of patterns.
As an example, you can
egrep to print all
From: and
Subject: lines from your incoming mail:
- egrep ‘^(From|Subject): ‘ /usr/spool/mail/$USER
All lines starting with
«From:» or
«Subject:» will be printed. There is no easy way to do this with the basic
regular expressions. You could try
«^[FS][ru][ob][mj]e*c*t*: » and hope you don’t have any lines that start with
«Sromeet:». Extended expressions don’t have
the
«<» and
«>» characters.
You can compensate by using the alternation mechanism.
Matching the word
«the» in the beginning, middle, end of a sentence, or end of a line can be
done with the extended regular expression:
- (^| )the([^a-z]|$)
There are two choices before the word, a space or the beginining of a
line.
After the word, there must be something besides a lower case letter or
else the end of the line.
One extra bonus with extended regular expressions is the ability to
use the
«*,»
«+,» and
«?» modifiers after a
«(…)» grouping. The following will match
«a simple problem,»
«an easy problem,» as well as
«a problem».
- egrep «a[n]? (simple |easy )?problem» data
Note the space after both «simple» and «easy».
I promised to explain why the backslash characters don’t work in
extended regular expressions.
Well, perhaps the
«{…}» and
«<…>» could be added to the extended expressions. These are the newest
addition to the regular expression family. They could be added, but
this might confuse people if those characters are added and the
«(…)» are not. And there is no way to add that functionality to the extended
expressions without changing the current usage. Do you see why?
It’s quite simple. If
«(» has a special meaning, then
«(» must be the ordinary character.
This is the opposite of the Basic regular expressions,
where
«(» is ordinary, and
«(» is special.
The usage of the parentheses is incompatable, and any change could
break old programs.
If the extended expression used
«( ..|…)» as regular characters, and
«(…|…)» for specifying alternate patterns, then it is possible to have one set
of regular expressions that has full functionality.
This is exactly
what GNU emacs does, by the way.
The rest of this is random notes.
Regular Expression | Class | Type | Meaning |
_ | |||
. | all | Character Set | A single character (except newline) |
^ | all | Anchor | Beginning of line |
$ | all | Anchor | End of line |
[…] | all | Character Set | Range of characters |
* | all | Modifier | zero or more duplicates |
< | Basic | Anchor | Beginning of word |
> | Basic | Anchor | End of word |
(..) | Basic | Backreference | Remembers pattern |
1..9 | Basic | Reference | Recalls pattern |
_+ | Extended | Modifier | One or more duplicates |
? | Extended | Modifier | Zero or one duplicate |
{M,N} | Extended | Modifier | M to N Duplicates |
(…|…) | Extended | Anchor | Shows alteration |
_ | |||
(…|…) | EMACS | Anchor | Shows alteration |
w | EMACS | Character set | Matches a letter in a word |
W | EMACS | Character set | Opposite of w |
POSIX character sets
POSIX added newer and more portable ways to search for character sets.
Instead of using [a-zA-Z] you can replace ‘a-zA-Z’ with [:alpha:], or to be more complete. replace [a-zA-Z] with [[:alpha:]].
The advantage is that this will match international character sets.
You can mix the old style and new POSIX styles, such as
grep ‘[1-9[:alpha:]]’
Here is the fill list
Character Group | Meaning |
[:alnum:] | Alphanumeric |
[:cntrl:] | Control Character |
[:lower:] | Lower case character |
[:space:] | Whitespace |
[:alpha:] | Alphabetic |
[:digit:] | Digit |
[:print:] | Printable character |
[:upper:] | Upper Case Character |
[:blank:] | whitespace, tabs, etc. |
[:graph:] | Printable and visible characters |
[:punct:] | Punctuation |
[:xdigit:] | Extended Digit |
Note that some people use [[:alpha:]] as a notation, but the outer ‘[…]’
specifies a character set.
Perl Extensions
Regular Expression | ||
Class | Type | Meaning |
t | Character Set | tab |
n | Character Set | newline |
r | Character Set | return |
f | Character Set | form |
a | Character Set | alarm |
e | Character Set | escape |
33 | Character Set | octal |
x1B | Character Set | hex |
c[ | Character Set | control |
l | Character Set | lowercase |
u | Character Set | uppercase |
L | Character Set | lowercase |
U | Character Set | uppercase |
E | Character Set | end |
Q | Character Set | quote |
w | Character Set | Match a «word» character |
W | Character Set | Match a non-word character |
s | Character Set | Match a whitespace character |
S | Character Set | Match a non-whitespace character |
d | Character Set | Match a digit character |
D | Character Set | Match a non-digit character |
b | Anchor | Match a word boundary |
B | Anchor | Match a non-(word boundary) |
A | Anchor | Match only at beginning of string |
Z | Anchor | Match only at EOS, or before newline |
z | Anchor | Match only at end of string |
G | Anchor | Match only where previous m//g left off |
Example of PERL Extended, multi-line regular expression
-
m{ ( ( # Start group [^()]+ # anything but '(' or ')' | # or ( [^()]* ) )+ # end group ) }x
Thanks
Thanks to the following who spotted some errors
- Charuhas Mehendale
- Rounak Jain
- Peter Renzland
- Karl Eric Wenzel
- Axel Schulze
- Dennis Deters
- Bryan Bergert
- Brad Coanwood
- Michael Siegel
This document was translated by troff2html v0.21 on June 27, 2001.