Regex for word ending with

In this post, I’m going to explain a regular expression that I built to find all the words from text that end with a particular character(letter).
I was searching online for a regex to do that but I didn’t find the one that worked for my exact use case (so I started learning how regex works). So, I experimented and finally found the correct answer.

To solve: Find all words in the text that end with 'e'.
Solution: w+eb 
          OR 
          [a-zA-Z]+eb

Explanation: There are probably many cheatsheets on Regular Expressions that can be referred to understand various parts of this regex solution. So I will try and keep the explanation short.

Let say, my text is a list of random words:
cater cat late gate ignore that sentence just match correct words here

To match words ‘w’ is used. It captures word characters including the underscore(_). We can also use a range of permitted character set instead – [a-zA-Z]. I am including both lowercase and uppercase ranges to include words that contain any of these. This will select every character of every word in the text.

So, we need the ‘+’ sign. The ‘+’ is a wildcard character that is used to expand the search past a single character. It is important to make entire word selections. Now the expression (w+) will make full word selections of all text words.

I want to get the result of words that end with the letter ‘e’, so I put ‘e’ at the end of my expression. Up until this point, this expression will select all words that contain ‘e’ but not necessarily ending with ‘e’, which is not what I want. Another problem with this is that it will select parts of the words till the letter ‘e’ and ignore the rest of it. So the selections are:
cater cat late gate ignore that sentence just match correct words here
Okay, so this is close as it matches all the words that have ‘e’ but that’s not it. It should only select the words that end with ‘e’. The words ‘cater’ and ‘correct’ should not match.

To solve this, we need ‘b’ which is used to match a word boundary. This is an empty string at the left and right side of a word. So I’ll add this at the right side of the expression to make it w+eb so that it considers word boundaries and check if the rightmost character is ‘e’. And now the expression matches the correct words and the result is:
late gate ignore sentence here

That’s all. Easy now! Hopefully it helps others as well!

Sometimes we have a requirement where we have to filter out lines from logs, which start from certain word OR end with certain word. In this Java regex word boundary tutorial, we will learn to create regex to filter out lines which either start or end with a certain word.

Table of Contents

1. Boundary matchers
2. Match word at the start of content
3. Match word at the end of content
4. Match word at the start of line
5. Match word at the end of line

1. Boundary matchers

Boundary macthers help to find a particular word, but only if it appears at the beginning or end of a line. They do not match any characters. Instead, they match at certain positions, effectively anchoring the regular expression match at those positions.

The following table lists and explains all the boundary matchers.

Boundary token Description
^ The beginning of a line
$ The end of a line
b A word boundary
B A non-word boundary
A The beginning of the input
G The end of the previous match
Z The end of the input but for the final terminator, if any
z The end of the input

2. Java regex word boundary – Match word at the start of content

The anchor "A" always matches at the very start of the whole text, before the first character. That is the only place where it matches. Place "A" at the start of your regular expression to test whether the content begins with the text you want to match.

The "A" must be uppercase. Alternatively, you can use "^" as well.

^wordToSearch OR AwordToSearch

String content = 	"begin here to start, and go there to endn" +
					"come here to begin, and end there to finishn" +
					"begin here to start, and go there to end";
					
String regex 	= 	"^begin";
//OR
//String regex = "\Abegin";

Pattern pattern = 	Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher matcher = 	pattern.matcher(content);
while (matcher.find())
{
	System.out.print("Start index: " + matcher.start());
	System.out.print(" End index: " + matcher.end() + " ");
	System.out.println(matcher.group());
}

Output:

Start index: 0 End index: 5 begin

3. Java regex word boundary – Match word at the end of content

The anchors "Z" and "z" always match at the very end of the content, after the last character. Place "Z" or "z" at the end of your regular expression to test whether the content ends with the text you want to match.

Alternatively, you can use "$" as well.

wordToSearch$ OR wordToSearchZ

String content = 	"begin here to start, and go there to endn" +
					"come here to begin, and end there to finishn" +
					"begin here to start, and go there to end";
					
String regex 	= 	"end$";
String regex 	= 	"end\Z";

Pattern pattern = 	Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher matcher = 	pattern.matcher(content);
while (matcher.find())
{
	System.out.print("Start index: " + matcher.start());
	System.out.print(" End index: " + matcher.end() + " ");
	System.out.println(matcher.group());
}

Output:

Start index: 122 End index: 125 end

4. Java regex word boundary – Match word at the start of line

You can use "(?m)" to tun on “multi-line” mode to match a word at start of every time.

“Multi-line” mode affects only the caret (^) and dollar ($) sign.

(?m)^wordToSearch

String content = 	"begin here to start, and go there to endn" +
					"come here to begin, and end there to finishn" +
					"begin here to start, and go there to end";
String regex 	= 	"(?m)^begin";
Pattern pattern = 	Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher matcher = 	pattern.matcher(content);
while (matcher.find())
{
	System.out.print("Start index: " + matcher.start());
	System.out.print(" End index: " + matcher.end() + " ");
	System.out.println(matcher.group());
}

Output:

Start index: 0 End index: 5 begin
Start index: 85 End index: 90 begin

5. Java regex word boundary – Match word at the end of line

You can use "(?m)" to tun on “multi-line” mode to match a word at end of every time.

(?m)wordToSearch$

String content = 	"begin here to start, and go there to endn" +
					"come here to begin, and end there to finishn" +
					"begin here to start, and go there to end";
String regex 	= 	"(?m)end$";
Pattern pattern = 	Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher matcher = 	pattern.matcher(content);
while (matcher.find())
{
	System.out.print("Start index: " + matcher.start());
	System.out.print(" End index: " + matcher.end() + " ");
	System.out.println(matcher.group());
}

Output:

Start index: 37 End index: 40 end
Start index: 122 End index: 125 end

Let me know of your thoughts on this Java regex word boundary example.

Happy Learning !!

References:

Java regex docs

While refactoring my Python code, I thought of the following question.

Can You Use a Regular Expression with the Python endswith() Method?

The simple answer is no because if you can use a regex, you won’t even need endswith()! Instead, use the re.match(regex, string) function from the re module. For example, re.match("^.*(coffee|cafe)$", tweet) checks whether a single-line string stored in variable tweet ends with either 'coffee' or 'cafe'.

In fact, I realized that using a regex with the endswith() method doesn’t make sense. Why? If you want to use regular expressions, use functions from the re module. That’s what they were created for! Regular expressions are infinitely more powerful than the endswith() method!

(Reading time 6 minutes — or watch the video to learn about the string.endswith() method)

Python endswith() — Super Simple Tutorial with Twitter Example

Do you want to master the regex superpower? Check out my new book The Smartest Way to Learn Regular Expressions in Python with the innovative 3-step approach for active learning: (1) study a book chapter, (2) solve a code puzzle, and (3) watch an educational chapter video.

How Does the Python endswith() Method Work?

Here’s an overview of the string.endswith method:

str.endswith(prefix[, start[, end]])
prefix required String value to be searched at the beginning of string str.
start optional Index of the first position where prefix is to be checked. Default: start=0.
end optional Index of the last position where prefix is to be checked. Default: end=len(str)-1.

Let’s look at some examples using the Python endswith method. In each one, I will modify the code to show different use cases. Let’s start with the most basic scenario. 

Python endswith() Most Basic Example

Suppose you have a list of strings where each string is a tweet.  

tweets = ["to thine own self be true",
          "coffee break python",
          "i like coffee"]

Let’s say you work in the coffee industry and you want to get all tweets that end with the string "coffee". You’ll use the endswith method with a single argument:

>>> for tweet in tweets:
...   if tweet.endswith("coffee"):
...       print(tweet)
i like coffee

The endswith method has two optional arguments: start and end. You can use these two arguments to check whether a substring from the original string ends with your argument. Need an example that explains both arguments?

Python endswith() Optional Arguments

The endswith method has two optional arguments: start and end. You can use these to define a range of indices to check. Per default, endswith checks the entire string. Let’s look at some examples.

The start argument tells endswith() where to begin searching. The default value is 0, i.e., it begins at the start of the string. So, the following code outputs the same result as above:

>>> for tweet in tweets:
...   if tweet.endswith("coffee", 0):
...       print(tweet)
i like coffee

What happens if we set start=8

>>> for tweet in tweets:
...   if tweet.endswith("coffee", 8):
...       print(tweet)

Why doesn’t it print anything? By calling the find() method, we see that the substring 'coffee' begins at index 7.

>>> 'i like coffee'.find('coffee')
7

But tweet.endsswith("coffee", 8) starts looking from index 8. So the result is False and nothing is printed.

Let’s add another argument – the end index – to the last snippet:

>>> for tweet in tweets:
...   if tweet.startswith("coffee", 7, 9):
...       print(tweet)


Nothing is printed on the console. This is because we are only searching over two characters – beginning at index 7 (inclusive) and ending at index 9 (exclusive). But we are searching for 'coffee' and it is 6 characters long. As 6 > 2, endswith() doesn’t find any matches and so returns nothing. 

Now that you know everything about Python’s endswith method, let’s go back to our original question:

Can I Use A Regular Expression with the Python endswith() Method?

No. The endswith() method does not allow for a regular expressions. You can only search for a string. 

A regular expression can describe an infinite set of matching strings. For example, '*A' matches all words ending with 'A'. This can be computationally expensive. So, for performance reasons, it makes sense that endswith() doesn’t accept regular expressions. 

Related article: Python Regex Superpower – The Ultimate Guide

But is it also true that endswith only accepts a single string as argument? Not at all. It is possible to do the following:

Python endswith() Tuple – Check For Multiple Strings 

>>> for tweet in tweets:
...   if tweet.endswith(("coffee", "python")):
...       print(tweet)
coffee break python
i like coffee

This snippet prints all strings that end with either "coffee" or "python". It is pretty efficient too. Unfortunately, you can only check a finite set of arguments. If you need to check an infinite set, you cannot use this method.

What Happens If I Pass A Regular Expression To endswith()?

Let’s check whether a tweet ends with any version of the "coffee" string. In other words, we want to apply the regex ".+coff*". This greedily matches any character one or more times, then 'coff' plus an arbitrary number of characters. Thus we match strings that end with "coffee", "coffees" and "coffe".

>>> tweets = ["to thine own self be true",
              "coffee break python",
              "i like coffee",
              "i love coffe",
              "what's better than one coffee? two coffees!"]

>>> for tweet in tweets:
        if tweet.endswith(".+coff*"):
          print(tweet)
# No output :(

This doesn’t work. In regular expressions, * is a wildcard and represents any character. But in the endswith() method, it just means the star character *. Since none of the tweets end with the literal string "coff*", Python prints nothing to the screen.

So you might ask:

What Are The Alternatives to Using Regular Expressions in endswith()?

There is one alternative that is simple and clean: use the re module. This is Python’s built-in module built to work with regular expressions.

>>> import re
>>> tweets = ["to thine own self be true",
              "coffee break python",
              "i like coffee",
              "i love coffe",
              "what's better than one coffee? two coffees!"]
# Success!
>>> for tweet in tweets:
        if re.match(".+coff*", tweet):
          print(tweet)
i like coffee
i love coffe
what’s better than one coffee? two coffees! 

Success! We’ve now printed all the tweets we expected. That is, all tweets that end with "coff" plus an arbitrary number of characters.

Note that this method is quite slow. Evaluating regular expressions is an expensive operation. But the clarity of the code has improved and we got the result we wanted. Slow and successful is better than fast and unsuccessful.

The function re.match() takes two arguments. First, the regular expression to be matched. Second, the string you want to search. If a matching substring is found, it returns True. If not, it returns False. In this case, it returns False for "to thine own self be true" and "coffee break python". It returns True for the rest. 

So let’s summarize the article.

Can You Use a Regular Expression with the Python endswith() Method?

No, you cannot use a regular expression with the Python endswith function. But you can use the Python regular expression module re instead. It’s as simple as calling the function match(s1, s2). This finds the regular expression s1 in the string s2.

Given that we can pass a tuple to endswith(), what happens if we pass a list? 

>>> s = 'cobra'
>>> if s.endswith(['a', 'b', 'c']):
        print('yay!')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: endswith first arg must be str or a tuple of str, not list

Python raises a TypeError. We can only pass a tuple to endswith(). So if we have a list of prefixes we want to check, we can call tuple() before passing it to endswith.

>>> if s.endswith(tuple(['a', 'b', 'c'])):
        print('yay!')
yay!

This works well and is fine performance wise. Yet, one of Python’s key features is its flexibility. So is it possible to get the same outcome without changing our list of letters to a tuple? Of course it is! 

We have two options:

  1. any() + list comprehension
  2. any() + map()

The any() function is a way to combine logical or statements together. It takes one argument – an iterable of conditional statements. So instead of writing

if s.endswith('a') or s.endswith('b') or s.endswith('c'):
    # some code

We write

# any takes 1 argument - an iterable
if any([s.endswith('a'),
        s.endswith('b'),
        s.endswith('c')]):
    # some code

This is much nicer to read and is especially useful if you are using many mathematical statements. We can improve this by first creating a list of conditions and passing this to any()

letters = ['a', 'b', 'c']
conditions = [s.endswith(l) for l in letters]

if any(conditions):
    # do something

Alternatively, we can use map instead of a list comprehension

letters = ['a', 'b', 'c']
if any(map(s.endswith, letters)):
    # do something

Both have the same outcome. We personally prefer list comprehensions and think they are more readable. But choose whichever you prefer.  

Regex Humor

Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee. (source)

Python Regex Course

Google engineers are regular expression masters. The Google search engine is a massive text-processing engine that extracts value from trillions of webpages.  

Facebook engineers are regular expression masters. Social networks like Facebook, WhatsApp, and Instagram connect humans via text messages

Amazon engineers are regular expression masters. Ecommerce giants ship products based on textual product descriptions.  Regular expressions ​rule the game ​when text processing ​meets computer science. 

If you want to become a regular expression master too, check out the most comprehensive Python regex course on the planet:

While working as a researcher in distributed systems, Dr. Christian Mayer found his love for teaching computer science students.

To help students reach higher levels of Python success, he founded the programming education website Finxter.com that has taught exponential skills to millions of coders worldwide. He’s the author of the best-selling programming books Python One-Liners (NoStarch 2020), The Art of Clean Code (NoStarch 2022), and The Book of Dash (NoStarch 2022). Chris also coauthored the Coffee Break Python series of self-published books. He’s a computer science enthusiast, freelancer, and owner of one of the top 10 largest Python blogs worldwide.

His passions are writing, reading, and coding. But his greatest passion is to serve aspiring coders through Finxter and help them to boost their skills. You can join his free email academy here.

GNU SED

BRE/ERE Regular Expressions

This chapter will cover Basic and Extended Regular Expressions as implemented in GNU sed. Though not strictly conforming to POSIX specifications, most of it is applicable to other sed implementations as well. Unless otherwise indicated, examples and descriptions will assume ASCII input.

By default, sed treats the search pattern as Basic Regular Expression (BRE). Using -E option will enable Extended Regular Expression (ERE). Older versions used -r for ERE, which can still be used, but -E is more portable. In GNU sed, BRE and ERE only differ in how metacharacters are applied, there’s no difference in features.

Line Anchors

Instead of matching anywhere in the line, restrictions can be specified. These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as metacharacters in regular expressions parlance. In case you need to match those characters literally, you need to escape them with a (discussed in Matching the metacharacters section).

There are two line anchors:

  • ^ metacharacter restricts the matching to the start of line
  • $ metacharacter restricts the matching to the end of line
$ # lines starting with 'sp'
$ printf 'spared no onenparnsparn' | sed -n '/^sp/p'
spared no one
spar

$ # lines ending with 'ar'
$ printf 'spared no onenparnsparn' | sed -n '/ar$/p'
par
spar

$ # change only whole line 'par'
$ printf 'spared no onenparnsparn' | sed 's/^par$/PAR/'
spared no one
PAR
spar

The anchors can be used by themselves as a pattern. Helps to insert text at the start/end of a input line, emulating string concatenation operations. These might not feel like useful capability, but combined with other features they become quite a handy tool.

$ printf 'spared no onenparnsparn' | sed 's/^/* /'
* spared no one
* par
* spar

$ # append only if line doesn't contain space characters
$ printf 'spared no onenparnsparn' | sed '/ /! s/$/./'
spared no one
par.
spar.

Word Anchors

The second type of restriction is word anchors. A word character is any alphabet (irrespective of case), digit and the underscore character. You might wonder why there are digits and underscores as well, why not only alphabets? This comes from variable and function naming conventions — typically alphabets, digits and underscores are allowed. So, the definition is more programming oriented than natural language.

The escape sequence b denotes a word boundary. This works for both start of word and end of word anchoring. Start of word means either the character prior to the word is a non-word character or there is no character (start of line). Similarly, end of word means the character after the word is a non-word character or no character (end of line). This implies that you cannot have word boundary without a word character.

info As an alternate, you can use < to indicate start of word anchor and > to indicate end of word anchor. Using b is preferred as it is more commonly used in other regular expression implementations and has B as its opposite.

warning bREGEXPb behaves a bit differently than <REGEXP>. See Gotchas and Tricks chapter for details.

$ cat word_anchors.txt
sub par
spar
apparent effort
two spare computers
cart part tart mart

$ # match words starting with 'par'
$ sed -n '/bpar/p' word_anchors.txt
sub par
cart part tart mart

$ # match words ending with 'par'
$ sed -n '/parb/p' word_anchors.txt
sub par
spar

$ # replace only whole word 'par'
$ sed -n 's/bparb/***/p' word_anchors.txt
sub ***

The word boundary has an opposite anchor too. B matches wherever b doesn’t match. This duality will be seen with some other escape sequences too.

warning Negative logic is handy in many text processing situations. But use it with care, you might end up matching things you didn’t intend.

$ # match 'par' if it is surrounded by word characters
$ sed -n '/BparB/p' word_anchors.txt
apparent effort
two spare computers

$ # match 'par' but not as start of word
$ sed -n '/Bpar/p' word_anchors.txt
spar
apparent effort
two spare computers

$ # match 'par' but not as end of word
$ sed -n '/parB/p' word_anchors.txt
apparent effort
two spare computers
cart part tart mart

$ echo 'copper' | sed 's/b/:/g'
:copper:
$ echo 'copper' | sed 's/B/:/g'
c:o:p:p:e:r

Alternation

Many a times, you’d want to search for multiple terms. In a conditional expression, you can use the logical operators to combine multiple conditions. With regular expressions, the | metacharacter is similar to logical OR. The regular expression will match if any of the expression separated by | is satisfied. These can have their own independent anchors as well.

Alternation is similar to using multiple -e option, but provides more flexibility with regular expression features. The | metacharacter syntax varies between BRE and ERE. Quoting from the manual:

In GNU sed, the only difference between basic and extended regular expressions is in the behavior of a few special characters: ?, +, parentheses, braces ({}), and |.

$ # BRE vs ERE
$ sed -n '/two|sub/p' word_anchors.txt
sub par
two spare computers
$ sed -nE '/two|sub/p' word_anchors.txt
sub par
two spare computers

$ # either 'cat' or 'dog' or 'fox'
$ # note the use of 'g' flag for multiple replacements
$ echo 'cats dog bee parrot foxed' | sed -E 's/cat|dog|fox/--/g'
--s -- bee parrot --ed

$ # lines with whole word 'par' or lines ending with 's'
$ sed -nE '/bparb|s$/p' word_anchors.txt
sub par
two spare computers

There’s some tricky situations when using alternation. If it is used for filtering a line, there is no ambiguity. However, for use cases like substitution, it depends on a few factors. Say, you want to replace are or spared — which one should get precedence? The bigger word spared or the substring are inside it or based on something else?

The alternative which matches earliest in the input gets precedence.

$ # here, the output will be same irrespective of alternation order
$ # note that 'g' flag isn't used here, so only first match gets replaced
$ echo 'cats dog bee parrot foxed' | sed -E 's/bee|parrot|at/--/'
c--s dog bee parrot foxed
$ echo 'cats dog bee parrot foxed' | sed -E 's/parrot|at|bee/--/'
c--s dog bee parrot foxed

In case of matches starting from same location, for example spar and spared, the longest matching portion gets precedence. Unlike other regular expression implementations, left-to-right priority for alternation comes into play only if length of the matches are the same. See Longest match wins and Backreferences sections for more examples. See regular-expressions: alternation for more information on this topic.

$ echo 'spared party parent' | sed -E 's/spa|spared/**/g'
** party parent
$ echo 'spared party parent' | sed -E 's/spared|spa/**/g'
** party parent

$ # other implementations like 'perl' have left-to-right priority
$ echo 'spared party parent' | perl -pe 's/spa|spared/**/'
**red party parent

Grouping

Often, there are some common things among the regular expression alternatives. It could be common characters or qualifiers like the anchors. In such cases, you can group them using a pair of parentheses metacharacters. Similar to a(b+c)d = abd+acd in maths, you get a(b|c)d = abd|acd in regular expressions.

$ # without grouping
$ printf 'rednreformnreadnarrestn' | sed -nE '/reform|rest/p'
reform
arrest
$ # with grouping
$ printf 'rednreformnreadnarrestn' | sed -nE '/re(form|st)/p'
reform
arrest

$ # without grouping
$ printf 'sub parnsparenpart timen' | sed -nE '/bparb|bpartb/p'
sub par
part time
$ # taking out common anchors
$ printf 'sub parnsparenpart timen' | sed -nE '/b(par|part)b/p'
sub par
part time
$ # taking out common characters as well
$ # you'll later learn a better technique instead of using empty alternate
$ printf 'sub parnsparenpart timen' | sed -nE '/bpar(|t)b/p'
sub par
part time

You have seen a few metacharacters and escape sequences that help to compose a regular expression. To match the metacharacters literally, i.e. to remove their special meaning, prefix those characters with a character. To indicate a literal character, use \. Some of the metacharacters, like the line anchors, lose their special meaning when not used in their customary positions with BRE syntax. If there are many metacharacters to be escaped, try to work out if the command can be simplified by switching between ERE and BRE.

$ # line anchors aren't special away from customary positions with BRE
$ echo 'a^2 + b^2 - C*3' | sed -n '/b^2/p'
a^2 + b^2 - C*3
$ echo '$a = $b + $c' | sed -n '/$b/p'
$a = $b + $c
$ # escape line anchors to match literally if you are using ERE
$ # or if you want to match them at customary positions with BRE
$ echo '$a = $b + $c' | sed 's/$//g'
a = b + c

$ # BRE vs ERE
$ printf '(a/b) + cn3 + (a/b) - cn' | sed -n '/^(a/b)/p'
(a/b) + c
$ printf '(a/b) + cn3 + (a/b) - cn' | sed -nE '/^(a/b)/p'
(a/b) + c

Handling the metacharacters in replacement section will be discussed in Backreferences section.

Using different delimiters

The / character is idiomatically used as the delimiter for REGEXP. But any character other than and the newline character can be used instead. This helps to avoid or reduce the need for escaping delimiter characters. The syntax is simple for substitution and transliteration commands, just use a different character instead of /.

$ # instead of this
$ echo '/home/learnbyexample/reports' | sed 's//home/learnbyexample//~//'
~/reports
$ # use a different delimiter
$ echo '/home/learnbyexample/reports' | sed 's#/home/learnbyexample/#~/#'
~/reports

$ echo 'a/b/c/d' | sed 'y/a/d/1-4/'
1-b-c-4
$ echo 'a/b/c/d' | sed 'y,a/d,1-4,'
1-b-c-4

For address matching, syntax is a bit different, the first delimiter has to be escaped. For address ranges, start and end REGEXP can have different delimiters, as they are independent.

$ printf '/foo/bar/1n/foo/baz/1n'
/foo/bar/1
/foo/baz/1

$ # here ; is used as the delimiter
$ printf '/foo/bar/1n/foo/baz/1n' | sed -n ';/foo/bar/;p'
/foo/bar/1

info See also a bit of history on why / is commonly used as delimiter.

The dot metacharacter serves as a placeholder to match any character (including newline character). Later you’ll learn how to define your own custom placeholder for limited set of characters.

$ # 3 character sequence starting with 'c' and ending with 't'
$ echo 'tac tin cot abc:tyz excited' | sed 's/c.t/-/g'
ta-in - ab-yz ex-ed

$ # any character followed by 3 and again any character
$ printf '42t35n' | sed 's/.3.//'
42

$ # N command is handy here to show that . matches n as well
$ printf 'abcnxyzn' | sed 'N; s/c.x/ /'
ab yz

Quantifiers

As an analogy, alternation provides logical OR. Combining the dot metacharacter . and quantifiers (and alternation if needed) paves a way to perform logical AND. For example, to check if a string matches two patterns with any number of characters in between. Quantifiers can be applied to both characters and groupings. Apart from ability to specify exact quantity and bounded range, these can also match unbounded varying quantities.

First up, the ? metacharacter which quantifies a character or group to match 0 or 1 times. This helps to define optional patterns and build terser patterns compared to groupings for some cases.

$ # same as: sed -E 's/b(fe.d|fed)b/X/g'
$ # BRE version: sed 's/fe.?db/X/g'
$ echo 'fed fold fe:d feeder' | sed -E 's/bfe.?db/X/g'
X fold X feeder

$ # same as: sed -nE '/bpar(|t)b/p'
$ printf 'sub parnsparenpart timen' | sed -nE '/bpart?b/p'
sub par
part time

$ # same as: sed -E 's/part|parrot/X/g'
$ echo 'par part parrot parent' | sed -E 's/par(ro)?t/X/g'
par X X parent
$ # same as: sed -E 's/part|parrot|parent/X/g'
$ echo 'par part parrot parent' | sed -E 's/par(en|ro)?t/X/g'
par X X X

$ # both '<' and '<' are replaced with '<'
$ echo 'blah < foo bar < blah baz <' | sed -E 's/\?</\</g'
blah < foo bar < blah baz <

The * metacharacter quantifies a character or group to match 0 or more times. There is no upper bound, more details will be discussed in the next section.

$ # 'f' followed by zero or more of 'e' followed by 'd'
$ echo 'fd fed fod fe:d feeeeder' | sed 's/fe*d/X/g'
X X fod fe:d Xer

$ # zero or more of '1' followed by '2'
$ echo '3111111111125111142' | sed 's/1*2/-/g'
3-511114-

The + metacharacter quantifies a character or group to match 1 or more times. Similar to * quantifier, there is no upper bound.

$ # 'f' followed by one or more of 'e' followed by 'd'
$ # BRE version: sed 's/fe+d/X/g'
$ echo 'fd fed fod fe:d feeeeder' | sed -E 's/fe+d/X/g'
fd X fod fe:d Xer

$ # 'f' followed by at least one of 'e' or 'o' or ':' followed by 'd'
$ echo 'fd fed fod fe:d feeeeder' | sed -E 's/f(e|o|:)+d/X/g'
fd X X X Xer

$ # one or more of '1' followed by optional '4' and then '2'
$ echo '3111111111125111142' | sed -E 's/1+4?2/-/g'
3-5-

You can specify a range of integer numbers, both bounded and unbounded, using {} metacharacters. There are four ways to use this quantifier as listed below:

Pattern Description
{m,n} match m to n times
{m,} match at least m times
{,n} match up to n times (including 0 times)
{n} match exactly n times
$ # note that inside {} space is not allowed around ,
$ # BRE version: sed 's/ab{1,4}c/X/g'
$ echo 'ac abc abbc abbbc abbbbbbbbc' | sed -E 's/ab{1,4}c/X/g'
ac X X X abbbbbbbbc

$ echo 'ac abc abbc abbbc abbbbbbbbc' | sed -E 's/ab{3,}c/X/g'
ac abc abbc X X

$ echo 'ac abc abbc abbbc abbbbbbbbc' | sed -E 's/ab{,2}c/X/g'
X X X abbbc abbbbbbbbc

$ echo 'ac abc abbc abbbc abbbbbbbbc' | sed -E 's/ab{3}c/X/g'
ac abc abbc X abbbbbbbbc

info The {} metacharacters have to be escaped to match them literally. However, unlike the () metacharacters, escaping { alone is enough.

Next up, how to construct conditional AND using dot metacharacter and quantifiers. To allow matching in any order, you’ll have to bring in alternation as well. But, for more than 3 patterns, the combinations become too many to write and maintain.

$ # match 'Error' followed by zero or more characters followed by 'valid'
$ echo 'Error: not a valid input' | sed -n '/Error.*valid/p'
Error: not a valid input

$ # 'cat' followed by 'dog' or 'dog' followed by 'cat'
$ echo 'two cats and a dog' | sed -E 's/cat.*dog|dog.*cat/pets/'
two pets
$ echo 'two dogs and a cat' | sed -E 's/cat.*dog|dog.*cat/pets/'
two pets

Longest match wins

You’ve already seen an example with alternation, where the longest matching portion was chosen if two alternatives started from same location. For example spar|spared will result in spared being chosen over spar. The same applies whenever there are two or more matching possibilities with quantifiers starting from same location. For example, f.?o will match foo instead of fo if the input string to match is foot.

$ # longest match among 'foo' and 'fo' wins here
$ echo 'foot' | sed -E 's/f.?o/X/'
Xt
$ # everything will match here
$ echo 'car bat cod map scat dot abacus' | sed 's/.*/X/'
X

$ # longest match happens when (1|2|3)+ matches up to '1233' only
$ # so that '12baz' can match as well
$ echo 'foo123312baz' | sed -E 's/o(1|2|3)+(12baz)?/X/'
foX
$ # in other implementations like 'perl', that is not the case
$ # quantifiers match as much as possible, but precedence is left to right
$ echo 'foo123312baz' | perl -pe 's/o(1|2|3)+(12baz)?/X/'
foXbaz

While determining the longest match, overall regular expression matching is also considered. That’s how Error.*valid example worked. If .* had consumed everything after Error, there wouldn’t be any more characters to try to match valid. So, among the varying quantity of characters to match for .*, the longest portion that satisfies the overall regular expression is chosen. Something like a.*b will match from first a in the input string to the last b in the string. In other implementations, like perl, this is achieved through a process called backtracking. Both approaches have their own advantages and disadvantages and have cases where the pattern can result in exponential time consumption.

$ # from start of line to last 'm' in the line
$ echo 'car bat cod map scat dot abacus' | sed 's/.*m/-/'
-ap scat dot abacus

$ # from first 'b' to last 't' in the line
$ echo 'car bat cod map scat dot abacus' | sed 's/b.*t/-/'
car - abacus

$ # from first 'b' to last 'at' in the line
$ echo 'car bat cod map scat dot abacus' | sed 's/b.*at/-/'
car - dot abacus

$ # here 'm*' will match 'm' zero times as that gives the longest match
$ echo 'car bat cod map scat dot abacus' | sed 's/a.*m*/-/'
c-

Character classes

To create a custom placeholder for limited set of characters, enclose them inside [] metacharacters. It is similar to using single character alternations inside a grouping, but with added flexibility and features. Character classes have their own versions of metacharacters and provide special predefined sets for common use cases. Quantifiers are also applicable to character classes.

$ # same as: sed -nE '/cot|cut/p' and sed -nE '/c(o|u)t/p'
$ printf 'cutencatncotncoatncostnscuttlen' | sed -n '/c[ou]t/p'
cute
cot
scuttle

$ # same as: sed -nE '/.(a|e|o)+t/p'
$ printf 'meetingncutenboatnatnfootn' | sed -nE '/.[aeo]+t/p'
meeting
boat
foot

$ # same as: sed -E 's/b(s|o|t)(o|n)b/X/g'
$ echo 'no so in to do on' | sed -E 's/b[sot][on]b/X/g'
no X in X do X

$ # lines made up of letters 'o' and 'n', line length at least 2
$ # words.txt contains dictionary words, one word per line
$ sed -nE '/^[on]{2,}$/p' words.txt
no
non
noon
on

Character classes have their own metacharacters to help define the sets succinctly. Metacharacters outside of character classes like ^, $, () etc either don’t have special meaning or have completely different one inside the character classes. First up, the - metacharacter that helps to define a range of characters instead of having to specify them all individually.

$ # same as: sed -E 's/[0123456789]+/-/g'
$ echo 'Sample123string42with777numbers' | sed -E 's/[0-9]+/-/g'
Sample-string-with-numbers
$ # whole words made up of lowercase alphabets and digits only
$ echo 'coat Bin food tar12 best' | sed -E 's/b[a-z0-9]+b/X/g'
X Bin X X X
$ # whole words made up of lowercase alphabets, starting with 'p' to 'z'
$ echo 'road i post grip read eat pit' | sed -E 's/b[p-z][a-z]*b/X/g'
X i X grip X eat X

Character classes can also be used to construct numeric ranges. However, it is easy to miss corner cases and some ranges are complicated to design. See also regular-expressions: Matching Numeric Ranges with a Regular Expression.

$ # numbers between 10 to 29
$ echo '23 154 12 26 34' | sed -E 's/b[12][0-9]b/X/g'
X 154 X X 34
$ # numbers >= 100 with optional leading zeros
$ echo '0501 035 154 12 26 98234' | sed -E 's/b0*[1-9][0-9]{2,}b/X/g'
X 035 X 12 26 X

Next metacharacter is ^ which has to specified as the first character of the character class. It negates the set of characters, so all characters other than those specified will be matched. As highlighted earlier, handle negative logic with care, you might end up matching more than you wanted.

$ # replace all non-digits
$ echo 'Sample123string42with777numbers' | sed -E 's/[^0-9]+/-/g'
-123-42-777-

$ # delete last two columns based on a delimiter
$ echo 'foo:123:bar:baz' | sed -E 's/(:[^:]+){2}$//'
foo:123

$ # sequence of characters surrounded by unique character
$ echo 'I like "mango" and "guava"' | sed -E 's/"[^"]+"/X/g'
I like X and X

$ # sometimes it is simpler to positively define a set than negation
$ # same as: sed -nE '/^[^aeiou]*$/p'
$ printf 'trystnfunnglyphnpitynwhyn' | sed '/[aeiou]/d'
tryst
glyph
why

Some commonly used character sets have predefined escape sequences:

  • w matches all word characters [a-zA-Z0-9_] (recall the description for word boundaries)
  • W matches all non-word characters (recall duality seen earlier, like b and B)
  • s matches all whitespace characters: tab, newline, vertical tab, form feed, carriage return and space
  • S matches all non-whitespace characters

These escape sequences cannot be used inside character classes. Also, as mentioned earlier, these definitions assume ASCII input.

warning sed doesn’t support d and D, commonly featured in other implementations as a shortcut for all the digits and non-digits.

$ # match all non-word characters
$ echo 'load;err_msg--nant,r2..not' | sed -E 's/W+/-/g'
load-err_msg-nant-r2-not

$ # replace all sequences of whitespaces with single space
$ printf 'hi  vf  there.thave   ra nicettdayn' | sed -E 's/s+/ /g'
hi there. have a nice day

$ # w would simply match  and w inside character classes
$ echo 'w=yx+9*3' | sed 's/[w=]//g'
yx+9*3

A named character set is defined by a name enclosed between [: and :] and has to be used within a character class [], along with any other characters as needed.

Named set Description
[:digit:] [0-9]
[:lower:] [a-z]
[:upper:] [A-Z]
[:alpha:] [a-zA-Z]
[:alnum:] [0-9a-zA-Z]
[:xdigit:] [0-9a-fA-F]
[:cntrl:] control characters — first 32 ASCII characters and 127th (DEL)
[:punct:] all the punctuation characters
[:graph:] [:alnum:] and [:punct:]
[:print:] [:alnum:], [:punct:] and space
[:blank:] space and tab characters
[:space:] whitespace characters, same as s
$ echo 'err_msg xerox ant m_2 P2 load1 eel' | sed -E 's/b[[:lower:]]+b/X/g'
err_msg X X m_2 P2 load1 X

$ echo 'err_msg xerox ant m_2 P2 load1 eel' | sed -E 's/b[[:lower:]_]+b/X/g'
X X X m_2 P2 load1 X

$ echo 'err_msg xerox ant m_2 P2 load1 eel' | sed -E 's/b[[:alnum:]]+b/X/g'
err_msg X X m_2 X X X

$ echo ',pie tie#ink-eat_42' | sed -E 's/[^[:punct:]]+//g'
,#-_

Specific placement is needed to match character class metacharacters literally.

warning Combinations like [. or [: cannot be used together to mean two individual characters, as they have special meaning within []. See sed manual: Character Classes and Bracket Expressions for more details.

$ # - should be first or last character within []
$ echo 'ab-cd gh-c 12-423' | sed -E 's/[a-z-]{2,}/X/g'
X X 12-423

$ # ] should be first character within []
$ printf 'int a[5]nfoon1+1=2n' | sed -n '/[=]]/p'
$ printf 'int a[5]nfoon1+1=2n' | sed -n '/[]=]/p'
int a[5]
1+1=2

$ # to match [ use [ anywhere in the character set
$ # but not combinations like [. or [:
$ # [][] will match both [ and ]
$ echo 'int a[5]' | sed -n '/[x[.y]/p'
sed: -e expression #1, char 9: unterminated address regex
$ echo 'int a[5]' | sed -n '/[x[y.]/p'
int a[5]

$ # ^ should be other than first character within []
$ echo 'f*(a^b) - 3*(a+b)/(a-b)' | sed 's/a[+^]b/c/g'
f*(c) - 3*(c)/(a-b)

Escape sequences

Certain ASCII characters like tab t, carriage return r, newline n, etc have escape sequences to represent them. Additionally, any character can be represented using their ASCII value in decimal dNNN or octal oNNN or hexadecimal xNN formats. Unlike character set escape sequences like w, these can be used inside character classes. As is special inside character class, use \ to represent it literally (technically, this is only needed if the combination of and the character(s) that follows is a valid escape sequence).

$ # using t to represent tab character
$ printf 'footbartbazn' | sed 's/t/ /g'
foo bar baz
$ echo 'a b c' | sed 's/ /t/g'
a       b       c

$ # these escape sequence work inside character class too
$ printf 'atrfbvcn' | sed -E 's/[tvfr]+/:/g'
a:b:c

$ # representing single quotes
$ # use d039 and o047 for decimal and octal respectively
$ echo "universe: '42'" | sed 's/x27/"/g'
universe: "42"
$ echo 'universe: "42"' | sed 's/"/x27/g'
universe: '42'

info If a metacharacter is specified by ASCII value format in the search section, it will still act as the metacharacter. However, metacharacters specified by ASCII value format in replacement section acts as a literal character. Undefined escape sequences (both search and replacement section) will be treated as the character it escapes, for example, e will match e (not and e).

$ # x5e is ^ character, acts as line anchor here
$ printf 'cutencotncatncoatn' | sed -n '/x5eco/p'
cot
coat

$ # & metacharacter in replacement will be discussed in next section
$ # it represents entire matched portion
$ echo 'hello world' | sed 's/.*/"&"/'
"hello world"
$ # x26 is & character, acts as literal character here
$ echo 'hello world' | sed 's/.*/"x26"/'
"&"

warning See sed manual: Escapes for full list and details such as precedence rules. See also stackoverflow: behavior of ASCII value format inside character classes.

Backreferences

The grouping metacharacters () are also known as capture groups. They are like variables, the string captured by () can be referred later using backreference N where N is the capture group you want. Leftmost ( in the regular expression is 1, next one is 2 and so on up to 9. Backreferences can be used in both search and replacement sections. Quantifiers can be applied to backreferences as well.

$ # whole words that have at least one consecutive repeated character
$ # word boundaries are not needed here due to longest match wins effect
$ echo 'effort flee facade oddball rat tool' | sed -E 's/w*(w)1w*/X/g'
X X facade X rat X

$ # reduce \ to single  and delete if it is a single 
$ echo '[] and \w and [a-zA-Z0-9_]' | sed -E 's/(\?)\/1/g'
[] and w and [a-zA-Z0-9_]

$ # remove two or more duplicate words separated by space
$ # b prevents false matches like 'the theatre', 'sand and stone' etc
$ echo 'aa a a a 42 f_1 f_1 f_13.14' | sed -E 's/b(w+)( 1)+b/1/g'
aa a 42 f_1 f_13.14

$ # 8 character lines having same 3 lowercase letters at start and end
$ sed -nE '/^([a-z]{3})..1$/p' words.txt
mesdames
respires
restores
testates

As a special case, or & represents entire matched string in the replacement section.

$ # duplicate first column value as final column
$ # same as: sed -E 's/^([^,]+).*/,1/'
$ echo 'one,2,3.14,42' | sed -E 's/^([^,]+).*/&,1/'
one,2,3.14,42,one

$ # surround entire line with double quotes
$ echo 'hello world' | sed 's/.*/"&"/'
"hello world"

$ echo 'hello world' | sed 's/.*/Hi. &. Have a nice day/'
Hi. hello world. Have a nice day

If quantifier is applied on a pattern grouped inside () metacharacters, you’ll need an outer () group to capture the matching portion. Other regular expression engines like PCRE (Perl Compatible Regular Expressions) provide non-capturing group to handle such cases. In sed you’ll have to work around the extra capture group.

$ # surround only third column with double quotes
$ # note the numbers used in replacement section
$ echo 'one,2,3.14,42' | sed -E 's/^(([^,]+,){2})([^,]+)/1"3"/'
one,2,"3.14",42

Here’s an example where alternation order matters when matching portions have the same length. Aim is to delete all whole words unless it starts with g or p and contains y. See stackoverflow: Non greedy matching in sed for another use case.

$ s='tryst,fun,glyph,pity,why,group'

$ # all words get deleted because bw+b gets priority here
$ echo "$s" | sed -E 's/bw+b|(b[gp]w*yw*b)/1/g'
,,,,,

$ # capture group gets priority here, thus words matching the group are retained
$ echo "$s" | sed -E 's/(b[gp]w*yw*b)|bw+b/1/g'
,,glyph,pity,,

As and & are special characters in replacement section, use \ and & respectively for literal representation.

$ echo 'foo and bar' | sed 's/and/[&]/'
foo [and] bar
$ echo 'foo and bar' | sed 's/and/[&]/'
foo [&] bar

$ echo 'foo and bar' | sed 's/and/\/'
foo  bar

warning Backreference will provide the string that was matched, not the pattern that was inside the capture group. For example, if ([0-9][a-f]) matches 3b, then backreferencing will give 3b and not any other valid match like 8f, 0a etc. This is akin to how variables behave in programming, only the result of expression stays after variable assignment, not the expression itself.

Known Bugs

Visit sed bug list for known issues.

Here’s an issue for certain usage of backreferences and quantifier that was filed by yours truly.

$ # takes some time and results in no output
$ # aim is to get words having two occurrences of repeated characters
$ # works if you use perl -ne 'print if /^(w*(w)2w*){2}$/'
$ sed -nE '/^(w*(w)2w*){2}$/p' words.txt | head -n5

$ # works when nesting is unrolled
$ sed -nE '/^w*(w)1w*(w)2w*$/p' words.txt | head -n5
Abbott
Annabelle
Annette
Appaloosa
Appleseed

warning unix.stackexchange: Why doesn’t this sed command replace the 3rd-to-last «and»? shows another interesting bug when word boundaries and group repetition are involved. Some examples are shown below. Again, workaround is to expand the group.

$ # wrong output
$ echo 'cocoa' | sed -nE '/(bco){2}/p'
cocoa
$ # correct behavior, no output
$ echo 'cocoa' | sed -nE '/bcobco/p'

$ # wrong output, there's only 1 whole word 'it' after 'with'
$ echo 'it line with it here sit too' | sed -E 's/with(.*bitb){2}/XYZ/'
it line XYZ too
$ # correct behavior, input isn't modified
$ echo 'it line with it here sit too' | sed -E 's/with.*bitb.*bitb/XYZ/'
it line with it here sit too

$ # changing word boundaries to < and > results in a different problem
$ # this correctly doesn't modify the input
$ echo 'it line with it here sit too' | sed -E 's/with(.*<it>){2}/XYZ/'
it line with it here sit too
$ # this correctly modifies the input
$ echo 'it line with it here it too' | sed -E 's/with(.*<it>){2}/XYZ/'
it line XYZ too
$ # but this one fails to modify the input
$ echo 'it line with it here it too sit' | sed -E 's/with(.*<it>){2}/XYZ/'
it line with it here it too sit

Cheatsheet and summary

Note Description
BRE Basic Regular Expression, enabled by default
ERE Extended Regular Expression, enabled using -E option
note: only ERE syntax is covered below
metacharacters characters with special meaning in REGEXP
^ restricts the match to the start of line
$ restricts the match to the end of line
b restricts the match to start/end of words
word characters: alphabets, digits, underscore
B matches wherever b doesn’t match
< start of word anchor
> end of word anchor
| combine multiple patterns as conditional OR
each alternative can have independent anchors
alternative which matches earliest in the input gets precedence
and the leftmost longest portion wins in case of a tie
() group pattern(s)
a(b|c)d same as abd|acd
prefix metacharacters with to match them literally
\ to match literally
switching between ERE and BRE helps in some cases
/ idiomatically used as the delimiter for REGEXP
any character except and newline character can also be used
. match any character, including the newline character
? match 0 or 1 times
* match 0 or more times
+ match 1 or more times
{m,n} match m to n times
{m,} match at least m times
{,n} match up to n times (including 0 times)
{n} match exactly n times
pat1.*pat2 any number of characters between pat1 and pat2
pat1.*pat2|pat2.*pat1 match both pat1 and pat2 in any order
[ae;o] match any of these characters once
quantifiers are applicable to character classes too
[3-7] range of characters from 3 to 7
[^=b2] match other than = or b or 2
[a-z-] - should be first/last character to match literally
[+^] ^ shouldn’t be first character
[]=] ] should be first character
combinations like [. or [: have special meaning
w similar to [a-zA-Z0-9_] for matching word characters
s similar to [ tnrfv] for matching whitespace characters
W and S for their opposites respectively
[:digit:] named character set, same as [0-9]
xNN represent ASCII character using hexadecimal value
use dNNN for decimal and oNNN for octal
N backreference, gives matched portion of Nth capture group
applies to both search and replacement sections
possible values: 1, 2 up to 9
or & represents entire matched string in the replacement section

Regular expressions is a feature that you’ll encounter in multiple command line programs and programming languages. It is a versatile tool for text processing. Although the features provided by BRE/ERE implementation are less compared to those found in programming languages, they are sufficient for most of the tasks you’ll need for command line usage. It takes a lot of time to get used to syntax and features of regular expressions, so I’ll encourage you to practice a lot and maintain notes. It’d also help to consider it as a mini-programming language in itself for its flexibility and complexity. In the next chapter, you’ll learn about flags that add more features to regular expressions usage.

Exercises

a) For the given input, print all lines that start with den or end with ly.

$ lines='lovelyn1 dentistn2 lonelynedennfly awayndentn'
$ printf '%b' "$lines" | sed ##### add your solution here
lovely
2 lonely
dent

b) Replace all occurrences of 42 with [42] unless it is at the edge of a word. Note that word in these exercises have same meaning as defined in regular expressions.

$ echo 'hi42bye nice421423 bad42 cool_42a 42c' | sed ##### add your solution here
hi[42]bye nice[42]1[42]3 bad42 cool_[42]a 42c

c) Add [] around words starting with s and containing e and t in any order.

$ words='sequoia subtle exhibit asset sets tests site'
$ echo "$words" | sed ##### add your solution here
sequoia [subtle] exhibit asset [sets] tests [site]

d) Replace all whole words with X that start and end with the same word character.

$ echo 'oreo not a _a2_ roar took 22' | sed ##### add your solution here
X not X X X took X

e) Replace all occurrences of [4]|* with 2

$ echo '2.3/[4]|*6 foo 5.3-[4]|*9' | sed ##### add your solution here
2.3/26 foo 5.3-29

f) sed -nE '/b[a-z](on|no)[a-z]b/p' is same as sed -nE '/b[a-z][on]{2}[a-z]b/p'. True or False? Sample input shown below might help to understand the differences, if any.

$ printf 'knownnmoodnknownponyninnsn'
known
mood
know
pony
inns

g) Print all lines that start with hand and ends with no further character or s or y or le.

$ lines='handednhandnhandynunhandnhandsnhandlen'
$ printf '%b' "$lines" | sed ##### add your solution here
hand
handy
hands
handle

h) Replace 42//5 or 42/5 with 8 for the given input.

$ echo 'a+42//5-c pressure*3+42/5-14256' | sed ##### add your solution here
a+8-c pressure*3+8-14256

i) For the given quantifiers, what would be the equivalent form using {m,n} representation?

  • ? is same as
  • * is same as
  • + is same as

j) True or False? In ERE, (a*|b*) is same as (a|b)*

k) For the given input, construct two different REGEXPs to get the outputs as shown below.

$ # delete from '(' till next ')'
$ echo 'a/b(division) + c%d() - (a#(b)2(' | sed ##### add your solution here
a/b + c%d - 2(

$ # delete from '(' till next ')' but not if there is '(' in between
$ echo 'a/b(division) + c%d() - (a#(b)2(' | sed ##### add your solution here
a/b + c%d - (a#2(

l) For the input file anchors.txt, convert markdown anchors to corresponding hyperlinks.

$ cat anchors.txt
# <a name="regular-expressions"></a>Regular Expressions
## <a name="subexpression-calls"></a>Subexpression calls
## <a name="the-dot-meta-character"></a>The dot meta character

$ sed ##### add your solution here
[Regular Expressions](#regular-expressions)
[Subexpression calls](#subexpression-calls)
[The dot meta character](#the-dot-meta-character)

m) Replace the space character that occurs after a word ending with a or r with a newline character.

$ echo 'area not a _a2_ roar took 22' | sed ##### add your solution here
area
not a
_a2_ roar
took 22

n) Surround all whole words with (). Additionally, if the whole word is imp or ant, delete them. Can you do it with single substitution?

$ words='tiger imp goat eagle ant important'
$ echo "$words" | sed ##### add your solution here
(tiger) () (goat) (eagle) () (important)

Symbols Hits Examples
sa all words containing the string sa sa, vasaku, sahata, tisa
bsa all words starting with sa sa, sahata, sana; NOT vasaku, tisa
bsab all words sa sa
bsa..b all words consisting of sa + two letters that follow
sa
saka, saku, sana
bsaw+ all words beginning with sa, but not the word sa by itself sahata, sana
b.*anab al words ending in ana sinana, tamuana, sana, bana, maana
(….)l all words with four reduplicated letters pakupaku, vapakupaku, mahumahun, vamahumahun
b(….)l all words beginning with four reduplicated
letters
pakupaku; NOT
vapakupaku
b(….)lanab all words beginning with four reduplicated letters and
ending in ana
vasuvasuana,
hunuhunuana
bva(….)l all words consisting of the prefix va- + four
reduplicated letters
vapakupaku,
vagunagunaha
bvahaa?b all tokens of vahaa and vaha vahaa and vaha

Like this post? Please share to your friends:
  • Regex find word starting with
  • Regular expression all but word
  • Regression testing with excel
  • Regex find word in text
  • Regression modeling in excel