Regular expression word characters

Регулярные выражения (их еще называют regexp, или regex) — это механизм для поиска и замены текста. В строке, файле, нескольких файлах… Их используют разработчики в коде приложения, тестировщики в автотестах, да просто при работе в командной строке!

Чем это лучше простого поиска? Тем, что позволяет задать шаблон.

Например, на вход приходит дата рождения в формате ДД.ММ.ГГГГГ. Вам надо передать ее дальше, но уже в формате ГГГГ-ММ-ДД. Как это сделать с помощью простого поиска? Вы же не знаете заранее, какая именно дата будет.

А регулярное выражение позволяет задать шаблон «найди мне цифры в таком-то формате».

Для чего применяют регулярные выражения?

  1. Удалить все файлы, начинающиеся на test (чистим за собой тестовые данные)

  2. Найти все логи

  3. grep-нуть логи

  4. Найти все даты

А еще для замены — например, чтобы изменить формат всех дат в файле. Если дата одна, можно изменить вручную. А если их 200, проще написать регулярку и подменить автоматически. Тем более что регулярные выражения поддерживаются даже простым блокнотом (в Notepad++ они точно есть).

В этой статье я расскажу о том, как применять регулярные выражения для поиска и замены. Разберем все основные варианты.

Содержание

  1. Где пощупать

  2. Поиск текста

  3. Поиск любого символа

  4. Поиск по набору символов

  5. Перечисление вариантов

  6. Метасимволы

  7. Спецсимволы

  8. Квантификаторы (количество повторений)

  9. Позиция внутри строки

  10. Использование ссылки назад

  11. Просмотр вперед и назад

  12. Замена

  13. Статьи и книги по теме

  14. Итого

Где пощупать

Любое регулярное выражение из статьи вы можете сразу пощупать. Так будет понятнее, о чем речь в статье — вставили пример из статьи, потом поигрались сами, делая шаг влево, шаг вправо. Где тренироваться:

  1. Notepad++ (установить Search Mode → Regular expression)

  2. Regex101 (мой фаворит в онлайн вариантах)

  3. Myregexp

  4. Regexr

Инструменты есть, теперь начнём

Поиск текста

Самый простой вариант регэкспа. Работает как простой поиск — ищет точно такую же строку, как вы ввели.

Текст: Море, море, океан

Regex: море

Найдет: Море, море, океан

Выделение курсивом не поможет моментально ухватить суть, что именно нашел regex, а выделить цветом в статье я не могу. Атрибут BACKGROUND-COLOR не сработал, поэтому я буду дублировать регулярки текстом (чтобы можно было скопировать себе) и рисунком, чтобы показать, что именно regex нашел:

Обратите внимание, нашлось именно «море», а не первое «Море». Регулярные выражения регистрозависимые!

Хотя, конечно, есть варианты. В JavaScript можно указать дополнительный флажок i, чтобы не учитывать регистр при поиске. В блокноте (notepad++) тоже есть галка «Match case». Но учтите, что это не функция по умолчанию. И всегда стоит проверить, регистрозависимая ваша реализация поиска, или нет.

А что будет, если у нас несколько вхождений искомого слова?

Текст: Море, море, море, океан

Regex: море

Найдет: Море, море, море, океан

По умолчанию большинство механизмов обработки регэкспа вернет только первое вхождение. В JavaScript есть флаг g (global), с ним можно получить массив, содержащий все вхождения.

А что, если у нас искомое слово не само по себе, это часть слова? Регулярное выражение найдет его:

Текст: Море, 55мореон, океан

Regex: море

Найдет: Море, 55мореон, океан

Это поведение по умолчанию. Для поиска это даже хорошо. Вот, допустим, я помню, что недавно в чате коллега рассказывала какую-то историю про интересный баг в игре. Что-то там связанное с кораблем… Но что именно? Уже не помню. Как найти?

Если поиск работает только по точному совпадению, мне придется перебирать все падежи для слова «корабль». А если он работает по включению, я просто не буду писать окончание, и все равно найду нужный текст:

Regex: корабл

Найдет:

На корабле

И тут корабль

У корабля

Это статический, заранее заданный текст. Но его можно найти и без регулярок. Регулярные выражения особенно хороши, когда мы не знаем точно, что мы ищем. Мы знаем часть слова, или шаблон.

Поиск любого символа

. — найдет любой символ (один).

Текст:

Аня

Ася

Оля

Аля

Валя

Regex: А.я

Результат:

Аня

Ася

Оля

Аля

Валя

Символ «.» заменяет 1 любой символ

Символ «.» заменяет 1 любой символ

Точка найдет вообще любой символ, включая цифры, спецсисимволы, даже пробелы. Так что кроме нормальных имен, мы найдем и такие значения:

А6я

А&я

А я

Учтите это при поиске! Точка очень удобный символ, но в то же время очень опасный — если используете ее, обязательно тестируйте получившееся регулярное выражение. Найдет ли оно то, что нужно? А лишнее не найдет?

Точку точка тоже найдет!

Regex: file.

Найдет:

file.txt

file1.txt

file2.xls

Но что, если нам надо найти именно точку? Скажем, мы хотим найти все файлы с расширением txt и пишем такой шаблон:

Regex: .txt

Результат:

file.txt

log.txt

file.png

1txt.doc

one_txt.jpg

Да, txt файлы мы нашли, но помимо них еще и «мусорные» значения, у которых слово «txt» идет в середине слова. Чтобы отсечь лишнее, мы можем использовать позицию внутри строки (о ней мы поговорим чуть дальше).

Но если мы хотим найти именно точку, то нужно ее заэкранировать — то есть добавить перед ней обратный слеш:

Regex: .txt

Результат:

file.txt

log.txt

file.png

1txt.doc

one_txt.jpg

Также мы будем поступать со всеми спецсимволами. Хотим найти именно такой символ в тексте? Добавляем перед ним обратный слеш.

Правило поиска для точки:

. — любой символ

. — точка

Поиск по набору символов

Допустим, мы хотим найти имена «Алла», «Анна» в списке. Можно попробовать поиск через точку, но кроме нормальных имен, вернется всякая фигня:

Regex: А..а

Результат:

Анна

Алла

аоикА74арплт

Аркан

А^&а

Абба

Если же мы хотим именно Анну да Аллу, вместо точки нужно использовать диапазон допустимых значений. Ставим квадратные скобки, а внутри них перечисляем нужные символы:

Regex: А[нл][нл]а

Результат:

Анна

Алла

аоикА74арплт

Аркан

А^&а

Абба

Вот теперь результат уже лучше! Да, нам все еще может вернуться «Анла», но такие ошибки исправим чуть позже.

Как работают квадратные скобки? Внутри них мы указываем набор допустимых символов. Это может быть перечисление нужных букв, или указание диапазона:

[нл] — только «н» и «л»

[а-я] — все русские буквы в нижнем регистре от «а» до «я» (кроме «ё»)

[А-Я]    — все заглавные русские буквы

[А-Яа-яЁё]  — все русские буквы

[a-z]  — латиница мелким шрифтом

[a-zA-Z]  — все английские буквы

[0-9]  — любая цифра

[В-Ю]   — буквы от «В» до «Ю» (да, диапазон — это не только от А до Я)

[А-ГО-Р]   — буквы от «А» до «Г» и от «О» до «Р»

Обратите внимание — если мы перечисляем возможные варианты, мы не ставим между ними разделителей! Ни пробел, ни запятую — ничего.

[абв] — только «а», «б» или «в»

[а б в] — «а», «б», «в», или пробел (что может привести к нежелательному результату)

[а, б, в] — «а», «б», «в», пробел или запятая

Единственный допустимый разделитель — это дефис. Если система видит дефис внутри квадратных скобок — значит, это диапазон:

  • Символ до дефиса — начало диапазона

  • Символ после — конец

Один символ! Не два или десять, а один! Учтите это, если захотите написать что-то типа [1-31]. Нет, это не диапазон от 1 до 31, эта запись читается так:

  • Диапазон от 1 до 3

  • И число 1

Здесь отсутствие разделителей играет злую шутку с нашим сознанием. Ведь кажется, что мы написали диапазон от 1 до 31! Но нет. Поэтому, если вы пишете регулярные выражения, очень важно их тестировать. Не зря же мы тестировщики! Проверьте то, что написали! Особенно, если с помощью регулярного выражения вы пытаетесь что-то удалить =)) Как бы не удалили лишнее…

Указание диапазона вместо точки помогает отсеять заведомо плохие данные:

Regex: А.я или А[а-я]я

Результат для обоих:

Аня

Ася

Аля

Результат для «А.я»:

А6я

А&я

А я

^ внутри [] означает исключение:

[^0-9]  — любой символ, кроме цифр

[^ёЁ]  — любой символ, кроме буквы «ё»

[^а-в8]  — любой символ, кроме букв «а», «б», «в» и цифры 8

Например, мы хотим найти все txt файлы, кроме разбитых на кусочки — заканчивающихся на цифру:

Regex: [^0-9].txt

Результат:

file.txt

log.txt

file_1.txt

1.txt

Так как квадратные скобки являются спецсимволами, то их нельзя найти в тексте без экранирования:

Regex: fruits[0]

Найдет: fruits0

Не найдет: fruits[0]

Это регулярное выражение говорит «найди мне текст «fruits», а потом число 0». Квадратные скобки не экранированы — значит, внутри будет набор допустимых символов.

Если мы хотим найти именно 0-левой элемент массива фруктов, надо записать так:

Regex: fruits[0]

Найдет: fruits[0]

Не найдет: fruits0

А если мы хотим найти все элементы массива фруктов, мы внутри экранированных квадратных скобок ставим неэкранированные!

Regex: fruits[[0-9]]

Найдет:

fruits[0] = “апельсин”;

fruits[1] = “яблоко”;

fruits[2] = “лимон”;

Не найдет:

cat[0] = “чеширский кот”;

Конечно, «читать» такое регулярное выражение становится немного тяжело, столько разных символов написано…

Без паники! Если вы видите сложное регулярное выражение, то просто разберите его по частям. Помните про основу эффективного тайм-менеджмента? Слона надо есть по частям.

Допустим, после отпуска накопилась гора писем. Смотришь на нее и сразу впадаешь в уныние:

— Ууууууу, я это за день не закончу!

Проблема в том, что груз задачи мешает работать. Мы ведь понимаем, что это надолго. А большую задачу делать не хочется… Поэтому мы ее откладываем, беремся за задачи поменьше. В итоге да, день прошел, а мы не успели закончить.

А если не тратить время на размышления «сколько времени это у меня займет», а сосредоточиться на конкретной задаче (в данном случае — первом письме из стопки, потом втором…), то не успеете оглянуться, как уже всё разгребли!

Разберем по частям регулярное выражение — fruits[[0-9]]

Сначала идет просто текст — «fruits».

Потом обратный слеш. Ага, он что-то экранирует.

Что именно? Квадратную скобку. Значит, это просто квадратная скобка в моем тексте — «fruits[»

Дальше снова квадратная скобка. Она не экранирована — значит, это набор допустимых значений. Ищем закрывающую квадратную скобку.

Нашли. Наш набор: [0-9]. То есть любое число. Но одно. Там не может быть 10, 11 или 325, потому что квадратные скобки без квантификатора (о них мы поговорим чуть позже) заменяют ровно один символ.

Пока получается: fruits[«любое однозназначное число»

Дальше снова обратный слеш. То есть следующий за ним спецсимвол будет просто символом в моем тексте.

А следующий символ — ]

Получается выражение: fruits[«любое однозназначное число»]

Наше выражение найдет значения массива фруктов! Не только нулевое, но и первое, и пятое… Вплоть до девятого:

Regex: fruits[[0-9]]

Найдет:

fruits[0] = “апельсин”;

fruits[1] = “яблоко”;

fruits[9] = “лимон”;

Не найдет:

fruits[10] = “банан”;

fruits[325] = “ абрикос ”;

Как найти вообще все значения массива, см дальше, в разделе «квантификаторы».

А пока давайте посмотрим, как с помощью диапазонов можно найти все даты.

Какой у даты шаблон? Мы рассмотрим ДД.ММ.ГГГГ:

  • 2 цифры дня

  • точка

  • 2 цифры месяца

  • точка

  • 4 цифры года

Запишем в виде регулярного выражения: [0-9][0-9].[0-9][0-9].[0-9][0-9][0-9][0-9].

Напомню, что мы не можем записать диапазон [1-31]. Потому что это будет значить не «диапазон от 1 до 31», а «диапазон от 1 до 3, плюс число 1». Поэтому пишем шаблон для каждой цифры отдельно.

В принципе, такое выражение найдет нам даты среди другого текста. Но что, если с помощью регулярки мы проверяем введенную пользователем дату? Подойдет ли такой regexp?

Давайте его протестируем! Как насчет 8888 года или 99 месяца, а?

Regex: [0-9][0-9].[0-9][0-9].[0-9][0-9][0-9][0-9]

Найдет:

01.01.1999

05.08.2015

Тоже найдет:

08.08.8888

99.99.2000

Попробуем ограничить:

  • День месяца может быть максимум 31 — первая цифра [0-3]

  • Максимальный месяц 12 — первая цифра [01]

  • Год или 19.., или 20.. — первая цифра [12], а вторая [09]

Вот, уже лучше, явно плохие данные регулярка отсекла. Надо признать, она отсечет довольно много тестовых данных, ведь обычно, когда хотят именно сломать, то фигачат именно «9999» год или «99» месяц…

Однако если мы присмотримся внимательнее к регулярному выражению, то сможем найти в нем дыры:

Regex: [0-3][0-9].[0-1][0-9].[12][09][0-9][0-9]

Не найдет:

08.08.8888

99.99.2000

Но найдет:

33.01.2000

01.19.1999

05.06.2999

Мы не можем с помощью одного диапазона указать допустимые значения. Или мы потеряем 31 число, или пропустим 39. И если мы хотим сделать проверку даты, одних диапазонов будет мало. Нужна возможность перечислить варианты, о которой мы сейчас и поговорим.

Перечисление вариантов

Квадратные скобки [] помогают перечислить варианты для одного символа. Если же мы хотим перечислить слова, то лучше использовать вертикальную черту — |.

Regex: Оля|Олечка|Котик

Найдет:

Оля

Олечка

Котик

Не найдет:

Оленька

Котенка

Можно использовать вертикальную черту и для одного символа. Можно даже внутри слова — тогда вариативную букву берем в круглые скобки

Regex: А(н|л)я

Найдет:

Аня

Аля

Круглые скобки обозначают группу символов. В этой группе у нас или буква «н», или буква «л». Зачем нужны скобки? Показать, где начинается и заканчивается группа. Иначе вертикальная черта применится ко всем символам — мы будем искать или «Ан», или «ля»:

Regex: Ан|ля

Найдет:

Аня

Аля

Оля

Малюля

А если мы хотим именно «Аня» или «Аля», то перечисление используем только для второго символа. Для этого берем его в скобки.

Эти 2 варианта вернут одно и то же:

  • А(н|л)я

  • А[нл]я

Но для замены одной буквы лучше использовать [], так как сравнение с символьным классом выполняется проще, чем обработка группы с проверкой на все её возможные модификаторы.

Давайте вернемся к задаче «проверить введенную пользователем дату с помощью регулярных выражений». Мы пробовали записать для дня диапазон [0-3][0-9], но он пропускает значения 33, 35, 39… Это нехорошо!

Тогда распишем ТЗ подробнее. Та-а-а-ак… Если первая цифра:

  • 0 — вторая может от 1 до 9 (даты 00 быть не может)

  • 1, 2 — вторая может от 0 до 9

  • 3 — вторая только 0 или 1

Составим регулярные выражения на каждый пункт:

  • 0[1-9]

  • [12][0-9]

  • 3[01]

А теперь осталось их соединить в одно выражение! Получаем: 0[1-9]|[12][0-9]|3[01]

По аналогии разбираем месяц и год. Но это остается вам для домашнего задания =)

Потом, когда распишем регулярки отдельно для дня, месяца и года, собираем все вместе:

(<день>).(<месяц>).(<год>)

Обратите внимание — каждую часть регулярного выражения мы берем в скобки. Зачем? Чтобы показать системе, где заканчивается выбор. Вот смотрите, допустим, что для месяца и года у нас осталось выражение:

[0-1][0-9].[12][09][0-9][0-9]

Подставим то, что написали для дня:

0[1-9]|[12][0-9]|3[01].[0-1][0-9].[12][09][0-9][0-9]

Как читается это выражение?

  • ИЛИ   0[1-9]

  • ИЛИ   [12][0-9]

  • ИЛИ    3[01].[0-1][0-9].[12][09][0-9][0-9]

Видите проблему? Число «19» будет считаться корректной датой. Система не знает, что перебор вариантов | закончился на точке после дня. Чтобы она это поняла, нужно взять перебор в скобки. Как в математике, разделяем слагаемые.

Так что запомните — если перебор идет в середине слова, его надо взять в круглые скобки!

Regex: А(нн|лл|лин|нтонин)а

Найдет:

Анна

Алла

Алина

Антонина

Без скобок:

Regex: Анн|лл|лин|нтонина

Найдет:

Анна

Алла

Аннушка

Кукулинка

Итого, если мы хотим указать допустимые значения:

  • Одного символа — используем []

  • Нескольких символов или целого слова — используем |

Метасимволы

Если мы хотим найти число, то пишем диапазон [0-9].

Если букву, то [а-яА-ЯёЁa-zA-Z].

А есть ли другой способ?

Есть! В регулярных выражениях используются специальные метасимволы, которые заменяют собой конкретный диапазон значений:

Символ

Эквивалент

Пояснение

d

[0-9]

Цифровой символ

D

[^0-9]

Нецифровой символ

s

[ fnrtv]

Пробельный символ

S

[^ fnrtv]

Непробельный символ

w

[[:word:]]

Буквенный или цифровой символ или знак подчёркивания

W

[^[:word:]]

Любой символ, кроме буквенного или цифрового символа или знака подчёркивания

.

Вообще любой символ

Это самые распространенные символы, которые вы будете использовать чаще всего. Но давайте разберемся с колонкой «эквивалент». Для d все понятно — это просто некие числа. А что такое «пробельные символы»? В них входят:

Символ

Пояснение

Пробел

r

Возврат каретки (Carriage return, CR)

n

Перевод строки (Line feed, LF)

t

Табуляция (Tab)

v

Вертикальная табуляция (vertical tab)

f

Конец страницы (Form feed)

[b]

Возврат на 1 символ (Backspace)

Из них вы чаще всего будете использовать сам пробел и перевод строки — выражение «rn». Напишем текст в несколько строк:

Первая строка

Вторая строка

Для регулярного выражения это:

Первая строкаrnВторая строка

А вот что такое backspace в тексте? Как его можно увидеть вообще? Это же если написать символ и стереть его. В итоге символа нет! Неужели стирание хранится где-то в памяти? Но тогда это было бы ужасно, мы бы вообще ничего не смогли найти — откуда нам знать, сколько раз текст исправляли и в каких местах там теперь есть невидимый символ [b]?

Выдыхаем — этот символ не найдет все места исправления текста. Просто символ backspace — это ASCII символ, который может появляться в тексте (ASCII code 8, или 10 в octal). Вы можете «создать» его, написать в консоли браузера (там используется JavaScript):

console.log("abcbbdef");

Результат команды:

adef

Мы написали «abc», а потом стерли «b» и «с». В итоге пользователь в консоли их не видит, но они есть. Потому что мы прямо в коде прописали символ удаления текста. Не просто удалили текст, а прописали этот символ. Вот такой символ регулярное выражение  [b] и найдет.

См также:

What’s the use of the [b] backspace regex? — подробнее об этом символе

Но обычно, когда мы вводим s, мы имеем в виду пробел, табуляцию, или перенос строки.

Ок, с этими эквивалентами разобрались. А что значит [[:word:]]? Это один из способов заменить диапазон. Чтобы запомнить проще было, написали значения на английском, объединив символы в классы. Какие есть классы:

Класс символов

Пояснение

[[:alnum:]]

Буквы или цифры: [а-яА-ЯёЁa-zA-Z0-9]

[[:alpha:]]

Только буквы: [а-яА-ЯёЁa-zA-Z]

[[:digit:]]

Только цифры: [0-9]

[[:graph:]]

Только отображаемые символы (пробелы, служебные знаки и т. д. не учитываются)

[[:print:]]

Отображаемые символы и пробелы

[[:space:]]

Пробельные символы [ fnrtv]

[[:punct:]]

Знаки пунктуации: ! » # $ % & ‘ ( ) * + , -. / : ; < = > ? @ [ ] ^ _ ` { | }

[[:word:]]

Буквенный или цифровой символ или знак подчёркивания: [а-яА-ЯёЁa-zA-Z0-9_]

Теперь мы можем переписать регулярку для проверки даты, которая выберет лишь даты формата ДД.ММ.ГГГГГ, отсеяв при этом все остальное:

[0-9][0-9].[0-9][0-9].[0-9][0-9][0-9][0-9]

dd.dd.dddd

Согласитесь, через метасимволы запись посимпатичнее будет =))

Спецсимволы

Большинство символов в регулярном выражении представляют сами себя за исключением специальных символов:

[ ] / ^ $ . | ? * + ( ) { }

Эти символы нужны, чтобы обозначить диапазон допустимых значений или границу фразы, указать количество повторений, или сделать что-то еще. В разных типах регулярных выражений этот набор различается (см «разновидности регулярных выражений»).

Если вы хотите найти один из этих символов внутри вашего текста, его надо экранировать символом (обратная косая черта).

Regex: 2^2 = 4

Найдет: 2^2 = 4

Можно экранировать целую последовательность символов, заключив её между Q и E (но не во всех разновидностях).

Regex: Q{кто тут?}E

Найдет: {кто тут?}

Квантификаторы (количество повторений)

Усложняем задачу. Есть некий текст, нам нужно вычленить оттуда все email-адреса. Например:

  • test@mail.ru

  • olga31@gmail.com

  • pupsik_99@yandex.ru

Как составляется регулярное выражение? Нужно внимательно изучить данные, которые мы хотим получить на выходе, и составить по ним шаблон. В email два разделителя — собачка «@» и точка «.».

Запишем ТЗ для регулярного выражения:

  • Буквы / цифры / _

  • Потом @

  • Снова буквы / цифры / _

  • Точка

  • Буквы

Так, до собачки у нас явно идет метасимвол «w», туда попадет и просто текст (test), и цифры (olga31), и подчеркивание (pupsik_99). Но есть проблема — мы не знаем, сколько таких символов будет. Это при поиске даты все ясно — 2 цифры, 2 цифры, 4 цифры. А тут может быть как 2, так и 22 символа.

И тут на помощь приходят квантификаторы — так называют специальные символы в регулярных выражениях, которые указывают количество повторений текста.

Символ «+» означает «одно или более повторений», это как раз то, что нам надо! Получаем: w+@

После собачки и снова идет w, и снова от одного повторения. Получаем: w+@w+.

После точки обычно идут именно символы, но для простоты можно снова написано w. И снова несколько символов ждем, не зная точно сколько. Итого получилось выражение, которое найдет нам email любой длины:

Regex: w+@w+.w+

Найдет:

test@mail.ru

olga31@gmail.com

pupsik_99_and_slonik_33_and_mikky_87_and_kotik_28@yandex.megatron

Какие есть квантификаторы, кроме знака «+»?

Квантификатор

Число повторений

?

Ноль или одно

*

Ноль или более

+

Один или более

Символ * часто используют с точкой — когда нам неважно, какой идет текст до интересующей нас фразы, мы заменяем его на «.*» — любой символ ноль или более раз.

Regex: .*dd.dd.dddd.*

Найдет:

01.01.2000

Приходи на ДР 09.08.2015! Будет весело!

Но будьте осторожны! Если использовать «.*» повсеместно, можно получить много ложноположительных срабатываний:

Regex: .*@.*..*

Найдет:

test@mail.ru

olga31@gmail.com

pupsik_99@yandex.ru

Но также найдет:

@yandex.ru

test@.ru

test@mail.

Уж лучше w, и плюсик вместо звездочки.

А вот есть мы хотим найти все лог-файлы, которые нумеруются — log, log1, log2… log133, то * подойдет хорошо:

Regex: logd*.txt

Найдет:

log.txt

log1.txt

log2.txt

log3.txt

log33.txt

log133.txt

А знак вопроса (ноль или одно повторение) поможет нам найти людей с конкретной фамилией — причем всех, и мужчин, и женщин:

Regex: Назина?

Найдет:

Назин

Назина

Если мы хотим применить квантификатор к группе символов или нескольким словам, их нужно взять в скобки:

Regex: (Хихи)*(Хаха)*

Найдет:

ХихиХаха

ХихиХихиХихи

Хихи

Хаха

ХихиХихиХахаХахаХаха

(пустота — да, её такая регулярка тоже найдет)

Квантификаторы применяются к символу или группе в скобках, которые стоят перед ним.

А что, если мне нужно определенное количество повторений? Скажем, я хочу записать регулярное выражение для даты. Пока мы знаем только вариант «перечислить нужный метасимвол нужное количество раз» — dd.dd.dddd.

Ну ладно 2-4 раза повторение идет, а если 10? А если повторить надо фразу? Так и писать ее 10 раз? Не слишком удобно. А использовать * нельзя:

Regex: d*.d*.d*

Найдет:

.0.1999

05.08.20155555555555555

03444.025555.200077777777777777

Чтобы указать конкретное количество повторений, их надо записать внутри фигурных скобок:

Квантификатор

Число повторений

{n}

Ровно n раз

{m,n}

От m до n включительно

{m,}

Не менее m

{,n}

Не более n

Таким образом, для проверки даты можно использовать как перечисление d n раз, так и использование квантификатора:

dd.dd.dddd

d{2}.d{2}.d{4}

Обе записи будут валидны. Но вторая читается чуть проще — не надо самому считать повторения, просто смотрим на цифру.

Не забывайте — квантификатор применяется к последнему символу!

Regex: data{2}

Найдет: dataa

Не найдет: datadata

Или группе символов, если они взяты в круглые скобки:

Regex: (data){2}

Найдет: datadata

Не найдет: dataa

Так как фигурные скобки используются в качестве указания количества повторений, то, если вы ищете именно фигурную скобку в тексте, ее надо экранировать:

Regex: x{3}

Найдет: x{3}

Иногда квантификатор находит не совсем то, что нам нужно.

Regex: <.*>

Ожидание:

<req>
<query>Ан</query>
<gender>FEMALE</gender>

Реальность:

<req> <query>Ан</query> <gender>FEMALE</gender></req>

Мы хотим найти все теги HTML или XML по отдельности, а регулярное выражение возвращает целую строку, внутри которой есть несколько тегов.

Напомню, что в разных реализациях регулярные выражения могут работать немного по разному. Это одно из отличий — в некоторых реализациях квантификаторам соответствует максимально длинная строка из возможных. Такие квантификаторы называют жадными.

Если мы понимаем, что нашли не то, что хотели, можно пойти двумя путями:

  1. Учитывать символы, не соответствующие желаемому образцу

  2. Определить квантификатор как нежадный (ленивый, англ. lazy) — большинство реализаций позволяют это сделать, добавив после него знак вопроса.

Как учитывать символы? Для примера с тегами можно написать такое регулярное выражение:

<[^>]*>

Оно ищет открывающий тег, внутри которого все, что угодно, кроме закрывающегося тега «>», и только потом тег закрывается. Так мы не даем захватить лишнее. Но учтите, использование ленивых квантификаторов может повлечь за собой обратную проблему — когда выражению соответствует слишком короткая, в частности, пустая строка.

Жадный

Ленивый

*

*?

+

+?

{n,}

{n,}?

Есть еще и сверхжадная квантификация, также именуемая ревнивой. Но о ней почитайте в википедии =)

Позиция внутри строки

По умолчанию регулярные выражения ищут «по включению».

Regex: арка

Найдет:

арка

чарка

аркан

баварка

знахарка

Это не всегда то, что нам нужно. Иногда мы хотим найти конкретное слово.

Если мы ищем не одно слово, а некую строку, проблема решается в помощью пробелов:

Regex: Товар №d+ добавлен в корзину в dd:dd

Найдет: Товар №555 добавлен в корзину в 15:30

Не найдет: Товарный чек №555 добавлен в корзину в 15:30

Или так:

Regex: .* арка .*

Найдет: Триумфальная арка была…

Не найдет: Знахарка сегодня…

А что, если у нас не пробел рядом с искомым словом? Это может быть знак препинания: «И вот перед нами арка.», или «…арка:».

Если мы ищем конкретное слово, то можно использовать метасимвол b, обозначающий границу слова. Если поставить метасимвол с обоих концов слова, мы найдем именно это слово:

Regex: bаркаb

Найдет:

арка

Не найдет:

чарка

аркан

баварка

знахарка

Можно ограничить только спереди — «найди все слова, которые начинаются на такое-то значение»:

Regex: bарка

Найдет:

арка

аркан

Не найдет:

чарка

баварка

знахарка

Можно ограничить только сзади —  «найди все слова, которые заканчиваются на такое-то значение»:

Regex: аркаb

Найдет:

арка

чарка

баварка

знахарка

Не найдет:

аркан

Если использовать метасимвол B, он найдем нам НЕ-границу слова:

Regex: BакрB

Найдет:

закройка

Не найдет:

акр

акрил

Если мы хотим найти конкретную фразу, а не слово, то используем следующие спецсимволы:

^ — начало текста (строки)

$ — конец текста (строки)

Если использовать их, мы будем уверены, что в наш текст не закралось ничего лишнего:

Regex: ^Я нашел!$

Найдет:

Я нашел!

Не найдет:

Смотри! Я нашел!

Я нашел! Посмотри!

Итого метасимволы, обозначающие позицию строки:

Символ

Значение

b

граница слова

B

Не граница слова

^

начало текста (строки)

$

конец текста (строки)

Использование ссылки назад

Допустим, при тестировании приложения вы обнаружили забавный баг в тексте — дублирование предлога «на»: «Поздравляем! Вы прошли на на новый уровень». А потом решили проверить, есть ли в коде еще такие ошибки.

Разработчик предоставил файлик со всеми текстами. Как найти повторы? С помощью ссылки назад. Когда мы берем что-то в круглые скобки внутри регулярного выражения, мы создаем группу. Каждой группе присваивается номер, по которому к ней можно обратиться.

Regex: [ ]+(w+)[ ]+1

Текст: Поздравляем! Вы прошли на на новый уровень. Так что что улыбаемся и и машем.

Разберемся, что означает это регулярное выражение:

[ ]+ → один или несколько пробелов, так мы ограничиваем слово. В принципе, тут можно заменить на метасимвол b.

(w+) → любой буквенный или цифровой символ, или знак подчеркивания. Квантификатор «+» означает, что символ должен идти минимум один раз. А то, что мы взяли все это выражение в круглые скобки, говорит о том, что это группа. Зачем она нужна, мы пока не знаем, ведь рядом с ней нет квантификатора. Значит, не для повторения. Но в любом случае, найденный символ или слово — это группа 1.

[ ]+ → снова один или несколько пробелов.

1 → повторение группы 1. Это и есть ссылка назад. Так она записывается в JavaScript-е.

Важно: синтаксис ссылок назад очень зависит от реализации регулярных выражений.

ЯП

Как обозначается ссылка назад

JavaScript

vi

Perl

$

PHP

$matches[1]

Java

Python

group[1]

C#

match.Groups[1]

Visual Basic .NET

match.Groups(1)

Для чего еще нужна ссылка назад? Например, можно проверить верстку HTML, правильно ли ее составили? Верно ли, что открывающийся тег равен закрывающемуся?

Напишите выражение, которое найдет правильно написанные теги:

<h2>Заголовок 2-ого уровня</h2>
<h3>Заголовок 3-ого уровня</h3>

Но не найдет ошибки:

<h2>Заголовок 2-ого уровня</h3>

Просмотр вперед и назад

Еще может возникнуть необходимость найти какое-то место в тексте, но не включая найденное слово в выборку. Для этого мы «просматриваем» окружающий текст.

Представление

Вид просмотра

Пример

Соответствие

(?=шаблон)

Позитивный просмотр вперёд

Блюдо(?=11)

Блюдо1

Блюдо11

Блюдо113

Блюдо511

(?!шаблон)

Негативный просмотр вперёд (с отрицанием)

Блюдо(?!11)

Блюдо1

Блюдо11

Блюдо113

Блюдо511

(?<=шаблон)

Позитивный просмотр назад

(?<=Ольга )Назина

Ольга Назина

Анна Назина

(?шаблон)

Негативный просмотр назад (с отрицанием)

(см ниже на рисунке)

Ольга Назина

Анна Назина

Замена

Важная функция регулярных выражений — не только найти текст, но и заменить его на другой текст! Простейший вариант замены — слово на слово:

RegEx: Ольга

Замена: Макар

Текст был: Привет, Ольга!

Текст стал: Привет, Макар!

Но что, если у нас в исходном тексте может быть любое имя? Вот что пользователь ввел, то и сохранилось. А нам надо на Макара теперь заменить. Как сделать такую замену? Через знак доллара. Давайте разберемся с ним подробнее.

Знак доллара в замене — обращение к группе в поиске. Ставим знак доллара и номер группы. Группа — это то, что мы взяли в круглые скобки. Нумерация у групп начинается с 1.

RegEx: (Оля) + Маша

Замена: $1

Текст был: Оля + Маша

Текст стал: Оля

Мы искали фразу «Оля + Маша» (круглые скобки не экранированы, значит, в искомом тексте их быть не должно, это просто группа). А замнили ее на первую группу — то, что написано в первых круглых скобках, то есть текст «Оля».

Это работает и когда искомый текст находится внутри другого:

RegEx: (Оля) + Маша

Замена: $1

Текст был: Привет, Оля + Маша!

Текст стал: Привет, Оля!

Можно каждую часть текста взять в круглые скобки, а потом варьировать и менять местами:

RegEx: (Оля) + (Маша)

Замена: $2 — $1

Текст был: Оля + Маша

Текст стал: Маша — Оля

Теперь вернемся к нашей задаче — есть строка приветствия «Привет, кто-то там!», где может быть написано любое имя (даже просто числа вместо имени). Мы это имя хотим заменить на «Макар».

Нам надо оставить текст вокруг имени, поэтому берем его в скобки в регулярном выражении, составляя группы. И переиспользуем в замене:

RegEx: ^(Привет, ).*(!)$

Замена: $1Макар$2

Текст был (или или):

Привет, Ольга!

Привет, 777!

Текст стал:

Привет, Макар!

Давайте разберемся, как работает это регулярное выражение.

^ — начало строки.

Дальше скобка. Она не экранирована — значит, это группа. Группа 1. Поищем для нее закрывающую скобку и посмотрим, что входит в эту группу. Внутри группы текст «Привет, »

После группы идет выражение «.*» — ноль или больше повторений чего угодно. То есть вообще любой текст. Или пустота, она в регулярку тоже входит.

Потом снова открывающаяся скобка. Она не экранирована — ага, значит, это вторая группа. Что внутри? Внутри простой текст — «!».

И потом символ $ — конец строки.

Посмотрим, что у нас в замене.

$1 — значение группы 1. То есть текст «Привет, ».

Макар — просто текст. Обратите внимание, что мы или включает пробел после запятой в группу 1, или ставим его в замене после «$1», иначе на выходе получим «Привет,Макар».

$2 — значение группы 2, то есть текст «!»

Вот и всё!

А что, если нам надо переформатировать даты? Есть даты в формате ДД.ММ.ГГГГ, а нам нужно поменять формат на ГГГГ-ММ-ДД.

Регулярное выражение для поиска у нас уже есть — «d{2}.d{2}.d{4}». Осталось понять, как написать замену. Посмотрим внимательно на ТЗ:

ДД.ММ.ГГГГ

ГГГГ-ММ-ДД

По нему сразу понятно, что нам надо выделить три группы. Получается так: (d{2}).(d{2}).(d{4})

В результате у нас сначала идет год — это третья группа. Пишем: $3

Потом идет дефис, это просто текст: $3-

Потом идет месяц. Это вторая группа, то есть «$2». Получается: $3-$2

Потом снова дефис, просто текст: $3-$2-

И, наконец, день. Это первая группа, $1. Получается: $3-$2-$1

Вот и всё!

RegEx: (d{2}).(d{2}).(d{4})

Замена: $3-$2-$1

Текст был:

05.08.2015

01.01.1999

03.02.2000

Текст стал:

2015-08-05

1999-01-01

2000-02-03

Другой пример — я записываю в блокнот то, что успела сделать за цикл в 12 недель. Называется файлик «done», он очень мотивирует! Если просто вспоминать «что же я сделал?», вспоминается мало. А тут записал и любуешься списком.

Вот пример улучшалок по моему курсу для тестировщиков:

  1. Сделала сообщения для бота — чтобы при выкладке новых тем писал их в чат

  2. Фолкс — поправила статью «Расширенный поиск», убрала оттуда про пустой ввод при простом поиске, а то путал

  3. Обновила кусочек про эффект золушки (переписывала под ютуб)

И таких набирается штук 10-25. За один цикл. А за год сколько? Ух! Вроде небольшие улучшения, а набирается прилично.

Так вот, когда цикл заканчивается, я пишу в блог о своих успехах. Чтобы вставить список в блог, мне надо удалить нумерацию — тогда я сделаю ее силами блоггера и это будет смотреться симпатичнее.

Удаляю с помощью регулярного выражения:

RegEx: d+. (.*)

Замена: $1

Текст был:

1. Раз

2. Два

Текст стал:

Раз

Два

Можно было бы и вручную. Но для списка больше 5 элементов это дико скучно и уныло. А так нажал одну кнопочку в блокноте — и готово!

Так что регулярные выражения могут помочь даже при написании статьи =)

Статьи и книги по теме

Книги

Регулярные выражения 10 минут на урок. Бен Форта — Очень рекомендую! Прям шикарная книга, где все просто, доступно, понятно. Стоит 100 рублей, а пользы море.

Статьи

Вики — https://ru.wikipedia.org/wiki/Регулярные_выражения. Да, именно ее вы будете читать чаще всего. Я сама не помню наизусть все метасимволы. Поэтому, когда использую регулярки, гуглю их, википедия всегда в топе результатов. А сама статья хорошая, с табличками удобными.

Регулярные выражения для новичков — https://tproger.ru/articles/regexp-for-beginners/

Итого

Регулярные выражения — очень полезная вещь для тестировщика. Применений у них много, даже если вы не автоматизатор и не спешите им стать:

  1. Найти все нужные файлы в папке.

  2. Grep-нуть логи — отсечь все лишнее и найти только ту информацию, которая вам сейчас интересна.

  3. Проверить по базе, нет ли явно некорректных записей — не остались ли тестовые данные в продакшене? Не присылает ли смежная система какую-то фигню вместо нормальных данных?

  4. Проверить данные чужой системы, если она выгружает их в файл.

  5. Выверить файлик текстов для сайта — нет ли там дублирования слов?

  6. Подправить текст для статьи.

Если вы знаете, что в коде вашей программы есть регулярное выражение, вы можете его протестировать. Вы также можете использовать регулярки внутри ваших автотестов. Хотя тут стоит быть осторожным.

Не забывайте о шутке: «У разработчика была одна проблема и он стал решать ее с помощью регулярных выражений. Теперь у него две проблемы». Бывает и так, безусловно. Как и с любым другим кодом.

Поэтому, если вы пишете регулярку, обязательно ее протестируйте! Особенно, если вы ее пишете в паре с командой rm (удаление файлов в linux). Сначала проверьте, правильно ли отрабатывает поиск, а потом уже удаляйте то, что нашли.

Регулярное выражение может не найти то, что вы ожидали. Или найти что-то лишнее. Особенно если у вас идет цепочка регулярок. Думаете, это так легко — правильно написать регулярку? Попробуйте тогда решить задачку от Егора или вот эти кроссворды =)

PS — больше полезных статей ищите в моем блоге по метке «полезное». А полезные видео — на моем youtube-канале

Regex or regular expression is a pattern-matching tool. It allows you to search text in an advanced manner.

Regex is like CTRL+F on steroids.

A regex that finds phone numbers in different formats

For example, to find out all the emails or phone numbers from text, regex can get the job done.

The downside of the regex is it takes a while to memorize all the commands. One could say it takes 20 minutes to learn, but forever to master.

In this guide, you learn the basics of regex.

We are going to use the regex online playground in regexr.com. This is a super useful platform where you can easily practice your regex skills with useful examples.

Make sure to write down each regular expression you see in this guide to truly learn what you are doing.

Regex Tutorial

To make it as beneficial as possible, this tutorial is example-heavy. This means some of the regex concepts are introduced as part of the examples. Make sure you read everything!

Anyway, let’s get started with regex.

Regex and Flags

A regular expression (or regex) starts with forward slash (/) and ends with a forward slash.

The pattern matching happens in between the forward slashes.

For instance, let’s find the word “loud” in the text document.

Searching for the word “loud” in the text chapter below.

As you can see, this works like CTRL + F.

Next, pay attention to the letter “g” in the above regex /loud/g.

The letter “g” means that the global flag is activated. In other words, you are treating the piece of example text as one long line of text.

Most of the time you are going to use the “g” flag only.

But it is good to understand there are other flags as well.

In the regexr online editor, you can find all the possible flags in the top right corner.

Now that you understand what regex is and what is the global flag, let’s see an example.

Let’s search for “at” in the piece of text:

As you can see, our regular expression found three matches of “at”.

Now, if you disable the “g” flag, it is only going to match the first occurrence of “at”.

Anyway, let’s switch the global flag back on.

So far using regex has been like using the good old CTRL+F.

However, the true power of the regular expressions shows up when we search for patterns instead of specific words.

To do this, we need to learn about the regex special characters that make pattern matching possible

Let’s start with the + charater.

The + Operator – Match One or More

Let’s search for character “s” in the example text.

This matches all “s” letters there are.

But what if you want to search for multiple “s” characters in a row?

In this case, you can use the + character after the letter “s”. This matches all the following “s” letters after the first one.

As a result, it now matches the double “s” in the text in addition to the singular “s”.

In short, the + operator matches one or more same characters in a row.

Next, let’s take a look at how optional matching works.

The ? Operator – Match Optional Characters

Optional matching is characterized by the question mark operator (?).

Optional matching means to match something that might follow.

For example, to match all letters “s” and every “s” followed by “t”, you can specify the letter “t” as an optional match using the question mark.

This matches:

  • Each singular “s”
  • Each combination of “st”.

Next up, let’s take a look at a special character that combines the + and ? characters.

The * Operator – Match Any Optional Characters

The star operator (*) means “match zero or more”.

Essentially, it is the combination of the + and the ? operators.

For example, let’s match with each letter “o” and any amount of letter “s” that follow.

This matches:

  • All the singular “o” letters.
  • All occurrences of “os”.
  • All occurrences of “oss”.

As a matter of fact, this would match with “ossssssss” with any number of “s” letters as long as they are preceded by an “o”.

Next, let’s take a look at the wild card character.

The . Operator – Match Anything Except a New Line

In regex, the period is a special character that matches any singular character.

It acts as the wildcard.

The only character the period does not match is a line break.

For example, let’s match any character that comes before “at” in the text.

But how about matching with a dot then? The period (.) is a reserved special character, so it cannot be used.

This is where escaping is used.

The Operator – Escape a Special Character

If you are familiar with programming, you know what escaping means.

If not, escaping means to “invalidate” a reserved keyword or operator using a special character in front of its name.

As you saw in the previous example, the period character acts as a wildcard in regex. This means you cannot use it to match a dot in the text.

Using a dot matches every singular character in the text.

As you can see, /./g matches with each letter (and space) in the text, so it is not much of a help.

This is where escaping is useful.

In regex, you can escape any reserved character using a backslash ().

Anything followed by a backslash is going to be converted into a normal text character.

To match dots using regex, escape the period character with (.).

Now it matches all the dots in the text.

Let’s play with the example. To match any character that comes before a dot, add a period before the escaped period:

Now you understand how to match and escape characters in regex. Let’s move on to matching word characters using other special characters.

Match Different Types of Characters

You just learned how to use a backslash to escape a character.

However, the backslash has another important use case. Combining a backslash with some particular character forms an operator that can be used to match useful things.

As an example, an important special character in regex is w.

This matches all the word characters, that is, letters and digits but leaves out spaces.

For example, let’s match all the letters and digits in the text:

Another commonly used special operator is the space character s that matches any type of white space there is in the text.

For example, let’s match all the spaces in the text.

Of course, you can also match numeric characters only.

This happens via the d operator.

For instance, let’s match all digits in the text:

This matches with “2” and “0”.

These are the very basic special character operators there are in regex.

Next, you are going to learn how to invert these special characters.

Invert Special Characters

To invert a special character in regex, capitalize it.

  • w matches any word character –> W matches with any non-word character
  • s matches with any white space character –> S matches with any non-whitespace character.
  • d matches with any digit –> D matches any non-digit character.

Examples:

Match non-word characters.
Match non-space characters.
Match non-digit characters.

Next, let’s take a look at how to match words with a specific length.

{} – Match Specific Length

Let’s say you want to capture all the words that are longer than 2 characters long.

Now, you cannot use + or * with the w character as it does not make sense.

Instead, use the curly braces {} by specifying how many characters to match.

There are three ways to use {}:

  • {n}. Match n consecutive characters.
  • {n,}. Match n character or more.
  • {n,m}. Match between n and m in length.

Let’s examples of each.

Example 1. Match all sets of characters that are exactly 3 in length:

Example 2. Match consecutive strings that are longer than 3 characters:

Example 3. Match any set of characters that are between 3 and 5 characters in length:

Now that you know how to deal with quantities in regex, let’s talk about grouping.

[] – Groups and Ranges

In regex, you can use character grouping. This means you match with any character in the group.

One way to group characters is by using square brackets [].

For example, let’s match with any two letters where the last letter is “s” and the first letter is either “a” or “o”.

A really handy feature of using square brackets is you can specify a range. This lets you match any letter in the specified range.

To specify a range, use the dash with the following syntax. For example, [a-z] matches any letter from a to z.

For example, let’s match with any two-letter word that ends in “s” and starts with any character from a to z.

One thing you sometimes may want to do is to combine ranges.

This is also possible in regex.

For example, to find any two letters that end with “s” and start with any lowercase or uppercase letter, you can do:

/[a-zA-Z]s/g

Or if you want to match with any two letters that end with “s” and start with a number between 0 and 9, you can do:

/[0-9]s/g

Awesome.

Next, let’s take a look at another way to group characters in regex.

() Capturing Groups

In regex, capturing groups is a way to treat multiple characters as a single unit.

To create a capturing group, place the characters inside of the parenthesis.

For example, let’s match with words “The” or “the”, where the first letter is either lowercase t or uppercase T.

But why parenthesis? Let’s see what happens without them:

Now it matches with either any single character “t” or the word “The”.

This is the power of the capturing group. It treats the characters inside the parenthesis as a single unit.

Let’s see another example where we find any words that are 2-3 letters long and each letter in the word is either a,s,e,d.

As the last example of capturing, let’s match any words that repeat “os” two or three times in a row.

Here the “os” is not matched in the words “explosion” and “across”. This is because the “os” occurs only a single time. However, the “osososos” at the end has 3 x “os” so it gets matched.

Next up, let’s take a look at yet another special character, caret (^).

The ^ operator – Match the Beginning of a Line

The caret (^) character in regex means match with the beginning of the new line.

For example, let’s match with the letter “T” at the beginning of a text chapter.

Now, let’s see what happens when we try to match with the letter “N” at the beginning of the next line.

No matches!

But why is that? There is an “N” at the beginning of the second line.

This happens because our flag is set to “g” or “global”. We are treating the whole piece of text as a single line of text.

If you want to change this, you need to set the multiline flag in addition to the global flag.

Now the match is also made at the beginning of the second line.

However, it is easier to deal with the text as a single chunk of text, so we are going to disable the multiline flag for the rest of the guide.

Now that you know how the caret operator works in regex, let’s take a look at the next special character, the dollar sign ($).

$ End of Statement

To match the end of a statement with regex, use the dollar sign ($).

For instance, let’s match with a dot that ends the text chapter.

As you can see, this only matches the dot at the end of the second line. As mentioned before, this happens because we treat the text as a single line of text.

Awesome! Now you have learned most of the special characters you are ever going to use in regex.

Next, let’s take a look at how to really benefit from regex by learning about important concepts of lookahead and lookbehind.

Lookbehinds

In regex, a lookbehind means to match something preceded by something.

There are two types of lookbehinds:

  • Positive lookbehind
  • Negative lookbehind

Let’s take a look at what these do.

The (?<=) Operator – Positive Lookbehind

A positive look behind is specified by defining a group that starts with a question mark, followed by a less than sign and an equal sign, and then a set of characters.

  • (?<=)

Here < means we are going to perform a look behind, and = means it is positive.

A positive lookbehind matches everything before the main expression without including it in the result.

For example, let’s match the first characters after “os” in the text.

This positive look behind does not include “os” in the matches. Instead, it checks if the matches are preceded by “os” before showing them.

This is super useful.

The (?<!) Operator – Negative Lookbehind

Another type of look behind is the negative look behind. This is basically the opposite of the positive lookbehind.

To create a negative look behind, create a group with a question mark followed by a less-than sign and an exclamation point.

  • (?<!)

Here < means look behind and ! makes it negative.

As an example, let’s perform the exact same search as we did in the positive lookbehind, but let’s make it negative:

As you can see, the negative lookbehind matches everything except the first character after the word “os”. This is the exact opposite of the positive lookahead.

Now that you know what the lookbehinds do, let’s move on to very similar concepts, that is, lookaheads.

Lookaheads

In regex, a lookahead is similar to lookbehind.

A lookahead matches everything after the main expression without including it in the result.

To perform a lookahead, all you need to do is remove the less-than sign.

  • (?=) is a positive lookahead.
  • (?!) is a negative lookahead.

The (?=) Operator – Positive Lookahead

For example, let’s match with any singular character followed by “es” or “os”.

And as you guessed, a negative lookahead matches the exact opposite of what a positive lookahead does.

The (?!) Operator – Positive Lookahead

For example, let’s match everything except for the single characters that occur before “os” or “es”

Now you have all the tools to understand a slightly more advanced example using regex. Also, you are going to learn a bunch of important things at the same, so keep on reading!

Find and Replace Phone Numbers Using Regex

Let’s say we have a text document that has differently formatted phone numbers.

Our task is to find those numbers and replace them by formatting them all in the same way.

The number that belongs to Alice is simple. Just 10 digits in a row.

The number that belongs to Bob is a bit trickier because you need to group the regex into 5 parts:

  1. A group of three digits
  2. Dash
  3. A group of three digits
  4. Dash
  5. A group of four digits.

Now it matches Bob’s number.

But our goal was to match all the numbers at the same time. Now Alice’s number is no longer found.

To fix this, we need to restructure the regex again. Instead of assuming there is always a dash between the first two groups of numbers, let’s assume it is optional. As you now know, this can be done using the question mark.

Good job.

Next up, there can also be numbers separated by space, such as Charlie’s number.

To take this into account, we must assume that the separator is either a white space or a dash. This can be done using a group with square brackets [] by placing a dash and a white space into it.

Now also Charlie’s number is matched by our regular expression.

Then there are those numbers where the first three digits are isolated by parenthesis and where the last two groups are separated by a dash.

To find these numbers, we need to add an optional parenthesis in front of the first three digits. But as you recall, parenthesis is a special character in regex, so you need to escape them using the backslash .

Awesome, now David’s number is also found.

Last but not least, a phone number might be formatted such that the country-specific number is in front of the number with a + sign.

To take this into account, we need to add an optional group of a + sign followed by a digit between 0-9.

Now our regex finds every phone number there is on the list!

Next, let’s replace each number with a number such that each number is formatted in the same way.

Before we can do this, we need to capture each set of numbers by creating capturing groups for them. As you learned before, this happens by placing each set of digits into a set of parenthesis.

If you inspect the Details section of the editor, you can see that now each set of numbers in a phone number is grouped in the capture groups.

For example, let’s click the first phone number match and see the Details:

Each number is grouped to capture groups 1-5.

As you can see, the first phone number is grouped into three capture groups 3,4, and 5.

As another example, let’s click Eric’s number to see the details:

Here you can see that the number is split into groups 1, 2, 3, 4, and 5.

However, there is one problem.

The number +7 occurs twice, in group 1 and group 2.

This is not what we want.

It happens because the regex catches both the +7 with a space and without a space. Thus the 2 groups.

To get rid of this, you can specify the expression that captures the number with space as a non-capturing group.

To do this, use the ?: operator in front of the group:

Now Eric’s number (and all the other numbers too) is nicely split into 4 groups.

Finally, we can use these four capture groups to replace the matched numbers with numbers that are formatted in the same way.

In regex, you can refer to each capture group with $n, where n is the number of the group.

To format the numbers, let’s open up the replace tab in the editor.

Let’s say we want to replace all the numbers with a number that is formatted like this:

+7 123-900-4343

And if there is no +7 in front of the number, then we leave it as:

123-900-4343

To do this, replace each phone number by referencing their capture group in the Replace section of the editor:

$1$2-$3-$4

Amazing! Now all the numbers are replaced in the resulting piece of text and follow the same format.

This concludes our regex tutorial.

Conclusion

Today you learned how to use regex.

In short, regex is a commonly supported tool to match patterns in text documents.

You can use it to find and replace text that matches a specific pattern.

Most programming languages support regex. This means you can use regex in your coding projects to automate a lot of manual work when it comes to text processing.

Thanks for reading.

Happy pattern-matching!

Further Reading

How to Validate Emails with JavaScript + RegEx

About the Author

I’m an entrepreneur and a blogger from Finland. My goal is to make coding and tech easier for you with comprehensive guides and reviews.

Recent Posts

A Regular Expression – or regex for short– is a syntax that allows you to match strings with specific patterns. Think of it as a suped-up text search shortcut, but a regular expression adds the ability to use quantifiers, pattern collections, special characters, and capture groups to create extremely advanced search patterns.
Regex can be used any time you need to query string-based data, such as:

  • Analyzing command line output
  • Parsing user input
  • Examining server or program logs
  • Handling text files with a consistent syntax, like a CSV
  • Reading configuration files
  • Searching and refactoring code

While doing all of these is theoretically possible without regex, when regexes hit the scene they act as a superpower for doing all of these tasks.

In this guide we’ll cover:

  • What does a regex look like?
  • How to read and write a regex
    • What’s a “quantifier”?
    • What’s a “pattern collection”?
    • What’s a “regex token”?
  • How to use a regex
  • What’s a “regex flag“?
  • What’s a “regex group”?

What does a regex look like?

In its simplest form, a regex in usage might look something like this:

We’re using a regular expression /Test/ to look for the word “Test”

This screenshot is of the regex101 website. All future screenshots will utilize this website for visual reference.

In the “Test” example the letters test formed the search pattern, same as a simple search.
These regexes are not always so simple, however. Here’s a regex that matches 3 numbers, followed by a “-“, followed by 3 numbers, followed by another “-“, finally ended by 4 numbers.

You know, like a phone number:

^(?:d{3}-){2}d{4}$

The phone number “555-555-5555” will match with the regex above, but “555-abc-5555” will not

This regex may look complicated, but two things to keep in mind:

  1. We’ll teach you how to read and write these in this article
  2. This is a fairly complex way of writing this regex.

In fact, most regexes can be written in multiple ways, just like other forms of programming. For example, the above can be rewritten into a longer but slightly more readable version:

^[0-9]{3}-[0-9]{3}-[0-9]{4}$

Most languages provide a built-in method for searching and replacing strings using regex. However, each language may have a different set of syntaxes based on what the language dictates.

In this article, we’ll focus on the ECMAScript variant of Regex, which is used in JavaScript and shares a lot of commonalities with other languages’ implementations of regex as well.

How to read (and write) regexes

Quantifiers

Regex quantifiers check to see how many times you should search for a character.

Here is a list of all quantifiers:

  • a|b– Match either “a” or “b
  • ? – Zero or one
  • + – one or more
  • * – zero or more
  • {N} – Exactly N number of times (where N is a number)
  •  {N,} – N or more number of times (where N is a number)
  • {N,M} – Between N and M number of times (where N and M are numbers and N < M)
  • *? – Zero or more, but stop after first match

For example, the following regex:

Hello|Goodbye

Matches both the string “Hello” and “Goodbye”.

Meanwhile:

Hey?

Will track “y” zero to one time, so will match up with “He” and “Hey”.

Alternatively:

Hello{1,3}

Will match “Hello”, “Helloo”, “Hellooo”, but not “Helloooo”, as it is looking for the letter “o” between 1 and 3 times.

These can even be combined with one another:

He?llo{2}

Here we’re looking for strings with zero-to-one instances of “e” and the letter “o” times 2, so this will match “Helloo” and “Hlloo”.

Greedy matching

One of the regex quantifiers we touched on in the previous list was the + symbol. This symbol matches one or more characters. This means that:

Hi+

Will match everything from “Hi” to “Hiiiiiiiiiiiiiiii”. This is because all quantifiers are considered “greedy” by default.

However, if you change it to be “lazy” using a question mark symbol (?) to the following, the behavior changes.

Hi+?

Now, the i matcher will try to match as few times as possible. Since the +icon means “one or more”, it will only match one “i”. This means that if we input the string “Hiiiiiiiiiii”, only “Hi” will be matched.

While this isn’t particularly useful on its own, when combined with broader matches like the the . symbol, it becomes extremely important as we’ll cover in the next section. The .symbol is used in regex to find “any character”.

Now if you use:

H.*llo

You can match everything from “Hillo” to “Hello” to “Hellollollo”.

>We’re using a regex /H.*llo/ to look for the words “Hillo”, “Hello”, and “Helloollo”

However, what if you want to only match “Hello” from the final example?

Well, simply make the search lazy with a ?  and it’ll work as we want:

H.*?llo

We’re using a regex /H.*?llo/ to look for the words “Hillo”, “Hello”, and partially match the “Hello” in “Helloollo”

Pattern collections

Pattern collections allow you to search for a collection of characters to match against. For example, using the following regex:

My favorite vowel is [aeiou]Code language: CSS (css)

You could match the following strings:

My favorite vowel is a
My favorite vowel is e
My favorite vowel is i
My favorite vowel is o
My favorite vowel is u

But nothing else.

Here’s a list of the most common pattern collections:

  • [A-Z]– Match any uppercase character from “A” to “Z”
  • [a-z]– Match any lowercase character from “a” to “z”
  • [0-9] – Match any number
  • [asdf]– Match any character that’s either “a”, “s”, “d”, or “f”
  • [^asdf]– Match any character that’s not any of the following: “a”, “s”, “d”, or “f”

You can even combine these together:

  • [0-9A-Z]– Match any character that’s either a number or a capital letter from “A” to “Z”
  • [^a-z] – Match any non-lowercase letter

General tokens

Not every character is so easily identifiable. While keys like “a” to “z” make sense to match using regex, what about the newline character?

The “newline” character is the character that you input whenever you press “Enter” to add a new line.

  • . – Any character
  • n – Newline character
  • t – Tab character
  • s– Any whitespace character (including t, n and a few others)
  • S – Any non-whitespace character
  • w– Any word character (Uppercase and lowercase Latin alphabet, numbers 0-9, and _)
  • W– Any non-word character (the inverse of the w token)
  • b– Word boundary: The boundaries between w and W, but matches in-between characters
  • B– Non-word boundary: The inverse of b
  • ^ – The start of a line
  • $ – The end of a line 
  • – The literal character “”

So if you wanted to remove every character that starts a new word you could use something like the following regex:

s.

And replace the results with an empty string. Doing this, the following:

Hello world how are you

Becomes:

Helloorldowreou

We’re using a regex /s./ to look for the whitespaces alongside the following character in the string “Hello world how are you”

Combining with collections

These tokens aren’t just useful on their own, though! Let’s say that we want to remove any uppercase letter or whitespace character. Sure, we could write

[A-Z]|s

But we can actually merge these together and place our s token into the collection:

[A-Zs]Code language: JSON / JSON with Comments (json)

We’re using a regex /[A-Zs]/ to look for uppercase letters and whitespaces in the string “Hello World how are you”

Word boundaries

In our list of tokens, we mentioned b to match word boundaries. I thought I’d take a second to explain how it acts a bit differently from others.

Given a string like “This is a string”, you might expect the whitespace characters to be matched – however, this isn’t the case. Instead, it matches between the letters and the whitespace:

We’re using a word boundary regex /b/ to look for the in-between spaces in characters

This can be tricky to get your head around, but it’s unusual to simply match against a word boundary. Instead, you might have something like the following to match full words:

bw+b

We’re using a regex /bw+b/ to look for full words. In the string “This is a string” we match “this”, “is”, “a”, and “string”

You can interpret that regex statement like this:

“A word boundary. Then, one or more ‘word’ characters. Finally, another word boundary”.

Start and end line

Two more tokens that we touched on are ^ and $. These mark off the start of a line and end of a line, respectively.

So, if you want to find the first word, you might do something like this:

^w+

To match one or more “word” characters, but only immediately after the line starts. Remember, a “word” character is any character that’s an uppercase or lowercase Latin alphabet letters, numbers 0-9, and_.

The regex /^w+/ matches the first word in the string. In “This is a string” we match “This”

Likewise, if you want to find the last word your regex might look something like this:

w+$

You can use /w+$/ to match the last word in the string. In “This is a string” we match “string”

However, just because these tokens typically end a line doesn’t mean that they can’t have characters after them.

For example, what if we wanted to find every whitespace character between newlines to act as a basic JavaScript minifier? 

Well, we can say “Find all whitespace characters after the end of a line” using the following regex:

$s+

We can use /$s+/ to find all whitespace between the end of a string and the start of the next string.

Character escaping

While tokens are super helpful, they can introduce some complexity when trying to match strings that actually contain tokens. For example, say you have the following string in a blog post:

"The newline character is 'n'"Code language: JSON / JSON with Comments (json)

Or want to find every instance of this blog post’s usage of the “n” string. Well, you can escape characters using. This means that your regex might look something like this:

\n

How to use a regex

Regular expressions aren’t simply useful for finding strings, however. You’re also able to use them in other methods to help modify or otherwise work with strings.

While many languages have similar methods, let’s use JavaScript as an example.

Creating and searching using regex

First, let’s look at how regex strings are constructed. 

In JavaScript (along with many other languages), we place our regex inside of // blocks. The regex searching for a lowercase letter looks like this:

/[a-z]/

This syntax then generates a RegExp object which we can use with built-in methods, like exec, to match against strings.

/[a-z]/.exec("a"); // Returns ["a"]
/[a-z]/.exec("0"); // Returns nullCode language: JavaScript (javascript)

We can then use this truthiness to determine if a regex matched, like we’re doing in line #3 of this example:

We can also alternatively call a RegExp constructor with the string we want to convert into a regex:

const regex = new RegExp("[a-z]"); // Same as /[a-z]/Code language: JavaScript (javascript)

Replacing strings with regex

You can also use a regex to search and replace a file’s contents as well. Say you wanted to replace any greeting with a message of “goodbye”. While you could do something like this:

function youSayHelloISayGoodbye(str) {
  str = str.replace("Hello", "Goodbye");
  str = str.replace("Hi", "Goodbye");
  str = str.replace("Hey", "Goodbye");  str = str.replace("hello", "Goodbye");
  str = str.replace("hi", "Goodbye");
  str = str.replace("hey", "Goodbye");
  return str;
}Code language: JavaScript (javascript)

There’s an easier alternative, using a regex:

function youSayHelloISayGoodbye(str) {
  str = str.replace(/[Hh]ello|[Hh]i|[Hh]ey/, "Goodbye");
  return str;
}Code language: JavaScript (javascript)

However, something you might notice is that if you run youSayHelloISayGoodbyewith “Hello, Hi there”: it won’t match more than a single input:

If the regex /[Hh]ello|[Hh]i|[Hh]ey/ is used on the string “Hello, Hi there”, it will only match “Hello” by default.

Here, we should expect to see both “Hello” and “Hi” matched, but we don’t.

This is because we need to utilize a Regex “flag” to match more than once.

Flags

A regex flag is a modifier to an existing regex. These flags are always appended after the last forward slash in a regex definition. 

Here’s a shortlist of some of the flags available to you.

  • g – Global, match more than once
  • m – Force $ and ^ to match each newline individually
  • i – Make the regex case insensitive

This means that we could rewrite the following regex:

/[Hh]ello|[Hh]i|[Hh]ey/

To use the case insensitive flag instead:

/Hello|Hi|Hey/i

With this flag, this regex will now match:

Hello
HEY
Hi
HeLLo

Or any other case-modified variant.

Global regex flag with string replacing

As we mentioned before, if you do a regex replace without any flags it will only replace the first result:

let str = "Hello, hi there!";
str = str.replace(/[Hh]ello|[Hh]i|[Hh]ey/, "Goodbye");
console.log(str); // Will output "Goodbye, hi there"Code language: JavaScript (javascript)

However, if you pass the global flag, you’ll match every instance of the greetings matched by the regex:

let str = "Hello, hi there!";
str = str.replace(/[Hh]ello|[Hh]i|[Hh]ey/g, "Goodbye");
console.log(str); // Will output "Goodbye, hi there"Code language: JavaScript (javascript)

A note about JavaScript’s global flag

When using a global JavaScript regex, you might run into some strange behavior when running the exec command more than once.

In particular, if you run exec with a global regex, it will return null every other time:

If we assign a regex to a variable then run `exec` on said variable, it will find the results properly the first and third time, but return `null` the second time

This is because, as MDN explains:

JavaScript RegExp objects are stateful when they have the global or sticky flags set… They store a lastIndex from the previous match. Using this internally, exec() can be used to iterate over multiple matches in a string of text…

The exec command attempts to start looking through the lastIndex moving forward. Because lastIndex is set to the length of the string, it will attempt to match "" – an empty string – against your regex until it is reset by another exec command again. While this feature can be useful in specific niche circumstances, it’s often confusing for new users.

To solve this problem, we can simply assign lastIndex to 0 before running each exec command:

If we run `regex.lastIndex = 0` in between each `regex.exec`, then every single `exec` runs as intended

Groups

When searching with a regex, it can be helpful to search for more than one matched item at a time. This is where “groups” come into play. Groups allow you to search for more than a single item at a time.

Here, we can see matching against both Testing 123  and Tests 123without duplicating the “123” matcher in the regex.

/(Testing|tests) 123/ig

With the regex /(Testing|tests) 123/ig we can match “Testing 123” and “Tests 123”

Groups are defined by parentheses; there are two different types of groups–capture groups and non-capturing groups:

  • (...) – Group matching any three characters
  • (?:...) – Non-capturing group matching any three characters

The difference between these two typically comes up in the conversation when “replace” is part of the equation. 

For example, using the regex above, we can use the following JavaScript to replace the text with “Testing 234” and “tests 234”:

const regex = /(Testing|tests) 123/ig;

let str = `
Testing 123
Tests 123
`;

str = str.replace(regex, '$1 234');
console.log(str); // Testing 234nTests 234"Code language: JavaScript (javascript)

We’re using $1 to refer to the first capture group, (Testing|tests). We can also match more than a single group, like both (Testing|tests) and (123):

const regex = /(Testing|tests) (123)/ig;

let str = `
Testing 123
Tests 123
`;

str = str.replace(regex, '$1 #$2');
console.log(str); // Testing #123nTests #123"Code language: JavaScript (javascript)

However, this is only true for capture groups. If we change:

/(Testing|tests) (123)/ig

To become:

/(?:Testing|tests) (123)/ig;

Then there is only one captured group – (123) – and instead, the same code from above will output something different:

const regex = /(?:Testing|tests) (123)/ig;

let str = `
Testing 123
Tests 123
`;

str = str.replace(regex, '$1');
console.log(str); // "123n123"Code language: JavaScript (javascript)

Named capture groups

While capture groups are awesome, it can easily get confusing when there are more than a few capture groups. The difference between $3 and $5 isn’t always obvious at a glance.

To help solve for this problem, regexes have a concept called “named capture groups”

  • (?<name>...)– Named capture group called “name” matching any three characters

You can use them in a regex like so to create a group called “num” that matches three numbers:

/Testing (?<num>d{3})/Code language: HTML, XML (xml)

Then, you can use it in a replacement like so:

const regex = /Testing (?<num>d{3})/
let str = "Testing 123";
str = str.replace(regex, "Hello $<num>")
console.log(str); // "Hello 123"Code language: JavaScript (javascript)

Named back reference

Sometimes it can be useful to reference a named capture group inside of a query itself. This is where “back references” can come into play.

  • k<name>Reference named capture group “name” in a search query

Say you want to match:

Hello there James. James, how are you doing?

But not:

Hello there James. Frank, how are you doing?

While you could write a regex that repeats the word “James” like the following:

/.*James. James,.*/

A better alternative might look something like this:

/.*(?<name>James). k<name>,.*/Code language: HTML, XML (xml)

Now, instead of having two names hardcoded, you only have one.

Lookahead and lookbehind groups

Lookahead and behind groups are extremely powerful and often misunderstood.

There are four different types of lookahead and behinds:

  • (?!) – negative lookahead
  • (?=) – positive lookahead
  • (?<=) – positive lookbehind
  • (?<!) – negative lookbehind

Lookahead works like it sounds like: It either looks to see that something is after the lookahead group or is not after the lookahead group, depending on if it’s positive or negative.

As such, using the negative lookahead like so:

/B(?!A)/

Will allow you to match BC but not BA.

With the regex /B(?!A)/ we can match “B” in “BC” but not in “BA”

You can even combine these with ^ and $ tokens to try to match full strings. For example, the following regex will match any string that does not start with “Test”

/^(?!Test).*$/gm

/^(?!Test).*$/gm lets us match “Hello” and “Other”, but not “Testing 123” and “Tests 123”

Likewise, we can switch this to a positive lookahead to enforce that our string muststart with “Test”

/^(?=Test).*$/gm

Inversing our previous item – /^(?=Test).*$/gm lets us match “Testing 123” and “Tests 123”, but not “Hello” and “Other”

Putting it all together

Regexes are extremely powerful and can be used in a myriad of string manipulations. Knowing them can help you refactor codebases, script quick language changes, and more!

Let’s go back to our initial phone number regex and try to understand it again:

^(?:d{3}-){2}d{4}$

Remember that this regex is looking to match phone numbers such as:

555-555-5555

Here this regex is:

  • Using ^ and $ to define the start and end of a regex line.
  • Using a non-capturing group to find three digits then a dash
    • Repeating this group twice, to match 555-555-
  • Finding the last 4 digits of the phone number

Hopefully, this article has been a helpful introduction to regexes for you. If you’d like to see quick definitions of useful regexes, check out our cheat sheet.

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Regular Expressions and Building Regexes in Python

In this tutorial, you’ll explore regular expressions, also known as regexes, in Python. A regex is a special sequence of characters that defines a pattern for complex string-matching functionality.

Earlier in this series, in the tutorial Strings and Character Data in Python, you learned how to define and manipulate string objects. Since then, you’ve seen some ways to determine whether two strings match each other:

  • You can test whether two strings are equal using the equality (==) operator.

  • You can test whether one string is a substring of another with the in operator or the built-in string methods .find() and .index().

String matching like this is a common task in programming, and you can get a lot done with string operators and built-in methods. At times, though, you may need more sophisticated pattern-matching capabilities.

In this tutorial, you’ll learn:

  • How to access the re module, which implements regex matching in Python
  • How to use re.search() to match a pattern against a string
  • How to create complex matching pattern with regex metacharacters

Fasten your seat belt! Regex syntax takes a little getting used to. But once you get comfortable with it, you’ll find regexes almost indispensable in your Python programming.

Regexes in Python and Their Uses

Imagine you have a string object s. Now suppose you need to write Python code to find out whether s contains the substring '123'. There are at least a couple ways to do this. You could use the in operator:

>>>

>>> s = 'foo123bar'
>>> '123' in s
True

If you want to know not only whether '123' exists in s but also where it exists, then you can use .find() or .index(). Each of these returns the character position within s where the substring resides:

>>>

>>> s = 'foo123bar'
>>> s.find('123')
3
>>> s.index('123')
3

In these examples, the matching is done by a straightforward character-by-character comparison. That will get the job done in many cases. But sometimes, the problem is more complicated than that.

For example, rather than searching for a fixed substring like '123', suppose you wanted to determine whether a string contains any three consecutive decimal digit characters, as in the strings 'foo123bar', 'foo456bar', '234baz', and 'qux678'.

Strict character comparisons won’t cut it here. This is where regexes in Python come to the rescue.

A (Very Brief) History of Regular Expressions

In 1951, mathematician Stephen Cole Kleene described the concept of a regular language, a language that is recognizable by a finite automaton and formally expressible using regular expressions. In the mid-1960s, computer science pioneer Ken Thompson, one of the original designers of Unix, implemented pattern matching in the QED text editor using Kleene’s notation.

Since then, regexes have appeared in many programming languages, editors, and other tools as a means of determining whether a string matches a specified pattern. Python, Java, and Perl all support regex functionality, as do most Unix tools and many text editors.

The re Module

Regex functionality in Python resides in a module named re. The re module contains many useful functions and methods, most of which you’ll learn about in the next tutorial in this series.

For now, you’ll focus predominantly on one function, re.search().

re.search(<regex>, <string>)

Scans a string for a regex match.

re.search(<regex>, <string>) scans <string> looking for the first location where the pattern <regex> matches. If a match is found, then re.search() returns a match object. Otherwise, it returns None.

re.search() takes an optional third <flags> argument that you’ll learn about at the end of this tutorial.

How to Import re.search()

Because search() resides in the re module, you need to import it before you can use it. One way to do this is to import the entire module and then use the module name as a prefix when calling the function:

Alternatively, you can import the function from the module by name and then refer to it without the module name prefix:

from re import search
search(...)

You’ll always need to import re.search() by one means or another before you’ll be able to use it.

The examples in the remainder of this tutorial will assume the first approach shown—importing the re module and then referring to the function with the module name prefix: re.search(). For the sake of brevity, the import re statement will usually be omitted, but remember that it’s always necessary.

For more information on importing from modules and packages, check out Python Modules and Packages—An Introduction.

First Pattern-Matching Example

Now that you know how to gain access to re.search(), you can give it a try:

>>>

 1>>> s = 'foo123bar'
 2
 3>>> # One last reminder to import!
 4>>> import re
 5
 6>>> re.search('123', s)
 7<_sre.SRE_Match object; span=(3, 6), match='123'>

Here, the search pattern <regex> is 123 and <string> is s. The returned match object appears on line 7. Match objects contain a wealth of useful information that you’ll explore soon.

For the moment, the important point is that re.search() did in fact return a match object rather than None. That tells you that it found a match. In other words, the specified <regex> pattern 123 is present in s.

A match object is truthy, so you can use it in a Boolean context like a conditional statement:

>>>

>>> if re.search('123', s):
...     print('Found a match.')
... else:
...     print('No match.')
...
Found a match.

The interpreter displays the match object as <_sre.SRE_Match object; span=(3, 6), match='123'>. This contains some useful information.

span=(3, 6) indicates the portion of <string> in which the match was found. This means the same thing as it would in slice notation:

In this example, the match starts at character position 3 and extends up to but not including position 6.

match='123' indicates which characters from <string> matched.

This is a good start. But in this case, the <regex> pattern is just the plain string '123'. The pattern matching here is still just character-by-character comparison, pretty much the same as the in operator and .find() examples shown earlier. The match object helpfully tells you that the matching characters were '123', but that’s not much of a revelation since those were exactly the characters you searched for.

You’re just getting warmed up.

Python Regex Metacharacters

The real power of regex matching in Python emerges when <regex> contains special characters called metacharacters. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

Consider again the problem of how to determine whether a string contains any three consecutive decimal digit characters.

In a regex, a set of characters specified in square brackets ([]) makes up a character class. This metacharacter sequence matches any single character that is in the class, as demonstrated in the following example:

>>>

>>> s = 'foo123bar'
>>> re.search('[0-9][0-9][0-9]', s)
<_sre.SRE_Match object; span=(3, 6), match='123'>

[0-9] matches any single decimal digit character—any character between '0' and '9', inclusive. The full expression [0-9][0-9][0-9] matches any sequence of three decimal digit characters. In this case, s matches because it contains three consecutive decimal digit characters, '123'.

These strings also match:

>>>

>>> re.search('[0-9][0-9][0-9]', 'foo456bar')
<_sre.SRE_Match object; span=(3, 6), match='456'>

>>> re.search('[0-9][0-9][0-9]', '234baz')
<_sre.SRE_Match object; span=(0, 3), match='234'>

>>> re.search('[0-9][0-9][0-9]', 'qux678')
<_sre.SRE_Match object; span=(3, 6), match='678'>

On the other hand, a string that doesn’t contain three consecutive digits won’t match:

>>>

>>> print(re.search('[0-9][0-9][0-9]', '12foo34'))
None

With regexes in Python, you can identify patterns in a string that you wouldn’t be able to find with the in operator or with string methods.

Take a look at another regex metacharacter. The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard:

>>>

>>> s = 'foo123bar'
>>> re.search('1.3', s)
<_sre.SRE_Match object; span=(3, 6), match='123'>

>>> s = 'foo13bar'
>>> print(re.search('1.3', s))
None

In the first example, the regex 1.3 matches '123' because the '1' and '3' match literally, and the . matches the '2'. Here, you’re essentially asking, “Does s contain a '1', then any character (except a newline), then a '3'?” The answer is yes for 'foo123bar' but no for 'foo13bar'.

These examples provide a quick illustration of the power of regex metacharacters. Character class and dot are but two of the metacharacters supported by the re module. There are many more. Next, you’ll explore them fully.

Metacharacters Supported by the re Module

The following table briefly summarizes all the metacharacters supported by the re module. Some characters serve more than one purpose:

Character(s) Meaning
. Matches any single character except newline
^ ∙ Anchors a match at the start of a string
∙ Complements a character class
$ Anchors a match at the end of a string
* Matches zero or more repetitions
+ Matches one or more repetitions
? ∙ Matches zero or one repetition
∙ Specifies the non-greedy versions of *, +, and ?
∙ Introduces a lookahead or lookbehind assertion
∙ Creates a named group
{} Matches an explicitly specified number of repetitions
∙ Escapes a metacharacter of its special meaning
∙ Introduces a special character class
∙ Introduces a grouping backreference
[] Specifies a character class
| Designates alternation
() Creates a group
:
#
=
!
Designate a specialized group
<> Creates a named group

This may seem like an overwhelming amount of information, but don’t panic! The following sections go over each one of these in detail.

The regex parser regards any character not listed above as an ordinary character that matches only itself. For example, in the first pattern-matching example shown above, you saw this:

>>>

>>> s = 'foo123bar'
>>> re.search('123', s)
<_sre.SRE_Match object; span=(3, 6), match='123'>

In this case, 123 is technically a regex, but it’s not a very interesting one because it doesn’t contain any metacharacters. It just matches the string '123'.

Things get much more exciting when you throw metacharacters into the mix. The following sections explain in detail how you can use each metacharacter or metacharacter sequence to enhance pattern-matching functionality.

Metacharacters That Match a Single Character

The metacharacter sequences in this section try to match a single character from the search string. When the regex parser encounters one of these metacharacter sequences, a match happens if the character at the current parsing position fits the description that the sequence describes.

[]

Specifies a specific set of characters to match.

Characters contained in square brackets ([]) represent a character class—an enumerated set of characters to match from. A character class metacharacter sequence will match any single character contained in the class.

You can enumerate the characters individually like this:

>>>

>>> re.search('ba[artz]', 'foobarqux')
<_sre.SRE_Match object; span=(3, 6), match='bar'>
>>> re.search('ba[artz]', 'foobazqux')
<_sre.SRE_Match object; span=(3, 6), match='baz'>

The metacharacter sequence [artz] matches any single 'a', 'r', 't', or 'z' character. In the example, the regex ba[artz] matches both 'bar' and 'baz' (and would also match 'baa' and 'bat').

A character class can also contain a range of characters separated by a hyphen (-), in which case it matches any single character within the range. For example, [a-z] matches any lowercase alphabetic character between 'a' and 'z', inclusive:

>>>

>>> re.search('[a-z]', 'FOObar')
<_sre.SRE_Match object; span=(3, 4), match='b'>

[0-9] matches any digit character:

>>>

>>> re.search('[0-9][0-9]', 'foo123bar')
<_sre.SRE_Match object; span=(3, 5), match='12'>

In this case, [0-9][0-9] matches a sequence of two digits. The first portion of the string 'foo123bar' that matches is '12'.

[0-9a-fA-F] matches any hexadecimal digit character:

>>>

>>> re.search('[0-9a-fA-f]', '--- a0 ---')
<_sre.SRE_Match object; span=(4, 5), match='a'>

Here, [0-9a-fA-F] matches the first hexadecimal digit character in the search string, 'a'.

You can complement a character class by specifying ^ as the first character, in which case it matches any character that isn’t in the set. In the following example, [^0-9] matches any character that isn’t a digit:

>>>

>>> re.search('[^0-9]', '12345foo')
<_sre.SRE_Match object; span=(5, 6), match='f'>

Here, the match object indicates that the first character in the string that isn’t a digit is 'f'.

If a ^ character appears in a character class but isn’t the first character, then it has no special meaning and matches a literal '^' character:

>>>

>>> re.search('[#:^]', 'foo^bar:baz#qux')
<_sre.SRE_Match object; span=(3, 4), match='^'>

As you’ve seen, you can specify a range of characters in a character class by separating characters with a hyphen. What if you want the character class to include a literal hyphen character? You can place it as the first or last character or escape it with a backslash ():

>>>

>>> re.search('[-abc]', '123-456')
<_sre.SRE_Match object; span=(3, 4), match='-'>
>>> re.search('[abc-]', '123-456')
<_sre.SRE_Match object; span=(3, 4), match='-'>
>>> re.search('[ab-c]', '123-456')
<_sre.SRE_Match object; span=(3, 4), match='-'>

If you want to include a literal ']' in a character class, then you can place it as the first character or escape it with backslash:

>>>

>>> re.search('[]]', 'foo[1]')
<_sre.SRE_Match object; span=(5, 6), match=']'>
>>> re.search('[ab]cd]', 'foo[1]')
<_sre.SRE_Match object; span=(5, 6), match=']'>

Other regex metacharacters lose their special meaning inside a character class:

>>>

>>> re.search('[)*+|]', '123*456')
<_sre.SRE_Match object; span=(3, 4), match='*'>
>>> re.search('[)*+|]', '123+456')
<_sre.SRE_Match object; span=(3, 4), match='+'>

As you saw in the table above, * and + have special meanings in a regex in Python. They designate repetition, which you’ll learn more about shortly. But in this example, they’re inside a character class, so they match themselves literally.

dot (.)

Specifies a wildcard.

The . metacharacter matches any single character except a newline:

>>>

>>> re.search('foo.bar', 'fooxbar')
<_sre.SRE_Match object; span=(0, 7), match='fooxbar'>

>>> print(re.search('foo.bar', 'foobar'))
None
>>> print(re.search('foo.bar', 'foonbar'))
None

As a regex, foo.bar essentially means the characters 'foo', then any character except newline, then the characters 'bar'. The first string shown above, 'fooxbar', fits the bill because the . metacharacter matches the 'x'.

The second and third strings fail to match. In the last case, although there’s a character between 'foo' and 'bar', it’s a newline, and by default, the . metacharacter doesn’t match a newline. There is, however, a way to force . to match a newline, which you’ll learn about at the end of this tutorial.

w
W

Match based on whether a character is a word character.

w matches any alphanumeric word character. Word characters are uppercase and lowercase letters, digits, and the underscore (_) character, so w is essentially shorthand for [a-zA-Z0-9_]:

>>>

>>> re.search('w', '#(.a$@&')
<_sre.SRE_Match object; span=(3, 4), match='a'>
>>> re.search('[a-zA-Z0-9_]', '#(.a$@&')
<_sre.SRE_Match object; span=(3, 4), match='a'>

In this case, the first word character in the string '#(.a$@&' is 'a'.

W is the opposite. It matches any non-word character and is equivalent to [^a-zA-Z0-9_]:

>>>

>>> re.search('W', 'a_1*3Qb')
<_sre.SRE_Match object; span=(3, 4), match='*'>
>>> re.search('[^a-zA-Z0-9_]', 'a_1*3Qb')
<_sre.SRE_Match object; span=(3, 4), match='*'>

Here, the first non-word character in 'a_1*3!b' is '*'.

d
D

Match based on whether a character is a decimal digit.

d matches any decimal digit character. D is the opposite. It matches any character that isn’t a decimal digit:

>>>

>>> re.search('d', 'abc4def')
<_sre.SRE_Match object; span=(3, 4), match='4'>

>>> re.search('D', '234Q678')
<_sre.SRE_Match object; span=(3, 4), match='Q'>

d is essentially equivalent to [0-9], and D is equivalent to [^0-9].

s
S

Match based on whether a character represents whitespace.

s matches any whitespace character:

>>>

>>> re.search('s', 'foonbar baz')
<_sre.SRE_Match object; span=(3, 4), match='n'>

Note that, unlike the dot wildcard metacharacter, s does match a newline character.

S is the opposite of s. It matches any character that isn’t whitespace:

>>>

>>> re.search('S', '  n foo  n  ')
<_sre.SRE_Match object; span=(4, 5), match='f'>

Again, s and S consider a newline to be whitespace. In the example above, the first non-whitespace character is 'f'.

The character class sequences w, W, d, D, s, and S can appear inside a square bracket character class as well:

>>>

>>> re.search('[dws]', '---3---')
<_sre.SRE_Match object; span=(3, 4), match='3'>
>>> re.search('[dws]', '---a---')
<_sre.SRE_Match object; span=(3, 4), match='a'>
>>> re.search('[dws]', '--- ---')
<_sre.SRE_Match object; span=(3, 4), match=' '>

In this case, [dws] matches any digit, word, or whitespace character. And since w includes d, the same character class could also be expressed slightly shorter as [ws].

Escaping Metacharacters

Occasionally, you’ll want to include a metacharacter in your regex, except you won’t want it to carry its special meaning. Instead, you’ll want it to represent itself as a literal character.

backslash ()

Removes the special meaning of a metacharacter.

As you’ve just seen, the backslash character can introduce special character classes like word, digit, and whitespace. There are also special metacharacter sequences called anchors that begin with a backslash, which you’ll learn about below.

When it’s not serving either of these purposes, the backslash escapes metacharacters. A metacharacter preceded by a backslash loses its special meaning and matches the literal character instead. Consider the following examples:

>>>

 1>>> re.search('.', 'foo.bar')
 2<_sre.SRE_Match object; span=(0, 1), match='f'>
 3
 4>>> re.search('.', 'foo.bar')
 5<_sre.SRE_Match object; span=(3, 4), match='.'>

In the <regex> on line 1, the dot (.) functions as a wildcard metacharacter, which matches the first character in the string ('f'). The . character in the <regex> on line 4 is escaped by a backslash, so it isn’t a wildcard. It’s interpreted literally and matches the '.' at index 3 of the search string.

Using backslashes for escaping can get messy. Suppose you have a string that contains a single backslash:

>>>

>>> s = r'foobar'
>>> print(s)
foobar

Now suppose you want to create a <regex> that will match the backslash between 'foo' and 'bar'. The backslash is itself a special character in a regex, so to specify a literal backslash, you need to escape it with another backslash. If that’s that case, then the following should work:

>>>

>>> re.search('\', s)

Not quite. This is what you get if you try it:

>>>

>>> re.search('\', s)
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    re.search('\', s)
  File "C:Python36libre.py", line 182, in search
    return _compile(pattern, flags).search(string)
  File "C:Python36libre.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:Python36libsre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "C:Python36libsre_parse.py", line 848, in parse
    source = Tokenizer(str)
  File "C:Python36libsre_parse.py", line 231, in __init__
    self.__next()
  File "C:Python36libsre_parse.py", line 245, in __next
    self.string, len(self.string) - 1) from None
sre_constants.error: bad escape (end of pattern) at position 0

Oops. What happened?

The problem here is that the backslash escaping happens twice, first by the Python interpreter on the string literal and then again by the regex parser on the regex it receives.

Here’s the sequence of events:

  1. The Python interpreter is the first to process the string literal '\'. It interprets that as an escaped backslash and passes only a single backslash to re.search().
  2. The regex parser receives just a single backslash, which isn’t a meaningful regex, so the messy error ensues.

There are two ways around this. First, you can escape both backslashes in the original string literal:

>>>

>>> re.search('\\', s)
<_sre.SRE_Match object; span=(3, 4), match='\'>

Doing so causes the following to happen:

  1. The interpreter sees '\\' as a pair of escaped backslashes. It reduces each pair to a single backslash and passes '\' to the regex parser.
  2. The regex parser then sees \ as one escaped backslash. As a <regex>, that matches a single backslash character. You can see from the match object that it matched the backslash at index 3 in s as intended. It’s cumbersome, but it works.

The second, and probably cleaner, way to handle this is to specify the <regex> using a raw string:

>>>

>>> re.search(r'\', s)
<_sre.SRE_Match object; span=(3, 4), match='\'>

This suppresses the escaping at the interpreter level. The string '\' gets passed unchanged to the regex parser, which again sees one escaped backslash as desired.

It’s good practice to use a raw string to specify a regex in Python whenever it contains backslashes.

Anchors

Anchors are zero-width matches. They don’t match any actual characters in the search string, and they don’t consume any of the search string during parsing. Instead, an anchor dictates a particular location in the search string where a match must occur.

^
A

Anchor a match to the start of <string>.

When the regex parser encounters ^ or A, the parser’s current position must be at the beginning of the search string for it to find a match.

In other words, regex ^foo stipulates that 'foo' must be present not just any old place in the search string, but at the beginning:

>>>

>>> re.search('^foo', 'foobar')
<_sre.SRE_Match object; span=(0, 3), match='foo'>
>>> print(re.search('^foo', 'barfoo'))
None

A functions similarly:

>>>

>>> re.search('Afoo', 'foobar')
<_sre.SRE_Match object; span=(0, 3), match='foo'>
>>> print(re.search('Afoo', 'barfoo'))
None

^ and A behave slightly differently from each other in MULTILINE mode. You’ll learn more about MULTILINE mode below in the section on flags.

$
Z

Anchor a match to the end of <string>.

When the regex parser encounters $ or Z, the parser’s current position must be at the end of the search string for it to find a match. Whatever precedes $ or Z must constitute the end of the search string:

>>>

>>> re.search('bar$', 'foobar')
<_sre.SRE_Match object; span=(3, 6), match='bar'>
>>> print(re.search('bar$', 'barfoo'))
None

>>> re.search('barZ', 'foobar')
<_sre.SRE_Match object; span=(3, 6), match='bar'>
>>> print(re.search('barZ', 'barfoo'))
None

As a special case, $ (but not Z) also matches just before a single newline at the end of the search string:

>>>

>>> re.search('bar$', 'foobarn')
<_sre.SRE_Match object; span=(3, 6), match='bar'>

In this example, 'bar' isn’t technically at the end of the search string because it’s followed by one additional newline character. But the regex parser lets it slide and calls it a match anyway. This exception doesn’t apply to Z.

$ and Z behave slightly differently from each other in MULTILINE mode. See the section below on flags for more information on MULTILINE mode.

b

Anchors a match to a word boundary.

b asserts that the regex parser’s current position must be at the beginning or end of a word. A word consists of a sequence of alphanumeric characters or underscores ([a-zA-Z0-9_]), the same as for the w character class:

>>>

 1>>> re.search(r'bbar', 'foo bar')
 2<_sre.SRE_Match object; span=(4, 7), match='bar'>
 3>>> re.search(r'bbar', 'foo.bar')
 4<_sre.SRE_Match object; span=(4, 7), match='bar'>
 5
 6>>> print(re.search(r'bbar', 'foobar'))
 7None
 8
 9>>> re.search(r'foob', 'foo bar')
10<_sre.SRE_Match object; span=(0, 3), match='foo'>
11>>> re.search(r'foob', 'foo.bar')
12<_sre.SRE_Match object; span=(0, 3), match='foo'>
13
14>>> print(re.search(r'foob', 'foobar'))
15None

In the above examples, a match happens on lines 1 and 3 because there’s a word boundary at the start of 'bar'. This isn’t the case on line 6, so the match fails there.

Similarly, there are matches on lines 9 and 11 because a word boundary exists at the end of 'foo', but not on line 14.

Using the b anchor on both ends of the <regex> will cause it to match when it’s present in the search string as a whole word:

>>>

>>> re.search(r'bbarb', 'foo bar baz')
<_sre.SRE_Match object; span=(4, 7), match='bar'>
>>> re.search(r'bbarb', 'foo(bar)baz')
<_sre.SRE_Match object; span=(4, 7), match='bar'>

>>> print(re.search(r'bbarb', 'foobarbaz'))
None

This is another instance in which it pays to specify the <regex> as a raw string, as the above examples have done.

Because 'b' is an escape sequence for both string literals and regexes in Python, each use above would need to be double escaped as '\b' if you didn’t use raw strings. That wouldn’t be the end of the world, but raw strings are tidier.

B

Anchors a match to a location that isn’t a word boundary.

B does the opposite of b. It asserts that the regex parser’s current position must not be at the start or end of a word:

>>>

 1>>> print(re.search(r'BfooB', 'foo'))
 2None
 3>>> print(re.search(r'BfooB', '.foo.'))
 4None
 5
 6>>> re.search(r'BfooB', 'barfoobaz')
 7<_sre.SRE_Match object; span=(3, 6), match='foo'>

In this case, a match happens on line 7 because no word boundary exists at the start or end of 'foo' in the search string 'barfoobaz'.

Quantifiers

A quantifier metacharacter immediately follows a portion of a <regex> and indicates how many times that portion must occur for the match to succeed.

*

Matches zero or more repetitions of the preceding regex.

For example, a* matches zero or more 'a' characters. That means it would match an empty string, 'a', 'aa', 'aaa', and so on.

Consider these examples:

>>>

 1>>> re.search('foo-*bar', 'foobar')                     # Zero dashes
 2<_sre.SRE_Match object; span=(0, 6), match='foobar'>
 3>>> re.search('foo-*bar', 'foo-bar')                    # One dash
 4<_sre.SRE_Match object; span=(0, 7), match='foo-bar'>
 5>>> re.search('foo-*bar', 'foo--bar')                   # Two dashes
 6<_sre.SRE_Match object; span=(0, 8), match='foo--bar'>

On line 1, there are zero '-' characters between 'foo' and 'bar'. On line 3 there’s one, and on line 5 there are two. The metacharacter sequence -* matches in all three cases.

You’ll probably encounter the regex .* in a Python program at some point. This matches zero or more occurrences of any character. In other words, it essentially matches any character sequence up to a line break. (Remember that the . wildcard metacharacter doesn’t match a newline.)

In this example, .* matches everything between 'foo' and 'bar':

>>>

>>> re.search('foo.*bar', '# foo $qux@grault % bar #')
<_sre.SRE_Match object; span=(2, 23), match='foo $qux@grault % bar'>

Did you notice the span= and match= information contained in the match object?

Until now, the regexes in the examples you’ve seen have specified matches of predictable length. Once you start using quantifiers like *, the number of characters matched can be quite variable, and the information in the match object becomes more useful.

You’ll learn more about how to access the information stored in a match object in the next tutorial in the series.

+

Matches one or more repetitions of the preceding regex.

This is similar to *, but the quantified regex must occur at least once:

>>>

 1>>> print(re.search('foo-+bar', 'foobar'))              # Zero dashes
 2None
 3>>> re.search('foo-+bar', 'foo-bar')                    # One dash
 4<_sre.SRE_Match object; span=(0, 7), match='foo-bar'>
 5>>> re.search('foo-+bar', 'foo--bar')                   # Two dashes
 6<_sre.SRE_Match object; span=(0, 8), match='foo--bar'>

Remember from above that foo-*bar matched the string 'foobar' because the * metacharacter allows for zero occurrences of '-'. The + metacharacter, on the other hand, requires at least one occurrence of '-'. That means there isn’t a match on line 1 in this case.

?

Matches zero or one repetitions of the preceding regex.

Again, this is similar to * and +, but in this case there’s only a match if the preceding regex occurs once or not at all:

>>>

 1>>> re.search('foo-?bar', 'foobar')                     # Zero dashes
 2<_sre.SRE_Match object; span=(0, 6), match='foobar'>
 3>>> re.search('foo-?bar', 'foo-bar')                    # One dash
 4<_sre.SRE_Match object; span=(0, 7), match='foo-bar'>
 5>>> print(re.search('foo-?bar', 'foo--bar'))            # Two dashes
 6None

In this example, there are matches on lines 1 and 3. But on line 5, where there are two '-' characters, the match fails.

Here are some more examples showing the use of all three quantifier metacharacters:

>>>

>>> re.match('foo[1-9]*bar', 'foobar')
<_sre.SRE_Match object; span=(0, 6), match='foobar'>
>>> re.match('foo[1-9]*bar', 'foo42bar')
<_sre.SRE_Match object; span=(0, 8), match='foo42bar'>

>>> print(re.match('foo[1-9]+bar', 'foobar'))
None
>>> re.match('foo[1-9]+bar', 'foo42bar')
<_sre.SRE_Match object; span=(0, 8), match='foo42bar'>

>>> re.match('foo[1-9]?bar', 'foobar')
<_sre.SRE_Match object; span=(0, 6), match='foobar'>
>>> print(re.match('foo[1-9]?bar', 'foo42bar'))
None

This time, the quantified regex is the character class [1-9] instead of the simple character '-'.

*?
+?
??

The non-greedy (or lazy) versions of the *, +, and ? quantifiers.

When used alone, the quantifier metacharacters *, +, and ? are all greedy, meaning they produce the longest possible match. Consider this example:

>>>

>>> re.search('<.*>', '%<foo> <bar> <baz>%')
<_sre.SRE_Match object; span=(1, 18), match='<foo> <bar> <baz>'>

The regex <.*> effectively means:

  • A '<' character
  • Then any sequence of characters
  • Then a '>' character

But which '>' character? There are three possibilities:

  1. The one just after 'foo'
  2. The one just after 'bar'
  3. The one just after 'baz'

Since the * metacharacter is greedy, it dictates the longest possible match, which includes everything up to and including the '>' character that follows 'baz'. You can see from the match object that this is the match produced.

If you want the shortest possible match instead, then use the non-greedy metacharacter sequence *?:

>>>

>>> re.search('<.*?>', '%<foo> <bar> <baz>%')
<_sre.SRE_Match object; span=(1, 6), match='<foo>'>

In this case, the match ends with the '>' character following 'foo'.

There are lazy versions of the + and ? quantifiers as well:

>>>

 1>>> re.search('<.+>', '%<foo> <bar> <baz>%')
 2<_sre.SRE_Match object; span=(1, 18), match='<foo> <bar> <baz>'>
 3>>> re.search('<.+?>', '%<foo> <bar> <baz>%')
 4<_sre.SRE_Match object; span=(1, 6), match='<foo>'>
 5
 6>>> re.search('ba?', 'baaaa')
 7<_sre.SRE_Match object; span=(0, 2), match='ba'>
 8>>> re.search('ba??', 'baaaa')
 9<_sre.SRE_Match object; span=(0, 1), match='b'>

The first two examples on lines 1 and 3 are similar to the examples shown above, only using + and +? instead of * and *?.

The last examples on lines 6 and 8 are a little different. In general, the ? metacharacter matches zero or one occurrences of the preceding regex. The greedy version, ?, matches one occurrence, so ba? matches 'b' followed by a single 'a'. The non-greedy version, ??, matches zero occurrences, so ba?? matches just 'b'.

{m}

Matches exactly m repetitions of the preceding regex.

This is similar to * or +, but it specifies exactly how many times the preceding regex must occur for a match to succeed:

>>>

>>> print(re.search('x-{3}x', 'x--x'))                # Two dashes
None

>>> re.search('x-{3}x', 'x---x')                      # Three dashes
<_sre.SRE_Match object; span=(0, 5), match='x---x'>

>>> print(re.search('x-{3}x', 'x----x'))              # Four dashes
None

Here, x-{3}x matches 'x', followed by exactly three instances of the '-' character, followed by another 'x'. The match fails when there are fewer or more than three dashes between the 'x' characters.

{m,n}

Matches any number of repetitions of the preceding regex from m to n, inclusive.

In the following example, the quantified <regex> is -{2,4}. The match succeeds when there are two, three, or four dashes between the 'x' characters but fails otherwise:

>>>

>>> for i in range(1, 6):
...     s = f"x{'-' * i}x"
...     print(f'{i}  {s:10}', re.search('x-{2,4}x', s))
...
1  x-x        None
2  x--x       <_sre.SRE_Match object; span=(0, 4), match='x--x'>
3  x---x      <_sre.SRE_Match object; span=(0, 5), match='x---x'>
4  x----x     <_sre.SRE_Match object; span=(0, 6), match='x----x'>
5  x-----x    None

Omitting m implies a lower bound of 0, and omitting n implies an unlimited upper bound:

Regular Expression Matches Identical to
<regex>{,n} Any number of repetitions of <regex> less than or equal to n <regex>{0,n}
<regex>{m,} Any number of repetitions of <regex> greater than or equal to m ----
<regex>{,} Any number of repetitions of <regex> <regex>{0,}
<regex>*

If you omit all of m, n, and the comma, then the curly braces no longer function as metacharacters. {} matches just the literal string '{}':

>>>

>>> re.search('x{}y', 'x{}y')
<_sre.SRE_Match object; span=(0, 4), match='x{}y'>

In fact, to have any special meaning, a sequence with curly braces must fit one of the following patterns in which m and n are nonnegative integers:

  • {m,n}
  • {m,}
  • {,n}
  • {,}

Otherwise, it matches literally:

>>>

>>> re.search('x{foo}y', 'x{foo}y')
<_sre.SRE_Match object; span=(0, 7), match='x{foo}y'>
>>> re.search('x{a:b}y', 'x{a:b}y')
<_sre.SRE_Match object; span=(0, 7), match='x{a:b}y'>
>>> re.search('x{1,3,5}y', 'x{1,3,5}y')
<_sre.SRE_Match object; span=(0, 9), match='x{1,3,5}y'>
>>> re.search('x{foo,bar}y', 'x{foo,bar}y')
<_sre.SRE_Match object; span=(0, 11), match='x{foo,bar}y'>

Later in this tutorial, when you learn about the DEBUG flag, you’ll see how you can confirm this.

{m,n}?

The non-greedy (lazy) version of {m,n}.

{m,n} will match as many characters as possible, and {m,n}? will match as few as possible:

>>>

>>> re.search('a{3,5}', 'aaaaaaaa')
<_sre.SRE_Match object; span=(0, 5), match='aaaaa'>

>>> re.search('a{3,5}?', 'aaaaaaaa')
<_sre.SRE_Match object; span=(0, 3), match='aaa'>

In this case, a{3,5} produces the longest possible match, so it matches five 'a' characters. a{3,5}? produces the shortest match, so it matches three.

Grouping Constructs and Backreferences

Grouping constructs break up a regex in Python into subexpressions or groups. This serves two purposes:

  1. Grouping: A group represents a single syntactic entity. Additional metacharacters apply to the entire group as a unit.
  2. Capturing: Some grouping constructs also capture the portion of the search string that matches the subexpression in the group. You can retrieve captured matches later through several different mechanisms.

Here’s a look at how grouping and capturing work.

(<regex>)

Defines a subexpression or group.

This is the most basic grouping construct. A regex in parentheses just matches the contents of the parentheses:

>>>

>>> re.search('(bar)', 'foo bar baz')
<_sre.SRE_Match object; span=(4, 7), match='bar'>

>>> re.search('bar', 'foo bar baz')
<_sre.SRE_Match object; span=(4, 7), match='bar'>

As a regex, (bar) matches the string 'bar', the same as the regex bar would without the parentheses.

Treating a Group as a Unit

A quantifier metacharacter that follows a group operates on the entire subexpression specified in the group as a single unit.

For instance, the following example matches one or more occurrences of the string 'bar':

>>>

>>> re.search('(bar)+', 'foo bar baz')
<_sre.SRE_Match object; span=(4, 7), match='bar'>
>>> re.search('(bar)+', 'foo barbar baz')
<_sre.SRE_Match object; span=(4, 10), match='barbar'>
>>> re.search('(bar)+', 'foo barbarbarbar baz')
<_sre.SRE_Match object; span=(4, 16), match='barbarbarbar'>

Here’s a breakdown of the difference between the two regexes with and without grouping parentheses:

Regex Interpretation Matches Examples
bar+ The + metacharacter applies only to the character 'r'. 'ba' followed by one or more occurrences of 'r' 'bar'
'barr'
'barrr'
(bar)+ The + metacharacter applies to the entire string 'bar'. One or more occurrences of 'bar' 'bar'
'barbar'
'barbarbar'

Now take a look at a more complicated example. The regex (ba[rz]){2,4}(qux)? matches 2 to 4 occurrences of either 'bar' or 'baz', optionally followed by 'qux':

>>>

>>> re.search('(ba[rz]){2,4}(qux)?', 'bazbarbazqux')
<_sre.SRE_Match object; span=(0, 12), match='bazbarbazqux'>
>>> re.search('(ba[rz]){2,4}(qux)?', 'barbar')
<_sre.SRE_Match object; span=(0, 6), match='barbar'>

The following example shows that you can nest grouping parentheses:

>>>

>>> re.search('(foo(bar)?)+(ddd)?', 'foofoobar')
<_sre.SRE_Match object; span=(0, 9), match='foofoobar'>
>>> re.search('(foo(bar)?)+(ddd)?', 'foofoobar123')
<_sre.SRE_Match object; span=(0, 12), match='foofoobar123'>
>>> re.search('(foo(bar)?)+(ddd)?', 'foofoo123')
<_sre.SRE_Match object; span=(0, 9), match='foofoo123'>

The regex (foo(bar)?)+(ddd)? is pretty elaborate, so let’s break it down into smaller pieces:

Regex Matches
foo(bar)? 'foo' optionally followed by 'bar'
(foo(bar)?)+ One or more occurrences of the above
ddd Three decimal digit characters
(ddd)? Zero or one occurrences of the above

String it all together and you get: at least one occurrence of 'foo' optionally followed by 'bar', all optionally followed by three decimal digit characters.

As you can see, you can construct very complicated regexes in Python using grouping parentheses.

Capturing Groups

Grouping isn’t the only useful purpose that grouping constructs serve. Most (but not quite all) grouping constructs also capture the part of the search string that matches the group. You can retrieve the captured portion or refer to it later in several different ways.

Remember the match object that re.search() returns? There are two methods defined for a match object that provide access to captured groups: .groups() and .group().

m.groups()

Returns a tuple containing all the captured groups from a regex match.

Consider this example:

>>>

>>> m = re.search('(w+),(w+),(w+)', 'foo,quux,baz')
>>> m
<_sre.SRE_Match object; span=(0, 12), match='foo:quux:baz'>

Each of the three (w+) expressions matches a sequence of word characters. The full regex (w+),(w+),(w+) breaks the search string into three comma-separated tokens.

Because the (w+) expressions use grouping parentheses, the corresponding matching tokens are captured. To access the captured matches, you can use .groups(), which returns a tuple containing all the captured matches in order:

>>>

>>> m.groups()
('foo', 'quux', 'baz')

Notice that the tuple contains the tokens but not the commas that appeared in the search string. That’s because the word characters that make up the tokens are inside the grouping parentheses but the commas aren’t. The commas that you see between the returned tokens are the standard delimiters used to separate values in a tuple.

m.group(<n>)

Returns a string containing the <n>th captured match.

With one argument, .group() returns a single captured match. Note that the arguments are one-based, not zero-based. So, m.group(1) refers to the first captured match, m.group(2) to the second, and so on:

>>>

>>> m = re.search('(w+),(w+),(w+)', 'foo,quux,baz')
>>> m.groups()
('foo', 'quux', 'baz')

>>> m.group(1)
'foo'
>>> m.group(2)
'quux'
>>> m.group(3)
'baz'

Since the numbering of captured matches is one-based, and there isn’t any group numbered zero, m.group(0) has a special meaning:

>>>

>>> m.group(0)
'foo,quux,baz'
>>> m.group()
'foo,quux,baz'

m.group(0) returns the entire match, and m.group() does the same.

m.group(<n1>, <n2>, ...)

Returns a tuple containing the specified captured matches.

With multiple arguments, .group() returns a tuple containing the specified captured matches in the given order:

>>>

>>> m.groups()
('foo', 'quux', 'baz')

>>> m.group(2, 3)
('quux', 'baz')
>>> m.group(3, 2, 1)
('baz', 'quux', 'foo')

This is just convenient shorthand. You could create the tuple of matches yourself instead:

>>>

>>> m.group(3, 2, 1)
('baz', 'qux', 'foo')
>>> (m.group(3), m.group(2), m.group(1))
('baz', 'qux', 'foo')

The two statements shown are functionally equivalent.

Backreferences

You can match a previously captured group later within the same regex using a special metacharacter sequence called a backreference.

<n>

Matches the contents of a previously captured group.

Within a regex in Python, the sequence <n>, where <n> is an integer from 1 to 99, matches the contents of the <n>th captured group.

Here’s a regex that matches a word, followed by a comma, followed by the same word again:

>>>

 1>>> regex = r'(w+),1'
 2
 3>>> m = re.search(regex, 'foo,foo')
 4>>> m
 5<_sre.SRE_Match object; span=(0, 7), match='foo,foo'>
 6>>> m.group(1)
 7'foo'
 8
 9>>> m = re.search(regex, 'qux,qux')
10>>> m
11<_sre.SRE_Match object; span=(0, 7), match='qux,qux'>
12>>> m.group(1)
13'qux'
14
15>>> m = re.search(regex, 'foo,qux')
16>>> print(m)
17None

In the first example, on line 3, (w+) matches the first instance of the string 'foo' and saves it as the first captured group. The comma matches literally. Then 1 is a backreference to the first captured group and matches 'foo' again. The second example, on line 9, is identical except that the (w+) matches 'qux' instead.

The last example, on line 15, doesn’t have a match because what comes before the comma isn’t the same as what comes after it, so the 1 backreference doesn’t match.

Numbered backreferences are one-based like the arguments to .group(). Only the first ninety-nine captured groups are accessible by backreference. The interpreter will regard 100 as the '@' character, whose octal value is 100.

Other Grouping Constructs

The (<regex>) metacharacter sequence shown above is the most straightforward way to perform grouping within a regex in Python. The next section introduces you to some enhanced grouping constructs that allow you to tweak when and how grouping occurs.

(?P<name><regex>)

Creates a named captured group.

This metacharacter sequence is similar to grouping parentheses in that it creates a group matching <regex> that is accessible through the match object or a subsequent backreference. The difference in this case is that you reference the matched group by its given symbolic <name> instead of by its number.

Earlier, you saw this example with three captured groups numbered 1, 2, and 3:

>>>

>>> m = re.search('(w+),(w+),(w+)', 'foo,quux,baz')
>>> m.groups()
('foo', 'quux', 'baz')

>>> m.group(1, 2, 3)
('foo', 'quux', 'baz')

The following effectively does the same thing except that the groups have the symbolic names w1, w2, and w3:

>>>

>>> m = re.search('(?P<w1>w+),(?P<w2>w+),(?P<w3>w+)', 'foo,quux,baz')
>>> m.groups()
('foo', 'quux', 'baz')

You can refer to these captured groups by their symbolic names:

>>>

>>> m.group('w1')
'foo'
>>> m.group('w3')
'baz'
>>> m.group('w1', 'w2', 'w3')
('foo', 'quux', 'baz')

You can still access groups with symbolic names by number if you wish:

>>>

>>> m = re.search('(?P<w1>w+),(?P<w2>w+),(?P<w3>w+)', 'foo,quux,baz')

>>> m.group('w1')
'foo'
>>> m.group(1)
'foo'

>>> m.group('w1', 'w2', 'w3')
('foo', 'quux', 'baz')
>>> m.group(1, 2, 3)
('foo', 'quux', 'baz')

Any <name> specified with this construct must conform to the rules for a Python identifier, and each <name> can only appear once per regex.

(?P=<name>)

Matches the contents of a previously captured named group.

The (?P=<name>) metacharacter sequence is a backreference, similar to <n>, except that it refers to a named group rather than a numbered group.

Here again is the example from above, which uses a numbered backreference to match a word, followed by a comma, followed by the same word again:

>>>

>>> m = re.search(r'(w+),1', 'foo,foo')
>>> m
<_sre.SRE_Match object; span=(0, 7), match='foo,foo'>
>>> m.group(1)
'foo'

The following code does the same thing using a named group and a backreference instead:

>>>

>>> m = re.search(r'(?P<word>w+),(?P=word)', 'foo,foo')
>>> m
<_sre.SRE_Match object; span=(0, 7), match='foo,foo'>
>>> m.group('word')
'foo'

(?P=<word>w+) matches 'foo' and saves it as a captured group named word. Again, the comma matches literally. Then (?P=word) is a backreference to the named capture and matches 'foo' again.

(?:<regex>)

Creates a non-capturing group.

(?:<regex>) is just like (<regex>) in that it matches the specified <regex>. But (?:<regex>) doesn’t capture the match for later retrieval:

>>>

>>> m = re.search('(w+),(?:w+),(w+)', 'foo,quux,baz')
>>> m.groups()
('foo', 'baz')

>>> m.group(1)
'foo'
>>> m.group(2)
'baz'

In this example, the middle word 'quux' sits inside non-capturing parentheses, so it’s missing from the tuple of captured groups. It isn’t retrievable from the match object, nor would it be referable by backreference.

Why would you want to define a group but not capture it?

Remember that the regex parser will treat the <regex> inside grouping parentheses as a single unit. You may have a situation where you need this grouping feature, but you don’t need to do anything with the value later, so you don’t really need to capture it. If you use non-capturing grouping, then the tuple of captured groups won’t be cluttered with values you don’t actually need to keep.

Additionally, it takes some time and memory to capture a group. If the code that performs the match executes many times and you don’t capture groups that you aren’t going to use later, then you may see a slight performance advantage.

(?(<n>)<yes-regex>|<no-regex>)
(?(<name>)<yes-regex>|<no-regex>)

Specifies a conditional match.

A conditional match matches against one of two specified regexes depending on whether the given group exists:

  • (?(<n>)<yes-regex>|<no-regex>) matches against <yes-regex> if a group numbered <n> exists. Otherwise, it matches against <no-regex>.

  • (?(<name>)<yes-regex>|<no-regex>) matches against <yes-regex> if a group named <name> exists. Otherwise, it matches against <no-regex>.

Conditional matches are better illustrated with an example. Consider this regex:

regex = r'^(###)?foo(?(1)bar|baz)'

Here are the parts of this regex broken out with some explanation:

  1. ^(###)? indicates that the search string optionally begins with '###'. If it does, then the grouping parentheses around ### will create a group numbered 1. Otherwise, no such group will exist.
  2. The next portion, foo, literally matches the string 'foo'.
  3. Lastly, (?(1)bar|baz) matches against 'bar' if group 1 exists and 'baz' if it doesn’t.

The following code blocks demonstrate the use of the above regex in several different Python code snippets:

Example 1:

>>>

>>> re.search(regex, '###foobar')
<_sre.SRE_Match object; span=(0, 9), match='###foobar'>

The search string '###foobar' does start with '###', so the parser creates a group numbered 1. The conditional match is then against 'bar', which matches.

Example 2:

>>>

>>> print(re.search(regex, '###foobaz'))
None

The search string '###foobaz' does start with '###', so the parser creates a group numbered 1. The conditional match is then against 'bar', which doesn’t match.

Example 3:

>>>

>>> print(re.search(regex, 'foobar'))
None

The search string 'foobar' doesn’t start with '###', so there isn’t a group numbered 1. The conditional match is then against 'baz', which doesn’t match.

Example 4:

>>>

>>> re.search(regex, 'foobaz')
<_sre.SRE_Match object; span=(0, 6), match='foobaz'>

The search string 'foobaz' doesn’t start with '###', so there isn’t a group numbered 1. The conditional match is then against 'baz', which matches.

Here’s another conditional match using a named group instead of a numbered group:

>>>

>>> regex = r'^(?P<ch>W)?foo(?(ch)(?P=ch)|)$'

This regex matches the string 'foo', preceded by a single non-word character and followed by the same non-word character, or the string 'foo' by itself.

Again, let’s break this down into pieces:

Regex Matches
^ The start of the string
(?P<ch>W) A single non-word character, captured in a group named ch
(?P<ch>W)? Zero or one occurrences of the above
foo The literal string 'foo'
(?(ch)(?P=ch)|) The contents of the group named ch if it exists, or the empty string if it doesn’t
$ The end of the string

If a non-word character precedes 'foo', then the parser creates a group named ch which contains that character. The conditional match then matches against <yes-regex>, which is (?P=ch), the same character again. That means the same character must also follow 'foo' for the entire match to succeed.

If 'foo' isn’t preceded by a non-word character, then the parser doesn’t create group ch. <no-regex> is the empty string, which means there must not be anything following 'foo' for the entire match to succeed. Since ^ and $ anchor the whole regex, the string must equal 'foo' exactly.

Here are some examples of searches using this regex in Python code:

>>>

 1>>> re.search(regex, 'foo')
 2<_sre.SRE_Match object; span=(0, 3), match='foo'>
 3>>> re.search(regex, '#foo#')
 4<_sre.SRE_Match object; span=(0, 5), match='#foo#'>
 5>>> re.search(regex, '@foo@')
 6<_sre.SRE_Match object; span=(0, 5), match='@foo@'>
 7
 8>>> print(re.search(regex, '#foo'))
 9None
10>>> print(re.search(regex, 'foo@'))
11None
12>>> print(re.search(regex, '#foo@'))
13None
14>>> print(re.search(regex, '@foo#'))
15None

On line 1, 'foo' is by itself. On lines 3 and 5, the same non-word character precedes and follows 'foo'. As advertised, these matches succeed.

In the remaining cases, the matches fail.

Conditional regexes in Python are pretty esoteric and challenging to work through. If you ever do find a reason to use one, then you could probably accomplish the same goal with multiple separate re.search() calls, and your code would be less complicated to read and understand.

Lookahead and Lookbehind Assertions

Lookahead and lookbehind assertions determine the success or failure of a regex match in Python based on what is just behind (to the left) or ahead (to the right) of the parser’s current position in the search string.

Like anchors, lookahead and lookbehind assertions are zero-width assertions, so they don’t consume any of the search string. Also, even though they contain parentheses and perform grouping, they don’t capture what they match.

(?=<lookahead_regex>)

Creates a positive lookahead assertion.

(?=<lookahead_regex>) asserts that what follows the regex parser’s current position must match <lookahead_regex>:

>>>

>>> re.search('foo(?=[a-z])', 'foobar')
<_sre.SRE_Match object; span=(0, 3), match='foo'>

The lookahead assertion (?=[a-z]) specifies that what follows 'foo' must be a lowercase alphabetic character. In this case, it’s the character 'b', so a match is found.

In the next example, on the other hand, the lookahead fails. The next character after 'foo' is '1', so there isn’t a match:

>>>

>>> print(re.search('foo(?=[a-z])', 'foo123'))
None

What’s unique about a lookahead is that the portion of the search string that matches <lookahead_regex> isn’t consumed, and it isn’t part of the returned match object.

Take another look at the first example:

>>>

>>> re.search('foo(?=[a-z])', 'foobar')
<_sre.SRE_Match object; span=(0, 3), match='foo'>

The regex parser looks ahead only to the 'b' that follows 'foo' but doesn’t pass over it yet. You can tell that 'b' isn’t considered part of the match because the match object displays match='foo'.

Compare that to a similar example that uses grouping parentheses without a lookahead:

>>>

>>> re.search('foo([a-z])', 'foobar')
<_sre.SRE_Match object; span=(0, 4), match='foob'>

This time, the regex consumes the 'b', and it becomes a part of the eventual match.

Here’s another example illustrating how a lookahead differs from a conventional regex in Python:

>>>

 1>>> m = re.search('foo(?=[a-z])(?P<ch>.)', 'foobar')
 2>>> m.group('ch')
 3'b'
 4
 5>>> m = re.search('foo([a-z])(?P<ch>.)', 'foobar')
 6>>> m.group('ch')
 7'a'

In the first search, on line 1, the parser proceeds as follows:

  1. The first portion of the regex, foo, matches and consumes 'foo' from the search string 'foobar'.
  2. The next portion, (?=[a-z]), is a lookahead that matches 'b', but the parser doesn’t advance past the 'b'.
  3. Lastly, (?P<ch>.) matches the next single character available, which is 'b', and captures it in a group named ch.

The m.group('ch') call confirms that the group named ch contains 'b'.

Compare that to the search on line 5, which doesn’t contain a lookahead:

  1. As in the first example, the first portion of the regex, foo, matches and consumes 'foo' from the search string 'foobar'.
  2. The next portion, ([a-z]), matches and consumes 'b', and the parser advances past 'b'.
  3. Lastly, (?P<ch>.) matches the next single character available, which is now 'a'.

m.group('ch') confirms that, in this case, the group named ch contains 'a'.

(?!<lookahead_regex>)

Creates a negative lookahead assertion.

(?!<lookahead_regex>) asserts that what follows the regex parser’s current position must not match <lookahead_regex>.

Here are the positive lookahead examples you saw earlier, along with their negative lookahead counterparts:

>>>

 1>>> re.search('foo(?=[a-z])', 'foobar')
 2<_sre.SRE_Match object; span=(0, 3), match='foo'>
 3>>> print(re.search('foo(?![a-z])', 'foobar'))
 4None
 5
 6>>> print(re.search('foo(?=[a-z])', 'foo123'))
 7None
 8>>> re.search('foo(?![a-z])', 'foo123')
 9<_sre.SRE_Match object; span=(0, 3), match='foo'>

The negative lookahead assertions on lines 3 and 8 stipulate that what follows 'foo' should not be a lowercase alphabetic character. This fails on line 3 but succeeds on line 8. This is the opposite of what happened with the corresponding positive lookahead assertions.

As with a positive lookahead, what matches a negative lookahead isn’t part of the returned match object and isn’t consumed.

(?<=<lookbehind_regex>)

Creates a positive lookbehind assertion.

(?<=<lookbehind_regex>) asserts that what precedes the regex parser’s current position must match <lookbehind_regex>.

In the following example, the lookbehind assertion specifies that 'foo' must precede 'bar':

>>>

>>> re.search('(?<=foo)bar', 'foobar')
<_sre.SRE_Match object; span=(3, 6), match='bar'>

This is the case here, so the match succeeds. As with lookahead assertions, the part of the search string that matches the lookbehind doesn’t become part of the eventual match.

The next example fails to match because the lookbehind requires that 'qux' precede 'bar':

>>>

>>> print(re.search('(?<=qux)bar', 'foobar'))
None

There’s a restriction on lookbehind assertions that doesn’t apply to lookahead assertions. The <lookbehind_regex> in a lookbehind assertion must specify a match of fixed length.

For example, the following isn’t allowed because the length of the string matched by a+ is indeterminate:

>>>

>>> re.search('(?<=a+)def', 'aaadef')
Traceback (most recent call last):
  File "<pyshell#72>", line 1, in <module>
    re.search('(?<=a+)def', 'aaadef')
  File "C:Python36libre.py", line 182, in search
    return _compile(pattern, flags).search(string)
  File "C:Python36libre.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:Python36libsre_compile.py", line 566, in compile
    code = _code(p, flags)
  File "C:Python36libsre_compile.py", line 551, in _code
    _compile(code, p.data, flags)
  File "C:Python36libsre_compile.py", line 160, in _compile
    raise error("look-behind requires fixed-width pattern")
sre_constants.error: look-behind requires fixed-width pattern

This, however, is okay:

>>>

>>> re.search('(?<=a{3})def', 'aaadef')
<_sre.SRE_Match object; span=(3, 6), match='def'>

Anything that matches a{3} will have a fixed length of three, so a{3} is valid in a lookbehind assertion.

(?<!<lookbehind_regex>)

Creates a negative lookbehind assertion.

(?<!<lookbehind_regex>) asserts that what precedes the regex parser’s current position must not match <lookbehind_regex>:

>>>

>>> print(re.search('(?<!foo)bar', 'foobar'))
None

>>> re.search('(?<!qux)bar', 'foobar')
<_sre.SRE_Match object; span=(3, 6), match='bar'>

As with the positive lookbehind assertion, <lookbehind_regex> must specify a match of fixed length.

Miscellaneous Metacharacters

There are a couple more metacharacter sequences to cover. These are stray metacharacters that don’t obviously fall into any of the categories already discussed.

(?#...)

Specifies a comment.

The regex parser ignores anything contained in the sequence (?#...):

>>>

>>> re.search('bar(?#This is a comment) *baz', 'foo bar baz qux')
<_sre.SRE_Match object; span=(4, 11), match='bar baz'>

This allows you to specify documentation inside a regex in Python, which can be especially useful if the regex is particularly long.

Vertical bar, or pipe (|)

Specifies a set of alternatives on which to match.

An expression of the form <regex1>|<regex2>|...|<regexn> matches at most one of the specified <regexi> expressions:

>>>

>>> re.search('foo|bar|baz', 'bar')
<_sre.SRE_Match object; span=(0, 3), match='bar'>

>>> re.search('foo|bar|baz', 'baz')
<_sre.SRE_Match object; span=(0, 3), match='baz'>

>>> print(re.search('foo|bar|baz', 'quux'))
None

Here, foo|bar|baz will match any of 'foo', 'bar', or 'baz'. You can separate any number of regexes using |.

Alternation is non-greedy. The regex parser looks at the expressions separated by | in left-to-right order and returns the first match that it finds. The remaining expressions aren’t tested, even if one of them would produce a longer match:

>>>

 1>>> re.search('foo', 'foograult')
 2<_sre.SRE_Match object; span=(0, 3), match='foo'>
 3>>> re.search('grault', 'foograult')
 4<_sre.SRE_Match object; span=(3, 9), match='grault'>
 5
 6>>> re.search('foo|grault', 'foograult')
 7<_sre.SRE_Match object; span=(0, 3), match='foo'>

In this case, the pattern specified on line 6, 'foo|grault', would match on either 'foo' or 'grault'. The match returned is 'foo' because that appears first when scanning from left to right, even though 'grault' would be a longer match.

You can combine alternation, grouping, and any other metacharacters to achieve whatever level of complexity you need. In the following example, (foo|bar|baz)+ means a sequence of one or more of the strings 'foo', 'bar', or 'baz':

>>>

>>> re.search('(foo|bar|baz)+', 'foofoofoo')
<_sre.SRE_Match object; span=(0, 9), match='foofoofoo'>
>>> re.search('(foo|bar|baz)+', 'bazbazbazbaz')
<_sre.SRE_Match object; span=(0, 12), match='bazbazbazbaz'>
>>> re.search('(foo|bar|baz)+', 'barbazfoo')
<_sre.SRE_Match object; span=(0, 9), match='barbazfoo'>

In the next example, ([0-9]+|[a-f]+) means a sequence of one or more decimal digit characters or a sequence of one or more of the characters 'a-f':

>>>

>>> re.search('([0-9]+|[a-f]+)', '456')
<_sre.SRE_Match object; span=(0, 3), match='456'>
>>> re.search('([0-9]+|[a-f]+)', 'ffda')
<_sre.SRE_Match object; span=(0, 4), match='ffda'>

With all the metacharacters that the re module supports, the sky is practically the limit.

That’s All, Folks!

That completes our tour of the regex metacharacters supported by Python’s re module. (Actually, it doesn’t quite—there are a couple more stragglers you’ll learn about below in the discussion on flags.)

It’s a lot to digest, but once you become familiar with regex syntax in Python, the complexity of pattern matching that you can perform is almost limitless. These tools come in very handy when you’re writing code to process textual data.

If you’re new to regexes and want more practice working with them, or if you’re developing an application that uses a regex and you want to test it interactively, then check out the Regular Expressions 101 website. It’s seriously cool!

Modified Regular Expression Matching With Flags

Most of the functions in the re module take an optional <flags> argument. This includes the function you’re now very familiar with, re.search().

re.search(<regex>, <string>, <flags>)

Scans a string for a regex match, applying the specified modifier <flags>.

Flags modify regex parsing behavior, allowing you to refine your pattern matching even further.

Supported Regular Expression Flags

The table below briefly summarizes the available flags. All flags except re.DEBUG have a short, single-letter name and also a longer, full-word name:

Short Name Long Name Effect
re.I re.IGNORECASE Makes matching of alphabetic characters case-insensitive
re.M re.MULTILINE Causes start-of-string and end-of-string anchors to match embedded newlines
re.S re.DOTALL Causes the dot metacharacter to match a newline
re.X re.VERBOSE Allows inclusion of whitespace and comments within a regular expression
---- re.DEBUG Causes the regex parser to display debugging information to the console
re.A re.ASCII Specifies ASCII encoding for character classification
re.U re.UNICODE Specifies Unicode encoding for character classification
re.L                             re.LOCALE Specifies encoding for character classification based on the current locale

The following sections describe in more detail how these flags affect matching behavior.

re.I
re.IGNORECASE

Makes matching case insensitive.

When IGNORECASE is in effect, character matching is case insensitive:

>>>

 1>>> re.search('a+', 'aaaAAA')
 2<_sre.SRE_Match object; span=(0, 3), match='aaa'>
 3>>> re.search('A+', 'aaaAAA')
 4<_sre.SRE_Match object; span=(3, 6), match='AAA'>
 5
 6>>> re.search('a+', 'aaaAAA', re.I)
 7<_sre.SRE_Match object; span=(0, 6), match='aaaAAA'>
 8>>> re.search('A+', 'aaaAAA', re.IGNORECASE)
 9<_sre.SRE_Match object; span=(0, 6), match='aaaAAA'>

In the search on line 1, a+ matches only the first three characters of 'aaaAAA'. Similarly, on line 3, A+ matches only the last three characters. But in the subsequent searches, the parser ignores case, so both a+ and A+ match the entire string.

IGNORECASE affects alphabetic matching involving character classes as well:

>>>

>>> re.search('[a-z]+', 'aBcDeF')
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>> re.search('[a-z]+', 'aBcDeF', re.I)
<_sre.SRE_Match object; span=(0, 6), match='aBcDeF'>

When case is significant, the longest portion of 'aBcDeF' that [a-z]+ matches is just the initial 'a'. Specifying re.I makes the search case insensitive, so [a-z]+ matches the entire string.

re.M
re.MULTILINE

Causes start-of-string and end-of-string anchors to match at embedded newlines.

By default, the ^ (start-of-string) and $ (end-of-string) anchors match only at the beginning and end of the search string:

>>>

>>> s = 'foonbarnbaz'

>>> re.search('^foo', s)
<_sre.SRE_Match object; span=(0, 3), match='foo'>
>>> print(re.search('^bar', s))
None
>>> print(re.search('^baz', s))
None

>>> print(re.search('foo$', s))
None
>>> print(re.search('bar$', s))
None
>>> re.search('baz$', s)
<_sre.SRE_Match object; span=(8, 11), match='baz'>

In this case, even though the search string 'foonbarnbaz' contains embedded newline characters, only 'foo' matches when anchored at the beginning of the string, and only 'baz' matches when anchored at the end.

If a string has embedded newlines, however, you can think of it as consisting of multiple internal lines. In that case, if the MULTILINE flag is set, the ^ and $ anchor metacharacters match internal lines as well:

  • ^ matches at the beginning of the string or at the beginning of any line within the string (that is, immediately following a newline).
  • $ matches at the end of the string or at the end of any line within the string (immediately preceding a newline).

The following are the same searches as shown above:

>>>

>>> s = 'foonbarnbaz'
>>> print(s)
foo
bar
baz

>>> re.search('^foo', s, re.MULTILINE)
<_sre.SRE_Match object; span=(0, 3), match='foo'>
>>> re.search('^bar', s, re.MULTILINE)
<_sre.SRE_Match object; span=(4, 7), match='bar'>
>>> re.search('^baz', s, re.MULTILINE)
<_sre.SRE_Match object; span=(8, 11), match='baz'>

>>> re.search('foo$', s, re.M)
<_sre.SRE_Match object; span=(0, 3), match='foo'>
>>> re.search('bar$', s, re.M)
<_sre.SRE_Match object; span=(4, 7), match='bar'>
>>> re.search('baz$', s, re.M)
<_sre.SRE_Match object; span=(8, 11), match='baz'>

In the string 'foonbarnbaz', all three of 'foo', 'bar', and 'baz' occur at either the start or end of the string or at the start or end of a line within the string. With the MULTILINE flag set, all three match when anchored with either ^ or $.

re.S
re.DOTALL

Causes the dot (.) metacharacter to match a newline.

Remember that by default, the dot metacharacter matches any character except the newline character. The DOTALL flag lifts this restriction:

>>>

 1>>> print(re.search('foo.bar', 'foonbar'))
 2None
 3>>> re.search('foo.bar', 'foonbar', re.DOTALL)
 4<_sre.SRE_Match object; span=(0, 7), match='foonbar'>
 5>>> re.search('foo.bar', 'foonbar', re.S)
 6<_sre.SRE_Match object; span=(0, 7), match='foonbar'>

In this example, on line 1 the dot metacharacter doesn’t match the newline in 'foonbar'. On lines 3 and 5, DOTALL is in effect, so the dot does match the newline. Note that the short name of the DOTALL flag is re.S, not re.D as you might expect.

re.X
re.VERBOSE

Allows inclusion of whitespace and comments within a regex.

The VERBOSE flag specifies a few special behaviors:

  • The regex parser ignores all whitespace unless it’s within a character class or escaped with a backslash.

  • If the regex contains a # character that isn’t contained within a character class or escaped with a backslash, then the parser ignores it and all characters to the right of it.

What’s the use of this? It allows you to format a regex in Python so that it’s more readable and self-documenting.

Here’s an example showing how you might put this to use. Suppose you want to parse phone numbers that have the following format:

  • Optional three-digit area code, in parentheses
  • Optional whitespace
  • Three-digit prefix
  • Separator (either '-' or '.')
  • Four-digit line number

The following regex does the trick:

>>>

>>> regex = r'^((d{3}))?s*d{3}[-.]d{4}$'

>>> re.search(regex, '414.9229')
<_sre.SRE_Match object; span=(0, 8), match='414.9229'>
>>> re.search(regex, '414-9229')
<_sre.SRE_Match object; span=(0, 8), match='414-9229'>
>>> re.search(regex, '(712)414-9229')
<_sre.SRE_Match object; span=(0, 13), match='(712)414-9229'>
>>> re.search(regex, '(712) 414-9229')
<_sre.SRE_Match object; span=(0, 14), match='(712) 414-9229'>

But r'^((d{3}))?s*d{3}[-.]d{4}$' is an eyeful, isn’t it? Using the VERBOSE flag, you can write the same regex in Python like this instead:

>>>

>>> regex = r'''^               # Start of string
...             ((d{3}))?    # Optional area code
...             s*             # Optional whitespace
...             d{3}           # Three-digit prefix
...             [-.]            # Separator character
...             d{4}           # Four-digit line number
...             $               # Anchor at end of string
...             '''

>>> re.search(regex, '414.9229', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 8), match='414.9229'>
>>> re.search(regex, '414-9229', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 8), match='414-9229'>
>>> re.search(regex, '(712)414-9229', re.X)
<_sre.SRE_Match object; span=(0, 13), match='(712)414-9229'>
>>> re.search(regex, '(712) 414-9229', re.X)
<_sre.SRE_Match object; span=(0, 14), match='(712) 414-9229'>

The re.search() calls are the same as those shown above, so you can see that this regex works the same as the one specified earlier. But it’s less difficult to understand at first glance.

Note that triple quoting makes it particularly convenient to include embedded newlines, which qualify as ignored whitespace in VERBOSE mode.

When using the VERBOSE flag, be mindful of whitespace that you do intend to be significant. Consider these examples:

>>>

 1>>> re.search('foo bar', 'foo bar')
 2<_sre.SRE_Match object; span=(0, 7), match='foo bar'>
 3
 4>>> print(re.search('foo bar', 'foo bar', re.VERBOSE))
 5None
 6
 7>>> re.search('foo bar', 'foo bar', re.VERBOSE)
 8<_sre.SRE_Match object; span=(0, 7), match='foo bar'>
 9>>> re.search('foo[ ]bar', 'foo bar', re.VERBOSE)
10<_sre.SRE_Match object; span=(0, 7), match='foo bar'>

After all you’ve seen to this point, you may be wondering why on line 4 the regex foo bar doesn’t match the string 'foo bar'. It doesn’t because the VERBOSE flag causes the parser to ignore the space character.

To make this match as expected, escape the space character with a backslash or include it in a character class, as shown on lines 7 and 9.

As with the DOTALL flag, note that the VERBOSE flag has a non-intuitive short name: re.X, not re.V.

re.DEBUG

Displays debugging information.

The DEBUG flag causes the regex parser in Python to display debugging information about the parsing process to the console:

>>>

>>> re.search('foo.bar', 'fooxbar', re.DEBUG)
LITERAL 102
LITERAL 111
LITERAL 111
ANY None
LITERAL 98
LITERAL 97
LITERAL 114
<_sre.SRE_Match object; span=(0, 7), match='fooxbar'>

When the parser displays LITERAL nnn in the debugging output, it’s showing the ASCII code of a literal character in the regex. In this case, the literal characters are 'f', 'o', 'o' and 'b', 'a', 'r'.

Here’s a more complicated example. This is the phone number regex shown in the discussion on the VERBOSE flag earlier:

>>>

>>> regex = r'^((d{3}))?s*d{3}[-.]d{4}$'

>>> re.search(regex, '414.9229', re.DEBUG)
AT AT_BEGINNING
MAX_REPEAT 0 1
  SUBPATTERN 1 0 0
    LITERAL 40
    MAX_REPEAT 3 3
      IN
        CATEGORY CATEGORY_DIGIT
    LITERAL 41
MAX_REPEAT 0 MAXREPEAT
  IN
    CATEGORY CATEGORY_SPACE
MAX_REPEAT 3 3
  IN
    CATEGORY CATEGORY_DIGIT
IN
  LITERAL 45
  LITERAL 46
MAX_REPEAT 4 4
  IN
    CATEGORY CATEGORY_DIGIT
AT AT_END
<_sre.SRE_Match object; span=(0, 8), match='414.9229'>

This looks like a lot of esoteric information that you’d never need, but it can be useful. See the Deep Dive below for a practical application.

Deep Dive: Debugging Regular Expression Parsing

As you know from above, the metacharacter sequence {m,n} indicates a specific number of repetitions. It matches anywhere from m to n repetitions of what precedes it:

>>>

>>> re.search('x[123]{2,4}y', 'x222y')
<_sre.SRE_Match object; span=(0, 5), match='x222y'>

You can verify this with the DEBUG flag:

>>>

>>> re.search('x[123]{2,4}y', 'x222y', re.DEBUG)
LITERAL 120
MAX_REPEAT 2 4
  IN
    LITERAL 49
    LITERAL 50
    LITERAL 51
LITERAL 121
<_sre.SRE_Match object; span=(0, 5), match='x222y'>

MAX_REPEAT 2 4 confirms that the regex parser recognizes the metacharacter sequence {2,4} and interprets it as a range quantifier.

But, as noted previously, if a pair of curly braces in a regex in Python contains anything other than a valid number or numeric range, then it loses its special meaning.

You can verify this also:

>>>

>>> re.search('x[123]{foo}y', 'x222y', re.DEBUG)
LITERAL 120
IN
  LITERAL 49
  LITERAL 50
  LITERAL 51
LITERAL 123
LITERAL 102
LITERAL 111
LITERAL 111
LITERAL 125
LITERAL 121

You can see that there’s no MAX_REPEAT token in the debug output. The LITERAL tokens indicate that the parser treats {foo} literally and not as a quantifier metacharacter sequence. 123, 102, 111, 111, and 125 are the ASCII codes for the characters in the literal string '{foo}'.

Information displayed by the DEBUG flag can help you troubleshoot by showing you how the parser is interpreting your regex.

Curiously, the re module doesn’t define a single-letter version of the DEBUG flag. You could define your own if you wanted to:

>>>

>>> import re
>>> re.D
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 're' has no attribute 'D'

>>> re.D = re.DEBUG
>>> re.search('foo', 'foo', re.D)
LITERAL 102
LITERAL 111
LITERAL 111
<_sre.SRE_Match object; span=(0, 3), match='foo'>

But this might be more confusing than helpful, as readers of your code might misconstrue it as an abbreviation for the DOTALL flag. If you did make this assignment, it would be a good idea to document it thoroughly.

re.A
re.ASCII
re.U
re.UNICODE
re.L
re.LOCALE

Specify the character encoding used for parsing of special regex character classes.

Several of the regex metacharacter sequences (w, W, b, B, d, D, s, and S) require you to assign characters to certain classes like word, digit, or whitespace. The flags in this group determine the encoding scheme used to assign characters to these classes. The possible encodings are ASCII, Unicode, or according to the current locale.

You had a brief introduction to character encoding and Unicode in the tutorial on Strings and Character Data in Python, under the discussion of the ord() built-in function. For more in-depth information, check out these resources:

  • Unicode & Character Encodings in Python: A Painless Guide
  • Python’s Unicode Support

Why is character encoding so important in the context of regexes in Python? Here’s a quick example.

You learned earlier that d specifies a single digit character. The description of the d metacharacter sequence states that it’s equivalent to the character class [0-9]. That happens to be true for English and Western European languages, but for most of the world’s languages, the characters '0' through '9' don’t represent all or even any of the digits.

For example, here’s a string that consists of three Devanagari digit characters:

>>>

>>> s = 'u0967u096au096c'
>>> s
'१४६'

For the regex parser to properly account for the Devanagari script, the digit metacharacter sequence d must match each of these characters as well.

The Unicode Consortium created Unicode to handle this problem. Unicode is a character-encoding standard designed to represent all the world’s writing systems. All strings in Python 3, including regexes, are Unicode by default.

So then, back to the flags listed above. These flags help to determine whether a character falls into a given class by specifying whether the encoding used is ASCII, Unicode, or the current locale:

  • re.U and re.UNICODE specify Unicode encoding. Unicode is the default, so these flags are superfluous. They’re mainly supported for backward compatibility.
  • re.A and re.ASCII force a determination based on ASCII encoding. If you happen to be operating in English, then this is happening anyway, so the flag won’t affect whether or not a match is found.
  • re.L and re.LOCALE make the determination based on the current locale. Locale is an outdated concept and isn’t considered reliable. Except in rare circumstances, you’re not likely to need it.

Using the default Unicode encoding, the regex parser should be able to handle any language you throw at it. In the following example, it correctly recognizes each of the characters in the string '१४६' as a digit:

>>>

>>> s = 'u0967u096au096c'
>>> s
'१४६'
>>> re.search('d+', s)
<_sre.SRE_Match object; span=(0, 3), match='१४६'>

Here’s another example that illustrates how character encoding can affect a regex match in Python. Consider this string:

>>>

>>> s = 'schu00f6n'
>>> s
'schön'

'schön' (the German word for pretty or nice) contains the 'ö' character, which has the 16-bit hexadecimal Unicode value 00f6. This character isn’t representable in traditional 7-bit ASCII.

If you’re working in German, then you should reasonably expect the regex parser to consider all of the characters in 'schön' to be word characters. But take a look at what happens if you search s for word characters using the w character class and force an ASCII encoding:

>>>

>>> re.search('w+', s, re.ASCII)
<_sre.SRE_Match object; span=(0, 3), match='sch'>

When you restrict the encoding to ASCII, the regex parser recognizes only the first three characters as word characters. The match stops at 'ö'.

On the other hand, if you specify re.UNICODE or allow the encoding to default to Unicode, then all the characters in 'schön' qualify as word characters:

>>>

>>> re.search('w+', s, re.UNICODE)
<_sre.SRE_Match object; span=(0, 5), match='schön'>
>>> re.search('w+', s)
<_sre.SRE_Match object; span=(0, 5), match='schön'>

The ASCII and LOCALE flags are available in case you need them for special circumstances. But in general, the best strategy is to use the default Unicode encoding. This should handle any world language correctly.

Combining <flags> Arguments in a Function Call

Flag values are defined so that you can combine them using the bitwise OR (|) operator. This allows you to specify several flags in a single function call:

>>>

>>> re.search('^bar', 'FOOnBARnBAZ', re.I|re.M)
<_sre.SRE_Match object; span=(4, 7), match='BAR'>

This re.search() call uses bitwise OR to specify both the IGNORECASE and MULTILINE flags at once.

Setting and Clearing Flags Within a Regular Expression

In addition to being able to pass a <flags> argument to most re module function calls, you can also modify flag values within a regex in Python. There are two regex metacharacter sequences that provide this capability.

(?<flags>)

Sets flag value(s) for the duration of a regex.

Within a regex, the metacharacter sequence (?<flags>) sets the specified flags for the entire expression.

The value of <flags> is one or more letters from the set a, i, L, m, s, u, and x. Here’s how they correspond to the re module flags:

Letter Flags
a re.A     re.ASCII
i re.I     re.IGNORECASE
L re.L     re.LOCALE
m re.M     re.MULTILINE
s re.S     re.DOTALL
u re.U     re.UNICODE
x re.X     re.VERBOSE

The (?<flags>) metacharacter sequence as a whole matches the empty string. It always matches successfully and doesn’t consume any of the search string.

The following examples are equivalent ways of setting the IGNORECASE and MULTILINE flags:

>>>

>>> re.search('^bar', 'FOOnBARnBAZn', re.I|re.M)
<_sre.SRE_Match object; span=(4, 7), match='BAR'>

>>> re.search('(?im)^bar', 'FOOnBARnBAZn')
<_sre.SRE_Match object; span=(4, 7), match='BAR'>

Note that a (?<flags>) metacharacter sequence sets the given flag(s) for the entire regex no matter where you place it in the expression:

>>>

>>> re.search('foo.bar(?s).baz', 'foonbarnbaz')
<_sre.SRE_Match object; span=(0, 11), match='foonbarnbaz'>

>>> re.search('foo.bar.baz(?s)', 'foonbarnbaz')
<_sre.SRE_Match object; span=(0, 11), match='foonbarnbaz'>

In the above examples, both dot metacharacters match newlines because the DOTALL flag is in effect. This is true even when (?s) appears in the middle or at the end of the expression.

As of Python 3.7, it’s deprecated to specify (?<flags>) anywhere in a regex other than at the beginning:

>>>

>>> import sys
>>> sys.version
'3.8.0 (default, Oct 14 2019, 21:29:03) n[GCC 7.4.0]'

>>> re.search('foo.bar.baz(?s)', 'foonbarnbaz')
<stdin>:1: DeprecationWarning: Flags not at the start
    of the expression 'foo.bar.baz(?s)'
<re.Match object; span=(0, 11), match='foonbarnbaz'>

It still produces the appropriate match, but you’ll get a warning message.

(?<set_flags>-<remove_flags>:<regex>)

Sets or removes flag value(s) for the duration of a group.

(?<set_flags>-<remove_flags>:<regex>) defines a non-capturing group that matches against <regex>. For the <regex> contained in the group, the regex parser sets any flags specified in <set_flags> and clears any flags specified in <remove_flags>.

Values for <set_flags> and <remove_flags> are most commonly i, m, s or x.

In the following example, the IGNORECASE flag is set for the specified group:

>>>

>>> re.search('(?i:foo)bar', 'FOObar')
<re.Match object; span=(0, 6), match='FOObar'>

This produces a match because (?i:foo) dictates that the match against 'FOO' is case insensitive.

Now contrast that with this example:

>>>

>>> print(re.search('(?i:foo)bar', 'FOOBAR'))
None

As in the previous example, the match against 'FOO' would succeed because it’s case insensitive. But once outside the group, IGNORECASE is no longer in effect, so the match against 'BAR' is case sensitive and fails.

Here’s an example that demonstrates turning a flag off for a group:

>>>

>>> print(re.search('(?-i:foo)bar', 'FOOBAR', re.IGNORECASE))
None

Again, there’s no match. Although re.IGNORECASE enables case-insensitive matching for the entire call, the metacharacter sequence (?-i:foo) turns off IGNORECASE for the duration of that group, so the match against 'FOO' fails.

As of Python 3.7, you can specify u, a, or L as <set_flags> to override the default encoding for the specified group:

>>>

>>> s = 'schu00f6n'
>>> s
'schön'

>>> # Requires Python 3.7 or later
>>> re.search('(?a:w+)', s)
<re.Match object; span=(0, 3), match='sch'>
>>> re.search('(?u:w+)', s)
<re.Match object; span=(0, 5), match='schön'>

You can only set encoding this way, though. You can’t remove it:

>>>

>>> re.search('(?-a:w+)', s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/re.py", line 199, in search
    return _compile(pattern, flags).search(string)
  File "/usr/lib/python3.8/re.py", line 302, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 805, in _parse
    flags = _parse_flags(source, state, char)
  File "/usr/lib/python3.8/sre_parse.py", line 904, in _parse_flags
    raise source.error(msg)
re.error: bad inline flags: cannot turn off flags 'a', 'u' and 'L' at
position 4

u, a, and L are mutually exclusive. Only one of them may appear per group.

Conclusion

This concludes your introduction to regular expression matching and Python’s re module. Congratulations! You’ve mastered a tremendous amount of material.

You now know how to:

  • Use re.search() to perform regex matching in Python
  • Create complex pattern matching searches with regex metacharacters
  • Tweak regex parsing behavior with flags

But you’ve still seen only one function in the module: re.search()! The re module has many more useful functions and objects to add to your pattern-matching toolkit. The next tutorial in the series will introduce you to what else the regex module in Python has to offer.

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Regular Expressions and Building Regexes in Python

Понравилась статья? Поделить с друзьями:
  • Regular expression to match a word
  • Regex start with word
  • Regular expression start with a word
  • Regex only one word
  • Regex not starts with word