I’m using https://regexr.com/ to test a regular expression. I’m trying to validate a name input on a form, so:
- Only letter characters
- No non-letter characters except backspace and space
In my validation function, I can have:
if (/d/.test(charStr)) {
return false;
}
/d/
will match numbers. So far, so good.
Changing it to:
if (/d|W/.test(charStr)) {
return false;
}
..will match numbers d
or |
non-word characters W
, which is good, except that it’s also matching whitespace characters like space and backspace.
So, I’m trying to somehow use W
, but with the exception of whitespace characters.
I tried:
if (/d|W[^s]/.test(charStr)) {
return false;
}
So, match numbers d
, or non-word characters W
excepting whitespace characters [^s]
, but my syntax appears to be wrong here.
What am I doing wrong? Thanks.
Regex Reference
The regular expression (regex) syntax and semantics implemented in BareGrep are common to PHP, Perl and Java.
Characters and Escapes
Logical Operators
Character Classes
Quantifiers
Assertions
There is also an example.
Characters and Escapes
. |
Any character«.» matches any character. For example: ... would match any three character sequence. To specify a literal «.», escape it with «». For example: www.baremetalsoft.com would match «www.baremetalsoft.com». |
x |
The literal character xAll characters which do not have a special meaning in a regex match themselves. For example: fred would match the string «fred». |
a |
Alert character (bell)BEL — ASCII code 07. |
cx |
Control-x characterFor example: cM would be equivalent to key sequence Control-M or character ASCII code 0D hexidecimal. |
d |
A digitA digit from 0 to 9. This is eqivalent to the regex: [0-9] |
D |
Any non-digitAny character which is not a digit. This is eqivalent to the regex: [^0-9] |
e |
Escape characterESC — ASCII code 27 (1B hexidecimal). |
f |
Form feed characterFF — ASCII code 12 (0C hexidecimal). |
r |
Carriage return character
CR — ASCII code 13 (0D hexidecimal). Carriage return characters are automatically stripped
Note: to match the start-of-line use the «^» assertion. To match the end-of-line |
s |
Any whitespace characterThe whitespace characters include space, tab, new-line, carriage-return and form-feed. This is eqivalent to the regex: [ trnf] |
S |
Any non-whitespace characterThis is eqivalent to the regex: [^ trnf] |
t |
Tab characterA horizontal tab character. HT — ASCII code 09. |
nnn |
The character with octal value nnn |
w |
Any word character
Any word character (in the set «A» to «Z», «a» This is equivalent to the regex: [0-9_A-Za-z] |
W |
Any non-word characterAny non-word character. A character in the set: [^0-9_A-Za-z] |
xhh |
The character with hexidecimal value hh |
Logical Operators
XY |
CatenationRegex X then Y regex. For example: abc would match the string «abc». |
X|Y |
AlternationX or Y For example: ERROR|FATAL would match «ERROR» or «FATAL». |
(?:X) |
GroupGrouping and operator precedence over-ride. For example: (?:A|B)(?:C|D) would match «AC», «BC», «AD» or «BD». Whereas: A|BC|D would match «A», «BC», or «D». |
(X) |
Capturing group
Grouping and capturing of the regex X. Capturing groups also imply operator precedence over-ride. For example: (A|B)(C|D) would match «AC», «BC», «AD» or «BD». Whereas: A|BC|D would match «A», «BC», or «D».
Note: Using capturing involves a significant performance overhead (the search runs slower), |
Character Classes
[abc] |
Character setA single a, b or c character. For example: [0123456789ABCDEFabcdef]
would match any hexidecimal digit character |
[^abc] |
Inverse character setAny character other than a, b or c. For example: [^0123456789ABCDEFabcdef] would match any character which is not an hexidecimal digit character. |
[a-b] |
Character set rangeA character in the range a to b. For example: [0-9_A-Za-z]
would match any word character (in the set «A» to «Z», «a» |
Quantifiers
X* |
Set closureThe regex X zero or more times. For example: .* would match anything (or nothing, because it may match zero times). For example: As*=s*B would match «A=B», «A = B» or even «A= B» (ignoring whitespace around the «=»). |
X+ |
Kleene closureThe regex X one or more times. For example: d+ would match a sequence of digits that is at least one character in length. |
X? |
Zero or oneThe regex X zero or one times. For example: d? would match zero or one digits only. |
X{n} |
Exactly n timesThe regex X exactly n times. For example: d{4} would match exactly 4 digits. |
X{n,} |
At least n timesThe regex X at least n times. For example: d{4,} would match 4 or more digits. |
X{n,m} |
Between n and m timesThe regex X at least n times, but no more than m times. For example: d{4,6} would match 4, 5 or 6 or more digits. |
Assertions
^ |
Start-of-lineThe start of a line. For example: ^Status would match «Status» only at the start of a line. |
$ |
End-of-lineThe end of a line. For example: Status$ would match «Status» only at the end of a line. |
Example
Question
Given the following lines:
#Fields: date time c-ip cs-username s-computername s-ip cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken s-port cs(User-Agent)
2005-01-04 00:31:32 10.67.65.57 — VENUS 10.7.40.91 GET /xpedio/ — 302 288 241 0 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:32 10.67.65.57 — VENUS 10.7.40.91 GET /xpedio/login.html — 200 1337 242 125 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:32 10.67.65.57 — VENUS 10.7.40.91 GET /xpedio/images/FAHC/sm_idoclogo2.gif — 200 1898 310 16 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:32 10.67.65.57 — VENUS 10.7.40.91 GET /intradoc-cgi/idc_cgi_isapi.dll IdcService=LOGIN&Action=GetTemplatePage&Page=HOME_PAGE&Auth=Intranet 200 15431 546 141 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:32 10.67.65.57 FAHCKioskUser VENUS 10.7.40.91 GET /intradoc-cgi/idc_cgi_isapi.dll IdcService=LOGIN&Action=GetTemplatePage&Page=HOME_PAGE&Auth=Intranet 200 23943 768 390 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 FAHCKioskUser VENUS 10.7.40.91 GET /xpedio/images/xpedio/enthome2.gif — 200 650 494 32 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 — VENUS 10.7.40.91 GET /xpedio/images/xpedio/enthome.gif — 200 662 493 62 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 FAHCKioskUser VENUS 10.7.40.91 GET /xpedio/images/xpedio/home.gif — 200 523 490 62 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 — VENUS 10.7.40.91 GET /xpedio/images/xpedio/home2.gif — 200 525 491 16 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 FAHCKioskUser VENUS 10.7.40.91 GET /xpedio/images/xpedio/library2.gif — 200 698 494 47 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 — VENUS 10.7.40.91 GET /xpedio/images/xpedio/library.gif — 200 701 493 31 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 FAHCKioskUser VENUS 10.7.40.91 GET /xpedio/images/xpedio/search2.pdf — 200 570 493 31 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 — VENUS 10.7.40.91 GET /xpedio/images/xpedio/help2.gif — 200 553 491 31 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 FAHCKioskUser VENUS 10.7.40.91 GET /xpedio/images/xpedio/search.gif — 200 574 492 16 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
I need to generate a report with the FAHCusername and the .pdf file
they accessed. I can do either one individually, but not sure how to do both.
Here’s the regex that works for the username:
(FAHC\S+)
and the regex that works for the .pdf file:
(S+.pdf)
but how do I format the «find» field for both?
Answer
In this case, as every line has the same format, I would first try to
construct a regex which matches the entire line.
So I’d start with something like:
S+ S+ S+ S+ S+ S+ S+ S+ S+ S+ S+ S+ S+ S+ S+
Then I’d pick out the columns I’m interested in:
S+ S+ S+ (S+) S+ S+ S+ (S+) S+ S+ S+ S+ S+ S+ S+
You can then refine the sub-regex for the two columns you’re
interested in:
S+ S+ S+ FAHC\(S+) S+ S+ S+ (S+.pdf) S+ S+ S+ S+ S+ S+ S+
There are various other ways this could also be done, but this is the
first way that sprung to mind.
Another way would be:
FAHC\(S+).* (S+.pdf)
This uses «.*» in the middle which means «match anything».
Improve Article
Save Article
Like Article
Improve Article
Save Article
Like Article
Regex stands for Regular Expression, which is used to define a pattern for a string. It is used to find the text or to edit the text. Java Regex classes are present in java.util.regex package, which needs to be imported before using any of the methods of regex classes.
java.util.regex package consists of 3 classes:
- Pattern
- Matcher
- PatternSyntaxException
Classes in regex package
Metacharacters
Metacharacters are like short-codes for common matching patterns.
Regular Expression |
Description |
---|---|
d |
Any digits, short-code for [0-9] |
D |
Any non-digits, short-code for [^0-9] |
s |
Any white space character, short-code for [tnx0Bfr] |
S |
Any non-whitespace character |
w |
Any word character, short-code for [a-zA-Z_0-9] |
W |
Any non-word character |
b |
Represents a word boundary |
B |
Represents a non-word boundary |
Usage of Metacharacters
- Precede the metacharacter with backslash ().
Explanation of Metacharacters
1. Digit & Non Digit related Metacharacters: (d, D)
Java
import
java.io.*;
import
java.util.regex.*;
class
GFG {
public
static
void
main(String[] args)
{
System.out.println(Pattern.matches(
"\d"
,
"2"
));
System.out.println(Pattern.matches(
"\d"
,
"a"
));
System.out.println(Pattern.matches(
"\D"
,
"a"
));
System.out.println(Pattern.matches(
"\D"
,
"2"
));
}
}
Output
true false true false
Explanation
- d metacharacter represents a digit from 0 to 9. So when we compare “d” within the range, it then returns true. Else return false.
- D metacharacter represents a non-digit that accepts anything except numbers. So when we compare “D” with any number, it returns false. Else True.
2. Whitespace and Non-Whitespace Metacharacters: (s, S)
Java
import
java.io.*;
import
java.util.regex.*;
class
GFG {
public
static
void
main(String[] args)
{
System.out.println(Pattern.matches(
"\s"
,
" "
));
System.out.println(Pattern.matches(
"\s"
,
"2"
));
System.out.println(Pattern.matches(
"\S"
,
"2"
));
System.out.println(Pattern.matches(
"\S"
,
" "
));
}
}
Output
true false true false
Explanation
- s represents whitespace characters like space, tab space, newline, etc. So when we compare “s” with whitespace characters, it returns true. Else false.
- S represents a Non-whitespace character that accepts everything except whitespace, So when we compare “S” with whitespace characters, it returns false. Else true
3. Word & Non Word Metacharacters: (w, W)
Java
import
java.io.*;
import
java.util.regex.*;
class
GFG {
public
static
void
main(String[] args)
{
System.out.println(Pattern.matches(
"\w"
,
"a"
));
System.out.println(Pattern.matches(
"\w"
,
"2"
));
System.out.println(Pattern.matches(
"\w"
,
"$"
));
System.out.println(Pattern.matches(
"\W"
,
"2"
));
System.out.println(Pattern.matches(
"\W"
,
" "
));
System.out.println(Pattern.matches(
"\W"
,
"$"
));
}
}
Output
true true false false true true
Explanation
- w represents word character which accepts alphabets (Capital & small) and digits [0-9]. So when we compare “w” with an alphabet or number returns true. Else false.
- W represents a non-word character that accepts anything except alphabets and digits. So when we compare “W” with an alphabet or number returns false. Else true.
4. Word & Non-Word Boundary Metacharacters: (b, B)
Java
import
java.io.*;
import
java.util.regex.*;
class
GFG {
public
static
void
main(String[] args)
{
System.out.println(
Pattern.matches(
"\bGFG\b"
,
"GFG"
));
System.out.println(
Pattern.matches(
"\b@GFG\b"
,
"@GFG"
));
System.out.println(Pattern.matches(
"\B@GFG@\B"
,
"@GFG@"
));
System.out.println(
Pattern.matches(
"\BGFG\B"
,
"GFG"
));
}
}
Output
true false true false
Explanation:
- b indicates a string must have boundary elements of word characters, i.e., either digits or alphabets. So here, the GFG string has boundaries G, G, which are word characters so returns true. For the @GFG string, the boundary elements are @, G where @ is not word character, so return false.
- B indicates a string must have boundary elements of Non-word characters, i.e., it may have anything except digits or alphabets. So here @GFG@ string has boundaries @,@ which are Non-word characters so returns true. For the GFG string, the boundary elements are G, G, which are word characters, returning false.
Example:
Java
import
java.io.*;
import
java.util.regex.*;
class
GFG {
public
static
void
main(String[] args)
{
System.out.println(Pattern.matches(
"\d\D\s\S\w\W"
,
"1G FG!"
));
System.out.println(Pattern.matches(
"\d\D\s\S\w\W"
,
"Geeks!"
));
}
}
Like Article
Save Article
Regex Cheatsheet
Basics
Escape a special character
*
Match preceding character 0 or more times
+
Match preceding character 1 or more times
.
Match any single character
Character Classes I
Match a single white space character (space, tab, form feed, or line feed)
Match any alphanumeric character (including underscore)
Match a non-digit character
Match a single character other than white space
Match any non-word character
Character Classes II
Match any one of the characters in the set ‘abc’
Match anything not in character set ‘abc’
Assertions
^
Match beginning of input
Match a non-word boundary
Assertions II
Quantifiers
Match exactly n occurrences of preceding character
Match at least n and at most m occurrences of the preceding character
Special Characters I
Match control character X in a string
Special Characters II
Match character with code hh (2 hex digits)
Match character with code hhhh (4 hex digits)
Flags
y
«sticky» search match starting at current position in target string
Groups
Match ‘x’ and remember the match
Match ‘x’ but do not remember the match
A back reference to the last substring matching the n parenthetical in the regex
This checklist summarizes the most commonly used/hard to remember parts of the
regexp engine available in most parts of calibre.
Character classes¶
Character classes are useful to represent different groups of characters,
succinctly.
Examples:
Representation |
Class |
|
Lowercase letters. Does not include characters with accent mark and ligatures |
|
Lowercase letters from a to z or numbers from 0 to 9 |
|
Uppercase or lowercase letters, or a dash. To include the dash in a class, you must put it at the beginning or at the end so as not to confuse it with the hyphen that specifies a range of characters |
|
Any character except a digit. The caret (^) placed at the beginning of the class excludes the characters of the class (complemented class) |
|
The lowercase consonants. A class can be included in a class. The characters |
|
All letters (including foreign accented characters). Abbreviated classes can be used inside a class |
Example:
<[^<>]+> to select an HTML tag
Shorthand character classes¶
Representation |
Class |
|
A digit (same as |
|
Any non-numeric character (same as |
|
An alphanumeric character ( |
|
Any “non-word” character |
|
Space, non-breaking space, tab, return line |
|
Any “non-whitespace” character |
|
Any character except newline. Use the “dot all” checkbox or the |
The quantifiers¶
Quantifier |
Number of occurrences of the expression preceding the quantifier |
|
0 or 1 occurrence of the expression. Same as |
|
1 or more occurrences of the expression. Same as |
|
0, 1 or more occurrences of the expression. Same as |
|
Exactly n occurrences of the expression |
|
Number of occurrences between the minimum and maximum values included |
|
Number of occurrences between the minimum value included and the infinite |
|
Number of occurrences between 0 and the maximum value included |
Greed¶
By default, with quantifiers, the regular expression engine is greedy: it
extends the selection as much as possible. This often causes surprises, at
first. ?
follows a quantifier to make it lazy.
Avoid putting two in the same expression, the result can be unpredictable.
Beware of nesting quantifiers, for example, the pattern (a*)*
, as it
exponentially increases processing time.
Alternation¶
The |
character in a regular expression is a logical OR
. It means
that either the preceding or the following expression can match.
Exclusion¶
Method 1
pattern_to_exclude(*SKIP)(*FAIL)|pattern_to_select
Example:
"Blabla"(*SKIP)(*FAIL)|Blabla
selects Blabla, in the strings Blabla or “Blabla or Blabla”, but not in “Blabla”.
Method 2
pattern_to_excludeK|(pattern_to_select)
"Blabla"K|(Blabla)
selects Blabla, in the strings Blabla or “Blabla or Blabla”, but not in “Blabla”.
Anchors¶
An anchor is a way to match a logical location in a string, rather than a
character. The most useful anchors for text processing are:
b
Designates a word boundary, i.e. a transition from space to non-space
character. For example, you can usebsurd
to matchthe surd
but
notabsurd
.^
Matches the start of a line (in multi-line mode, which is the
default)$
Matches the end of a line (in multi-line mode, which is the default)
K
Resets the start position of the selection to its position in the pattern.
Some regexp engines (but not calibre) do not allow lookbehind of variable
length, especially with quantifiers. When you can useK
with these
engines, it also allows you to get rid of this limit by writing the
equivalent of a positive lookbehind of variable length.
Groups¶
(expression)
Capturing group, which stores the selection and can be recalled later
in the search or replace patterns withn
, wheren
is the
sequence number of the capturing group (starting at 1 in reading order)(?:expression)
Group that does not capture the selection
(?>expression)
Atomic Group: As soon as the expression is satisfied, the regexp engine
passes, and if the rest of the pattern fails, it will not backtrack to
try other combinations with the expression. Atomic groups do not
capture.(?|expression)
Branch reset group: the branches of the alternations included in the
expression share the same group numbers(?<name>expression)
Group named “name”. The selection can be recalled later in the search
pattern by(?P=name)
and in the replace byg<name>
. Two
different groups can use the same name.
Lookarounds¶
Lookaround |
Meaning |
|
Positive lookahead (to be placed after the selection) |
|
Negative lookahead (to be placed after the selection) |
|
Positive lookbehind (to be placed before the selection) |
|
Negative lookbehind (to be placed before the selection) |
Lookaheads and lookbehinds do not consume characters, they are zero length and
do not capture. They are atomic groups: as soon as the assertion is satisfied,
the regexp engine passes, and if the rest of the pattern fails, it will not
backtrack inside the lookaround to try other combinations.
When looking for multiple matches in a string, at the starting position of each
match attempt, a lookbehind can inspect the characters before the current
position. Therefore, on the string 123, the pattern (?<=d)d
(a digit preceded
by a digit) should, in theory, select 2 and 3. On the other hand, dKd
can
only select 2, because the starting position after the first selection is
immediately before 3, and there are not enough digits for a second match.
Similarly, d(d)
only captures 2. In calibre’s regexp engine practice, the
positive lookbehind behaves in the same way, and selects only 2, contrary to
theory.
Groups can be placed inside lookarounds, but capture is rarely useful.
Nevertheless, if it is useful, it will be necessary to be very careful in the
use of a quantifier in a lookbehind: the greed associated with the absence of
backtracking can give a surprising capture. For this reason, use K
rather than
a positive lookbehind when you have a quantifier (or worse, several) in a
capturing group of the positive lookbehind.
Example of negative lookahead:
(?![^<>{}]*[>}])
Placed at the end of the pattern prevents to select within a tag or a style embedded in the file.
Whenever possible, it is always better to “anchor” the lookarounds, to reduce
the number of steps necessary to obtain the result.
Recursion¶
Representation |
Meaning |
|
Recursion of the entire pattern |
|
Recursion of the only pattern of the numbered capturing group, here group 1 |
Recursion is calling oneself. This is useful for balanced queries, such as
quoted strings, which can contain embedded quoted strings. Thus, if during the
processing of a string between double quotation marks, we encounter the
beginning of a new string between double quotation marks, well we know how to
do, and we call ourselves. Then we have a pattern like:
start-pattern(?>atomic sub-pattern|(?R))*end-pattern
To select a string between double quotation marks without stopping on an embedded string:
“((?>[^“”]+|(?R))*[^“”]+)”
This template can also be used to modify pairs of tags that can be
embedded, such as <div>
tags.
Special characters¶
Representation |
Character |
|
tabulation |
|
line break |
|
(breakable) space |
|
no-break space |
Meta-characters¶
Meta-characters are those that have a special meaning for the regexp engine. Of
these, twelve must be preceded by an escape character, the backslash (), to
lose their special meaning and become a regular character again:
Seven other meta-characters do not need to be preceded by a backslash (but can
be without any other consequence):
Special characters lose their status if they are used inside a class (between
brackets []
). The closing bracket and the dash have a special status in a
class. Outside the class, the dash is a simple literal, the closing bracket
remains a meta-character.
The slash (/) and the number sign (or hash character) (#) are not
meta-characters, they don’t need to be escaped.
In some tools, like regex101.com with the Python engine, double quotes have the
special status of separator, and must be escaped, or the options changed. This
is not the case in the editor of calibre.
Modes¶
(?s)
Causes the dot (
.
) to match newline characters as well(?m)
Makes the
^
and$
anchors match the start and end of lines
instead of the start and end of the entire string.