A
Regular
Expression (abbreviated as
regexp
or
regex, with plural forms
regexps,
regexes, or
regexen ) is a
string that
describes or matches a set of strings, according to certain
syntax
rules. Regular expressions are used by many
text editors
and utilities to search and manipulate bodies of text based on certain
patterns. Many programming languages support regular expressions for
string manipulation. For example,
Perl
and
Tcl
have a powerful regular expression engine built directly into their
syntax. The set of utilities (including the editor
sed
and the filter
grep ) provided by
Unix
distributions were the first to popularize the concept of regular
expressions.
A regular expression, often called a pattern,
is an
expression that describes a set of strings. They are usually used to
give a concise description of a set, without having to list all
elements. For example, the set containing the three strings Handel,
Händel, and Haendel can
be described by the pattern "H(ä|ae? )ndel" (or alternatively,
it is said that the pattern matches
each of the three strings). Most formalisms provide the following
operations to
construct regular expressions.
- alternation
- A vertical bar
separates alternatives. For example, "gray|grey" matches gray
or grey, which can commonly be shortened to
"gr(a|e)y".
grouping- Parentheses
are used to define the scope and precedence of the
operators. For example, "gray|grey" and "gr(a|e)y" are different
patterns, but they both describe the set containing gray
and grey.
quantification- A
quantifier after a character or group specifies how often that
preceding expression is allowed to occur. The most common quantifiers
are ?, *, and +:
- ? The question mark
indicates
there is 0 or 1 of the previous expression. For
example, "colou?r" matches both color and colour.
- *
The
asterisk indicates there are 0, 1 or any number of
the previous expression. For example, "go*gle" matches ggle,
gogle, google, etc.
- +
The
plus sign indicates that there is at least 1 of the
previous expression. For example, "go+gle" matches gogle,
google, etc. (but not ggle ).
These constructions can be combined to
form arbitrarily complex
expressions, very much like one can construct arithmetical expressions
from the numbers and the operations +, -, * and /.
As an example, the pattern
"((great )*grand )?(father|mother)" matches any
ancestor: father, mother, grand
father, grand mother, great
grand father, great grand mother, great
great grand father, great great grand mother,
great great great grand father, great
great great grand mother and so on.
Here is a table
to help you understand how to write Regular Expressions. This
Regular Expression Syntax table describes and gives an example of the
characters and sequences that can be used.
Regular
Expression Syntax
| Character
| Description
|
\
| Marks
the next
character as either a special
character or a literal. For example, "n" matches the character "n".
"\n" matches a newline character. The sequence "\\" matches "\" and
"\(" matches "(". |
| ^
|
Matches the
beginning of input.
|
$
| Matches
the end of
input. |
*
| Matches
the preceding
character zero or more times.
For example, "zo*" matches either "z" or "zoo". |
+
| Matches
the preceding
character one or more times.
For example, "zo+" matches "zoo" but not "z". |
?
| Matches
the preceding
character zero or one time.
For example, "a?ve?" matches the "ve" in "never".
|
.
| Matches
any single
character except a newline
character. |
| (pattern )
| Matches pattern
and remembers
the match. The matched substring can be retrieved from the resulting Matches
collection, using Item [0]...[n]. To match
parentheses characters ( ), use "\(" or "\)". |
x|y
| Matches
either x
or y.
For example, "z|wood" matches "z" or "wood". "(z|w)oo" matches "zoo" or
"wood". |
| {n}
| n
is a nonnegative integer. Matches exactly n times.
For example, "o{2}" does not match the "o" in "Bob," but matches the
first two o's in "foooood". |
{n,}
| n is
a nonnegative integer.
Matches at least n times. For example, "o{2,}" does
not match the "o" in "Bob" and matches all the o's in "foooood."
"o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*".
|
{n,m}
| m
and n are
nonnegative integers. Matches at least n and at
most m times. For example, "o{1,3}" matches the
first three o's in "fooooood." "o{0,1}" is equivalent to "o?".
|
[xyz]
| A character set.
Matches any one of the enclosed
characters. For example, "[abc]" matches the "a" in "plain".
|
[^xyz]
| A negative character
set. Matches any character not
enclosed. For example, "[^abc]" matches the "p" in "plain".
|
[a-z]
| A range of characters.
Matches any character in the
specified range. For example, "[a-z]" matches any lowercase alphabetic
character in the range "a" through "z". |
[^m-z]
| A negative range
characters. Matches any character
not in the specified range. For example, "[m-z]" matches any character
not in the range "m" through "z". |
\b
| Matches
a word
boundary, that is, the position
between a word and a space. For example, "er\b" matches the "er" in
"never" but not the "er" in "verb". |
\B
| Matches
a non-word
boundary. "ea*r\B" matches the
"ear" in "never early". |
\d
| Matches
a digit
character. Equivalent to [0-9].
|
\D
| Matches
a
non-digit character. Equivalent to
[^0-9]. |
| \f
|
Matches
a form-feed character. |
\n
| Matches
a newline
character.
|
\r
| Matches
a carriage
return character.
|
\s
| Matches
any white
space including space, tab,
form-feed, etc. Equivalent to "[ \f\n\r\t\v]".
|
\S
| Matches
any
nonwhite space character. Equivalent to
"[^ \f\n\r\t\v]". |
| \t
|
Matches
a tab character. |
| \v
|
Matches
a vertical tab character. |
\w
| Matches
any word
character including underscore.
Equivalent to "[A-Za-z0-9_]". |
\W
| Matches
any
non-word character. Equivalent to
"[^A-Za-z0-9_]". |
| \num
|
Matches num,
where num
is a positive integer. A reference back to remembered matches. For
example, "(.)\1" matches two consecutive identical characters.
|
\n
| Matches
n, where n is an octal escape
value. Octal escape values must be 1, 2, or 3 digits long. For example,
"\11" and "\011" both match a tab character. "\0011" is the equivalent
of "\001" & "1". Octal escape values must not exceed 256. If
they do, only the first two digits comprise the expression. Allows
ASCII codes to be used in regular expressions. |
\xn
| Matches
n,
where n
is a hexadecimal escape value. Hexadecimal escape values must be
exactly two digits long. For example, "\x41" matches "A". "\x041" is
equivalent to "\x04" & "1". Allows ASCII codes to be used in
regular expressions. |
Popularity: 8% [?]
Related Posts:Learn How to Write Regular ExpressionsPHP Regular Expressions TutorialFree Shell Scripting Course16 Programming eBooks for Python, Bash, Regex10 Ebooks on Linux and Solaris