Intelligentedu
Best New Free Computer IT Training Tutorial Resources


 



     Blog Roll:


     Top Links:

June 29, 2006

Writing Regular Expressions

A Regular Expression (abbreviated as regexp or regex, with plural forms regexps, regexes, or regexen ) is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. For example, Perl and Tcl have a powerful regular expression engine built directly into their syntax. The set of utilities (including the editor sed and the filter grep ) provided by Unix distributions were the first to popularize the concept of regular expressions.

A regular expression, often called a pattern, is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. For example, the set containing the three strings Handel, Händel, and Haendel can be described by the pattern "H(ä|ae? )ndel" (or alternatively, it is said that the pattern matches each of the three strings). Most formalisms provide the following operations to construct regular expressions.

alternation
A vertical bar separates alternatives. For example, "gray|grey" matches gray or grey, which can commonly be shortened to "gr(a|e)y".

grouping
Parentheses are used to define the scope and precedence of the operators. For example, "gray|grey" and "gr(a|e)y" are different patterns, but they both describe the set containing gray and grey.

quantification
A quantifier after a character or group specifies how often that preceding expression is allowed to occur. The most common quantifiers are ?, *, and +:
?     The question mark indicates there is 0 or 1 of the previous expression. For example, "colou?r" matches both color and colour.
*     The asterisk indicates there are 0, 1 or any number of the previous expression. For example, "go*gle" matches ggle, gogle, google, etc.
+     The plus sign indicates that there is at least 1 of the previous expression. For example, "go+gle" matches gogle, google, etc. (but not ggle ).

These constructions can be combined to form arbitrarily complex expressions, very much like one can construct arithmetical expressions from the numbers and the operations +, -, * and /.

As an example, the pattern "((great )*grand )?(father|mother)" matches any ancestor: father, mother, grand father, grand mother, great grand father, great grand mother, great great grand father, great great grand mother, great great great grand father, great great great grand mother and so on.

Here is a table to help you understand how to write Regular Expressions. This Regular Expression Syntax table describes and gives an example of the characters and sequences that can be used.

Regular Expression Syntax

Character

Description

\

Marks the next character as either a special character or a literal. For example, "n" matches the character "n". "\n" matches a newline character. The sequence "\\" matches "\" and "\(" matches "(".

^

Matches the beginning of input.

$

Matches the end of input.

*

Matches the preceding character zero or more times. For example, "zo*" matches either "z" or "zoo".

+

Matches the preceding character one or more times. For example, "zo+" matches "zoo" but not "z".

?

Matches the preceding character zero or one time. For example, "a?ve?" matches the "ve" in "never".

.

Matches any single character except a newline character.

(pattern )

Matches pattern and remembers the match. The matched substring can be retrieved from the resulting Matches collection, using Item [0]...[n]. To match parentheses characters ( ), use "\(" or "\)".

x|y

Matches either x or y. For example, "z|wood" matches "z" or "wood". "(z|w)oo" matches "zoo" or "wood".

{n}

n is a nonnegative integer. Matches exactly n times. For example, "o{2}" does not match the "o" in "Bob," but matches the first two o's in "foooood".

{n,}

n is a nonnegative integer. Matches at least n times. For example, "o{2,}" does not match the "o" in "Bob" and matches all the o's in "foooood." "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*".

{n,m}

m and n are nonnegative integers. Matches at least n and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood." "o{0,1}" is equivalent to "o?".

[xyz]

A character set. Matches any one of the enclosed characters. For example, "[abc]" matches the "a" in "plain".

[^xyz]

A negative character set. Matches any character not enclosed. For example, "[^abc]" matches the "p" in "plain".

[a-z]

A range of characters. Matches any character in the specified range. For example, "[a-z]" matches any lowercase alphabetic character in the range "a" through "z".

[^m-z]

A negative range characters. Matches any character not in the specified range. For example, "[m-z]" matches any character not in the range "m" through "z".

\b

Matches a word boundary, that is, the position between a word and a space. For example, "er\b" matches the "er" in "never" but not the "er" in "verb".

\B

Matches a non-word boundary. "ea*r\B" matches the "ear" in "never early".

\d

Matches a digit character. Equivalent to [0-9].

\D

Matches a non-digit character. Equivalent to [^0-9].

\f

Matches a form-feed character.

\n

Matches a newline character.

\r

Matches a carriage return character.

\s

Matches any white space including space, tab, form-feed, etc.
Equivalent to "[ \f\n\r\t\v]".

\S

Matches any nonwhite space character.
Equivalent to "[^ \f\n\r\t\v]".

\t

Matches a tab character.

\v

Matches a vertical tab character.

\w

Matches any word character including underscore.
Equivalent to "[A-Za-z0-9_]".

\W

Matches any non-word character.
Equivalent to "[^A-Za-z0-9_]".

\num

Matches num, where num is a positive integer. A reference back to remembered matches. For example, "(.)\1" matches two consecutive identical characters.

\n

Matches n, where n is an octal escape value. Octal escape values must be 1, 2, or 3 digits long. For example, "\11" and "\011" both match a tab character. "\0011" is the equivalent of "\001" & "1". Octal escape values must not exceed 256. If they do, only the first two digits comprise the expression. Allows ASCII codes to be used in regular expressions.

\xn

Matches n, where n is a hexadecimal escape value. Hexadecimal escape values must be exactly two digits long. For example, "\x41" matches "A". "\x041" is equivalent to "\x04" & "1". Allows ASCII codes to be used in regular expressions.



Popularity: 6% [?]




Related Posts:
  • Learn How to Write Regular Expressions
  • PHP Regular Expressions Tutorial
  • 15 Programming Cheat Sheets
  • Free Shell Scripting Course
  • 16 Programming eBooks for Python, Bash, Regex


  • Filed under: Best New Free Computer IT Training Tutorial Resources — computer_teacher @ 10:40 pm

    No Comments »

    No comments yet.

    RSS feed for comments on this post.

    Leave a comment

    You must be logged in to post a comment.



    Powered by WordPress