Regular Expressions Tutorial

Regular Expressions (or regex) is a way to specify pattern matching in strings. So let’s jump in. Here’s a sentence:

The quick brown fox jumps over the lazy dog

And here’s a regular expression:

own

So that expression will search that string for the string “own” anywhere in that sentence, whether it’s at the beginning of a word or part of a word or whatever. In this case the word “brown” contains “own”, so the search proved successful.What if we wanted to match a word that starts with “umps” then we could just use this:

  .umps

a dot can represent any single character except a newline. Also notice the the space before the dot, that means that this will have to be matched as a separate word within the sentence (as long as it’s not the first word). In our sentence, we have 1 match for ” .umps”, which is ” jumps”.
Check this one out:

quick|lazy

The pipe symbol seperates multiple expressions to match against, so the above regex would match for the string “quick” or “lazy”
This one will match against “lazy”, “lazer”, or “lazarus”:

laz(y|er|arus)

Using the grouping parenthesis, I created a subexpression. It’s usage is kind of obvious but less obvious to try to explain so I’m trusting that you understand the usage of parenthesis.
Now let’s take a look at these:

  • ? = Matches the preceding element zero or one time.
  • * = Matches the preceding element zero or more times.
  • + = Matches the preceding element one or more times.

This one will match “fx”, “fix”, “fox”, “fax”, “fux”, etc:

f.?x

The dot means any character (besides newline), but the question mark means that the previous character (in this case: dot) may or may not be present.
This will match “ac”, “abc”, “abbc”, “abbbc”, etc:

ab*c

The asterisk matches the previous character (b) zero or more times.
This will match “Welcome” enclosed by at least one pair of equal signs:

=+Welcome=+

So the above will match something like: “=========================Welcome==========================”
Use {m,n} which matches the preceding element [i]m[/i] and not more than [i]n[/i] times. If n is not specified, then it will match the preceding character exactly m times.

(a|b|c){3}

The above will match “aaa”, “bbb”, or “ccc”.
Another operator you could use is the dash, it will match a range of characters (as denoted by ascii) where the beginning character is specified preceding the dash and the ending character specified proceeding the dash.

a-z0-9

Will match any lower-case alphabetical character followed by a numbeer from 0 through 9.
Now the special brackets can be used, which will match any 1 character out of all the character contained in the group. This is also used to escape the other special pattern matching characters in regex except for ^. When using ^ at the beginning of the character sequence within the brackets, it will match all characters that are not contained within the brackets. The following will match any filename ending with a .exe or .com:

.+[.](exe|com)

This will match any word not starting with ‘.’, ‘?’, or ‘$’:

  [^.?$].+ 

Thank you for reading my tutorial, now have a look at this reference for escape sequences:

^    =   Beginning of string (or line in multiline mode)
$    =   End of string (or line in multiline mode)
\w   =   [A-Za-z0-9]
\W   =   [^a-Za-z0-9]
\d   =   [0-9]
\D   =   [^0-9]
\s   =   [ \t\r\n\v\f]
\S   =   [^ \t\r\n\v\f]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: