Regular Expressions

It is suggested the reader uses this app to get a clue of what we are talking about in this wiki until he learns to program it all by himself.

Regular Expressions (abbreviated regex) are the most useful tools in string processing. If you are fond of the search and replace tool in your favorite text editor/word processor, you'll love this.

xkcd xkcd

Introduction

Regular Expressions was initially a term borrowed from automata theory in theoretical computer science. Broadly, it refers to patterns to which a substring needs to be matched.

The comic should have already given you an idea of what regular expressions could be useful for. It should not be surprising that many programming languages, text processing tools, data validation tools and search engines make extensive use of them.

The key idea is that a regular expression is a pattern which matches a set of target strings.

\w+@\w+\.(com|org|net|in) is a regex that matches a most email addresses that end with a .com, .net, .org or a .in.

Concepts

There are many forms of regex syntax that vary with the language. Here, we will be examining perl regex since most other regexes are usually a variation on this.

Before we dive into the syntax, these are the kinds of things that the patterns consist of:

Literals: They are the simplest things to match. When they are there, we just match them. It could be like an a or a 1.
Metacharacters: They do not mean what they look like. They usually refer to something else. For example, \d could refer to any digit.
Vertical Bar: The | is a symbol of boolean OR. It gives an option to match any of the things it delimits.
Quantifiers: They specify how many of the concerned pattern needs to be matched.
Grouping and Capturing: Parentheses could be used to group parts of the regex or capturing parts for later use.

Syntax

Let's look at what the metacharacters do in a little more detail.

Metacharacter	Description
`^`	Start of a string
`$`	End of a string
`\t`	Tab
`\n`	Newline
`\r`	Carriage Return
`\s`	Any whitespace character
`\S`	Any non-whitespace character
`\d`	Any Digit
`\D`	Any non-digit
`\w`	Any word-character
`\W`	Any non-word character
`\b`	Any word boundary
`\B`	Any non-word-boundary
`.`	Any single character, usually barring a newline

By the way, if you want to match a metacharacter literally, you need to use \ to escape it. For example, \. would just match the . character.

Now, let us look into more flexibility stuff.

Expression	Meaning
`[abc]`	Matches any of `a`,`b`, or `c`
`[^abc]`	Matches anything other than `a`, `b`, or `c`
`[a-d]`	Matches any of the characters in the range `a-d`
`a*`	Matches `a` zero or more times
`a?`	Matches `a` zero or one time
`a+`	Matches `a` one or more times
`a\|b`	Matches either `a` or `b`
`a{3}`	Matches exactly 3 of `a`
`a{3,}`	Matches 3 or more of `a`
`a{3,5}`	Matches 3, 4 or 5 of `a` (inclusive range)
`( )`	Captures everything inside the bracket

We are now ready to explain why \w+@\w+\.(com|org|net|in) does what it claims.

Firstly, what should an email look like? That's right, it should have a structure like user@domain.extension.

The user and domain consists of any letter, number or underscore but at least one of them. So, we use \w+.

We restrict the extension to org, com, net or in by using the |.

Regular Expressions in Action - Perl Implementation

Perl is the language that is the most famous for its use of regular expression for good reasons.

We use the =~ operator to denote a match or an assignment depending upon the context. The use of !~ is to reverse the sense of the match.

There are basically two regex operators in perl:

Matching: m//
Substitution: s///

The purpose of the // is to enclose the regex. However, any other delimiters like {}, "", etc could be used.

Matching

To use the matching operator, we simply check both sides using the =~ and m// operator.

The following sets $true to 1 if and only if $foo matches the regular expression foo:
1
$true = ($foo =~ m/foo/);
It is not difficult to see that just the opposite is achieved with !~:
1
$false = ($foo !~ m/foo/);

Capturing

As promised, the () could be used for capturing parts of the regexes. When the pattern inside a parentheses match, they go into special variables like $1, $2, etc in that order.

Here's how one would extract the hours, minutes, seconds from a time string:
1
2
3
4
5
if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format
 $hours = $1;
 $minutes = $2;
 $seconds = $3;
}

In list context, the list ($1, $2, $3, .. ) would be returned.

A simpler way to do the same would be

1	`my ($hours, $minutes, $seconds) = ($time =~ m/(\d+):(\d+):(\d+)/);`

Substitution

This is our favorite search and replace feature. Almost the same syntax rules apply here except that there is an extra clause between the second // that tells us what to match with.

Here is a self-explanatory piece of code:

$x = "Time to feed the cat!";
$x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
 $more_insistent = 1;
}
$y = "'quoted words'";
$y =~ s/^'(.*)'$/$1/; # strip single quotes,
# $y contains "quoted words"

Modifiers

Modifiers could be appended to the end of the regex operation expression to modify their matching behavior.

Here is a list of some important modifiers:

Modifier	Description
`i`	Case insensisitive matching
`s`	Allows the use of `.` to match newlines
`x`	Allows use of whitespace in the regex for clarity
`g`	Globally find all matches

Here's how one might want to use the g modifier:

$x = "I batted 4 for 4";
$x =~ s/4/four/; # doesn't do it all:
# $x contains "I batted four for 4"

$x = "I batted 4 for 4";
$x =~ s/4/four/g; # does it all:
# $x contains "I batted four for four"

Contents

Matching

Capturing

Substitution

Modifiers