Regular Expressions
It is suggested the reader uses this app to get a clue of what we are talking about in this wiki until he learns to program it all by himself.
Regular Expressions (abbreviated regex) are the most useful tools in string processing. If you are fond of the search and replace tool in your favorite text editor/word processor, you'll love this.
Introduction
Regular Expressions was initially a term borrowed from automata theory in theoretical computer science. Broadly, it refers to patterns to which a substring needs to be matched.
The comic should have already given you an idea of what regular expressions could be useful for. It should not be surprising that many programming languages, text processing tools, data validation tools and search engines make extensive use of them.
The key idea is that a regular expression is a pattern which matches a set of target strings.
\w+@\w+\.(com|org|net|in)
is a regex that matches a most email addresses that end with a.com
,.net
,.org
or a.in
.
Concepts
There are many forms of regex syntax that vary with the language. Here, we will be examining perl regex since most other regexes are usually a variation on this.
Before we dive into the syntax, these are the kinds of things that the patterns consist of:
Literals: They are the simplest things to match. When they are there, we just match them. It could be like an
a
or a1
.Metacharacters: They do not mean what they look like. They usually refer to something else. For example,
\d
could refer to any digit.Vertical Bar: The
|
is a symbol of boolean OR. It gives an option to match any of the things it delimits.Quantifiers: They specify how many of the concerned pattern needs to be matched.
Grouping and Capturing: Parentheses could be used to group parts of the regex or capturing parts for later use.
Syntax
Let's look at what the metacharacters do in a little more detail.
Metacharacter | Description |
^ | Start of a string |
$ | End of a string |
\t | Tab |
\n | Newline |
\r | Carriage Return |
\s | Any whitespace character |
\S | Any non-whitespace character |
\d | Any Digit |
\D | Any non-digit |
\w | Any word-character |
\W | Any non-word character |
\b | Any word boundary |
\B | Any non-word-boundary |
. | Any single character, usually barring a newline |
By the way, if you want to match a metacharacter literally, you need to use \
to escape it. For example, \.
would just match the .
character.
Now, let us look into more flexibility stuff.
Expression | Meaning |
[abc] | Matches any of a ,b , or c |
[^abc] | Matches anything other than a , b , or c |
[a-d] | Matches any of the characters in the range a-d |
a* | Matches a zero or more times |
a? | Matches a zero or one time |
a+ | Matches a one or more times |
a|b | Matches either a or b |
a{3} | Matches exactly 3 of a |
a{3,} | Matches 3 or more of a |
a{3,5} | Matches 3, 4 or 5 of a (inclusive range) |
( ) | Captures everything inside the bracket |
We are now ready to explain why
\w+@\w+\.(com|org|net|in)
does what it claims.Firstly, what should an email look like? That's right, it should have a structure like
user@domain.extension
.The
user
anddomain
consists of any letter, number or underscore but at least one of them. So, we use\w+
.We restrict the
extension
toorg
,com
,net
orin
by using the|
.
Brilliant Staff have emails like calvin@brilliant.org or support@brilliant.org, i.e. a single alphanumeric word (sometimes with underscores or periods) or name followed by the site address.
Kenji wants to build an app which only the brilliant staff could use. Which of the following regex would be the best for him to use?
.*@brilliant\.org
\w+@brilliant\.org
\w+@brilliant.org
\w*@brilliant\.org
.+@brilliant\.org
Regular Expressions in Action - Perl Implementation
Perl is the language that is the most famous for its use of regular expression for good reasons.
We use the =~
operator to denote a match or an assignment depending upon the context. The use of !~
is to reverse the sense of the match.
There are basically two regex operators in perl:
- Matching:
m//
- Substitution:
s///
The purpose of the //
is to enclose the regex. However, any other delimiters like {}
, ""
, etc could be used.
Matching
To use the matching operator, we simply check both sides using the =~
and m//
operator.
The following sets
$true
to 1 if and only if$foo
matches the regular expressionfoo
:
1$true = ($foo =~ m/foo/);
It is not difficult to see that just the opposite is achieved with
!~
:
1$false = ($foo !~ m/foo/);
Capturing
As promised, the ()
could be used for capturing parts of the regexes. When the pattern inside a parentheses match, they go into special variables like $1
, $2
, etc in that order.
Here's how one would extract the hours, minutes, seconds from a time string:
1 2 3 4 5if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format $hours = $1; $minutes = $2; $seconds = $3; }
In list context, the list ($1, $2, $3, .. )
would be returned.
A simpler way to do the same would be
1my ($hours, $minutes, $seconds) = ($time =~ m/(\d+):(\d+):(\d+)/);
Substitution
This is our favorite search and replace feature. Almost the same syntax rules apply here except that there is an extra clause between the second //
that tells us what to match with.
Here is a self-explanatory piece of code:
1 2 3 4 5 6 7 8$x = "Time to feed the cat!"; $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { $more_insistent = 1; } $y = "'quoted words'"; $y =~ s/^'(.*)'$/$1/; # strip single quotes, # $y contains "quoted words"
Modifiers
Modifiers could be appended to the end of the regex operation expression to modify their matching behavior.
Here is a list of some important modifiers:
Modifier | Description |
i | Case insensisitive matching |
s | Allows the use of . to match newlines |
x | Allows use of whitespace in the regex for clarity |
g | Globally find all matches |
Here's how one might want to use the
g
modifier:
1 2 3 4 5 6 7$x = "I batted 4 for 4"; $x =~ s/4/four/; # doesn't do it all: # $x contains "I batted four for 4" $x = "I batted 4 for 4"; $x =~ s/4/four/g; # does it all: # $x contains "I batted four for four"