EIW Fall 2003 Lecture Notes

Perl Pattern Matching and Regular Expressions


Pattern Matching

Perl includes a number of pattern matching operators that can be used to do all kinds of fancy string manipulation operations. All these pattern matching operators share a common "language" for expressing patterns. Patterns can be as simple as a fixed sequence of characters, for example the string "foo" is a pattern. Patterns can be more complex, for example the pattern an "a" followed by either a "c" or a "d".

In the simplest case, we can use pattern matching to determine whether or not a pattern matches a string. A pattern matches a string if the string contains any substring that can be described by the pattern. For example, using the fixed pattern "foo", any string that contains the substring "foo" will be matched by the pattern. The string "My food is cold" would match the pattern "foo" since it contains the substring "foo". The string "info of interest" does not match the pattern "foo" - although it contains the substring "fo o", that is not a match, the match must be exact.

The result of a simple pattern match operation is TRUE or FALSE (in the Perl way, anything other than 0 or "" is FALSE), TRUE if the pattern matches the string, FALSE if it does not. We can also do things like replace the part of a string that matches a pattern with something new, or perhaps remove any part of a string that matches a pattern - these operations are all possible in Perl.

Regular Expressions

The "language" used to express patterns is called regular expressions - we often refer to a pattern as "a regular expression". In Perl, regular expressions are themselves strings, so we can actually build new regular expressions at run time and then apply them (attempt to match strings with them).

The simplest kind of regular expression is a simple string. The pattern "foo" is such an expression. As stated before, the pattern "foo" would match any string that contains an "f" followed immediately by an "o", followed immediately by another "o". Nothing can come between these characters, and we don't care what else is in the string.

The Perl match operator

The basic pattern matching operator is called the match operator. Although it is possible to use the match operator with any scalar variable, it is often used with the default Perl scalar variable $_. The following statement attempts to match the pattern "foo" against the string in $_:

/foo/

The slashes delimit the regular expression - in this case the regular expression is the string "foo". Notice that the regular expression does not include any quotes, the slashes server as the delimiters (so we can tell whether or not there are any blanks at the beginning or end - in our example there are none).

The match operator returns either TRUE or FALSE, so we typically use it as a conditional expression in an if or loop. Here is a sample program that uses the match operator to filter out all the lines of input that do not contain the substring "foo":


# read one line at a time in to $_
while (<>) {

    # see if the string in $_ contains "foo"
    if (/foo/) {
		
	# the string matches "foo", so print it
	print;		# prints the default variable $_
    }
}

If we put the above program in the file "foofinder.pl" and told Perl to use the program itself as input (searching the text of the program for lines that contain the string "foo"), we would see the following:

 perl foofinder.pl foofinder.pl
    # see if the string contains "foo"
    if (/foo/) {
	# the string matches "foo", so print it

We could also create a program that prints out all lines that do not contain the string "foo":

while (<>) {
    unless (/foo/) {
	print;
    }
}

Defining Patterns

The simplest part of a pattern is a single character. The example pattern we have already used ("foo") is made up of three single character patterns, each specifies a single character (the "f", the "o" and the other "o"). By putting these individual characters in a sequence we are telling Perl that a match must contain a sequence of characters that match the corresponding pattern characters, so the string "food" matches, but the string "f o o" does not since it has spaces in it.There are a few other ways to match a single character:

Character Class Shortcuts

You can define ranges of characters inside a character class definition (or negated character class) like this:

[a-z] lower case letters
[0-9] digits
[a-zA-Z] lower or upper case letters

When found inside a character class definition we now must include "-" in the growing list of characters that must be "escaped" when we want them to be used literally: "\-".

Some character classes are very common, so perl provides shortcuts to save you some typing. For example, you can use the shortcut "\d" instead of [0-9] (d stands for "digit"). Here is a complete list of the predefined character classes available in perl:

ShortcutCharacter ClassDescription
\d [0-9] digits
\w [a-zA-Z0-9_] word characters (alphanumerics and "_")
\s [ \r\n\t\f] space (whitespace)
\D [^0-9] NOT digits
\W [^a-zA-Z0-9_] NOT word characters (alphanumerics and "_")
\S [^ \r\n\t\f] NOT space (whitespace)

Examples using predefined character class shortcuts:

PatternMeaningWill MatchWon't Match
/\s/ will match any string that contains whitespace "A space"
"Lots   of   space"
"NoWhiteSpace"
"i_do_not_match"
/\d\w/ will match any string that has a digit followed by a letter or underscore. "Water is H2O"
"BMW 320i"
"Mazda RX7"
"1,834,621 grasshoppers"
/\sEIW\s/ will match any string that has "EIW" with whitespace on each side. "Schedule: EIW 10AM"
"an EIW kinda day"
"EIW does not have leading whitespace"
"The KEIWFAR frazzled"

Exercises:

answers are here

Construct regular expressions (match operators) for the following:

Grouping Patterns

There are also ways to build regular expressions that are sequences of groups of characters (so far we just looked at sequences of single characters). The asterisk "*" can follow any single character pattern (a character, dot, character class or negated character class) and means "zero or more of the previous pattern". For example, the regular expression /a*/ means any sequence of zero or more "a"s. This would match "a", "aaaaa" or "aaaaaaaaaaa". (it would actually match any string, since any string has zero or more "a"s!). The question mark "?" means "zero or one" of the previous character pattern and the plus sign "+" means "one or more". Some examples:

PatternMeaningWill MatchWon't Match
/abc*/ will match any string that has an "a" followed by a "b" followed by zero or more "c"s "abxxx"
"abccccc"
"ccccc"
"accc"
/a+[01]/ one or more "a"s followed by either a "0" or a "1". "alpha1"
"aaaaaaa0"
"alpha2"
"0"
/<.*>/ anything inside angle brackets "<HTML>"
"<H3>Hi Dave</H3>"
"<>"
"<No final bracket"
">>>>"
/A[0-9]?B/ an "A" followed by a "B" with possibly a single digit in the middle "A1B"
"AB"
"a1b"
"A123B"

Later we will worry about what part of a string is matched by a regular expression (not just that a string matches a pattern), and then it will become clear when and why we do something like /a*/ (which would match any string!). For now we will just worry about the grouping operator syntax.

Grouping with Parentheses

We can put parentheses around any part of a pattern and apply "*", "?" and "+" to everything inside the parentheses. For examples, the pattern /(a[0-9])+/ will match any part of a string that contains any number of two character sequences, where each two character sequence is an "a" followed by a digit. The string "a1a2a3a4" is matched by the pattern. Examples:

PatternMeaningWill MatchWon't Match
/a(bc)*d/ will match any string that has an "a" followed by zero or more "bc"s followed by a "d". "ad"
"abcbcd"
"abcxd"
"bcbcd"
/([Ww]indows ?XP)+/ One or more occurrences of the phrase "Windows XP" with lower or upper case leading "w" and zero or one spaces between the "Windows" and the "XP". "Windows XP"
"windowsXP"
"WindowsXPwindows XP"
"Windows98"
"XP"

Exercises:

answers are here

Develop regular expressions for the following:

Grouping and Memory

When doing a match operation, Perl can be told to remember the part of a string that matches each group in parentheses in the regular expression. The match operator will return a list of these substrings, so you can capture them by doing something like this:

($firstword,$secondword) = /(\w+)\s+(\w+)/

This results in the variable $firstword being set to whatever part of $_ was matches by the first (\w+) (the red one), and $secondword will be the part of $_ that matched the second word. Now we can think about using regular expressions to do more than simply matching a pattern to a string, we also extract parts of the string.

Exercise

What does the following do?

($proto,$host,$uri) = /([^:]+):\/\/([^\/]+)\/(.*)/;

Perl also allows you to reference the "memorized" chunks of the string being matched inside the regular expression itself. Within a regular expression a \1 will match whatever part of the string was matched by the first parenthesized group in the regular expression. For example, the following regular expression will match any line that contains a substring containing non-blank characters that is repeated (the same substring appears twice in the sentence):

/(\w+).*\1/

The \1 is replaced by whatever substring was matched by the (\w+).

The following regular expression will match any string that contains a "0" followed by any substring (could be empty substring) followed by a "1" followed by the same substring that followed the "0".

/0(.*)1\1/

The string 0Hi Dave1Hi Dave would match, but the string 0abc1ab would not.

Exercises:

answers are here

Develop regular expressions for the following:

Alternation

You can use the "|" alternation symbol in a regular expression to mean "either alternative". For example the expression a|b matches either an "a" or a "b", you could also express this as [ab]. You can also do things like this:

/Dave|Dad|baldy|dummy|cookie monster/

which would match any string that contains any of "Dave", "Dad", "baldy", "dummy" or "cookie monster".

Anchoring

You can use anchors in regular expressions to force matches only in specific places in a string. The regular expression /Joe/ without any anchors will match any string that contains the substring "Joe". But what if you only want to match strings that begin with "Joe", or strings in which "Joe" is surrounded by whitespace?

Perl provides four type of anchors to ensure that patterns match specific parts of a string:

Anchoring Examples:

PatternMeaningWill MatchWon't Match
/^[A-Z]/ will match any string that starts with a capital letter. "I am a funny guy."
"ABCDEF"
"John flunked the test"
"<H2>Hi Dave</H2>"
"something is wrong"
/^(\w+)\b.*\b\1$/ Any string that starts and ends with the same word. "Hi blah Hi"
"Sam I am Sam"
"Hello, Hello, Hello"
"hellohello"
"there is a the"
"why ask whynot"

Substitutions

Perl has a substitute operator that allows you to substitute some value for any part of a string that matches a regular expression. The general form of the substitute operator is:

s/someregexp/newvalue/modifiers;

This operator operates on (possibly modifies) the default scalar variable $_. The regular expression is matched against $_, and whatever part of $_ matches the regular expression is removed and replace with newvalue. Here is a simple example:

s/Dave/Joe/;

This would replace the first occurrence of "Dave" with the string "Joe", so if $_ was originally the string "Dave is a fool", after the above substitute command the string would be "Joe is a fool". Some examples:

Original $_commandMeaningResult
"Dave likes cookies too much" s/cookies/pizza/; replace first "cookies" with "pizza" "Dave likes pizza too much"
"Chocolate Chocolate Chip" s/Chocolate/Mint/; replace first "Chocolate" with "Mint" "Mint Chocolate Chip"
"Vitamin B9 is great" s/[A-Z][0-9]/foo/; replace first cap. letter followed by digit by "foo" "Vitamin foo is great"
"H20 H20 H20 C3PO" s/[0-9]//; remove first digit (replace with nothing!) "HO H2O H2O C3PO"

If the regular expression is never matched by any part of the string in $_ no substitution happens ($_ is unchanged).

There are some modifiers you can specify that change how the substitute operator works. For example, if you include the modifier "g" the substitution will happen to all parts of the string that match the regular expression - not just the first one. ("g" stands for global).So s/Joe/foo/g; applied to the string "Joe is a Joe Joe Joe" will result in "foo is a foo foo foo".

The "i" modifier tells Perl to ignore case when matching, so that an "a" would match a "A", etc. "i" stands for (case) "insensitive". You can include both modifiers in a substitute expression, for example:

s/mit/RPI/gi;
would replace "mit" or "MIT" or "mIT" or ..., with "RPI". The string:
"MIT is a greate place to clean your mit, but you need to
wear mittens or you will be committed"
would become:
"RPI is a greate place to clean your RPI, but you need to
wear RPI or you will be comRPIted"
.

Writing a Perl program that reads from standard input (or from a file specified on the command line), makes substitutions on each line, and prints out the result is pretty simple. Here is an example that replaces all numbers that are not part of a word (digits surrounded by whitespace) by the word "number", and also replaces the word "Honorable" with "Bald":

# read each line
while (<>) {
	s/\b([0-9]+)\b/number/g;
	s/\bHonorable\b/Bald/g;
	print;
}

Exercises

answers are here

Variable Interpolation in Regular Expressions

You can do variable interpolation inside regular expressions or in substitution text, so that all or part of a pattern can be built at run time. For example, we could write a program that personalizes form letters by replacing various strings in a document with the values of some variables (assume the variables have been retrieved from a database of customers):

# Form letter personalization
# ... some code here that sets the following variables:
#  $fname is the customer's first name
#  $lname is the customer's last name
#  $title is the customer's title
#  $email is the customer's email address

while (<>) {	# read each line
	s/CUSTOMER NAME/$title $fname $lastname/;
	s/FNAME/$fname/;
	s/EMAIL/$email/;

	print;
}

Assuming we have a customer Mr. Joe Jones with email address joe@smallbiz.com and the following letter fed as input to the Perl program:

Dear CUSTOMER NAME,

We have received your complaint and would like to assure you that
we take such matters seriously. I have personally spoken to the manager
of our Customer Satisfaction Unit (Cus-Sat-U) and he will be handling your
problem as soon as possible. 

FNAME, I would like to apologize for any inconvienence you have
suffered, please feel free to contact any of our Customer Service
staff if you have questions or furthur problems. You can contact our
Customer Service desk at 1-800-BUZ-ZOFF on any Monday in March between
the hours of 10AM EST and 6:59AM PST.

Our records show your email address as EMAIL. Please contact us
immediately if this is correct.

The output of the program would look like this:

Dear Mr. Joe Jones,

We have received your complaint and would like to assure you that
we take such matters seriously. I have personally spoken to the manager
of our Customer Satisfaction Unit (Cus-Sat-U) and he will be handling your
problem as soon as possible. 

Joe, I would like to apologize for any inconvienence you have
suffered, please feel free to contact any of our Customer Service
staff if you have questions or furthur problems. You can contact our
Customer Service desk at 1-800-BUZ-ZOFF on any Monday in March between
the hours of 10AM EST and 6:59AM PST.

Our records show your email address as joe@smallbiz.com. Please
contact us immediately if this is correct.

We can also build regular expressions at runtime, for example the following program will replace all "1"s in the first line with the word "foo", and all "2"s in the second line, "3"s in the third line, etc.


$linenumber = 1;
while (<>) {
	s/$linenumber/foo/g;
	print;
	$linenumber++;   #shorthand for $linenumber = $linenumber+1
}

The =~ Operator (matching with your own variables)

The match and substitute operators use the default perl variable $_, but sometimes you want to tell Perl to use some other variable. The =~ operator expects a scalar variable on the left and a match expression or substitute expression on the right. Perl applies the match or substitute to the variable you've specified instead of to $_. In the case of a substitute command the modified value is stored in the variable specified as well. For example, the following code:

if ($foo =~ /gumdrop/) {
   print "Gumdrop found\n";
}

uses the string in $foo to match against the regular expression /gumdrop/ instead of using $_. Here is an example using the substitute command:

$line =~ s/<H1>/<H2>/g;

This Perl code looks for "<H1>" in the string $line and replaces each "<H1>" found with "<H2>".

$line = "<H1>The effect of pop culture on worms</H1>";
$line =~ s/<H1>/<H2>/g;

# now $line would be "<H2>The effect of pop culture on worms</H1>"

Join and Split

The Perl split operator uses a regular expression to split a string in to a sequence of substrings (which are returned as an array). For example, the following expression will split the string in "$_" on whitespace, returning an array of all the non-whitespace tokens:

 @tokens = split(\s+); 

If $_ holds the string "Sometimes I dunk cookies in melted butter" then after the above split command the array @tokens would hold the value: ("Sometimes", "I", "dunk", "cookies", "in", "melted", "butter").

The general form of the split command is:

split(regular_expression,string_to_split);

If you only give split one argument it assumes you want to split the default Perl variable $_. If you don't give the split any arguments it assumes you want to split on whitespace, the default regular expression used is /\s+/. The following program counts the number of words in each input line and prints this information out:

while (<>) {
	chomp;			# get rid of newline
	@words = split;		# split on whitespace
	$numwords = @words;	# get number of elements in @words
	print "The string \"$_\" has $numwords words\n";
}

The join command creates a string from an array (sort of the opposite of split). The general form of join is:

join(glue_string,an_array);

join returns a single string that results from inserting the glue_string between each pair of array elements. The glue_string is just a string (not a regular expression). An example that creates HTML table format grade report from a student database in tab delimited format. Assume that input is a file that contains student records in the following format:

Student Name\tTest1 grade\tTest2 grade\tHomework\n

that is, each line contains a name and three grades with tabs separating individual fields. The following Perl program will convert this format to an HTML table (using split and join) and will calculate the student average.

print "<TABLE>\n";                  # start the HTML table

# add some headings
print "<TR><TH>Name</TH><TH>Test1</TH><TH>Test2</TH>";
print "<TH>HW</TH><TH>Ave</TH></TR>\n";

while (<>) {
	chomp;			    # get rid of newline
	@studentrec = split(/\t/);  # split in to fields
	
	# now $studentrec[0] is the student name
	#     $studentrec[1] is the test 1 grade
	#     $studentrec[2] is the test 2 grade
	#     $studentrec[3] is the homework grade

	print "<TR><TD>";           # start the row for this student

	print join("</TD><TD>",@studentrec);  # print fields in table format

	# calculate student average
	$ave = ($studentrec[1] + $studentrec[2] + $studentrec[3])/3;

	print "</TD><TD>$ave</TD></TR>\n";       # finish up the row
}
print "</TABLE>\n";                 # finish up the table

Exercise

Write a Perl program that creates a student record in the form used as input to the above program. Each line should contain a student name, followed by a tab (no tabs in the name are allowed), followed by a test1 grade, followed by a tab, etc. A sample output line is:

Joe Smith\t88\t92\t77\n

Your program will accept input in the form of lines that contain name, value pairs with an equal sign (=) between the name and the value. Here is a sample input file:

name = Joe Student
test1 = 86
test2 = 77
homework = 33
name = Jane Smith
test1 = 98
test2 = 35
homework = 85

for this input, the output should be this (\t is a tab):

Joe Student\t86\t77\t33
Jane Smith\t98\t35\t85

Thought Exercise: what happens here?:

while (<>) {
	s/a(.*)a/\1/;
	print;
}

  Try the following input string: "a00000a11111a"

Remember: within a regular expression, \1 is replaced by the part of the string that matched the part of the regular expression in parentheses (in this case everything that is matched by (.*)).

There are a few possibilities:

  1. The output will be "0000011111". Nope! For starters, we did not include the modifier "g", so Perl only does the substitution once. Even if we did use this: s/a(.*)a/\1/g;, once Perl did the first substitution there would only be one "a" left (and the pattern couldn't match again).

  2. The output will be "0000011111a". In this case Perl matched the first and second "a"s with the pattern and removed them, but left the last "a" alone (only matched once). NO AGAIN! This sounds reasonable, but it isn't what Perl does. Perl will make as large a match as it can - in this case it will match the "a"s in the pattern with the leftmost and rightmost "a"s in the string.

  3. The output will be "00000a11111". YES! Perl matches the leftmost and rightmost "a"s to the pattern and removes them, leaving everything in the middle. Perl tries to match as much of the string as possible with the (.*), in this case it can match (.*) to "00000a11111" so it does so.

Important Rule of Matching

Perl tries to match as much as it can to any pattern or sub-pattern. It does not simply scan the string from left to right looking for the first possible match to the entire pattern.

Exercise

Come up with a substitute command (regular expression) that will remove the < and > from an HTML tag. For example, given the input string (in $_) "This is <B>BOLD</B>" the substitute command will leave us with "B" since the first tag is a B tag.

You need to make sure that you don't end up with "B>BOLD</B"!

Harder Exercise

Come up with a substitute command that will remove the <B> and the </B> and leave whatever is between these adjacent tags. In this case, given the string "This is <B>BOLD</B>" the result of your substitute command should be "This is BOLD".

However, what if the string looks like this? : "This is <B>BOLD</B> and so is <B>THIS</B>" . I'd still like the result of the substitute command to deal with only the first set of tags: This is BOLD and so is <B>THIS</B> (whatever is in the first pair of <B> </B> tags).

First one that can come up with a substitute command to do this wins a cookie.

If you do something like this: s/<B>(.*)<\/B>/\1/; Perl will match the first (leftmost) <B> and the rightmost </B> in the string, and you will end up with this:

"BOLD</B> and so is <B>THIS"

What we need

To do the above exercise we need some way to tell Perl that we want it to stop matching as soon as it finds a </B> instead of looking through the rest of the string for a larger (more of the string) match. When we asked Perl to match the inside of an HTML tag things were easy - we could tell Perl to match everything that is not a > as in the following:

s/<([^>]*)>/\1/;

The stuff inside the parentheses matches only characters that are not ">", so Perl will stop as soon as it sees a ">". This is possible only because the end of the match is marked by a single character - in our case we want to stop as soon as a "</B>" is found and there is no way to write a regular expression that matches not"</B>".

Telling Perl to match the take the first match

Perl will match as much of the input string as possible whenever it sees something like .* or .+ (or anything that has a * or + at the end). We can, however, tell Perl to stop when it finds the first match by putting a ? right after the * or +.

Let's go back to the first exercise and observe the effect of putting a ? after a *

Exercise: what happens here?:

while (<>) {
	s/a(.*?)a/\1/;
	print;
}

  Try the following input string: "a00000a11111a"

If you run this you will find out that with the ? after the *, Perl will stop at the first match, so it will remove the first and second "a"s and leave the last one.

Solution to Harder Exercise

We can do the same thing when trying to match everything between adjacent <B> and </B> tags:

s/<B>(.*?)</B>/\1/;

The above substitute command will extract everything between the first <B> tag and the next </B> tag (remember that without the ? it will match everything up through the last </B> tag).

Exercises