EIW Fall 2000 Lecture Notes - Pattern Matching and Regular Expressions


Pattern Matching

Perl includes a number of pattern matching operators that can be used to do all kinds of fancy string manipulation operations. All these pattern matching operators share a common "language" for expressing patterns. Patterns can be as simple as a fixed sequence of characters, for example the string "foo" is a pattern. Patterns can be more complex, for example the pattern an "a" followed by either a "c" or a "d".

In the simplest case, we can use pattern matching to determine whether or not a pattern matches a string. A pattern matches a string if the string contains any substring that can be described by the pattern. For example, using the fixed pattern "foo", any string that contains the subtring "foo" will be matched by the pattern. The string "My food is cold" would match the pattern "foo" since it contains the substring "foo". The string "info of interest" does not match the pattern "foo" - although it contains the substring "fo o", that is not a match, the match must be exact.

The result of a simple pattern match operation is TRUE or FALSE, TRUE if the pattern matches the string, FALSE if it does not. We can also do things like replace the part of a string that matches a pattern with something new, or perhaps remove any part of a string that matches a pattern - these operations are all possible in perl.

Regular Expressions

The "language" used to express patterns is called regular expressions - we often refer to a single pattern as "a regular expression". In perl, regular expressions are themselves strings, so we can actually build new regular expressions at run time and then apply them (attempt to match strings with them).

The simplest kind of regular expression is a simple string. The pattern "foo" is such an expression. As stated before, the pattern "foo" would match any string that contains an "f" followed immediately by an "o", followed immediately by another "o". Nothing can come between these characters, and we don't care what else is in the string.

The perl match operator

The basic pattern matching operator is called the match operator. Although it is possible to use the match operator with any scalar variable, it is often used with the default perl scalar variable $_. The following statement attempts to match the pattern "foo" against the string in $_:
/foo/
The slashes delimit the regular expression - in this case the regular expression is the string "foo". Notice that the regular expression does not include any quotes, the slashes server as the delimiters (so we can tell whether or not there are any blanks at the beginning or end - in our example there are none).

The match operator returns either TRUE or FALSE, so we typically use it as a conditional expression in an if or loop. Here is a sample program that uses the match operator to filter out all the lines of input that do not contain the substring "foo":


# read one line at a time in to $_
while (<>) {

    # see if the string in $_ contains "foo"
    if (/foo/) {
		
	# the string matches "foo", so print it
	print;		# prints the default variable $_
    }
}

If we put the above program in the file "foofinder.pl" and told perl to use the program itself as input (searching the text of the program for lines that contain the string "foo"), we would see the following:

 perl foofinder.pl foofinder.pl
    # see if the string contains "foo"
    if (/foo/) {
	# the string matches "foo", so print it

We could also create a program that prints out all lines that do not contain the string "foo":

while (<>) {
    unless (/foo/) {
	print;
    }
}

Defining Patterns

The simplest part of a pattern is a single character. The example pattern we have already used ("foo") is made up of three single character patterns, each specifies a single character (the "f", the "o" and the other "o"). By putting these individual characters in a sequence we are telling perl that a match must contain a sequence of characters that match the corresponding pattern characters, so the string "food" matches, but the string "f o o" does not since it has spaces in it.There are a few other ways to match a single character:

Character Class Shortcuts

You can define ranges of characters inside a character class definition (or negated character class) like this:

[a-z] lower case letters
[0-9] digits
[a-zA-Z] lower or upper case letters

When found inside a character class definition we now must include "-" in the growing list of characters that must be "escaped" when we want them to be used literally: "\-".

Some character classes are very common, so perl provides shortcuts to save you some typing. For example, you can use the shortcut "\d" instead of [0-9] (d stands for "digit"). Here is a complete list of the predefined character classes available in perl:

ShortcutCharacter ClassDescription
\d [0-9] digits
\w [a-zA-Z0-9_] word characters (alphanumerics and "_")
\s [ \r\n\t\f] space (whitespace)
\D [^0-9] NOT digits
\W [^a-zA-Z0-9_] NOTword characters (alphanumerics and "_")
\S [^ \r\n\t\f] NOT space (whitespace)

Examples using predefined character class shortcuts:

PatternMeaningWill MatchWon't Match
/\s/ will match any string that contains whitespace "A space"
"here is a         tab"
"NoWhiteSpace"
"i_do_not_match"
/\d\w/ will match any string that has a digit followed by a letter or underscore. "Water is H2O"
"BMW 320i"
"Mazda RX7"
"1,834,621 grasshoppers"
/\sEIW\s/ will match any string that has "EIW" with whitespace on each side. "Schedule: EIW 10AM"
"an EIW kinda day"
"EIW does not have leading whitespace"
"The KEIWFAR frazzled"

Exercises:

answers are here

Construct regular expressions (match operators) for the following:

Grouping Patterns

There are also ways to build regular expressions that are sequences of groups of characters (so far we just looked at sequences of single characters). The asterisk "*" can follow any single character pattern (a character, dot, character class or negated character class) an means "zero or more of the previous pattern". For example, the regular expression /a*/ means any sequence of zero or more "a"s. This would match "a", "aaaaa" or "aaaaaaaaaaa". The question mark "?" means "zero or one" of the previous character pattern and the plus sign "+" means "one or more". Some examples:

PatternMeaningWill MatchWon't Match
/abc*/ will match any string that has an "a" followed by a "b" followed by zero or more "c"s "abxxx"
"abccccc"
"ccccc"
"accc"
/a+[01]/ one or more "a"s followed by either a "0" or a "1". "alpha1"
"aaaaaaa0"
"alpha2"
"0"
/<.*>/ anything inside angle brackets "<HTML>"
"<H3>Hi Dave</H3>"
"<>"
"<No final bracket"
">>>>"
/A[0-9]?B/ an "A" followed by a "B" with possibly a single digit in the middle "A1B"
"AB"
"a1b"
"A123B"

Later we will worry about what part of a string is matched by a regular expression (not just that a string matches a pattern), and then it will become clear when and why we do something like /a*/ (which would match any string!). For now we will just worry about the grouping operator syntax.

Grouping with Parentheses

We can put parentheses around any part of a pattern and apply "*", "?" and "+" to everything inside the parentheses. For examples, the pattern /(a[0-9])+/ will match any part of a string that contains any number of two character sequences, where each two character sequence is an "a" followed by a digit. The string "a1a2a3a4" is matched by the pattern. Examples:

PatternMeaningWill MatchWon't Match
/a(bc)*d/ will match any string that has an "a" followed by zero or more "bc"s followed by a "d". "ad"
"abcbcd"
"abcxd"
"bcbcd"
/([Ww]indows ?95)+/ One or more occurrances of the phrase "Windows 95" with lower or upper case leading "w" and zero or one spaces between the "Windows" and the "95". "Windows 95"
"windows95"
"Windows95windows 95"
"Windows98"
"95"

Exercises:

answers are here

Develop regular expressions for the following:

Grouping and Memory

When doing a match operation, perl can be told to remember the part of a string that matches each group in parentheses in the regular expression. The match operator will return an array of these substrings, so you can capture them by doing something like this:
($firstword,$secondword) = /(\w+)\s+(\w+)/
This results in the variable $firstword being set to whatever part of $_ was matches by the first (\w+) (the red one), and $secondword will be the part of $_ that matched the second word. Now we can think about using regular expressions to do more than simply matching a pattern to a string, we also extract parts of the string.

Exercise

What does the following do?

($proto,$host,$uri) = /([^:]+):\/\/([^\/]+)\/(.*)/;

Perl also allows you to reference the "memorized" chunks of the string being matched inside the regular expression itself. Within a regular expression a \1 will match whatever part of the string was matched by the first parenthesized group in the regular expression. For example, the following regular expression will match any line that contains a substring containing non-blank characters that is repeated (the same substring appears twice in the sentence):

/(\w+).*\1/
The \1 is replaced by whatever substring was matched by the (\w+).

The following regular expression will match any string that contains a "0" followed by any substring (could be empty substring) followed by a "1" followed by the same substring that followed the "0". /0(.*)1\1/

The string 0Hi Dave1Hi Dave would match, but the string 0abc1ab would not.

Exercises:

answers are here

Develop regular expressions for the following:

Alternation

You can use the "|" alternation symbol in a regular expression to mean "either alternative". For example the expresion a|b matches either an "a" or a "b", you could also express this as [ab]. You can also do things like this:
/Dave|Dad|baldy|dummy|cookie monster/
which would match any string that contains any of "Dave", "Dad", "baldy", "dummy" or "cookie monster".

Anchoring

You can use anchors in regular expressions to force matches only in specific places in a string. The regular expression /Joe/ without any anchors will match any string that contains the substring "Joe", but what if you only want to match strings that begin with "Joe", or strings in which "Joe" is surrounded by whitespace.

Perl provides four type of anchors to ensure that patterns match specific parts of a string:

Anchoring Examples:

PatternMeaningWill MatchWon't Match
/^[A-Z]/ will match any string that starts with a capital letter. "I am a funny guy."
"ABCDEF"
"John flunked the test"
"<H2>Hi Dave</H2>"
"something is wrong"
/^(\w+)\b.*\b\1$/ Any string that starts and ends with the same word. "Hi blah Hi"
"Sam I am Sam"
"Hello, Hello, Hello"
"hellohello"
"there is a the"
"why ask whynot"

Substitutions

Perl has a substitute operator that allows you to substitute some value for any part of a string that matches a regular expression. The general form of the substitute operator is:
s/someregexp/newvalue/modifiers
This operator operates on (possibly modifies) the default scalar variable $_. The regular expression is matched against $_ whatever part of $_ matches the regular expression is removed and replace with newvalue. Here is a simple example:
s/Dave/Joe/;
This would replace the first occurrance of "Dave" with the string "Joe", so if $_ was originally the string "Dave is a fool", after the above substitute command the string would be "Joe is a fool". Some examples:

Original $_commandMeaningResult
"Dave likes cookies too much" s/cookies/pizza/; replace first "cookies" with "pizza" "Dave likes pizza too much"
"Chocolate Chocolate Chip" s/Chocolate/Mint/; replace first "Chocolate" with "Mint" "Mint Chocolate Chip"
"Vitamin B9 is great" s/[A-Z][0-9]/foo/; replace first cap. letter follwed by digit by "foo" "Vitamin foo is great"
"H20 H20 H20 C3PO" s/[0-9]//; remove first digit (replace with nothing!) "HO H2O H2O C3PO"

If the regular expression is never matched by any part of the string in $_ not substitution happens ($_ is unchanged).

There are some modifiers you can specify that change how the substitute operator works. For example, if you include the modifier "g" the substitution will happen to all parts of the string that match the regular expression - not just the first one. ("g" stands for global).So s/Joe/foo/g; applied to the string "Joe is a Joe Joe Joe" will result in "foo is a foo foo foo".

The "i" modifier tells perl to ignore case when matching, so that an "a" would match a "A", etc. "i" stands for (case) "insensitive". You can include both modifiers in a substitute expression, for example:

s/mit/RPI/gi;
would replace "mit" or "MIT" or "mIT" or ..., with "RPI". The string:
"MIT is a greate place to clean your mit, but you need to
wear mittens or you will be committed"
would become:
"RPI is a greate place to clean your RPI, but you need to
wear RPI or you will be comRPIted"
.

Writing a perl program that reads from standard input (or from a file specified on the command line),makes substitutions one each line, and prints out the result is pretty simple. Here is an example that replaces all numbers that are not part of a word (digits surrounded by whitespace) by the word "number", and also replaces the word "Honorable" with "Bald":

# read each line
while (<>) {
	s/\b([0-9]+)\b/number/g;
	s/\bHonorable\b/Bald/g;
	print;
}

Exercises

answers are here

Variable Interpolation in Regular Expressions

You can do variable interpolation inside regular expressions or in substitution text, so that all or part of a pattern can be built at run time. For example, we could write a program that personalizes form letters by replacing various strings in a document with the values of some variables (assume the variables have been retrieved from a database of customers):

# Form letter personalization
# ... some code here that sets the following variables:
#  $fname is the customer's first name
#  $lname is the customer's last name
#  $title is the customer's title
#  $email is the customer's email address

while (<>) {	# read each line
	s/CUSTOMER NAME/$title $fname $lastname/;
	s/FNAME/$fname/;
	s/EMAIL/$email/;

	print;
}

Assuming we have a customer Mr. Joe Jones with email address joe@smallbiz.com and the following letter fed as input to the perl program:

Dear CUSTOMER NAME,

We have received your complaint and would like to assure you that
we take such matters seriously. I have personally spoken to the manager
of our Customer Satisfaction Unit (Cus-Sat-U) and he will be handling your
problem as soon as possible. 

FNAME, I would like to apologize for any inconvienence you have
suffered, please feel free to contact any of our Customer Service
staff if you have questions or further problems. You can contact our
Customer Service desk at 1-800-BUZ-ZOFF on any Monday in March between
the hours of 10AM EST and 6:59AM PST.

Our records show your email address as EMAIL. Please contact us
immediately if this is correct.

The output of the program would look like this:

Dear Mr. Joe Jones,

We have received your complaint and would like to assure you that
we take such matters seriously. I have personally spoken to the manager
of our Customer Satisfaction Unit (Cus-Sat-U) and he will be handling your
problem as soon as possible. 

Joe, I would like to apologize for any inconvienence you have
suffered, please feel free to contact any of our Customer Service
staff if you have questions or further problems. You can contact our
Customer Service desk at 1-800-BUZ-ZOFF on any Monday in March between
the hours of 10AM EST and 6:59AM PST.

Our records show your email address as joe@smallbiz.com. Please
contact us immediately if this is correct.

We can also build regular expressions at runtime, for example the following program will replace all "1"s in the first line with the word "foo", and all "2"s in the second line, "3"s in the third line, etc.


$linenumber = 1;
while (<>) {
	s/$linenumber/foo/g;
	print;
	$linenumber++;   #shorthand for $linenumber = $linenumber+1
}

The =~ Operator (matching with your own variables)

The match and substitute operators use the default perl variable $_, sometimes you want to tell perl to use some other variable. The =~ operator expects a scalar variable on the left and a match expression or substitute expression on the right. Perl applies the match or substitute to the variable you've specified instead of to $_, in the case of a substitute command the modified value is stored in the variable specified as well. For example, the following code:
if ($foo =~ /gumdrop/) {
   print "Gumdrop found\n";
}
uses the string in $foo to match against the regular expression /gumdrop/ instead of using $_. Here is an example using the substitute command:
$line =~ s/<H1>/<H2>/g;
This perl code looks for "<H1>" in the string $line and replaces each "<H1>" found with "<H2>".

$line = "<H1>The effect of pop culture on worms</H1>";
$line =~ s/<H1>/<H2>/g;

# now $line would be "<H2>The effect of pop culture on worms</H1>"

Join and Split

The perl split operator uses a regular expression to split a string in to a sequence of substrings (which are returned as an array). For example, the following expression will split the string in "$_" on whitespace, returning an array of all the non-whitespace tokens:
 @tokens = split(\s+); 
If $_ holds the string "Sometimes I dunk cookies in melted butter" then after the above split command the array @tokens would hold the value: ("Sometimes", "I", "dunk", "cookies", "in", "melted", "butter").

The general form of the split command is:

split(regular_expression,string_to_split);
If you only give split one argument it assumes you want to split the default perl variable $_. If you don't give the split any arguments it assumes you want to split on whitespace, the default regular expression used is /\s+/. The following program counts the number of words in each input line and prints this information out:

while (<>) {
	chomp;			# get rid of newline
	@words = split;		# split on whitespace
	$numwords = @words;	# get number of elements in @words
	print "The string \"$_\" has $numwords words\n";
}

The join command creates a string from an array (sort of the opposite of split). The general form of join is:

join(glue_string,an_array);
join returns a single string that results from inserting the glue_string between each pair of array elements. The glue_string is just a string (not a regular expression). An example that creates HTML table format grade report from a student database in tab delimited format. Assume that input is a file that contains student records in the following format:
Student Name\tTest1 grade\tTest2 grade\tHomework\n
that is, each line contains a name and three grades with tabs seperating individual fields. The following perl program will convert this format to an HTML table (using split and join) and will calculate the student average.

print "<TABLE>\n";                  # start the HTML table

# add some headings
print "<TR><TH>Name</TH><TH>Test1</TH><TH>Test2</TH>";
print "<TH>HW</TH><TH>Ave</TH></TR>\n";

while (<>) {
	chomp;			    # get rid of newline
	@studentrec = split(/\t/);  # split in to fields
	
	# now $studentrec[0] is the student name
	#     $studentrec[1] is the test 1 grade
	#     $studentrec[2] is the test 2 grade
	#     $studentrec[3] is the homework grade

	print "<TR><TD>";           # start the row for this student

	print join("</TD><TD>",@studentrec);  # print fields in table format

	# calculate student average
	$ave = ($studentrec[1] + $studentrec[2] + $studentrec[3])/3;

	print "</TD><TD>$ave</TD></TR>\n";       # finish up the row
}
print "</TABLE>\n";                 # finish up the table

Exercise

Write a perl program that creates a student record in the form used as input to the above program. Each line should contain a student name, followed by a tab (no tabs in the name are allowed), followed by a test1 grade, followed by a tab, etc. A sample output line is:
Joe Smith\t88\t92\t77\n

Your program will accept input in the form of lines that contain name, value pairs with an equal sign (=) between the name and the value. Here is a sample input file:

name = Joe Student
test1 = 86
test2 = 77
homework = 33
name = Jane Smith
test1 = 98
test2 = 35
homework = 85

for this input, the output should be this (\t is a tab):

Joe Student\t86\t77\t33
Jane Smith\t98\t35\t85