In the simplest case, we can use pattern matching to determine whether or not a pattern matches a string. A pattern matches a string if the string contains any substring that can be described by the pattern. For example, using the fixed pattern "foo", any string that contains the subtring "foo" will be matched by the pattern. The string "My food is cold" would match the pattern "foo" since it contains the substring "foo". The string "info of interest" does not match the pattern "foo" - although it contains the substring "fo o", that is not a match, the match must be exact.
The result of a simple pattern match operation is TRUE or FALSE, TRUE if the pattern matches the string, FALSE if it does not. We can also do things like replace the part of a string that matches a pattern with something new, or perhaps remove any part of a string that matches a pattern - these operations are all possible in perl.
The simplest kind of regular expression is a simple string. The pattern "foo" is such an expression. As stated before, the pattern "foo" would match any string that contains an "f" followed immediately by an "o", followed immediately by another "o". Nothing can come between these characters, and we don't care what else is in the string.
$_. The following statement attempts to match the
pattern "foo" against the string in $_:
/foo/
The match operator returns either TRUE or FALSE, so we typically use it
as a conditional expression in an if or loop. Here is a sample program that uses the match operator to filter out all the lines of input that do
not contain the substring "foo":
|
If we put the above program in the file "foofinder.pl" and told perl to use the program itself as input (searching the text of the program for lines that contain the string "foo"), we would see the following:
|
We could also create a program that prints out all lines that do not contain the string "foo":
|
"." (the dot/period character) in a regular
expression will match any single character except a newline
\n. The "." is a wildcard that we can
put in a regular expression when we want to make sure there is a
character, but we don't care what it is. Some examples including
strings that do contain matching character sequences / don't contain
matching sequences :
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/f./ |
will match any string that contains a two letter sequence that starts with the character "f". | "Hello funny face" |
"chocolate chip"
|
/a.b/ |
will match any string that contains a three letter sequence that starts with an "a" and ends with a "b". | "axb"
|
"axxb" |
Since the character . means something special when used in a
regular expression - we need to tell perl if we want to actually match the
period character literally! We can do this the same way we tell perl to
ignore special characters in doubly quoted strings - just use "\." instead of ".".
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/a\.b/ |
will match any string that contains the three letter sequence "a.b". | "proga.bat" |
"axb"
|
[abc] would match a
single character as long as it is an "a", "b" or "c". Some examples
using character classes:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/[aeiou]/ |
will match any string that contains a lowercase vowel | "lots of vowels" |
"frtzblllb"
|
/a[bc]/ |
will match any string that contains "ab" or "ac". |
"cab driver"
|
"a b a c" |
/[aA][lL]/ |
will match any string that contains the substring "al" in any combination of upper or lower case characters. |
"alphanumeric"
|
"aa LL" |
"^". For example: [^abc]
would match a single character except "a", "b" or "c". Some examples
using negated character classes:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/[^aeiou]/ |
will match any string that contains anything that is not a lowercase vowel | "CAPSLOCK" |
"aeiouaeiou"
|
/a[^bc]/ |
will match any string that has an "a" followed by a character that is not a "b" or a "c". |
"ad ae af"
|
"abacabac" |
Since we need the characters "[", "]" and
"^" to create character
classes, we need to do something special if we want perl to match them
literally: "\[", "\]", "\^"
. For example, the regular expression
/\[[0-9]\]/ would match any string that contains a single
digit inside square brackets, like "[3]" or
"[8]".
[a-z] | lower case letters |
[0-9] | digits |
[a-zA-Z] | lower or upper case letters |
When found inside a character class definition we now must include
"-" in the growing list of characters that must be "escaped" when we want them to be used literally: "\-".
Some character classes are very common, so perl provides shortcuts to
save you some typing. For example, you can use the shortcut "\d"
instead of [0-9] (d stands for "digit"). Here is a complete list of the predefined character classes available in perl:
| Shortcut | Character Class | Description |
|---|---|---|
\d |
[0-9] |
digits |
\w |
[a-zA-Z0-9_] |
word characters (alphanumerics and "_") |
\s |
[ \r\n\t\f] |
space (whitespace) |
\D |
[^0-9] |
NOT digits |
\W |
[^a-zA-Z0-9_] |
NOTword characters (alphanumerics and "_") |
\S |
[^ \r\n\t\f] |
NOT space (whitespace) |
Examples using predefined character class shortcuts:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/\s/ |
will match any string that contains whitespace | "A space" |
"NoWhiteSpace"
|
/\d\w/ |
will match any string that has a digit followed by a letter or underscore. |
"Water is H2O"
|
"Mazda RX7" |
/\sEIW\s/ |
will match any string that has "EIW" with whitespace on each side. |
"Schedule: EIW 10AM"
|
"EIW does not have leading whitespace" |
Construct regular expressions (match operators) for the following:
"*" can follow any single
character pattern (a character, dot, character class or negated
character class) an means "zero or more of the previous pattern". For
example, the regular expression /a*/ means any sequence
of zero or more "a"s. This would match "a", "aaaaa" or
"aaaaaaaaaaa". The question mark "?" means "zero or one"
of the previous character pattern and the plus sign "+" means
"one or more". Some examples:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/abc*/ |
will match any string that has an "a" followed by a "b" followed by zero or more "c"s | "abxxx" |
"ccccc"
|
/a+[01]/ |
one or more "a"s followed by either a "0" or a "1". |
"alpha1"
|
"alpha2" |
/<.*>/ |
anything inside angle brackets |
"<HTML>"
|
"<No final bracket" |
/A[0-9]?B/ |
an "A" followed by a "B" with possibly a single digit in the middle |
"A1B"
|
"a1b" |
Later we will worry about what part of a string is matched by
a regular expression (not just that a string matches a pattern), and
then it will become clear when and why we do something like
/a*/ (which would match any string!). For now we will
just worry about the grouping operator syntax.
"*", "?" and "+" to everything
inside the parentheses. For examples, the pattern
/(a[0-9])+/ will match any part of a string that contains
any number of two character sequences, where each two character
sequence is an "a" followed by a digit. The string
"a1a2a3a4" is matched by the pattern. Examples:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/a(bc)*d/ |
will match any string that has an "a" followed by zero or more "bc"s followed by a "d". | "ad" |
"abcxd"
|
/([Ww]indows ?95)+/ |
One or more occurrances of the phrase "Windows 95" with lower or upper case leading "w" and zero or one spaces between the "Windows" and the "95". |
"Windows 95"
|
"Windows98" |
Develop regular expressions for the following:
<A HREF=blahblah>).
($firstword,$secondword) = /(\w+)\s+(\w+)/This results in the variable
$firstword being set to
whatever part of $_ was matches by the first (\w+) (the red one), and
$secondword will be the part of $_ that
matched the second word. Now we can think about using regular
expressions to do more than simply matching a pattern to a string, we
also extract parts of the string.
($proto,$host,$uri) = /([^:]+):\/\/([^\/]+)\/(.*)/;
Perl also allows you to reference the "memorized" chunks of the string
being matched inside the regular expression itself. Within a regular
expression a \1 will match whatever part of the string
was matched by the first parenthesized group in the regular
expression. For example, the following regular expression will match
any line that contains a substring containing non-blank characters
that is repeated (the same substring appears twice in the sentence):
/(\w+).*\1/
\1 is replaced by whatever substring was matched by the
(\w+).
The following regular expression will match any string that contains a
"0" followed by any substring (could be empty substring) followed by a
"1" followed by the same substring that followed the "0".
/0(.*)1\1/
The string 0Hi Dave1Hi Dave would match, but the string
0abc1ab would not.
Develop regular expressions for the following:
<H2>Hi
Dave</H2> and so should <TITLE>The Test
Answers</TITLE>, but this should not match
<TITLE>Not a match</H2>.
"|" alternation symbol in a regular
expression to mean "either alternative". For example the expresion
a|b matches either an "a" or a "b", you could also
express this as [ab]. You can also do things like this:
/Dave|Dad|baldy|dummy|cookie monster/which would match any string that contains any of "Dave", "Dad", "baldy", "dummy" or "cookie monster".
/Joe/ without any anchors will match any string that
contains the substring "Joe", but what if you only want to match
strings that begin with "Joe", or strings in which "Joe" is surrounded
by whitespace.Perl provides four type of anchors to ensure that patterns match specific parts of a string:
\b matches any "word boundary". A word boundary is
between a pair of character that match \w and
\W. Remember that \w is the character class
[a-zA-Z0-9_] and \W is the negated character
class [^a-zA-Z0-9_]. So \b allows you to
make sure there is either whitespace or the beginning/end of a line
adjacent to some alphanumeric symbol. It is important to realize that
\b does not match any character(s), it matches a transition.
Examples:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/\bJoe\b/ |
will match any string that has "Joe" as a word. | "Joe" |
"Joeseph"
|
/abc\bdef/ |
Impossible - doesn't match anything! |   |   |
\B anchor matches anything that is not
a word boundary.
^ anchor matches the beginning of a line. So the
regular expression /^a/ will match any line that starts
with an "a". The caret character "^" is only interpreted as an anchor
when used at the beginning of a regular expression (or any place where
it would make sense to match the beginning of a line).
$ anchor matches the end of a line. So the regular
expression /a$/ will match any line that ends with an
"a". $ is only interpreted as an anchor when at the end of
a regular expression.
Anchoring Examples:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/^[A-Z]/ |
will match any string that starts with a capital letter. | "I am a funny guy." |
"<H2>Hi Dave</H2>"
|
/^(\w+)\b.*\b\1$/ |
Any string that starts and ends with the same word. | "Hi blah Hi"
|
"hellohello" |
s/someregexp/newvalue/modifiersThis operator operates on (possibly modifies) the default scalar variable
$_. The regular expression is matched against
$_ whatever part of $_ matches the regular
expression is removed and replace with newvalue. Here is a
simple example:
s/Dave/Joe/;This would replace the first occurrance of "Dave" with the string "Joe", so if
$_ was originally the string "Dave is a fool", after the above
substitute command the string would be "Joe is a fool". Some examples:
| Original $_ | command | Meaning | Result |
|---|---|---|---|
"Dave likes cookies too much" |
s/cookies/pizza/; |
replace first "cookies" with "pizza" | "Dave likes pizza too much" |
"Chocolate Chocolate Chip" |
s/Chocolate/Mint/; |
replace first "Chocolate" with "Mint" | "Mint Chocolate Chip" |
"Vitamin B9 is great" |
s/[A-Z][0-9]/foo/; |
replace first cap. letter follwed by digit by "foo" | "Vitamin foo is great" |
"H20 H20 H20 C3PO" |
s/[0-9]//; |
remove first digit (replace with nothing!) | "HO H2O H2O C3PO" |
If the regular expression is never matched by any part of the string
in $_ not substitution happens ($_ is unchanged).
There are some modifiers you can specify that change how
the substitute operator works. For example, if you include the
modifier "g" the substitution will happen to all parts of the
string that match the regular expression - not just the first
one. ("g" stands for global).So s/Joe/foo/g;
applied to the string "Joe is a Joe Joe Joe" will result
in "foo is a foo foo foo".
The "i" modifier tells perl to ignore case when matching, so that an "a" would match a "A", etc. "i" stands for (case) "insensitive". You can include both modifiers in a substitute expression, for example:
s/mit/RPI/gi;would replace "mit" or "MIT" or "mIT" or ..., with "RPI". The string:
"MIT is a greate place to clean your mit, but you need to wear mittens or you will be committed"would become:
"RPI is a greate place to clean your RPI, but you need to wear RPI or you will be comRPIted".
Writing a perl program that reads from standard input (or from a file specified on the command line),makes substitutions one each line, and prints out the result is pretty simple. Here is an example that replaces all numbers that are not part of a word (digits surrounded by whitespace) by the word "number", and also replaces the word "Honorable" with "Bald":
|
<H1>,</H1> tag pairs with
<H3>,</H3> tags.
<HEAD> tag and the
</HEAD> tag. Keep in mind that in HTML newlines
mean nothing - any part of a document can be split amongst lines any possible
way, so we could have something nice like this:
|
something like this:
|
or even this:
|
|
Assuming we have a customer Mr. Joe Jones with email
address joe@smallbiz.com and the following letter
fed as input to the perl program:
|
The output of the program would look like this:
|
We can also build regular expressions at runtime, for example the following program will replace all "1"s in the first line with the word "foo", and all "2"s in the second line, "3"s in the third line, etc.
|
=~ Operator (matching with your own variables)$_, sometimes you want to tell perl to use some other
variable. The =~ operator expects a scalar variable on
the left and a match expression or substitute expression on the
right. Perl applies the match or substitute to the variable you've
specified instead of to $_, in the case of a substitute
command the modified value is stored in the variable specified as
well. For example, the following code:
if ($foo =~ /gumdrop/) {
print "Gumdrop found\n";
}
uses the string in $foo to match against the regular expression
/gumdrop/ instead of using $_. Here is an
example using the substitute command:
$line =~ s/<H1>/<H2>/g;This perl code looks for
"<H1>" in the string
$line and replaces each "<H1>" found with
"<H2>".
|
split operator uses a regular expression to
split a string in to a sequence of substrings (which are returned as
an array). For example, the following expression will split the string
in "$_" on whitespace, returning an array of all the non-whitespace
tokens:
@tokens = split(\s+);If
$_ holds the string "Sometimes I dunk cookies in
melted butter" then after the above split command the array
@tokens would hold the value: ("Sometimes", "I",
"dunk", "cookies", "in", "melted", "butter").
The general form of the split command is:
split(regular_expression,string_to_split);If you only give
split one argument it assumes you want
to split the default perl variable $_. If you don't give
the split any arguments it assumes you want to split on
whitespace, the default regular expression used is
/\s+/. The following program counts the number of words
in each input line and prints this information out:
|
The join command creates a string from an array (sort of
the opposite of split). The general form of join
is:
join(glue_string,an_array);
join returns a single string that results from inserting
the glue_string between each pair of array elements. The
glue_string is just a string (not a regular expression). An
example that creates HTML table format grade report from a student
database in tab delimited format. Assume that input is a file that
contains student records in the following format:
Student Name\tTest1 grade\tTest2 grade\tHomework\nthat is, each line contains a name and three grades with tabs seperating individual fields. The following perl program will convert this format to an HTML table (using
split and join) and will
calculate the student average.
|
Joe Smith\t88\t92\t77\n
Your program will accept input in the form of lines that contain
name, value pairs with an equal sign (=) between the
name and the value. Here is a sample input file:
|
for this input, the output should be this (\t is a tab):
|