| EIW Fall 2003 Lecture Notes |
|   EIW Home  |   Course Syllabus |
Perl includes a number of pattern matching operators that can be used to do all kinds of fancy string manipulation operations. All these pattern matching operators share a common "language" for expressing patterns. Patterns can be as simple as a fixed sequence of characters, for example the string "foo" is a pattern. Patterns can be more complex, for example the pattern an "a" followed by either a "c" or a "d".
In the simplest case, we can use pattern matching to determine whether or not a pattern matches a string. A pattern matches a string if the string contains any substring that can be described by the pattern. For example, using the fixed pattern "foo", any string that contains the substring "foo" will be matched by the pattern. The string "My food is cold" would match the pattern "foo" since it contains the substring "foo". The string "info of interest" does not match the pattern "foo" - although it contains the substring "fo o", that is not a match, the match must be exact.
The result of a simple pattern match operation is TRUE or FALSE (in the Perl way, anything other than 0 or "" is FALSE), TRUE if the pattern matches the string, FALSE if it does not. We can also do things like replace the part of a string that matches a pattern with something new, or perhaps remove any part of a string that matches a pattern - these operations are all possible in Perl.
The "language" used to express patterns is called regular expressions - we often refer to a pattern as "a regular expression". In Perl, regular expressions are themselves strings, so we can actually build new regular expressions at run time and then apply them (attempt to match strings with them).
The simplest kind of regular expression is a simple string. The pattern "foo" is such an expression. As stated before, the pattern "foo" would match any string that contains an "f" followed immediately by an "o", followed immediately by another "o". Nothing can come between these characters, and we don't care what else is in the string.
The basic pattern matching operator is called the match
operator. Although it is possible to use the match operator with any
scalar variable, it is often used with the default Perl scalar
variable $_. The following statement attempts to match the
pattern "foo" against the string in $_:
/foo/
The slashes delimit the regular expression - in this case the regular expression is the string "foo". Notice that the regular expression does not include any quotes, the slashes server as the delimiters (so we can tell whether or not there are any blanks at the beginning or end - in our example there are none).
The match operator returns either TRUE or FALSE, so we typically use it
as a conditional expression in an if or loop. Here is a sample program that uses the match operator to filter out all the lines of input that do
not contain the substring "foo":
|
If we put the above program in the file "foofinder.pl" and told Perl to use the program itself as input (searching the text of the program for lines that contain the string "foo"), we would see the following:
|
We could also create a program that prints out all lines that do not contain the string "foo":
|
The simplest part of a pattern is a single character. The example pattern we have already used ("foo") is made up of three single character patterns, each specifies a single character (the "f", the "o" and the other "o"). By putting these individual characters in a sequence we are telling Perl that a match must contain a sequence of characters that match the corresponding pattern characters, so the string "food" matches, but the string "f o o" does not since it has spaces in it.There are a few other ways to match a single character:
A "." (the dot/period character) in a regular
expression will match any single character except a newline
\n. The "." is a wildcard that we can
put in a regular expression when we want to make sure there is a
character, but we don't care what it is. Some examples including
strings that do contain matching character sequences / don't contain
matching sequences :
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/f./ |
will match any string that contains a two letter sequence that starts with the character "f". | "Hello funny face" |
"chocolate chip"
|
/a.b/ |
will match any string that contains a three letter sequence that starts with an "a" and ends with a "b". | "axb"
|
"axxb" |
Since the character . means something special when used in a
regular expression - we need to tell perl if we want to actually match the
period character literally! We can do this the same way we tell perl to
ignore special characters in doubly quoted strings - just use "\." instead of ".".
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/a\.b/ |
will match any string that contains the three letter sequence "a.b". | "proga.bat" |
"axb"
|
We can also use a character class to match a single
character. A character class is a set of characters, and matching a
character class means that we find a character that is in the set. We
express this in regular expressions by putting a list of characters
inside square brackets, for example: [abc] would match a
single character as long as it is an "a", "b" or "c". Some examples
using character classes:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/[aeiou]/ |
will match any string that contains a lowercase vowel | "lots of vowels" |
"frtzblllb"
|
/a[bc]/ |
will match any string that contains "ab" or "ac". |
"cab driver"
|
"a b a c" |
/[aA][lL]/ |
will match any string that contains the substring "al" in any combination of upper or lower case characters. |
"alphanumeric"
|
"aa LL" |
There is also a negated character class that matches any
character except those listed. We do this by putting a list of
characters inside square brackets, but the first character inside the
brackets must be "^". For example: [^abc]
would match a single character except "a", "b" or "c". Some examples
using negated character classes:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/[^aeiou]/ |
will match any string that contains anything that is not a lowercase vowel | "CAPSLOCK" |
"aeiouaeiou"
|
/a[^bc]/ |
will match any string that has an "a" followed by a character that is not a "b" or a "c". |
"ad ae af"
|
"abacabac" |
Since we need the characters "[", "]" and
"^" to create character
classes, we need to do something special if we want perl to match them
literally: "\[", "\]", "\^"
. For example, the regular expression
/\[[0-9]\]/ would match any string that contains a single
digit inside square brackets, like "[3]" or
"[8]".
You can define ranges of characters inside a character class definition (or negated character class) like this:
[a-z] | lower case letters |
[0-9] | digits |
[a-zA-Z] | lower or upper case letters |
When found inside a character class definition we now must include
"-" in the growing list of characters that must be "escaped" when we want them to be used literally: "\-".
Some character classes are very common, so perl provides shortcuts to
save you some typing. For example, you can use the shortcut "\d"
instead of [0-9] (d stands for "digit"). Here is a complete list of the predefined character classes available in perl:
| Shortcut | Character Class | Description |
|---|---|---|
\d |
[0-9] |
digits |
\w |
[a-zA-Z0-9_] |
word characters (alphanumerics and "_") |
\s |
[ \r\n\t\f] |
space (whitespace) |
\D |
[^0-9] |
NOT digits |
\W |
[^a-zA-Z0-9_] |
NOT word characters (alphanumerics and "_") |
\S |
[^ \r\n\t\f] |
NOT space (whitespace) |
Examples using predefined character class shortcuts:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/\s/ |
will match any string that contains whitespace | "A space" |
"NoWhiteSpace"
|
/\d\w/ |
will match any string that has a digit followed by a letter or underscore. |
"Water is H2O"
|
"Mazda RX7" |
/\sEIW\s/ |
will match any string that has "EIW" with whitespace on each side. |
"Schedule: EIW 10AM"
|
"EIW does not have leading whitespace" |
Construct regular expressions (match operators) for the following:
any string that contains an "a" or "b" followed by any 2 characters followed by an "a" or a "b". The strings "axxb", "alfa" and "blka" match, and "ab" does not.
upper case "A" followed by anything except "x", "y" or "z".
any 5 digit integer.
There are also ways to build regular expressions that are sequences of
groups of characters (so far we just looked at sequences of single
characters). The asterisk "*" can follow any single
character pattern (a character, dot, character class or negated
character class) and means "zero or more of the previous pattern". For
example, the regular expression /a*/ means any sequence
of zero or more "a"s. This would match "a", "aaaaa" or
"aaaaaaaaaaa". (it would actually match any string, since any string has zero or more "a"s!).
The question mark "?" means "zero or one"
of the previous character pattern and the plus sign "+" means
"one or more". Some examples:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/abc*/ |
will match any string that has an "a" followed by a "b" followed by zero or more "c"s | "abxxx" |
"ccccc"
|
/a+[01]/ |
one or more "a"s followed by either a "0" or a "1". |
"alpha1"
|
"alpha2" |
/<.*>/ |
anything inside angle brackets |
"<HTML>"
|
"<No final bracket" |
/A[0-9]?B/ |
an "A" followed by a "B" with possibly a single digit in the middle |
"A1B"
|
"a1b" |
Later we will worry about what part of a string is matched by
a regular expression (not just that a string matches a pattern), and
then it will become clear when and why we do something like
/a*/ (which would match any string!). For now we will
just worry about the grouping operator syntax.
We can put parentheses around any part of a pattern and apply
"*", "?" and "+" to everything
inside the parentheses. For examples, the pattern
/(a[0-9])+/ will match any part of a string that contains
any number of two character sequences, where each two character
sequence is an "a" followed by a digit. The string
"a1a2a3a4" is matched by the pattern. Examples:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/a(bc)*d/ |
will match any string that has an "a" followed by zero or more "bc"s followed by a "d". | "ad" |
"abcxd"
|
/([Ww]indows ?XP)+/ |
One or more occurrences of the phrase "Windows XP" with lower or upper case leading "w" and zero or one spaces between the "Windows" and the "XP". |
"Windows XP"
|
"Windows98" |
Develop regular expressions for the following:
Any Perl scalar variable name (including the "$"). Perl variable names can contain any alphanumeric character and the "_" character.
Any string that contains nothing but whitespace.
An HTML Anchor tag (for example: <A HREF=blahblah>).
When doing a match operation, Perl can be told to remember the part of a string that matches each group in parentheses in the regular expression. The match operator will return a list of these substrings, so you can capture them by doing something like this:
($firstword,$secondword) = /(\w+)\s+(\w+)/
This results in the variable $firstword being set to
whatever part of $_ was matches by the first (\w+) (the red one), and
$secondword will be the part of $_ that
matched the second word. Now we can think about using regular
expressions to do more than simply matching a pattern to a string, we
also extract parts of the string.
What does the following do?
($proto,$host,$uri) = /([^:]+):\/\/([^\/]+)\/(.*)/;
Perl also allows you to reference the "memorized" chunks of the string
being matched inside the regular expression itself. Within a regular
expression a \1 will match whatever part of the string
was matched by the first parenthesized group in the regular
expression. For example, the following regular expression will match
any line that contains a substring containing non-blank characters
that is repeated (the same substring appears twice in the sentence):
/(\w+).*\1/
The \1 is replaced by whatever substring was matched by the
(\w+).
The following regular expression will match any string that contains a "0" followed by any substring (could be empty substring) followed by a "1" followed by the same substring that followed the "0".
/0(.*)1\1/
The string 0Hi Dave1Hi Dave would match, but the string
0abc1ab would not.
Develop regular expressions for the following:
Any word (a word is defined as a sequence of alphanumerics - no whitespace) that contains a double letter, for example "book" has a double "o" and "feed" has a double "e".
Any string that contains an HTML tag and it's corresponding end
tag. The following should match: <H2>Hi
Dave</H2> and so should <TITLE>The Test
Answers</TITLE>, but this should not match
<TITLE>Not a match</H2>.
You can use the "|" alternation symbol in a regular
expression to mean "either alternative". For example the expression
a|b matches either an "a" or a "b", you could also
express this as [ab]. You can also do things like this:
/Dave|Dad|baldy|dummy|cookie monster/
which would match any string that contains any of "Dave", "Dad", "baldy", "dummy" or "cookie monster".
You can use anchors in regular expressions to force matches
only in specific places in a string. The regular expression
/Joe/ without any anchors will match any string that
contains the substring "Joe". But what if you only want to match
strings that begin with "Joe", or strings in which "Joe" is surrounded
by whitespace?
Perl provides four type of anchors to ensure that patterns match specific parts of a string:
\b matches any "word boundary". A word boundary is
between a pair of character that match \w and
\W. Remember that \w is the character class
[a-zA-Z0-9_] and \W is the negated character
class [^a-zA-Z0-9_]. So \b allows you to
make sure there is either whitespace or the beginning/end of a line
adjacent to some alphanumeric symbol. It is important to realize that
\b does not match any character(s), it matches a transition.
Examples:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/\bJoe\b/ |
will match any string that has "Joe" as a word. | "Joe" |
"Joeseph"
|
/abc\bdef/ |
Impossible - doesn't match anything! |   |   |
The \B anchor matches anything that is not
a word boundary.
The ^ anchor matches the beginning of a line. So the
regular expression /^a/ will match any line that starts
with an "a". The caret character "^" is only interpreted as an anchor
when used at the beginning of a regular expression (or any place where
it would make sense to match the beginning of a line).
The $ anchor matches the end of a line. So the regular
expression /a$/ will match any line that ends with an
"a". $ is only interpreted as an anchor when at the end of
a regular expression.
Anchoring Examples:
| Pattern | Meaning | Will Match | Won't Match |
|---|---|---|---|
/^[A-Z]/ |
will match any string that starts with a capital letter. | "I am a funny guy." |
"<H2>Hi Dave</H2>"
|
/^(\w+)\b.*\b\1$/ |
Any string that starts and ends with the same word. | "Hi blah Hi"
|
"hellohello" |
Perl has a substitute operator that allows you to substitute some value for any part of a string that matches a regular expression. The general form of the substitute operator is:
s/someregexp/newvalue/modifiers;
This operator operates on (possibly modifies) the default scalar
variable $_. The regular expression is matched against
$_, and whatever part of $_ matches the regular
expression is removed and replace with newvalue. Here is a
simple example:
s/Dave/Joe/;
This would replace the first occurrence of "Dave" with the string
"Joe", so if $_ was originally the string "Dave is a
fool", after the above substitute command the string would be "Joe is
a fool". Some examples:
| Original $_ | command | Meaning | Result |
|---|---|---|---|
"Dave likes cookies too much" |
s/cookies/pizza/; |
replace first "cookies" with "pizza" | "Dave likes pizza too much" |
"Chocolate Chocolate Chip" |
s/Chocolate/Mint/; |
replace first "Chocolate" with "Mint" | "Mint Chocolate Chip" |
"Vitamin B9 is great" |
s/[A-Z][0-9]/foo/; |
replace first cap. letter followed by digit by "foo" | "Vitamin foo is great" |
"H20 H20 H20 C3PO" |
s/[0-9]//; |
remove first digit (replace with nothing!) | "HO H2O H2O C3PO" |
If the regular expression is never matched by any part of the string
in $_ no substitution happens ($_ is unchanged).
There are some modifiers you can specify that change how
the substitute operator works. For example, if you include the
modifier "g" the substitution will happen to all parts of the
string that match the regular expression - not just the first
one. ("g" stands for global).So s/Joe/foo/g;
applied to the string "Joe is a Joe Joe Joe" will result
in "foo is a foo foo foo".
The "i" modifier tells Perl to ignore case when matching, so that an "a" would match a "A", etc. "i" stands for (case) "insensitive". You can include both modifiers in a substitute expression, for example:
s/mit/RPI/gi;would replace "mit" or "MIT" or "mIT" or ..., with "RPI". The string:
"MIT is a greate place to clean your mit, but you need to wear mittens or you will be committed"would become:
"RPI is a greate place to clean your RPI, but you need to wear RPI or you will be comRPIted".
Writing a Perl program that reads from standard input (or from a file specified on the command line), makes substitutions on each line, and prints out the result is pretty simple. Here is an example that replaces all numbers that are not part of a word (digits surrounded by whitespace) by the word "number", and also replaces the word "Honorable" with "Bald":
|
Write a perl program that replaces all digits with the name of the digit, so every "0" is replaced with "zero" , "1" is replaced with "one", ... "9" is replaced with "nine".
Write a Perl program that reads in an HTML file (from STDIN) and replaces
all <H1>,</H1> tag pairs with
<H3>,</H3> tags.
Write a Perl program that removes all HTML tags (anything that looks like an HTML tag - you don't need to check each tag name).
You might need to think about this one! Write a Perl program that strips the HEAD from a HTML file
(everything between the <HEAD> tag and the
</HEAD> tag. Keep in mind that in HTML newlines
mean nothing - any part of a document can be split amongst lines any possible
way, so we could have something nice like this:
|
something like this:
|
or even this:
|
You can do variable interpolation inside regular expressions or in substitution text, so that all or part of a pattern can be built at run time. For example, we could write a program that personalizes form letters by replacing various strings in a document with the values of some variables (assume the variables have been retrieved from a database of customers):
|
Assuming we have a customer Mr. Joe Jones with email
address joe@smallbiz.com and the following letter
fed as input to the Perl program:
|
The output of the program would look like this:
|
We can also build regular expressions at runtime, for example the following program will replace all "1"s in the first line with the word "foo", and all "2"s in the second line, "3"s in the third line, etc.
|
=~ Operator (matching with your own variables)The match and substitute operators use the default perl variable
$_, but sometimes you want to tell Perl to use some other
variable. The =~ operator expects a scalar variable on
the left and a match expression or substitute expression on the
right. Perl applies the match or substitute to the variable you've
specified instead of to $_. In the case of a substitute
command the modified value is stored in the variable specified as
well. For example, the following code:
if ($foo =~ /gumdrop/) {
print "Gumdrop found\n";
}
uses the string in $foo to match against the regular expression
/gumdrop/ instead of using $_. Here is an
example using the substitute command:
$line =~ s/<H1>/<H2>/g;
This Perl code looks for "<H1>" in the string
$line and replaces each "<H1>" found with
"<H2>".
|
The Perl split operator uses a regular expression to
split a string in to a sequence of substrings (which are returned as
an array). For example, the following expression will split the string
in "$_" on whitespace, returning an array of all the non-whitespace
tokens:
@tokens = split(\s+);
If $_ holds the string "Sometimes I dunk cookies in
melted butter" then after the above split command the array
@tokens would hold the value: ("Sometimes", "I",
"dunk", "cookies", "in", "melted", "butter").
The general form of the split command is:
split(regular_expression,string_to_split);
If you only give split one argument it assumes you want
to split the default Perl variable $_. If you don't give
the split any arguments it assumes you want to split on
whitespace, the default regular expression used is
/\s+/. The following program counts the number of words
in each input line and prints this information out:
|
The join command creates a string from an array (sort of
the opposite of split). The general form of join
is:
join(glue_string,an_array);
join returns a single string that results from inserting
the glue_string between each pair of array elements. The
glue_string is just a string (not a regular expression). An
example that creates HTML table format grade report from a student
database in tab delimited format. Assume that input is a file that
contains student records in the following format:
Student Name\tTest1 grade\tTest2 grade\tHomework\n
that is, each line contains a name and three grades with tabs separating
individual fields. The following Perl program will convert this format to
an HTML table (using split and join) and will
calculate the student average.
|
Write a Perl program that creates a student record in the form used as input to the above program. Each line should contain a student name, followed by a tab (no tabs in the name are allowed), followed by a test1 grade, followed by a tab, etc. A sample output line is:
Joe Smith\t88\t92\t77\n
Your program will accept input in the form of lines that contain
name, value pairs with an equal sign (=) between the
name and the value. Here is a sample input file:
|
for this input, the output should be this (\t is a tab):
|
|
  | Try the following input string: "a00000a11111a"
|
Remember: within a regular expression, \1 is replaced by
the part of the string that matched the part of the regular expression
in parentheses (in this case everything that is matched by (.*)).
There are a few possibilities:
The output will be "0000011111". Nope! For starters, we did not
include the modifier "g", so Perl only does the substitution once. Even if we
did use this: s/a(.*)a/\1/g;, once Perl did the first substitution
there would only be one "a" left (and the pattern couldn't match again).
The output will be "0000011111a". In this case Perl
matched the first and second "a"s with the pattern and
removed them, but left the last "a" alone (only matched once). NO AGAIN!
This sounds reasonable, but it isn't what Perl does. Perl will make as large
a match as it can - in this case it will match the "a"s in the pattern with
the leftmost and rightmost "a"s in the string.
The output will be "00000a11111". YES! Perl
matches the leftmost and rightmost "a"s to the pattern and removes
them, leaving everything in the middle. Perl tries to match as much of
the string as possible with the (.*), in this case it can
match (.*) to "00000a11111" so it does so.
Perl tries to match as much as it can to any pattern or sub-pattern. It does not simply scan the string from left to right looking for the first possible match to the entire pattern.
Come up with a substitute command (regular expression) that will
remove the < and > from an HTML tag.
For example, given the input string (in $_) "This is
<B>BOLD</B>" the substitute command will leave us
with "B" since the first tag is a B tag.
You need to make sure that you don't end up with
"B>BOLD</B"!
Come up with a substitute command that will remove the
<B> and the </B> and leave
whatever is between these adjacent tags. In this case, given the string
"This is <B>BOLD</B>" the result of your
substitute command should be "This is BOLD".
However, what if the string looks like this? : "This is
<B>BOLD</B> and so is <B>THIS</B>"
. I'd still like the result of the substitute command to deal with only the first set of tags:
This is BOLD and so is <B>THIS</B> (whatever is in the first pair of <B> </B> tags).
First one that can come up with a substitute command to do this wins a cookie.
If you do something like this: s/<B>(.*)<\/B>/\1/;
Perl
will match the first (leftmost) <B> and the rightmost
</B> in the string, and you will end up with this:
"BOLD</B> and so is <B>THIS" To do the above exercise we need some way to tell Perl that we want it
to stop matching as soon as it finds a </B> instead
of looking through the rest of the string for a larger (more of the
string) match. When we asked Perl to match the inside of an HTML tag
things were easy - we could tell Perl to match everything that is not
a > as in the following:
s/<([^>]*)>/\1/;
The stuff inside the parentheses matches only characters that are not
">", so Perl will stop as soon as it sees a
">". This is possible only because the end of the match is
marked by a single character - in our case we want to stop as soon as a
"</B>" is found and there is no way to write a regular
expression that matches not"</B>".
Perl will match as much of the input string as possible whenever it
sees something like .* or .+ (or anything
that has a * or + at the end). We can, however,
tell Perl to stop when it finds the first match by putting a ?
right after the * or +.
Let's go back to the first exercise and observe the effect of
putting a ? after a *
|
  | Try the following input string: "a00000a11111a"
|
If you run this you will find out that with the ? after the
*, Perl will stop at the first match, so it will remove the
first and second "a"s and leave the last one.
We can do the same thing when trying to match everything between
adjacent <B> and </B> tags:
s/<B>(.*?)</B>/\1/;
The above substitute command will extract everything between the first
<B> tag and the next </B>
tag (remember that without the ? it will match everything
up through the last </B> tag).
Write a Perl program that prints out the title of an HTML document (given an entire HTML document it extracts just the title).
Write a Perl program that reads an HTML document and prints out a list of all hyperlinks found in the document. It might generate output something like this:
|
The general idea is that the program prints out each hyperlink
including the text inside the <A> and </A>
tags and the URL.
Get fancy and generate an HTML page as the output! We can later turn this in to a CGI program that creates a list of links found in any page on the WWW!
Write a perl program the acts as an HTML preprocessor. We
would like to save some typing when creating HTML files, so we want to
create some custom tags. For example, I want start and end tags named
<BI> and </BI> that do
the following:
Everything inside the tags is placed in a set of
<B> </B> tags (so that it is
rendered as bold). Additionally everything inside the
<BI> and </BI> tags should be
placed inside <I>, </I> tags so
that it is rendered in italics. Here is an example of what your program would
read (HTML with our new tags) and what it should print out (plain old
HTML):
| Original |
| ||
| Converted |
|
Write a perl program that will read in an HTML document and output
a new HTML document that contains a table with two cells (in one
row). In the left cell should be a copy of the complete original HTML
document inside <PRE> tags so we can see the raw HTML. You will
need to replace all "<" characters with the sequence "<" and all
">" characters with the sequence ">", otherwise the browser will
think they are HTML tags (and we want to see the tags in the left
cell). In the right cell just include the HTML body of the document,
so we can see what it will look like when rendered by a browser.
You can save the output of your program in a file by using the DOS ">" redirection operator like this:
perl myprog.pl input.html > output.html
Then you can view the HTML file (output.html) with Netscape or IE. Here is an example of what you should see in the browser.
<HEAD> <TITLE>This is a sample</TITLE> </HEAD> <BODY> <HTML> This sample includes a word that is in <B>boldface</B> and another in <I>italics</I>. |
This sample includes a word that is in boldface and another in
italics. Here is also a list of my favorite buildings on campus:
The answer to HW #2 is here. |
Here is a list of what you need to do:
You want to generate an HTML document - so you should start by printing
out a head and the <BODY> and <HTML>
tags.
Read in the entire HTML document using <>. You
can use something like this to put the entire document in a single variable:
|
Save a copy of the entire document so you can put it in the left cell
in your table. ($origdoc = $htmldoc will do).
Extract everything that is within the body of the document. Something
that looks like
$htmldoc ~= s/someregexp/\1/;
will do the job
(you need to come up with someregexp). This will later go in the
right cell of the table.
Replace all "<" characters in $origdoc with the sequence
"<" and replace ">" with ">".
print out an HTML table, which will look something like this:
"<TABLE BORDER=1><TR><TD>$origdoc</TD><TD>$htmldoc</TD></TR></TABLE>"
Finish up the HTML document you are creating by printing out the
</HTML> and </BODY> tags.