[an error occurred while processing this directive]

Perl

Menu

Homework 2

For this homework, you will create a larger program using regular expressions to investigate and modify a file, as well as the advanced sorting techniques we discussed in class.

You have been hired to analyze several terabytes of email correspondence, looking for specific information. This information will be vital to law enforcement operatives trying to capture a nefarious criminal. The agency has provided you with all emails sent and received, as captured in a raid. They want you to analyze the emails and provide them with:

Phone numbers
Phone numbers contain at least three digits and four digits for the prefix and exchange. They may or may not also contain a preceding three digits for the area code. The three parts may be separated by hyphens, periods, spaces, or not at all. The area code may or may not be enclosed in parentheses.
Email addresses
Email addresses contain any characters (other than whitespace), with an at-sign, and at least one period in the part after the at sign (but not at the end of the address itself)
Physical addrsses
Physical addresses contain a number, a street name, and a street type (one of: Rd, St, Cir, Blvd, Ct, Ln, Dr, or Ave). They may or may not be followed by a city name, a comma, and a two-letter (both capitals) state abbreviation.
Times
A time of day contains one- or two-digit hours, followed by a two digit minute, and possibly a two digit second, all separated by colons. AM or PM (any case) may or may not follow.
Dates
Dates contain either a four-digit year, one- or two-digit month, one- or two-digit day, or the month, day, year. The components may be separated by hyphens or slashes.

You must find and locate all of these vital pieces of information. The agency wants a report of each information in each paragraph of text from the emails. (See below regarding "paragraph mode"). For each email supplied to you (ie, file name listed on the command line), scan each paragraph, looking for all instances of the above information. Report all instances found, in the following orders specified:

  • Phone numbers: ordered by area code, least to greatest. Numbers without area codes come first, ordered by prefix, least to greatest.
  • Email addresses: ordered by domain name, asciibettically. Addresses with same domain name get ordered by user name.
  • Physical addresses: ordered by state (with city being tie breaker), if available. Addresses without city/state come first. Addresses without city/state and with same city/state get ordered by street name, with street number beign the tie breaker.
  • Times: ordered from earliest to latest.
  • Dates: ordered from earliest to latest.

Things to keep in mind

  • Be sure to account for both military time and 12-hour times with am/pm being in the same list.
  • There may be multiple instances of each information within the same email, within the same paragraph, and even within the same line.
  • Physical addresses, which contain spaces, may span multiple lines. That is, it may start on one line, and continue to the next. Number, street, type, city, state may all be separated by any form of whitespace, including newlines.
  • Be sure to account for differently formatted dates in the same list.
  • For each file, you should output the file name, the paragraph number, and then the list of all information for that paragraph.
  • To enable "paragraph mode", set the $/ variable to the empty string (""). This will cause the standard readline operator (<>) to read paragraphs at a time, rather than lines. Paragraphs are separated in the files by multiple consecutive newlines.

Tasks to save for 2/25

The following components of this homework will be significantly easier to achieve after the lecture of 2/25. Do not put your program through hoops trying to get them to work with only the material learned on 2/18

  • Case-insensitivity of am/pm
  • finding more than one example of a given pattern in a single paragraph

Sample Input and Output

Sample I/O is available to aid in your development.

Grading Criteria

Find all email addresses 5
Sort all email addresses 7.5
Find all phone numbers 10
Sort all phone numbers 7.5
Find all physical addresses 10
Sort all physical addresses 10
Find all times 5
Sort all times 10
Find all dates 7.5
Sort all dates 7.5
Error Checking 5
Code Style 5
Output Style 5
No warnings 5

Penalties

Late
A submission turned in within 14 hours past the deadline will lose 20 points. A submission turned in more than 14 hours past the deadline will be graded a 0
Compilation
If your program fails to compile, it will be graded subjectively based on the code written, and then lose 50% of the remaining points.

No Warnings

No compilation nor runtime warnings can be generated by Perl evaluating your code. That is, I should never see "use of uninitialized value" or "illegal division by 0" or similar. (You are, of course, welcome and encouraged to use the warn and die functions to tell the user when he/she has done something wrong). Your code will be executed with warnings enabled, even if you don't explicitly use warnings in your program.

Error Checking

You must check to make sure the user has not done something wrong. "Wrong" in this case means making sure files are readable, and at least one command line arg is given.

Your program should never crash due to unexpected input. A sensible error message should be printed to the user, telling him/her what went wrong.

Code Style

Your code must be easily read by a human being. Most important are three facets: consistent indentation, meaningful variable names, and explanatory comments. For a larger guide to writing well-styled Perl code, please read perldoc perlstyle

Output Style

Your output must be easily read by a human being. Values and data should be labeled. Prompts should be explicit. White space and newlines should be used for visual distinction. Debugging statements should be removed before submitting.

Submission Instructions

To submit, log in to solaris.remote.cs.rpi.edu and execute the program ~lallip/public/submit.pl. Follow the prompts. Please remember that the RCS Submission script is no longer valid. All work should be done on CSNet from this point forward.

You may submit infinite times, only the last submission will be graded. Final submission is due at 11:59:59pm, Wednesday, March 3, 2010.

Perl Quotes
Perl Quotes