Homework 2
For this homework, you will create a larger program using regular expressions to investigate and modify a file, as well as the advanced sorting techniques we discussed in class.
You have been hired to analyze several terabytes of email correspondence, looking for specific information. This information will be vital to law enforcement operatives trying to capture a nefarious criminal. The agency has provided you with all emails sent and received, as captured in a raid. They want you to analyze the emails and provide them with:
- Phone numbers
- Phone numbers contain at least three digits and four digits for the prefix and exchange. They may or may not also contain a preceding three digits for the area code. The three parts may be separated by hyphens, periods, spaces, or not at all. The area code may or may not be enclosed in parentheses.
- Email addresses
- Email addresses contain any characters (other than whitespace), with an at-sign, and at least one period in the part after the at sign (but not at the end of the address itself)
- Physical addrsses
- Physical addresses contain a number, a street name, and a street type (one of: Rd, St, Cir, Blvd, Ct, Ln, Dr, or Ave). They may or may not be followed by a city name, a comma, and a two-letter (both capitals) state abbreviation.
- Times
- A time of day contains one- or two-digit hours, followed by a two digit minute, and possibly a two digit second, all separated by colons. AM or PM (any case) may or may not follow.
- Dates
- Dates contain either a four-digit year, one- or two-digit month, one- or two-digit day, or the month, day, year. The components may be separated by hyphens or slashes.
You must find and locate all of these vital pieces of information. The agency wants a report of each information in each paragraph of text from the emails. (See below regarding "paragraph mode"). For each email supplied to you (ie, file name listed on the command line), scan each paragraph, looking for all instances of the above information. Report all instances found, in the following orders specified:
- Phone numbers: ordered by area code, least to greatest. Numbers without area codes come first, ordered by prefix, least to greatest.
- Email addresses: ordered by domain name, asciibettically. Addresses with same domain name get ordered by user name.
- Physical addresses: ordered by state (with city being tie breaker), if available. Addresses without city/state come first. Addresses without city/state and with same city/state get ordered by street name, with street number beign the tie breaker.
- Times: ordered from earliest to latest.
- Dates: ordered from earliest to latest.
Things to keep in mind
- Be sure to account for both military time and 12-hour times with am/pm being in the same list.
- There may be multiple instances of each information within the same email, within the same paragraph, and even within the same line.
- Physical addresses, which contain spaces, may span multiple lines. That is, it may start on one line, and continue to the next. Number, street, type, city, state may all be separated by any form of whitespace, including newlines.
- Be sure to account for differently formatted dates in the same list.
- For each file, you should output the file name, the paragraph number, and then the list of all information for that paragraph.
- To enable "paragraph mode", set the
$/variable to the empty string (""). This will cause the standard readline operator (<>) to read paragraphs at a time, rather than lines. Paragraphs are separated in the files by multiple consecutive newlines.
Tasks to save for 2/25
The following components of this homework will be significantly easier to achieve after the lecture of 2/25. Do not put your program through hoops trying to get them to work with only the material learned on 2/18
- Case-insensitivity of am/pm
- finding more than one example of a given pattern in a single paragraph
Sample Input and Output
Sample I/O is available to aid in your development.
Grading Criteria
| Find all email addresses | 5 |
|---|---|
| Sort all email addresses | 7.5 |
| Find all phone numbers | 10 |
| Sort all phone numbers | 7.5 |
| Find all physical addresses | 10 |
| Sort all physical addresses | 10 |
| Find all times | 5 |
| Sort all times | 10 |
| Find all dates | 7.5 |
| Sort all dates | 7.5 |
| Error Checking | 5 |
| Code Style | 5 |
| Output Style | 5 |
| No warnings | 5 |
Penalties
- Late
- A submission turned in within 14 hours past the deadline will lose 20 points. A submission turned in more than 14 hours past the deadline will be graded a 0
- Compilation
- If your program fails to compile, it will be graded subjectively based on the code written, and then lose 50% of the remaining points.
No Warnings
No compilation nor runtime warnings can be generated by Perl evaluating your code. That is,
I should never see "use of uninitialized value" or "illegal division by 0" or similar.
(You are, of course, welcome and encouraged to use the warn
and die functions to tell the user when he/she has done something wrong). Your
code will be executed with warnings enabled, even if you don't explicitly
use warnings in your program.
Error Checking
You must check to make sure the user has not done something wrong. "Wrong" in this case means making sure files are readable, and at least one command line arg is given.
Your program should never crash due to unexpected input. A sensible error message should be printed to the user, telling him/her what went wrong.
Code Style
Your code must be easily read by a human being. Most important are three
facets: consistent indentation, meaningful variable names, and explanatory
comments. For a larger guide to writing well-styled Perl code, please
read perldoc perlstyle
Output Style
Your output must be easily read by a human being. Values and data should be labeled. Prompts should be explicit. White space and newlines should be used for visual distinction. Debugging statements should be removed before submitting.
Submission Instructions
To submit, log in to solaris.remote.cs.rpi.edu and execute the program
~lallip/public/submit.pl. Follow the prompts. Please remember that the
RCS Submission script is no longer valid. All work should be done on CSNet from this
point forward.
You may submit infinite times, only the last submission will be graded. Final submission is due at 11:59:59pm, Wednesday, March 3, 2010.
