Network Programming
Spring 2001

Homework 2 - Web Crawler
Due Date: Feb 21

Submit to netprog-submit@cs.rpi.edu with the subject line "2"
Complete Submission instructions are here


IMPORTANT!: All testing will be done on the CS Sun workstations. Make sure your code works on a Sun!


Your assignment is to write a TCP server that provides a web crawling service. The client sends the server a URL and a depth. The server goes off and retrieves the requested URL and extracts all the hyperlinks from the document. The server sends back to the client a list of the found hyperlinks. In addition, the server repeats this process for each of the hyperlinks found (this is the crawl - the server follows each link - crawling around the WWW). For each hyperlink found in the first document the server retrieves the document specified by the link, and lists all the hyperlinks found in the document. The resulting list of hyperlinks forms a tree, with the originally requested URL at the root of the tree. Your server should build this tree to the depth indicated by the depth parameter indicated in the original request. If a depth of 1 is requested, the result should be just a list of the links found in the orignal URL. If a depth of 2 is requested, your server should build a list by following all the links found in original URL, and extracting links from those documents.

The output of the server should be a list of all the links found, with indenting used to indicate the depth at which each link was found. This is simply an "outline" view of the list of URLs found. (Please refer to the demo system for an example of the output format required.) The output of your server will be viewed using a browser, so you should use HTML to format the output!

The format of the requests that your server will handle is based on the format of HTTP requests (the client will actually be a browser, so all it can send are HTTP requests). The request method will be GET, although you can ignore the request method (you can assume it will always be a GET). The request URI will be composed of 2 parts, one part to indicate the starting point for the web crawl (the starting URL) and the other part indicating the depth of the crawl. The initial URL comes first, followed the the character '?', followed by the depth. The starting URL is indicated by a hostname followed by a resource name (this is just an HTTP url without the protocol indicator). For example, the following URI tells your server to start the crawl at www.cs.rpi.edu and to build a list of links of depth 2:

/www.cs.rpi.edu/?2
The complete HTTP request line that would be sent would look like this:
GET /www.cs.rpi.edu/?2 HTTP/1.1

Keep in mind that the client will be a browser, and that you need to make sure your server can understand an HTTP request as sent by a browser!

Here is sample output for the above request (list all links found in the document www.cs.rpi.edu/ and for each link found, follow each link and list the links found):

Netprog sample hw2

Crawl Report for www.cs.rpi.edu/ (depth 1)


Demo System

A sample server is running on monte.cs.rpi.edu on port 1200. You can test out the system by entering a URL like the following into the your browser:


http://monte.cs.rpi.edu:1200/www.cs.rpi.edu/?2

monte has been having some troubles lately - send Dave email if you can't access the sample server...

The above request would ask the server to send back a list of depth 2 of the links found in the document at www.cs.rpi.edu/

Warning: Using a depth of more than 2 on any non-trival web page will probably result in a long wait... Keep in mind that the server is becoming an HTTP client and retreiving each page so it can look for links in the page. A depth of more than 2 can easily result in hundreds or thousands of web page accesses.

To look at the links in www.rpi.edu/rpinfo you could tell your browser to retreive this URL:


http://monte.cs.rpi.edu:1200/www.rpi.edu/rpinfo/?1

Suggestions

Deliverables

You must also include in your submission a file named README that includes a brief description of your submission, including the name of each file submitted along with a one line description of what is in the file. You should also include instructions for building your code, problems you had or anything else you think might be helpful to us. You must make sure your code works on the CS Sun workstations, as this is the only platform that will be used to test your submission.

Grading

Your project will be tested to make sure it works properly - part of this testing will make sure that you check for error conditions and that you don't have a server that keeps growing (memory leaks). If you submit a concurrent server, make sure you take care of zombies...

25% of your homework grade depends on the how well we can understand your code - this means everything should be commented!

Submitting your files

Submission of your homework is via email, the general idea is to send an email message with all your files as attachments. There is an automated email submission system that will respond to your submission right away, so you will have a record that we got your files.

All projects must be submitted via email to netprog-submit@cs.rpi.edu. The subject line of the submission message should contain a single number indicating the project number (1 for HW1, 2 for HW2, etc.). You must include your files as attachments, feel free to send a zip-file or a tar file.

Don't send compiled code!

You can expect a return email indicating receipt of your project submission immediately. This receipt will include a list of all the files that were successfully extracted by the submission script - please look over the receipt carefully to make sure your submission worked.

Multiple Submissions: You can resubmit up to 10 times for each project, we will always grade the last submission received unless you tell us otherwise.