WWWPal - A System for Analysis and Synthesis of Web Pages

John R. Punin
Dept. of Computer Science, Rensselaer Polytechnic Institute, USA, E-mail: puninj@cs.rpi.edu
Dr. Mukkai Krishnamoorthy
Dept. of Computer Science, Rensselaer Polytechnic Institute, USA, E-mail: moorthy@cs.rpi.edu
 

1. Introduction

A number of systems have been developed in the past few years for the analysis of web pages. Systems such as WebCutter (Mapuccino) [Maarek & Shaul 97], WebQuery [Carriere  & Kazman 95] [Carriere  & Kazman 97], WebCiao [Chen & Koutsofios 97] and SiteHelper [Ngu & Wu 97] have been recently reported to perform organization of Web documents. Our system, WWWPal,  is different from those as our system can handle large graphs [Isakowit et al. 95] [Piroli et al. 96] [Sano 96] [Takahashi & Liang 97], has a better display using clustering algorithms, has a skeletal graph browser, and has an interface to the HTML Synthesizer and browser. WWWPal is modular and can be added more functionality if necessary. Our system can also synthesize site maps in many different forms.

We present the design, implementation, algorithms used, and examples of WWWPal in the following sections.

2. System Overview - Components of the System

The components of the System are:
 
Figure 1: Components of the WWWPal System.
 
The whole system is implemented in C (in an object-oriented fashion). We have used the Libwww and Motif libraries. An important feature is that our system consists of a number of modules (such as graph library, graph analyzer) which others can use as libraries. Our whole system currently runs under Solaris 5.2 and it can be ported to Linux. The system (binary image) takes 1 M in storage.

3. Web Robot

WWWPal System uses Libwww from the World Wide Web Consortium. The Web Robot uses Libwww to navigate automatically in the Web. Parameters given to the Web Robot are the initial URL, depth of navigation, and conditions of navigation. The initial URL is where the Web Robot starts the navigation and continues using a Breadth-First Search (BFS) algorithm. The depth of navigation is a parameter of the Robot to prevent it from navigating infinitely in the Web. Other parameters of the navigation are names of Web servers and conditions such as which directories are allowed to visit or not. The output of the navigation is a text file that describes the Web graph generated by the Robot.  [Fig. 2] shows a sample output of the graph from navigating the WebNet site.

The Web Robot handles recursive paths, redirection of URLs, and semantically incorrect URLs. Recursive paths can be caused from navigation through file directories (through links) so that the Robot is not allowed to visit them. All visited URLs are kept in hash tables to maximize the performance of the navigation time. If the visited Web Server redirects a URL, the Web Robot updates that information in the Web Graph erasing the old URL and creating a new Web Node with the redirection. The Web Robot verifies that each visited URL is semantically correct: for example, URLs with blank spaces are not visited.

Furthermore, the Web Robot uses the standard file "robots.txt" to obtain additional information for visiting Web sites. This information determines the Web directories the Robot is allowed to visit and the directories the Robot is not allowed to visit.

The Web Robot gathers meta information from each Web Node such as title of document, size of document, date, mime types and general HEAD info (LINK, META tags). Each Web Node is given in an unique number the first time the corresponding page is first visited by the Web Robot.  This information is important for visualization of the Web graphs and for the system graph Analyzer. The Robot saves this information as  properties of a Web Node.

The Robot first marks all the links at a given level; it then explores the first link and continues its traversal. This exploration stops if we maintain all nodes encountered are either visited or marked. In order to do this, a queue is used to maintain the parent links.

The Robot traversal takes the longest time in our system. Because the system saves the Web graph in a plain text file, large sites tend to produce large Web graphs. From our current experience, we found that for a medium-size computer science department, the Web graph file was  6 M bytes, for a medium-size technical university the Web graph was 12 M bytes and for an average user or a medium-size company it was 0.1 M bytes. For example, the Web graph for the WebNet site on 02/20/98 was 0.01M bytes.

4. Graph Visualizer

The Graph Visualizer is designed to display large graphs.  Graphs can be directed or undirected and the nodes can have different geometrical shapes. The Graph Visualizer interacts with a Graph Library that we developed. The Graph Library stores information about nodes and edges in structures that can be accessed through hash tables. Labels (identifiers) and positions are the sources used to generate the hash numbers in order to store and access this information.

The Graph Library has several methods of drawing graphs such as a horizontal tree, vertical tree, radial tree, barycentric, spring algorithms, circular, as well as incremental algorithms. These different drawing algorithms use "heuristics" to place the vertices of a graph according to "some specifications" so that the resulting embedding of the graph looks "nice."  The Zoom function helps to look at a graph in great detail or to get an overview of a graph. The Graph Library allows the user to change information of nodes and edges so the Graph Visualizer can edit the graph. Graphs can be saved in text format which can be read by the Graph Library's parser. The Graph Library is object-oriented so it is easy to add other methods to read different graph file formats. It is also easy to add more drawing methods.

The Graph Visualizer of the WWWPal system interprets the  information of nodes and edges of the graph in this way:

Graph Visualizer can display large graphs (200,000 nodes).  We recommend partitioning large graphs into medium-sized graphs (~ 500 nodes) so the drawing of the graph is reasonably fast. An automatic partition can be made by the Clustering Tool of the WWWPal system.  A sample graph of the WebNet site is given in [Fig. 2]. The same graph  under a spring loaded embedding is shown in [Fig. 3].
 
 
Figure 2: Web Graph of the WebNet Site.
 

 

           Figure 3: Web Graph of the WebNet Site (spring loaded embedding).
 
Web graphs can be displayed as a hierarchical list of documents. If one of the Web Nodes has children, this document is displayed as a directory of the list. We used the ListTree Widget of Robert W. McMullen. [Fig. 4 ] shows the WebNet web site as a hierarchical list of documents.
 
 
Figure 4: Web Graph of the WebNet Site shown as a hierarchical list of documents.
 

5. Clustering Tool

The Clustering Tool of the WWWPal system helps partition the graph using heuristic algorithms. There are two major reasons as to why the system performs clustering algorithms.  The first reason is in visualization of large graphs. Clustering [Broder 97] [Storey 96]  nodes reduce the size of the graph. The second reason is to get a site map of a given Web site. We have implemented three algorithms and the user can apply any one of  them on a Web graph.  All three algorithms are recursive and they follow this general algorithm.
URL selectandgroup(URL urlbase,Graph g)
{
   Graph g1,g2;
   URL url1,url2;

   g1 = select(g);
   if (selection conditions are met)
    {
      save g in a new url;
      return new url based on urlbase;
    }
   else
    {
      g2 = group(g,g1);
      url1=selectandgroup(urlbase, g1);
      update g2 with url1;

      url2=selectandgroup(urlbase , g2);
      return url2;
    }
}
The essence of this algorithm is to first select a subset of nodes (g1) so that the selected nodes are more related than the rest of the nodes. From this, two graphs are generated: a browse graph (g1) and a group graph (g2). The new vertex group in the group graph (g2) is updated with the location of the browse graph (url1). The parameters of the algorithm are the minimum number of nodes in the graph (MIN_NODES) and the minimum number of the selected nodes (MIN_SEL_NODES). If these parameters are met, a new selection of nodes is made and the process is repeated recursively.

All three heuristics are different in the selection of the nodes. For all the three heuristics a BFS algorithm first classifies the edges of a Web graph. Once the clustering is done, the clustered nodes are represented as pentagons.  There is an edge between any two clustered nodes, if there is an edge between any node in one cluster to any node in the second cluster.

The first clustering algorithm counts the number of nodes of the subtree of the immediate children of the root node. The subtree that has the larger number of nodes and is bigger than the parameter MIN_SEL_NODES is selected.  If the number of nodes in the current graph are smaller the the parameter MIN_NODES, no selection is made. The first clustering algorithm has been used to get site maps for medium-size companies.

The second clustering algorithm chooses a group of nodes with the same URL template. For example, all web nodes that are under http://www.cs.rpi.edu/guide97/ form a tree. This helps us to visualize a cluster of nodes that are related with the same pattern of URLs. The second clustering algorithm has been used for an average user.

The third clustering algorithm is designed for a large Web site. First the selection is made by the user using the second algorithm (For example the URL template http://www.cs.rpi.edu/~moorthy/* describes all Web nodes of the user moorthy).  When no selection can be made, the first algorithm is applied to partition the graph in smaller graphs. This algorithm can be used to visualize the Web graph as a set of Web graphs of the users' Web sites. The third clustering has been used for a Web site consisting of several users. It has been effectively used to obtain site maps of university web sites.

These clustering algorithms generate a new graph that contains an overview relation between all generated graphs. The shapes of the nodes of this graph are hexagons and the identifiers are the URLs where the files of the graphs are saved. This graph is always a tree and helps the user visualize the whole structure of large Web site. [Fig. 5] shows the graph overview of the Computer Science Department of Rensselaer Polytechnic Institute Web site. [Fig. 6] shows the subgraph of the nodes that are clustered in the root node of [Fig.5]. It is also possible to get a HTML site map from the clustering algorithms as described in the next section.
 
 

 
Figure 5: Web Graph of the Web Site of the Computer Science Dept. RPI after clustering.
 
 
Figure 6: First Node of the Web Graph of the Web Site of the Dept. of Comp. Sci. RPI.

 6. Graph Analyzer

Graph Analyzer has a number of modules (filters) to analyze the constructed Web graph. These filters obtain specific data from the Web graph. They also report groups of nodes that are related under any specific property.

These filters are  graph algorithms with the following caveats that the graphs are very large (so the storage space for these algorithm is limited) and there are a number of attributes associated with nodes of the Web graph. In this section we describe a few filters that we have found useful.

 
 
 
Figure 7: Site map of AACE Web Site (Table view).
 
 
Figure 8: Site map of AACE Web Site (List view).

7. ASHE (xhtml)

ASHE (A Simple HTML Editor) was presented in the Second WWW Conference at Chicago in October 94  [Punin & Krishnamoorthy 94]. Our first ASHE release was in January of 1995 and since then we have had 2 more releases.

The main features of ASHE are:

ASHE is used in the synthesis of documents for the WWWPal system. It is specifically useful when we generate skeletal documents by constructing a graph. As an example, for the graph shown in [Fig. 9], the HTML document is shown in [Fig.  10].
 
 
Figure 9: Graph of a Skeletal Document.
 
 
Figure 10: HTML Document of a Skeletal Document

8. Applications

WWWPal has a number of applications some of which have been mentioned earlier. In this section, we mention two other applications of the system: 1) Graph Browser and 2) HTML Synthesizer.

8.1 Graph Browser

In our graph visualization module, it is possible not only to edit, create, display and modify graphs (Web graphs), but also to open a URL. When a URL is specified, a node gets created and displayed. Furthermore all the links that are connected to that URL are displayed as edges originating from that Web node. Now interactively, we can explore any node in the created graph (Web graph). When that node gets explored, all the links connected to that URL page get created as edges emanating from that node. If a Web node has already been visited, it is no longer expanded. If one wants to look at the content of any Web node, then by using either ASHE or Netscape interface one can see the actual content of the URL. Thus, we have created a browser, which may be thought of as an overview browser. The advantage of this kind of browser is that it does not load the URL document itself, but only gets the information such as the type, size, link and the protocol. [Fig. 11] shows a sample browser of one of the authors' home page.

 

 
Figure 11: Web Graph generated after navigating in a Web Site

8.2 HTML Synthesizer

WWWPal also helps us create an outline of a HTML document. Using the graph Visualizer module, a user starts by creating Web nodes and edges joining these Web nodes. The user also specifies the URL attributes of the Web nodes and the order of traversal. WWWPal provides an option of creating a URL document for the created Web Graph. The created URL document consists of only links and does not contain any text or images. The contents (textual, image and audio) can be added using ASHE, which has been interfaced with our system. [Fig. 9] shows a sample Web graph and  [Fig. 10] shows the created Web document.

9. Conclusion

This paper describes the WWWPal system that we designed and implemented. Our system is useful for both analysis and synthesis of Web Documents. WWWPal also helps in the organization of Web sites. We have provided a number of examples to illustrate the uses of our system. Our future work involves adding more modules to Graph Analyzer and to obtain many different types of Site maps. Our immediate future work is to make our system freely available to all interested Web servers and clients.

10. References

[Bevirt 96] B. Bevirt (1996), Designing Hypertext Navigation Tools, Proceedings of WebNet 96 Conference, San Francisco, USA.

[Broder 97] A. Broder, S. Glassman, M. Manasse, G. Zweig (1997), Syntactic Clustering of the Web, Proceedings of WWW6 Conference, Santa Clara, USA.

[Carriere  & Kazman 95] J. Carriere and R. Kazman (1995), WebQuery, Interacting with Huge Hierarchies: Beyond Cone Trees, Proceedings of IEEE Information Visualization, Atlanta, USA, 74-81.

[Carriere  & Kazman 97] J. Carriere and R. Kazman (1997), WebQuery, Searching and Visualizing the Web through Connectivity, Proceedings of WWW 6 Conference, Santa Clara, USA, 701-711.

[Chen & Koutsofios 97] Y. Chen, E. Koutsofios (1997), WebCiao: A Website Visualization and Tracking System, Proceedings of WebNet 97 Conference, Toronto, Canada.
 
[DiBattista et al. 93] G. DiBattista, P. Eades, R. Tamassia and I. Tollis (1993), Algorithms for Drawing Graphs: An Annotated Bibliography, Technical Report, Computer Science Department, Brown University.

[Isakowit et al. 95] T. Isakowitz, E. A. Stohr and P. Balasubramanian (1995), RMM: A Methodology for Structured Hypermedia Design, Comm. ACM, 38, 34-44.

[Maarek & Shaul 97] Y. Maarek and I. Shaul (1997), WebCutter: A System for Dynamic and Tailorable Site Mapping, Proceedings of WWW 6 Conference, Santa Clara, USA,  713-722.

[Mukherjea & Foley 95] S. Mukherjea and J. Foley (1995). Visualizing the World-Wide Web with the Navigational View Builder, Proceedings of WWW 3 Conference, Darmstadt, Germany.

[Ngu & Wu 97] D. Ngu and X. Wu  (1997), SiteHelper: A Localized Agent that Helps Incremental Exploration of the World Wide Web, Proceedings of WWW 6 Conference, Santa Clara, USA, 691-699.

[Piroli et al. 96] P. Piroli, J. Pitkow and R. Rao (1996), Silk from a Sow's Ear: Extracting Usable Structures from the Web, Proceedings of the CHI 1996, Vancouver, Canada.

[Punin & Krishnamoorthy 94] J. Punin and M. Krishnamoorthy (1994), Environment for Preparing HTML Documents, Proceedings of WWW2 Conference, Chicago, USA, 147-172.

[Sano 96] D. Sano (1996), Designing Large-Scale Web Sites: A Visual Design Methodology, Wiley Computer Publishing.

[Storey 96] M. D. Storey, H Muller, K. Wong (1996), Manipulating and Documenting Software Structures, Software Visualization.

[Takahashi & Liang 97] K. Takahashi and E. Liang (1997), Analysis and Design of Web-based Information Systems, Proceedings of WWW 6 Conference, Santa Clara, USA, 377-389.

[Walker 90]  J. Walker II (1990), A Node-positioning Algorithm for General Trees, Software-Practice and Experience, 20(7), 685-705.

11. URL References

Mapuccino
http://www.ibm.com/java/mapuccino/

WebQuery
http://www.cgl.uwaterloo.ca/Vanish/

Libwww
http://www.w3.org/Library/

World Wide Web Consortium
http://www.w3.org/

WebNet Site
http://www.aace.org/conf/webnet/

ASHE(xhtml)
http://www.cs.rpi.edu/~puninj/TALK/head.html

Netscape
http://www.netscape.com/