WWWPal - A System for Analysis and Synthesis of Web Pages
John R. Punin
Dept. of Computer Science, Rensselaer Polytechnic
Institute, USA, E-mail: puninj@cs.rpi.edu
Dr. Mukkai Krishnamoorthy
Dept. of Computer Science, Rensselaer Polytechnic
Institute, USA, E-mail: moorthy@cs.rpi.edu
Abstract:
WWWPal is a system that helps in the analysis and synthesis of Web documents.
The system eliminates a common problem of obscure organization of Web documents
in Web information systems. WWWPal consists of the following six components:
1) Web Robot, 2) Graph Visualizer, 3) Graph Analyzer, 4) Clustering Tool,
5) Synthesizer, and 6) Interface Package. The Web Robot is used to
visit all the URLs that can be accessed from a given URL in a given Web
server and to construct a graph. Graph Visualizer is used to display graphs
(up to 200,000 nodes) in a variety of ways and also to edit a graph.
Graph Analyzer is used to report different problems such as unreachable
nodes, HTML documents that contain formatting errors, and broken links.
The Clustering Tool is used to partition the nodes of the graph in different
clusters using several heuristics. The partitioned graph is used to get
better overall site maps of the Web server. In addition, this Clustering
Tool helps in the visualization of large graphs. The Synthesizer helps
to create a skeletal HTML document from a given graph. The Synthesizer
interfaces with ASHE (A Simple HTML Editor) to edit a document. An interface
package enables our system, WWWPal, to communicate with a browser, such
as Netscape. This interface and Graph Visualizer form a skeletal graph
browser (of the URLs) in our system.
1. Introduction
A number of systems have been developed in the past few years
for the analysis of web pages. Systems such as WebCutter (Mapuccino)
[Maarek & Shaul 97], WebQuery
[Carriere & Kazman 95] [Carriere & Kazman 97], WebCiao
[Chen & Koutsofios 97] and SiteHelper [Ngu & Wu 97] have been recently
reported to perform organization of Web documents. Our system, WWWPal,
is different from those as our system can handle large graphs [Isakowit
et al. 95] [Piroli et al. 96] [Sano 96] [Takahashi & Liang 97], has
a better display using clustering algorithms, has a skeletal graph browser,
and has an interface to the HTML Synthesizer and browser. WWWPal is modular
and can be added more functionality if necessary. Our system can also synthesize
site maps in many different forms.
We present the design, implementation, algorithms used,
and examples of WWWPal in the following sections.
2. System Overview - Components of the System
The components of the System are:
-
Web Robot
This component is responsible for navigating through
Web sites and gathering raw data (Web graph). The input is the given initial
URL and the output is a collection of nodes and edges, representing the
URLs and the connections, respectively. Protocol information, such as HTTP
and FTP, is an attribute of a node.
-
Graph Visualizer
The visual interface is designed to display general graphs.
Properties of nodes are label, color, shape, weight, and metadata information.
Properties of the edges are color and weight. These properties help us
visualize general Web graph information such as types of files, protocols
used, size of web documents, titles, dates of creation, etc. The visual
interface interacts with the graph library to display graphs in different
kinds of drawings [DiBattista et al. 93] [Mukherjea & Foley 95] [Walker
90] that include radial [Fig. 2], spring loaded (to model edges as springs)
[Fig. 3] and incremental. It is possible to display graphs with 200,000
nodes. This module has three main functions: browsing graphs, creating
and editing graphs, and helping Web navigation using an interface with
a Web browser, such as Netscape.
-
Graph Analyzer
This tool has a number of modules (filters) to analyze
the constructed Web graph. These filters help the user obtain specific
data from the Web graph. They report groups of nodes that are related under
specific properties. It is easy to add additional modules to the Graph
Analyzer.
Examples of filters that are implemented:
-
Pages of a specific user.
-
Broken links.
-
Web books. These are filtered following the HTML tag LINK
or groups of nodes that are related under a specific property: URL template
(For example all documents of the Guide of the RPI Computer Science Department
http://www.cs.rpi.edu/guide97/* ).
-
Linearization of a Web graph. This is used for printing the
Web document.
-
Site Maps. These are reported using Web graphs [Fig. 5] or
HTML files as lists [Fig. 8] or tables [Fig. 7].
-
Popular paths of a Web site. This is achieved using log files
of the Web server.
-
Report of problems of a Web site such as HTML pages that
are large or dead ends (no outgoing hyperlinks).
-
Clustering Tool
This tool helps to visualize large Web graphs. If the
number of nodes of a given Web graph is larger than a given number, the
Web graph is partitioned using clustering algorithms. Our system currently
has three clustering heuristics.
-
Synthesizer
The Synthesizer helps create a skeletal HTML document
from a given graph. Synthesizer also interfaces with ASHE to edit a document
-
Interfaces
The system currently supports two interfaces. The interface
with Netscape helps navigate a URL interactively. The user sees the corresponding
graph to that URL in the visual interface that is generated when he/she
navigates the Web. The second interface is with the ASHE (xhtml) HTML editor
that helps edit the content of any Web node that is an HTML document.
Figure 1: Components of the WWWPal System.
The whole system is implemented in C (in an object-oriented
fashion). We have used the Libwww and Motif libraries. An important feature
is that our system consists of a number of modules (such as graph library,
graph analyzer) which others can use as libraries. Our whole system currently
runs under Solaris 5.2 and it can be ported to Linux. The system (binary
image) takes 1 M in storage.
3. Web Robot
WWWPal System uses Libwww
from the World Wide Web Consortium. The
Web Robot uses Libwww to navigate automatically in the Web. Parameters
given to the Web Robot are the initial URL, depth of navigation, and conditions
of navigation. The initial URL is where the Web Robot starts the navigation
and continues using a Breadth-First Search (BFS) algorithm. The depth of
navigation is a parameter of the Robot to prevent it from navigating infinitely
in the Web. Other parameters of the navigation are names of Web servers
and conditions such as which directories are allowed to visit or not. The
output of the navigation is a text file that describes the Web graph generated
by the Robot. [Fig. 2] shows a sample output of the graph from navigating
the WebNet site.
The Web Robot handles recursive paths, redirection of
URLs, and semantically incorrect URLs. Recursive paths can be caused from
navigation through file directories (through links) so that the Robot is
not allowed to visit them. All visited URLs are kept in hash tables to
maximize the performance of the navigation time. If the visited Web Server
redirects a URL, the Web Robot updates that information in the Web Graph
erasing the old URL and creating a new Web Node with the redirection. The
Web Robot verifies that each visited URL is semantically correct: for example,
URLs with blank spaces are not visited.
Furthermore, the Web Robot uses the standard file "robots.txt"
to obtain additional information for visiting Web sites. This information
determines the Web directories the Robot is allowed to visit and the directories
the Robot is not allowed to visit.
The Web Robot gathers meta information from each Web Node
such as title of document, size of document, date, mime types and general
HEAD info (LINK, META tags). Each Web Node is given in an unique number
the first time the corresponding page is first visited by the Web Robot.
This information is important for visualization of the Web graphs and for
the system graph Analyzer. The Robot saves this information as properties
of a Web Node.
The Robot first marks all the links at a given level;
it then explores the first link and continues its traversal. This exploration
stops if we maintain all nodes encountered are either visited or marked.
In order to do this, a queue is used to maintain the parent links.
The Robot traversal takes the longest time in our system.
Because the system saves the Web graph in a plain text file, large sites
tend to produce large Web graphs. From our current experience, we found
that for a medium-size computer science department, the Web graph file
was 6 M bytes, for a medium-size technical university the Web graph
was 12 M bytes and for an average user or a medium-size company it was
0.1 M bytes. For example, the Web graph for the WebNet site on 02/20/98
was 0.01M bytes.
4. Graph Visualizer
The Graph Visualizer is designed to display large graphs.
Graphs can be directed or undirected and the nodes can have different geometrical
shapes. The Graph Visualizer interacts with a Graph Library that we developed.
The Graph Library stores information about nodes and edges in structures
that can be accessed through hash tables. Labels (identifiers) and positions
are the sources used to generate the hash numbers in order to store and
access this information.
The Graph Library has several methods of drawing graphs
such as a horizontal tree, vertical tree, radial tree, barycentric, spring
algorithms, circular, as well as incremental algorithms. These different
drawing algorithms use "heuristics" to place the vertices of a graph according
to "some specifications" so that the resulting embedding of the graph looks
"nice." The Zoom function helps to look at a graph in great detail
or to get an overview of a graph. The Graph Library allows the user to
change information of nodes and edges so the Graph Visualizer can edit
the graph. Graphs can be saved in text format which can be read by the
Graph Library's parser. The Graph Library is object-oriented so it is easy
to add other methods to read different graph file formats. It is also easy
to add more drawing methods.
The Graph Visualizer of the WWWPal system interprets the
information of nodes and edges of the graph in this way:
-
URLs are the label identifiers of the nodes of the graphs.
-
Shapes of nodes are set depending on the protocol of the
URL. For example, squares are visited web documents, circles are
non-visited web documents, rhombi are ftp documents, ellipses are mailto
URLs, triangles are broken links, etc.
-
Colors of nodes represent types of documents such as HTML
files, text files, compressed files, images files, etc. This information
can be set using the mime type of the Web Node.
-
The Web Graph can be represented as a tree using either DFS
(Depth-First Search) or BFS (Breadth-First Search) from an initial Web
Node. The Graph Visualizer assigns a color to the edges so we can distinguish
between the Tree edges (blue), Forward edges (cyan), Backward edges (green)
and edges to unreachable Web Nodes (magenta).
-
We can partition the graph using the selection methods of
the Graph Library. Selection can be individual, global, children of a Web
Node, subtree of the Web Graph, and Web Nodes with related property such
as URL templates. Once the selection is made, the Graph Visualizer can
partition the graph in two new graphs: One graph contains all selected
nodes, this graph is called "Browse Graph." The second graph is formed
by all non-selected vertices and a new node that groups all selected nodes.
The shape of this node is a pentagon and the label identifier can be
the URL of the file where "Browse Graph" is saved. Clicking the mouse in
the "Group Vertex" opens a window with the "Browse Graph."
-
The Graph Visualizer interacts with ASHE
(xhtml) and Netscape to edit
or browse any of the Web Nodes.
Graph Visualizer can display large graphs (200,000 nodes).
We recommend partitioning large graphs into medium-sized graphs (~ 500
nodes) so the drawing of the graph is reasonably fast. An automatic partition
can be made by the Clustering Tool of the WWWPal system. A sample
graph of the WebNet site is given in [Fig. 2]. The same graph under
a spring loaded embedding is shown in [Fig. 3].
Figure 2: Web Graph of the WebNet Site.

Figure
3: Web Graph of the WebNet Site (spring loaded embedding).
Web graphs can be displayed as a hierarchical list of documents.
If one of the Web Nodes has children, this document is displayed as a directory
of the list. We used the ListTree Widget of Robert W. McMullen. [Fig. 4
] shows the WebNet web site as a hierarchical list of documents.
Figure 4: Web Graph of the WebNet Site shown
as a hierarchical list of documents.
5. Clustering Tool
The Clustering Tool of the WWWPal system helps partition
the graph using heuristic algorithms. There are two major reasons as to
why the system performs clustering algorithms. The first reason is
in visualization of large graphs. Clustering [Broder 97] [Storey 96]
nodes reduce the size of the graph. The second reason is to get a site
map of a given Web site. We have implemented three algorithms and the user
can apply any one of them on a Web graph. All three algorithms
are recursive and they follow this general algorithm.
URL selectandgroup(URL urlbase,Graph g)
{
Graph g1,g2;
URL url1,url2;
g1 = select(g);
if (selection conditions are met)
{
save g in a new url;
return new url based on urlbase;
}
else
{
g2 = group(g,g1);
url1=selectandgroup(urlbase, g1);
update g2 with url1;
url2=selectandgroup(urlbase , g2);
return url2;
}
}
The essence of this algorithm is to first select a subset
of nodes (g1) so that the selected nodes are more related than the rest
of the nodes. From this, two graphs are generated: a browse graph (g1)
and a group graph (g2). The new vertex group in the group graph (g2) is
updated with the location of the browse graph (url1). The parameters of
the algorithm are the minimum number of nodes in the graph (MIN_NODES)
and the minimum number of the selected nodes (MIN_SEL_NODES). If these
parameters are met, a new selection of nodes is made and the process is
repeated recursively.
All three heuristics are different in the selection of
the nodes. For all the three heuristics a BFS algorithm first classifies
the edges of a Web graph. Once the clustering is done, the clustered nodes
are represented as pentagons. There is an edge between any two clustered
nodes, if there is an edge between any node in one cluster to any node
in the second cluster.
The first clustering algorithm counts the number of nodes
of the subtree of the immediate children of the root node. The subtree
that has the larger number of nodes and is bigger than the parameter MIN_SEL_NODES
is selected. If the number of nodes in the current graph are smaller
the the parameter MIN_NODES, no selection is made. The first clustering
algorithm has been used to get site maps for medium-size companies.
The second clustering algorithm chooses a group of nodes
with the same URL template. For example, all web nodes that are under http://www.cs.rpi.edu/guide97/
form a tree. This helps us to visualize a cluster of nodes that are related
with the same pattern of URLs. The second clustering algorithm has been
used for an average user.
The third clustering algorithm is designed for a large
Web site. First the selection is made by the user using the second algorithm
(For example the URL template http://www.cs.rpi.edu/~moorthy/* describes
all Web nodes of the user moorthy). When no selection can be made,
the first algorithm is applied to partition the graph in smaller graphs.
This algorithm can be used to visualize the Web graph as a set of Web graphs
of the users' Web sites. The third clustering has been used for a Web site
consisting of several users. It has been effectively used to obtain site
maps of university web sites.
These clustering algorithms generate a new graph that
contains an overview relation between all generated graphs. The shapes
of the nodes of this graph are hexagons and the identifiers are the URLs
where the files of the graphs are saved. This graph is always a tree and
helps the user visualize the whole structure of large Web site. [Fig. 5]
shows the graph overview of the Computer Science Department of Rensselaer
Polytechnic Institute Web site. [Fig. 6] shows the subgraph of the nodes
that are clustered in the root node of [Fig.5]. It is also possible to
get a HTML site map from the clustering algorithms as described in the
next section.
Figure 5: Web Graph of the Web Site of the
Computer Science Dept. RPI after clustering.
Figure 6: First Node of the Web Graph of the
Web Site of the Dept. of Comp. Sci. RPI.
6. Graph Analyzer
Graph Analyzer has a number of modules (filters) to analyze
the constructed Web graph. These filters obtain specific data from the
Web graph. They also report groups of nodes that are related under any
specific property.
These filters are graph algorithms with the following
caveats that the graphs are very large (so the storage space for these
algorithm is limited) and there are a number of attributes associated with
nodes of the Web graph. In this section we describe a few filters that
we have found useful.
-
Pages of a specific user. This is accomplished by traversing
the Web graph from the Web node root of the user and selecting all those
nodes whose URL's have the user name. We also select one more level beyond
the last level.
-
Broken links. These are the nodes whose URL's point to a
nonexistent document. These are obtained from the Web graph and they are
triangles in shape. For example, the only broken link of AACE Web Site
on 02/17/98 is:
-
1998 Author Guidelines
http://www.aace.org/conf/edmedia/guidelines.htm
1.http://www.aace.org/home.html
-
Web books - These are filtered following the HTML tag LINK
or groups of nodes that are related under a specific property : URL template
or graph cluster. For example once we apply a latex2html to a document,
we get a collection of HTML documents and we lose the structure of the
document. But this structure can be regained by following the URL template
property.
-
Linearization of a Web graph. This can be used for printing
the Web document. This specifies a particular traversal of a graph. When
we print the whole book, we assume that individual pages are printed properly.
-
Site Maps [Bevirt 96]. These are reported using Web graphs
or HTML files as lists or tables. This site map is obtained from the clustering
algorithm described in the last section. The site map is obtained by meaningfully
expanding the clusters at various levels. The site map of AACE (Association
for Advancement of Computing in Education) web site in tabular fashion
and in list fashion are given in [Fig. 7] and [Fig. 8]. If we have a different
format for a site map, then that could also be generated fairly easily.
-
Popular paths of a Web site. This is achieved using log files
of the Web server. These are obtained as attributes of nodes in Web
graph.
-
Report of problems of a Web site such as HTML pages that
are huge or dead ends (no outgoing hyper links). It is often desirable
to have a pointer back to the previous or home page in a given HTML document.
The dead end nodes are those Web nodes which do not have any edge going
out of them in the Web graph.
Figure 7: Site map of AACE Web Site (Table
view).
Figure 8: Site map of AACE Web Site (List
view).
7. ASHE (xhtml)
ASHE (A Simple HTML Editor) was presented in the Second WWW
Conference at Chicago in October 94 [Punin & Krishnamoorthy 94].
Our first ASHE release was in January of 1995 and since then we have had
2 more releases.
The main features of ASHE are:
-
WYSYWYG HTML Browser Area.
-
Menus to help the user edit HTML tags
-
Emacs Key Binding.
-
Hypertext Help.
-
Export HTML Document as plain text or postscript files.
-
Multiple Frames.
-
Multiple Fonts.
-
Capable of displaying Forms.
-
Capable of displaying GIF and Bitmaps Images.
-
Capable of displaying simple tables.
-
Local Navigation.
-
Connection with Mosaic and Netscape.
ASHE is used in the synthesis of documents for the WWWPal
system. It is specifically useful when we generate skeletal documents by
constructing a graph. As an example, for the graph shown in [Fig. 9], the
HTML document is shown in [Fig. 10].
Figure 9: Graph of a Skeletal Document.
Figure 10: HTML Document of a Skeletal Document
8. Applications
WWWPal has a number of applications some of which have been
mentioned earlier. In this section, we mention two other applications of
the system: 1) Graph Browser and 2) HTML Synthesizer.
8.1 Graph Browser
In our graph visualization module, it is possible not only
to edit, create, display and modify graphs (Web graphs), but also to open
a URL. When a URL is specified, a node gets created and displayed. Furthermore
all the links that are connected to that URL are displayed as edges originating
from that Web node. Now interactively, we can explore any node in the created
graph (Web graph). When that node gets explored, all the links connected
to that URL page get created as edges emanating from that node. If a Web
node has already been visited, it is no longer expanded. If one wants to
look at the content of any Web node, then by using either ASHE or Netscape
interface one can see the actual content of the URL. Thus, we have created
a browser, which may be thought of as an overview browser. The advantage
of this kind of browser is that it does not load the URL document itself,
but only gets the information such as the type, size, link and the protocol.
[Fig. 11] shows a sample browser of one of the authors' home page.
Figure 11: Web Graph generated after navigating
in a Web Site
8.2 HTML Synthesizer
WWWPal also helps us create an outline of a HTML document.
Using the graph Visualizer module, a user starts by creating Web nodes
and edges joining these Web nodes. The user also specifies the URL attributes
of the Web nodes and the order of traversal. WWWPal provides an option
of creating a URL document for the created Web Graph. The created URL document
consists of only links and does not contain any text or images. The contents
(textual, image and audio) can be added using ASHE, which has been interfaced
with our system. [Fig. 9] shows a sample Web graph and [Fig. 10]
shows the created Web document.
9. Conclusion
This paper describes the WWWPal system that we designed and
implemented. Our system is useful for both analysis and synthesis of Web
Documents. WWWPal also helps in the organization of Web sites. We have
provided a number of examples to illustrate the uses of our system. Our
future work involves adding more modules to Graph Analyzer and to obtain
many different types of Site maps. Our immediate future work is to make
our system freely available to all interested Web servers and clients.
10. References
[Bevirt 96] B. Bevirt (1996), Designing Hypertext Navigation
Tools, Proceedings of WebNet 96 Conference, San Francisco, USA.
[Broder 97] A. Broder, S. Glassman, M. Manasse, G. Zweig
(1997), Syntactic Clustering of the Web, Proceedings of WWW6 Conference,
Santa Clara, USA.
[Carriere & Kazman 95] J. Carriere and R. Kazman
(1995), WebQuery, Interacting with Huge Hierarchies: Beyond Cone Trees,
Proceedings of IEEE Information Visualization, Atlanta, USA, 74-81.
[Carriere & Kazman 97] J. Carriere and R. Kazman
(1997), WebQuery, Searching and Visualizing the Web through Connectivity,
Proceedings of WWW 6 Conference, Santa Clara, USA, 701-711.
[Chen & Koutsofios 97] Y. Chen, E. Koutsofios (1997),
WebCiao: A Website Visualization and Tracking System, Proceedings of
WebNet 97 Conference, Toronto, Canada.
[DiBattista et al. 93] G. DiBattista, P. Eades,
R. Tamassia and I. Tollis (1993), Algorithms for Drawing Graphs: An Annotated
Bibliography, Technical Report, Computer Science Department, Brown
University.
[Isakowit et al. 95] T. Isakowitz, E. A. Stohr and P.
Balasubramanian (1995), RMM: A Methodology for Structured Hypermedia Design,
Comm. ACM, 38, 34-44.
[Maarek & Shaul 97] Y. Maarek and I. Shaul (1997),
WebCutter: A System for Dynamic and Tailorable Site Mapping, Proceedings
of WWW 6 Conference, Santa Clara, USA, 713-722.
[Mukherjea & Foley 95] S. Mukherjea and J. Foley (1995).
Visualizing the World-Wide Web with the Navigational View Builder, Proceedings
of WWW 3 Conference, Darmstadt, Germany.
[Ngu & Wu 97] D. Ngu and X. Wu (1997), SiteHelper:
A Localized Agent that Helps Incremental Exploration of the World Wide
Web, Proceedings of WWW 6 Conference, Santa Clara, USA, 691-699.
[Piroli et al. 96] P. Piroli, J. Pitkow and R. Rao (1996),
Silk from a Sow's Ear: Extracting Usable Structures from the Web, Proceedings
of the CHI 1996, Vancouver, Canada.
[Punin & Krishnamoorthy 94] J. Punin and M. Krishnamoorthy
(1994), Environment for Preparing HTML Documents, Proceedings of WWW2
Conference, Chicago, USA, 147-172.
[Sano 96] D. Sano (1996), Designing Large-Scale Web
Sites: A Visual Design Methodology, Wiley Computer Publishing.
[Storey 96] M. D. Storey, H Muller, K. Wong (1996), Manipulating
and Documenting Software Structures, Software Visualization.
[Takahashi & Liang 97] K. Takahashi and E. Liang (1997),
Analysis and Design of Web-based Information Systems, Proceedings of
WWW 6 Conference, Santa Clara, USA, 377-389.
[Walker 90] J. Walker II (1990), A Node-positioning
Algorithm for General Trees, Software-Practice and Experience, 20(7),
685-705.
11. URL References
Mapuccino
http://www.ibm.com/java/mapuccino/
WebQuery
http://www.cgl.uwaterloo.ca/Vanish/
Libwww
http://www.w3.org/Library/
World Wide Web Consortium
http://www.w3.org/
WebNet Site
http://www.aace.org/conf/webnet/
ASHE(xhtml)
http://www.cs.rpi.edu/~puninj/TALK/head.html
Netscape
http://www.netscape.com/