Keywords: XML, XGMML, LOGML, Web Usage Mining, Web Characterization, Web Graph, WWWPal System, Frequent Pattern Mining
XGMML (Extensible Graph Markup and Modeling Language) [21] is an XML 1.0 application [23] based on Graph Modeling Language (GML) [6] which is used for graph description. XGMML uses tags to describe nodes and edges of a graph. The purpose of XGMML is to make possible the exchange of graphs between different authoring and browsing tools for graphs. The conversion of graphs written in GML to XGMML is straight forward. Using XSL (Extensible Stylesheet Language) [26] with XGMML allows the translation of graphs to different formats. In Section 2, we present details of XGMML.
Log Markup Language (LOGML) [7] is an XML 1.0 application designed to describe log reports of web servers. Web data mining is one of the current hot topics in computer science. Mining data that has been collected from web server logfiles, is not only useful for studying customer choices, but also helps in organizing web pages. This is accomplished by knowing which web pages are most frequently accessed by the web surfers. In section 2 we explain how the structure of a web site can be represented as a web graph using XGMML. In mining the data from the log statistics, we use the web graph in annotating the log information. Further we give summary reports, comprising of information such as client sites, types of browsers and the usage time statistics. We also gather the client activity in a web site as a subgraph of the web site graph. This subgraph can be used to get better understanding of general user activity in the web site. In LOGML, we create a new XML vocabulary to structurally express the contents of the logfile information. In section 3, we present details of LOGML. Section 4 describes LOGML generator as an additional module for the WWWPal system [13].
Web data mining has been gaining a lot of attention because of its potential commercial benefits. For example, consider a web log database at a popular site, where an object is a web user and an attribute is a web page. The mined patterns could be the sets or sequences of most frequently accessed pages at that site. This kind of information can be used to restructure the web-site, or to dynamically insert relevant links in web pages based on user access patterns. Furthermore, click-stream mining can help E-commerce vendors to target potential online customers in a more effective way, at the same time enabling personalized service to the customers.
Web Mining is an umbrella term that refers to mainly two distinct tasks. One is Web Content Mining [3], which deals with problems of automatic information filtering and categorization, intelligent search agents, and personalize web agents. Web Usage Mining [3] on the other hand relies on the structure of the site, and concerns itself with discovering interesting information from user navigational behavior as stored in web access logs.
The focus of this paper is primarily on using web usage mining. While extracting simple information from web logs is easy, mining complex structural information is very challenging. Data cleaning and preparation constitute a very significant effort before mining can even be applied. The relevant data challenges include: elimination of irrelevant information such as image files and cgi scripts, user identification, user session formation, incorporating temporal windows in the user modeling, and so on. After all this pre-processing one is ready to mine the resulting database.
The proposed LOGML and XGMML languages have been designed to facilitate this web mining process in addition to storing additional detailed summary information extracted from web logs. Using the LOGML generate documents the pre-processing steps of mining are considerably simplified. We also propose a new mining paradigm, called Frequent Pattern Mining, to extract increasingly informative patterns from the LOGML database. Our approach and its application to real log databases is discussed further in Section 5.
Section 6 illustrates how LOGML can be applied for web characterization.
We provide an example to demonstrate the ease with which information
about a website can be generated using LOGML with style sheets (XSLT).
Additional information about web characterization can also be extracted
from the mined data.
Example 1
<?xml version="1.0"?> <!DOCTYPE graph PUBLIC "-//John Punin//DTD graph description//EN" "http://www.cs.rpi.edu/~puninj/XGMML/xgmml.dtd"> <graph directed="1" id="2"> <node id="1" label="Node 1"/> </graph>XGMML well formed documents can be part of other XML documents using namespaces [10]. The following example is a graph inside of an XHTML [22] document :
Example 2
<?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xmlns:xgmml="http://www.cs.rpi.edu/XGMML" xsi:schemaLocation="http://www.w3.org/1999/Style/Transform http://www.w3.org/1999/Style/Transform/xslt.xsd http://www.w3.org/1999/xhtml http://www.w3.org/1999/xhtml/xhtml.xsd http://www.cs.rpi.edu/XGMML http://www.cs.rpi.edu/~puninj/XGMML/xgmml.xsd" xml:lang="en"> <head> <title>Graph Information</title> </head> <body> <!-- XHTML Document here --> <xgmml:graph directed="1" graphic="1" Layout="points"> <xgmml:node id="1" label="1" weight="0"> <xgmml:graphics type="circle" x="250" y="90" /> </xgmml:node> <xgmml:node id="2" label="2" weight="0"> <xgmml:graphics type="circle" x="190" y="150" /> </xgmml:node> <xgmml:edge source="1" target="2" weight="0" /> </xgmml:graph> <!-- XHTML Document here --> </body> </html>RDF (Resource Description Framework) [14] is one way to describe metadata about resources. XGMML includes metadata information for a graph, node and/or edge using the att tag. Example 3 is part of a graph describing a website. The nodes represent web pages and the edges represent hyperlinks. The metadata of the webpages is included as attribute of a node. RDF and DC (Dublin Core)[5] vocabularies have been used to describe the metadata of the nodes.
Example 3
<?xml version="1.0"?> <graph xmlns = "http://www.cs.rpi.edu/XGMML" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:schemaLocation="http://www.cs.rpi.edu/XGMML http://www.cs.rpi.edu/~puninj/XGMML/xgmml.xsd" directed="1" > <node id="3" label="http://www.cs.rpi.edu/courses/" weight="5427"> <att> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.0/"> <rdf:Description about="http://www.cs.rpi.edu/courses/" dc:title="Courses at Rensselaer Computer Science Department" dc:subject="www@cs.rpi.edu; M.S. requirements; CSCI-1190 Beginning C Programming for Engineers; Courses; People; Graduate Program; CSCI-4020 Computer Algorithms; CSCI- 2220-01 Programming in Java; Research; Course Selection Guide; CSCI-4961-01, CSCI-6961-01 Advanced Robotics; Programming in Java; CSCI-2400 Models of Computation" dc:date="2000-01-31" dc:type="Text" > <dc:format> <rdf:Bag rdf:_1="text/html" rdf:_2="5427 bytes" /> </dc:format> </rdf:Description> </rdf:RDF> </att> </node> .... <edge source="1" target="3" weight="0" label="SRC IMG gfx/courses2.jpg" /> <edge source="7" target="3" weight="0" label="SRC IMG ../gfx/courses2.jpg" /> </graph>
Nodes and edges can reference XGMML documents. For example, a node may represent a graph that can be shown when the user points in the node. This behavior is similar to hyperlinks in HTML documents. XGMML uses XLink framework [23] to create hyperlinks either in nodes or edges. The XLink attributes: type, role, title, show, actuate and href, are added as attributes of the node and edge elements. All these attributes are taken directly from the XLink Working Draft.
The node element describes the properties of a node object. The node can be rendered as a graphic object and also can have additional meta information to be used for the application program. The only elements allowed inside the node are graphics and att. The graphic representation of the node is reported on the graphics element. For example, a graphical representation of a node can be a rectangle, a circle or a bitmap. The additional meta information is reported on the att element. For example, if a node is a representation of a web page, useful metadata is the title, date of creation and size of the web page.
The edge element describes the properties of an edge object. For each edge element at least two node elements have to be included in the graph element. An edge is between a source node and a target node. The application program must verify if the source node and target node are included in the XGMML document. The weight attribute is used to save the weight number for weighted graphs. The edge element as the node element can have a graphical representation and additional metadata information. The graphics element shows the graphical representation of an edge. For example, a graphical representation of an edge can be a line or an arc. An att element is used to attach additional meta information related to an edge. For example, if an edge is a representation of a hyperlink, useful metadata is the anchor string and the type of the hyperlink (Typed Links)[17].
An att element is used to hold meta information about the element that contains the att element. An att element can contain other att elements, for example to represent structured metadata such as records, lists, etc. For example, the metadata of an object person A is name: John, ssn: 123456789 and e-mail: john@rpi.edu. To attach this metadata to a node of a graph using the att element, the following lines must be included in the node element:
<att type="list" name="person_description"> <att name="name" value="John"/> <att name="ssn" value="123456789"/> <att name="e-mail" value="john@rpi.edu"/> </att>The graphics element defines the graphical representation a graph, a node or an edge. A graphics element must be included in a graph, node or edge element. Line, center and att elements are the only elements that can be contained in a graphics element. Line element is defined between two point elements and it is used to represent edges. A center element is a special point element to represent the central point of the graphical representation of a node. The att element permits to add information to the graphical representation. All these elements are inherited from GML[6].
LOGML files are large files so example 4 shows part of a LOGML file.
Example 4
<?xml version="1.0"?> <logml xmlns="http://www.cs.rpi.edu/LOGML" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:schemaLocation="http://www.cs.rpi.edu/LOGML http://www.cs.rpi.edu/~puninj/LOGML/logml.xsd" start_date="12/Oct/2000:05:00:05" end_date="12/Oct/2000:16:00:01"> <graph xmlns="http://www.cs.rpi.edu/XGMML" xmlns:lml="http://www.cs.rpi.edu/LOGML" xsi:schemaLocation="http://www.cs.rpi.edu/XGMML http://www.cs.rpi.edu/~puninj/XGMML/xgmml.xsd http://www.cs.rpi.edu/LOGML http://www.cs.rpi.edu/~puninj/LOGML/logml.xsd" directed="1"> <node id="234" label="http://www.cs.rpi.edu/~puninj/JAVA/projects/lfarrw.gif" lml:hits="1" weight="1"> <att name="title" value="No title"/> <att name="mime" value="image/gif"/> <att name="size" value="1291"/> <att name="date" value="Sun Jun 11 02:14:28 2000"/> <att name="code" value="200"/> </node> <node id="228" label="http://www.cs.rpi.edu/~puninj/XGMML/POSTER/IMG/pptgraph.gif" lml:hits="2" weight="2"> <att name="title" value="No title"/> <att name="mime" value="image/gif"/> <att name="size" value="27689"/> <att name="date" value="Wed Sep 22 14:17:15 1999"/> <att name="code" value="200"/> </node> .... <edge source="191" target="234" label="SRC IMG lfarrw.gif" lml:hits="1" weight="1"> <att value="image"/> </edge> <edge source="161" target="228" lml:hits="2" weight="2"> <att value="image"/> </edge> .... <edge source="550" target="561" lml:hits="1" weight="1" lml:indp="1"/> <edge source="550" target="562" lml:hits="1" weight="1" lml:indp="1"/> </graph> <hosts count="35"> <host name="vamos.inria.fr" access_count="43" bytes="487397" html_pages="43"/> <host name="kbl-ternzn1200.zeelandnet.nl" access_count="13" bytes="46354" html_pages="1"/> .... </hosts> <domains count="9"> <domain name="fr" access_count="44" bytes="488509" html_pages="44"/> <domain name="unknown" access_count="25" bytes="388608" html_pages="16"/> <domain name="com" access_count="21" bytes="229979" html_pages="19"/> .... </domains> <directories count="30"> <directory name="http://www.cs.rpi.edu/~puninj/XGMML" access_count="21" total_count="49" bytes="1116521"/> <directory name="http://www.cs.rpi.edu/~puninj/TALK" access_count="19" total_count="22" bytes="91460"/> .... </directories> <userAgents count="23"> <userAgent name="xyro_(xcrawler@cosmos.inria.fr)" access_count="43" bytes="487397" html_pages="43"/> <userAgent name="Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)" access_count="27" bytes="670815" html_pages="9"/> .... </userAgents> <hostReferers count="14"> <hostReferer name="No Referer" access_count="66" bytes="945527"/> <hostReferer name="http://www.cs.rpi.edu" access_count="41" bytes="701097"/> <hostReferer name="http://home.xnet.com" access_count="1" bytes="1112"/> .... </hostReferers> <referers count="11"> <referer name="No referer" access_count="66" bytes="945527"/> <referer name="http://boss.cae.wisc.edu/hppd/hpux/Networking/WWW/xhtml-1.3/" access_count="1" bytes="35272" target="8"/> <referer name="http://informant.dartmouth.edu/" access_count="1" bytes="1112" target="2"/> .... </referers> <keywords count="10" search_count="9"> <keyword name="java" count="3"/> <keyword name="xhtml" count="2"/> .... </keywords> <summary requests="132" sessions="6" bytes="1796173" html_pages="56" nhtml_pages="17" inline_objects="10" hyperlink_html="7" hyperlink_nhtml="16" html_entry_pages="55" nhtml_entry_pages="4" unique_sites="35" unique_host_referers="8" unique_se_referers="6" unique_external_url_referers="7" unique_internal_url_referers="4" unique_user_agents="23" requests_hour="12.00" requests_day="288.03" kbytes_day="159.48" kbytes_hour="3827.46" searches="9" unique_keywords="10"> <httpCode code="200" name="200 - OK " count="118" bytes="1793393" html_pages="83"/> <httpCode code="301" name="301 - Moved Permanently" count="3" bytes="1058" html_pages="3"/> <httpCode code="304" name="304 - Not Modified" count="6" bytes="0" html_pages="5"/> <httpCode code="404" name="404 - Not Found" count="5" bytes="1722" html_pages="5"/> <httpMethod name="GET" count="131" bytes="1796173" html_pages="95"/> <httpMethod name="HEAD" count="1" bytes="0" html_pages="1"/> <httpCode name="HTTP/1.0" count="97" bytes="1399288" html_pages="83"/> <httpCode name="HTTP/1.1" count="35" bytes="396885" html_pages="13"/> <dateStat> <monthStat month="10" hits="132" bytes="1796173" html_requests="96"/> <dayStat day="12" hits="132" bytes="1796173" html_requests="96"/> <hourStat hour="5" hits="12" bytes="15622" html_requests="12"/> <hourStat hour="6" hits="15" bytes="103280" html_requests="14"/> <hourStat hour="7" hits="41" bytes="642786" html_requests="28"/> <hourStat hour="8" hits="16" bytes="105435" html_requests="9"/> <hourStat hour="10" hits="2" bytes="346" html_requests="2"/> <hourStat hour="11" hits="7" bytes="54889" html_requests="5"/> <hourStat hour="12" hits="22" bytes="505379" html_requests="14"/> <hourStat hour="13" hits="2" bytes="1444" html_requests="2"/> <hourStat hour="14" hits="12" bytes="364297" html_requests="7"/> <hourStat hour="15" hits="3" bytes="2695" html_requests="3"/> </dateStat> </summary> <userSessions count="2" max_edges="100" min_edges="2"> <userSession name="proxy.artech.com.uy" ureferer="No referer" entry_page="http://www.cs.rpi.edu/~puninj/XGMML/" start_time="12/Oct/2000:12:50:11" access_count="4"> <path count="3"> <uedge source="3" target="10" utime="12/Oct/2000:12:50:12"/> <uedge source="3" target="21" utime="12/Oct/2000:12:51:41"/> <uedge source="21" target="22" utime="12/Oct/2000:12:52:02"/> </path> </userSession> <userSession name="207.234.33.12" ureferer="http://search.excite.com/search.gw?search=XHTML" entry_page="http://www.cs.rpi.edu/~puninj/TALK/head.html" start_time="12/Oct/2000:14:05:10" access_count="3"> <path count="2"> <uedge source="2" target="7" utime="12/Oct/2000:14:05:24"/> <uedge source="2" target="8" utime="12/Oct/2000:14:06:14"/> </path> </userSession> </userSessions> </logml>
The global attributes are used by most of the LOGML elements and they are:
The elements of the second section are:
Example 5
<userSession name="proxy.artech.com.uy" ureferer="No referer" entry_page="http://www.cs.rpi.edu/~puninj/XGMML/" start_time="12/Oct/2000:12:50:11" access_count="4"> <path count="3"> <uedge source="3" target="10" utime="12/Oct/2000:12:50:12"/> <uedge source="3" target="21" utime="12/Oct/2000:12:51:41"/> <uedge source="21" target="22" utime="12/Oct/2000:12:52:02"/> </path> </userSession>
The information that we extract from the common logs is: host name or IP, date of the request, relative URI of the requested page, HTTP version, HTTP status code, HTTP method and bytes transferred to the web client. The extended log files contain additionally the absolute URI of the referer web page and a string that describes the User Agent (web browser or web crawler) that has made the request. These information is saved in a data structure to generate the LOGML Document. The LOGML Generator also can write HTML reports making this module a powerful tool for web administrators.
Several algorithms have been developed to find the user sessions in the log files [9,11,20]. A simple algorithm uses the IP or host name of the web client to identify a user. SpeedTracer [20] also checks the User Agent and date of the request to find the user session. Straight ways to find user session requires "cookies" or remote user identification [9]. LOGML Generator algorithm to find user sessions is very similar to the algorithm used by SpeedTracer. An user is identified by IP and User Agent. We only consider a User session if the user has traversed at least one hyperlink and a user session is considered finished when there are not more requests made by the user in a given period of time. The LOGML Generator also identifies spiders either by the User Agent name or by the high number of the requests in a brief period of time. Spiders are not considered as user sessions. User sessions are being reported as a set of hyperlinks between HTML pages. We can expand the user sessions inline objects using the log graph of the first section of the LOGML document. The hyperlinks are reported with the timestamp for Web Data Mining purposes.
We used the Graph Visualizer of WWWPal System to display the log graph of the LOGML document or any of the user sessions that has been identified in the log files. Figure 6 shows part of the log graph of the Rensselaer Info web site (http://www.rpi.edu/rpinfo/). The numbers on the edges are the times that a user has traversed that edge (hyperlink).
Figure 6: Log graph of RPI Info Website
Figure 7 shows the log graph of the Rensselaer News website (http://www.rpi.edu/web/News/).
The number in the nodes are the times that a user has requested that node
(web page). For visualization purposes just the main nodes of the log graph
have been highlighted and the title of the web page displayed.

The same underlying LOGML document that stores the web graph, as well as the user sessions, which are subgraphs of the web graph, can be used to extract increasingly complex and more informative patterns. Given a LOGML document extracted from the database of web access logs at a popular site, one can perform several mining tasks. The simplest is to ignore all link information from the user sessions, and to mine only the frequent sets of pages accessed by users. The next step can be to form for each user the sequence of links they followed, and to mine the most frequent user access paths. It is also possible to look at only the forward accesses of a user, and to mine the most frequently accessed subtrees at that site. Generalizing even further, a website can be modeled as a directed graph, since in addition to the forward hyperlinks, it can have back references, creating cycles. Given a database of user accesses (with full information about their traversal, including forward and backward links) one can discover the frequently occurring subgraphs.
In the rest of this section, we first formulate the FPM problem. We show how LOGML facilitates the creation of a database suitable for web mining. Using real examples we show our experimental results on one day of RPI logs, and we describe several increasingly complex mining tasks that can be performed on the same LOGML database.
A structural rule is an expression
,
where X and Y are structures. The support of the rule
in the database of structures is the joint probability of X and
Y,
and the confidence is the conditional probability that a structure
contains Y, given that it contains X. A rule is strong
if its confidence is more than a user-specified minimum confidence (min_conf).
The frequent pattern mining task is to generate all structural rules
in the database, which have a support greater than min_sup and have
confidence greater than min_conf. This task can be broken into two
main steps: 1) Find all frequent structures having minimum support and
other constraints. This step is the most computationally and I/O intensive
step, since the search space for enumeration of all frequent substructures
is exponential in the worst case. The minimum support criterion is very
successful in reducing the search space. In addition other constraints
can be induced, such as finding maximal, closed or correlated substructures.
2) Generate all strong structural rules having minimum confidence.
Rule generation is also exponential in the size of the longest substructure.
However, this time we don't have to access the database; we only need the
set of frequent structures.
1 http://www.cs.rpi.edu/ 4 http://www.cs.rpi.edu/guide/machines/ 6 http://www.cs.rpi.edu/courses/ 8 http://www.cs.rpi.edu/current-events/ 10 http://www.cs.rpi.edu/grad/ 12 http://www.cs.rpi.edu/People/ 14 http://www.cs.rpi.edu/research/ 16 http://www.cs.rpi.edu/undergrad/ 31 http://www.cs.rpi.edu/guide/ ...For enabling web mining we make use of the third section of the LOGML document that stores the user sessions organized as subgraphs of the web graph. We have complete history of the user clicks including the time a page is requested. Each user session has a session id (the IP or host name), a path count (the number of source and destination node pairs) and the time a link is traversed. We simply extract the relevant information depending on the mining task at hand. For example if our goal is to discover frequent sets of pages accessed, we ignore all link information and note down the unique source or destination nodes in a user session. For example, let the user session have the following information in the LOGML document:
<userSession name=''ppp0-69.ank2.isbank.net.tr'' ...> <path count=''6''> <uedge source=''5938'' target=''16470'' utime=''24/Oct/2000:07:53:46''/> <uedge source=''16470'' target=''24754'' utime=''24/Oct/2000:07:56:13''/> <uedge source=''16470'' target=''24755'' utime=''24/Oct/2000:07:56:36''/> <uedge source=''24755'' target=''47387'' utime=''24/Oct/2000:07:57:14''/> <uedge source=''24755'' target=''47397'' utime=''24/Oct/2000:07:57:28''/> <uedge source=''16470'' target=''24756'' utime=''24/Oct/2000:07:58:30''/>We can then extract the set of nodes accessed by this user:
#format: user name, number of nodes accessed, node list ppp0-69.ank2.isbank.net.tr 7 5938 16470 24754 24755 47387 47397 24756After extracting this information from all the user sessions we obtain a database that is ready to be used for frequent set mining, as we shall see below.
On the other hand if out task is to perform sequence mining, we look for the longest forward links, and generate a new sequence each time a back edge is traversed. Using a simple stack-based implementation all maximal forward node sequences can be found. For the example user session above we would get:
#format: user name, sequence id, node position, node accessed ppp0-69.ank2.isbank.net.tr 1 1 5938 ppp0-69.ank2.isbank.net.tr 1 2 16470 ppp0-69.ank2.isbank.net.tr 1 3 24754 ppp0-69.ank2.isbank.net.tr 2 1 5938 ppp0-69.ank2.isbank.net.tr 2 2 16470 ppp0-69.ank2.isbank.net.tr 2 3 24755 ppp0-69.ank2.isbank.net.tr 2 4 47387 ppp0-69.ank2.isbank.net.tr 3 1 5938 ppp0-69.ank2.isbank.net.tr 3 2 16470 ppp0-69.ank2.isbank.net.tr 3 3 24755 ppp0-69.ank2.isbank.net.tr 3 4 47397 ppp0-69.ank2.isbank.net.tr 4 1 5938 ppp0-69.ank2.isbank.net.tr 4 2 16470 ppp0-69.ank2.isbank.net.tr 4 3 24756For more complex mining task like tree or graph mining, once again the appropriate information can be directly read from the LOGML user sessions.
![]() |
For specific instances of the FPM paradigm in web mining, consider the
example in Figure 1, which pictorially depicts the original web graph of
a particular website. There are 7 pages, forming the set of primitive items
,
connected with hyperlinks. Now the LOGML document already stores in a systematic
manner the user sessions, each of them being a subgraph of the web graph.
The figure shows the pages visited by 6 users. We'll see below how this
user browsing information can be used for mining different kinds of increasingly
complex substructures, starting with the frequently accessed pages, to
the frequently traversed paths, to the frequent subtrees, etc.
|
|
Consider the example web logs database shown in Figure 2. For each user
(in Figure 1) we only record the pages accessed by them, ignoring the path
information. The mining task is to find all frequently accessed sets of
pages. Figure 2 shows all the frequent k-itemsets
that
are contained in at least three user transactions, i.e.,
.
ABC,
AF
and CF, are the maximal frequent itemsets.
We applied the Eclat association mining algorithm [28] to a real LOGML document from the RPI web site. There were 200 user sessions with an average of 56 distinct nodes in each session. It took us 0.03s to do the mining. An example frequent set found is shown below:
FREQUENCY = 22 , NODE IDS = 25854 5938 25649 25650 25310 16511 http://www.cs.rpi.edu/~sibel/poetry/poems/nazim_hikmet/turkce.html http://www.cs.rpi.edu/~sibel/poetry/sair_listesi.html http://www.cs.rpi.edu/~sibel/poetry/frames/nazim_hikmet_1.html http://www.cs.rpi.edu/~sibel/poetry/frames/nazim_hikmet_2.html http://www.cs.rpi.edu/~sibel/poetry/links.html http://www.cs.rpi.edu/~sibel/poetry/nazim_hikmet.html
The structure database
consists of a collection of sequences, and
denotes the subsequence relation. The mining goal is to discover all frequent
subsequences. For example, consider the sequence database shown in Figure
3, by storing all paths from the starting page to a leaf (note: there are
other ways of constructing user access paths; this is just one example).
With minimum support of 3 we find that
,
,
are
the maximal frequent sequences.
We applied the SPADE sequence mining algorithm [29] to a real LOGML document from the RPI web site. From the 200 user sessions, we obtain 8208 maximal forward sequences, with an average sequence size of 2. It took us 0.09s to do the mining. An example frequent sequence found is shown below:
FREQUENCY = 21 , NODE IDS = 37668 -> 5944 -> 25649 -> 31409 http://www.cs.rpi.edu/~sibel/poetry/ -> http://www.cs.rpi.edu/~sibel/poetry/translation.html -> http://www.cs.rpi.edu/~sibel/poetry/frames/nazim_hikmet_1.html -> http://www.cs.rpi.edu/~sibel/poetry/poems/nazim_hikmet/english.html
|
|
Given a database
of trees (i.e., a forest) on the vertex set
,
the frequent tree mining problem [30] is to find all subtrees that appear
in at least min_sup trees. For example, given the user access subtrees
shown in Figure 1, we mine the frequent subtrees shown in Figure 4. There
are two maximal frequent subtrees,
and
.
We already have an initial implementation of TreeMiner [30], an algorithm
for mining frequent trees. We will apply it to the LOGML database in near
future. .
There are many other generalizations that are possible. For example, we can generalize the tree mining problem to directed acyclic graphs, and more generally to directed and undirected graphs. Continuing the web mining example, a general website can be modeled as a directed graph, since in addition to the forward hyperlinks, it can have back references, creating cycles. Figure 5 shows an example web graph. Given a database of user accesses (with full information about their traversal, including forward and backward links) one can discover the frequently occurring subgraphs, such as the one shown.
Example 5
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:template match="/"> <xsl:apply-templates select="logml"/> </xsl:template> <xsl:template match="logml"> <xsl:apply-templates select="graph"/> </xsl:template> <xsl:template match="graph"> <xsl:apply-templates select="node"/> </xsl:template> <xsl:template match="node"> <xsl:value-of select="@id"/> <xsl:text> </xsl:text> <xsl:value-of select="@hits"/> <xsl:text> </xsl:text> <xsl:value-of select="@label"/> <xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet>Once that the number of requests of the web pages are obtained from the LOGML file, this information can be plotted in a line graph as shown in Figure 8.
![]() |
Figure 8: Plot of the Top Request of the RPI Computer Science Web Site (http://www.cs.rpi.edu/)
Several web characterizations have been reported [12] such as: Requested file popularity, File sizes, Periodic nature of HTTP traffic, Site popularity, Rate of Broken Links, Session time outs. All of them can be extracted from the XGMML and LOGML files either by applying style sheets (XSLT) or by programs using DOM [4] or SAX XML parsers.
Future work includes obtaining user graph mining, as well as visualization of mined data using WWWPal system[13]. To perform web content mining, we need keyword information and content for each of the nodes. To obtain this information will involve analyzing each of the web pages and collecting relevant keywords. Work is under way to accomplish this task.
[2] R. Agrawal and R. Srikant.Mining sequential patterns. In 11th Intl. Conf. on Data Engg., 1995.
[3] R. Cooley, B. Mobasher, and J. Srivastava. Web Mining: Information and Pattern Discovery on the World Wide Web. In 8th IEEE Intl. Conf. on Tools with AI, 1997.
[4] W3C, Document Object Model (DOM), http://www.w3.org/DOM/
[5] Dublin Core Metadata for Resource Discovery, Internet RFC 2413. http://purl.oclc.org/dc/
[6] M. Himsolt. GML: A portable Graph File Format, Technical Report, Universität Passau, 94030 Passau, Germany, 1997.
[7] LOGML, Log Markup Language Specification, http://www.cs.rpi.edu/~puninj/LOGML/draft-logml.html
[8] H. Mannila, H. Toivonen, and I. Verkamo. Discovering frequent episodes in sequences. In 1st Intl. Conf. Knowledge Discovery and Data Mining, 1995.
[9] B. Mobasher et al., Web Mining: Pattern Discovery from World Wide Web Transactions, Technical Report 96-050, Department of Computer Science, University of Minnesota, Minneapolis (September 1996).
[10] Namespaces in XML, http://www.w3.org/TR/REC-xml-names.
[11] P. Pirolli, R. Rao, and J. Pitkow, Silk from a Sow's Ear: Extracting Usable Structures from the Web. In Proceedings of 1996 Conference on Human Factors in Computing Systems, pp. 118-125, 1996.
[12] J. Pitkow, Summary of WWW Characterizations, In Proc. Seventh International World Wide Web Conference, Brisbane, Australia, April 14-18, 1998
[13] J. Punin, M. Krishnamoorthy. WWWPal System - A System for Analysis and Synthesis of Web Pages. In Proceedings of the WebNet 98 Conference, Orlando, November, 1998.
[14] W3C, Resource Description Framework (RDF) Model and Syntax
Specification . http://www.w3.org/TR/REC-rdf-syntax
[15] R. Srikant and R. Agrawal. Mining generalized association
rules. In 21st VLDB Conf., 1995.
[16] W3C, Scalable Vector Graphics (SVG) 1.0 Specification, http://www.w3.org/TR/SVG/
[17] M. Thuring, J. Hannemann, J.M Haake. Designing for Comprehension: A Cognitive Approach to Hypermedia Development. Communications of the ACM, 38(8), 57-66, 1995.
[18] W3C, Web Characterization Terminology & Definitions Sheet, http://www.w3.org/1999/05/WCA-terms/
[19] webbot and Libwww ,W3C Sample Code Library, http://www.w3.org/Library/
[20] Wu K.L, Yu P.S., and Ballman A. Speed Tracer: A Web usage mining and analysis tool. In Internet Computing Vol 37. No 1 p89, 1997.
[21] XGMML, Extensible Graph Markup and Modeling Language Specification, http://www.cs.rpi.edu/~puninj/XGMML/draft-xgmml.html
[22] W3C, XHTML[tm] 1.0: The Extensible HyperText Markup Language, http://www.w3.org/TR/xhtml1
[23] W3C, XML Linking Language (XLink), http://www.w3.org/TR/xlink
[24] W3C, XML Schema Part 1: Structures, http://www.w3.org/TR/xmlschema-1
[25] W3C, XML Specification, http://www.w3.org/TR/REC-xml
[26] W3C, XSL, Extensible Stylesheet Language Specification, http://www.w3.org/TR/xsl/
[27] M. J. Zaki. Efficient enumeration of frequent sequences. In 7th Intl. Conf. on Information and Knowledge Management, November 1998.
[28] M. J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390, May-June 2000.
[29] M. J. Zaki. SPADE: An efficient algorithm for mining frequent sequences. Machine Learning Journal, 42(1/2), Jan/Feb 2001. Special issue on Unsupervised Learning.
[30] M. J. Zaki, T. Bingol, and R. Kulkarni. TREEMINER: Mining trees in a forest. Technical Report 00-x, Computer Science Dept., Rensselaer Polytechnic Institute, in preparation 2000.