* Research

Ph.D. Theses

Mining Interesting Subgraphs by Output Space Sampling

By Mohammad Al Hasan
Advisor: Mohammed J. Zaki
July 23, 2009

Lack of scalability and information overload are two prevailing problems in graph mining that hinder the applicability of graph mining in real-life knowledge mining tasks. The earlier arises due to the NP-Hardness of the subgraph isomorphism test and the latter results from the enormous size of the frequent subgraph space. To cope with the above problems, several data mining researchers propose to mine a succinct set of interesting subgraph patterns instead of the entire frequent pattern-set. But, the majority of these methods follow a two-step solution that enumerate the entire frequent subgraphs and then obtain a small set of interesting subgraphs by filtering. They fail for large datasets for which the complete enumeration of the frequent subgraphs is infeasible.

In this thesis, we propose output space sampling (OSS) of frequent subgraphs, which elicits a new avenue in subgraph pattern mining by adopting randomization in candidate subgraph space exploration. Through Markov Chain Monte Carlo (MCMC) sampling, OSS solves both the above problems by avoiding a complete enumeration of the output space. It further guarantees that any sample it returns is taken from a distribution that is proportional to the interestingness score of that pattern. The process is generic as it can adapt to any interestingness function based on the specific knowledge discovery application at hand. Moreover, OSS works identically for other subclass of patterns like trees, sequences, itemsets, etc. This thesis also presents ORIGAMI and MUSK, two algorithms for representative subgraph mining.

* Return to main PhD Theses page