Ph.D. Theses

Exploring Similarities Across High-Dimensional Datasets

By Karlton Sequeira
Advisor: Mohammed Zaki
August 2, 2005

Very often, data may be collected by a number of sources. These sources may be unable to share their entire datasets for reasons like confidentiality agreements, dataset size, etc. However, these sources may be willing to share a condensed representation of their datasets. If some subset of the condensed representations of such datasets, from different sources, are found to be unusually similar, policies successfully applied to one may be considered for application to the others.

In this dissertation, we tackle the problem of finding similarities across high-dimensional datasets. We propose a framework, wherein we use condensed representations of the datasets to obfuscate details and limit noise. We provide algorithms to find interesting regions within datasets which become components of the condensed representations. We propose similarity measures for these components based on their structure. We then use a graph-matching based formulation to find structurally similar components across the condensed representations of the datasets.

We test our algorithms on a wide array of synthetic datasets. We conduct experiments on real datasets from a number of domains. We find that structure-based similarity amplifies weaker patterns. It also allows discovery of structural patterns from related datasets having differing schema.

Return to main PhD Theses page