Ph.D. Theses

Data Clustering with Distance Thresholds

By Scott D. Epter
Advisor: Mukkai S. Krishnamoorthy
December 3, 1999

There exists much current work in pattern-discovery in large, numeric (e.g. market basket) databases. One common form treats items as binary values and does not consider item quantity as significant. Somewhat orthogonally to this work, a large body of research has been conducted in the field of data clustering, where dense groups of points are discovered in large data sets. Two common applications of such clustering are the discovery of patterns in spatial data, where the clusters are detected as they would be seen by the human eye, and pattern recognition, where clusters indicate such things as printable characters. One common clustering model works from a perspective of knowing a priori the number of clusters desired.

This thesis examines the functional dual of such methods, namely the use of tight distance thresholds to drive the clustering. Not only does this allow the natural number of clusters to be discovered automatically from the input, it is also shown to improve the accuracy and performance of clustering algorithms significantly.

The use of such methods for both numeric and spatial data is explored. In the case of spatial data, we present algorithms that view the data set from the perspective of its edges rather than its points. Such algorithms are fast and completely insensitive to the order of the input. For numeric data we present centroid-based algorithms characterized by highly-accurate results attained in only a single-pass over the data set. Several clustering-related tasks, including clustering tendency investigation, filtering of outliers and applications of clustering algorithms to specific problem domains are also investigated.

Return to main PhD Theses page