Ph.D. Theses

Validity Guided Robust Fuzzy Clustering Methods for Characterizing Clusters in Noisy Data

By Mary Anne L. Egan
Advisor: Mukkai S. Krishnamoorthy
December 3, 1997

This thesis extends the robustness of fuzzy clustering and genetic algorithms in uncovering structure in noisy data. It presents three general-purpose clustering techniques applicable in the presence of noise. The objective of cluster analysis is to group observed data in such a way that the entities within a cluster are more similar to each other than to those in other clusters. It is one of the most widely used procedures for exploratory data analysis in many disciplines, including crystallographic statistics, economic trend analysis, image processing and data mining. These problems consist of a set of n points in (d and require the accurate description of cluster center, shape and size despite the high percentage of noise points. The work presented in this dissertation extends traditional clustering techniques through the combination of noise models and validity measures.

This thesis examines several instances of clustering problems and solutions. The first clustering instance is the characterization of diffuse clusters embedded in data sets with noise levels typically greater than 50%. This algorithm is able to accurately locate clusters in both dense and sparse data sets and has a demonstrated sensitivity with even a few cluster points. A statistical test based on Ripley's K function is used to determine complete spatial randomness, or lack of any clustering. If clustering is present, a robust noise-clustering algorithm is implemented for increasing numbers of clusters. For each cluster assignment, a validity measure based on the combination of fuzzy hypervolume, partition density and average fuzzy density is used to identify the optimum number of clusters. This is an objective technique that requires no prior knowledge of the number or the size of the clusters in the data.

The second clustering instance is an elliptical shell characterization problem. Characterizing ellipses is difficult due to their nonlinear definition and the background noise associated with the images. An ellipsoidal shell cluster prototype is obtained by using a norm induced by an adaptive symmetric positive definite matrix as a distance metric in the objective function. A validity measure based on the combination of fuzzy circumference, average fuzzy density and shell thickness is used to identify the optimal number of clusters.

This thesis also investigates the utility of genetic algorithms in locating clusters in noisy data. A genetic algorithm with nonoverlapping populations is presented incorporating a fitness function similar to the objective function of the robust fuzzy clustering method. Empirical results compare the efficiency and accuracy with the robust fuzzy clustering method.

Return to main PhD Theses page