Recent Changes - Search:

Main Page



Piazza Site


edit SideBar


Expectation-Maximization Clustering (50 points)

Due Date: 7th December, before midnight

Implements the Expectation-Maximization (EM) algorithm for clustering (see Algorithm 13.3 in Chapter 13). Run the code on the datasets specified below. Use the last attribute as the class, which will not be used for clustering, but will be used only for the purity-based clustering evaluation (see below). In your implementation, you should estimate the full covariance matrix for each cluster.

For EM initialization, you can start with a random assignment of points to clusters, or you can randomly select the initial means, and after assigning the points to the closest mean, you can compute the initial means, covariances and prior probabilities. For convergence testing, you can compare the sum of the euclidean distance between the old means and the new means over the \(k\) clusters. If this distance is less than some small \(\epsilon\) you can stop the method.

Your program output should consist of the following information:

  1. The final mean for each cluster
  2. The final covariance matrix for each cluster
  3. Number of iterations the EM algorithm took to converge.
  4. Final cluster assignment of all the points, where each point will be assigned to the cluster that yields the highest probability \(P(C_i|\mathbf{x}_j)\)
  5. Final size of each cluster

Finally, you must compute the 'purity score' for your clustering, computed as follows: Assume that \(C_i\) denotes the set of points assigned to cluster \(i\) by the EM algorithm, and let \(T_i\) denote the true assignments of the points based on the last attribute. Purity score is defined as: $$Purity= \frac{1}{n} \sum_{i=1}^k max_{j=1}^K \{C_i \cap T_j\} $$ where \(K\) is the true number of clusters, and \(k\) is the input number of clusters to find.

As a practical point, if you get an error when inverting the covariance matrix, consider adding a small \(\lambda\) value along each diagonal entry to make the matrix invertible. This can be considered as a regularized estimate of the covariance matrix, i.e., $$\Sigma_i + \lambda \mathbf{I}$$

Run the code on the following datasets:

Since the initialization is random, run the code several time and report the best results as determined by the purity.

Extra Credit: Density Based Clustering (50 points)

Implement the DENCLUE Algorithm, Alg 15.2, in the book. Use the "Gaussian" density kernel. The algorithm should use input parameters for the spread of the gaussian kernel \(h\), the minimum density value \(\xi\), and the convergence threshold \(\epsilon\).

An important point of implementation comes after finding the density attractors, namely in step 11, where we try to find the maximal "density reachable" attractors. For this step, it is best to simply merge into one cluster two attractors that are within some small distance of each other, say \(\delta\). You can make this an input parameter and see what works for each dataset. Also, if the final cluster has very few points it can be discarded; those points can be noise. However, the overall coverage should remain high, say above 95%, i.e., do not discard more than 5% of the points.

Run your code on the same three datasets above, but with appropriate values of \(h, \xi, \delta\), to try to get the correct number of clusters. Also, output the purity of your clustering.

What to turn in

  • Write a python script called, and submit a text file that contains the output of the script via submitty.

Your script should read the filename from the command line as a parameter, and you should also read the \(k\) value (the number of clusters to find) and the \(\epsilon\) value from the command line, e.g., it will be run as " FILE k eps"

  • If you do the extra credit, submit another file and its output. Your code will be run as

" FILE \(h\) \(\xi\) \(\delta\) eps"

Edit - History - Print - Recent Changes - Search
Page last modified on November 30, 2018, at 08:34 AM