Clustering via Expectation-Maximization
Due Date: 2nd December, before midnight
Implement the Expectation-Maximization (EM) algorithm for clustering (see Algorithm 16.3 in Chapter 16). Run the code on the Attach:iris.txt dataset. Use the first four attributes for clustering, and use the labels only for the purity-based clustering evaluation (see below).
You can assume that the number of clusters to find is k=3. In your implementation, you should estimate the full covariance matrix for each cluster.
For EM initialization, use the first 50 points for cluster 1, the next 50 for cluster 2, and the last 50 for cluster 3. For convergence testing, use \epsilon=0.001.
Your program output should consist of the following information:
- The final mean for each cluster
- The final covariance matrix for each cluster
- Number of iterations the EM algorithm took to converge.
- Final cluster assignment of all the points, where each point will be assigned to the cluster that yields the highest probability P(C_i|\mathbf{x}_j)
- Final size of each cluster
Finally, you must compute the 'purity score' for your clustering, computed as follows: Assume that C_i denotes the set of points assigned to cluster i by the EM algorithm, and let T_i denote the points in class i based on the last attribute. Purity score is defined as:
That is, for each cluster C_i find the majority class via the intersection over the T_j's. The purity is the fraction of points in the clusters that are "pure", based on majority class assignment for each cluster.
What to turn in
- Write a python script called RCSID-Assign6.py, and submit a PDF file named RCSID-Assign6.pdf that contains the output in the format mentioned above.
- Submit the assignment as a zip or tar file via email to: dmcourse.cs@gmail.com. The subject of your email should be "RCSID-Assign6 Submission".
Solutions
1 0.688945569587
2 0.763807902029
3 0.829473766893
4 0.630824689135
5 0.446047886491
6 0.449186725419
7 0.403128914227
8 0.355218930626
9 0.330388273834
10 0.344876596236
11 0.368330802924
12 0.245729206066
13 0.0756555895108
14 0.052419988351
15 0.0452564106221
16 0.0308250626211
17 0.0175835051888
18 0.0118673261561
19 0.00757490162551
20 0.0047680778933
21 0.00300240092021
22 0.00189194564847
23 0.00119161380751
24 0.000750410088635 <-- number of iterations to converge
cluster: C0
mean [ 5.00624812 3.41854816 1.46408011 0.24397199]
cov [[ 0.12169829 0.09806288 0.01578305 0.01035496]
[ 0.09806288 0.14173294 0.0113637 0.01124421]
[ 0.01578305 0.0113637 0.02950527 0.00559116]
[ 0.01035496 0.01124421 0.00559116 0.01126782]]
size 50.0
cluster: C1
mean [ 6.24586024 2.97293536 5.08270195 1.90247301]
cov [[ 0.28646255 0.07063223 0.22505393 0.12231976]
[ 0.07063223 0.06611998 0.06822495 0.04667828]
[ 0.22505393 0.06822495 0.31724512 0.18971589]
[ 0.12231976 0.04667828 0.18971589 0.14787381]]
size 46.0
cluster: C2
mean [ 6.27718474 2.77135983 4.72853404 1.45013946]
cov [[ 0.58342265 0.17453364 0.6797531 0.21658121]
[ 0.17453364 0.13271296 0.17943592 0.06647733]
[ 0.6797531 0.17943592 0.97375385 0.30384379]
[ 0.21658121 0.06647733 0.30384379 0.1080279 ]]
size 54.0
confusion matrix <-- not required
C0 C1 C2
Iris-versicolor [ 0. 12. 38.]
Iris-setosa [ 50. 0. 0.]
Iris-virginica [ 0. 34. 16.]
purity 0.813333333333
The python code is as follows:
#!/usr/bin/env python
import sys
import random
import numpy as np
#helper routines
clmap = {"Iris-versicolor": 0, "Iris-setosa": 1,
"Iris-virginica": 2}
def class2int(c):
c =c.strip('"')
return clmap[c]
def dmvnorm(x, mu, cov):
part1 = np.power(2*np.pi, d/2)
part2 = np.power(np.linalg.det(cov),0.5)
dev = x-mu
icov = np.linalg.inv(cov)
prod = np.dot(np.dot(dev,icov), dev)
part3 = np.exp(-0.5*prod)
res = part3/(part1*part2)
return res
#Estep
def Estep(X, mu, cov, p):
w = np.zeros((n,k))
for j in range(n):
for i in range(k):
w[j,i] = dmvnorm(X[j,:], mu[i], cov[i])
w[j,:] = w[j,:]/sum(w[j,:]) #normalize
return w
#Mstep
def Mstep(X, w):
mu = []
cov = []
p = []
for i in range(k):
mui = np.sum(w[:,i]*X.T, axis=1)/np.sum(w[:,i])
mu.append(mui)
Zi = X - mui
covi = np.zeros((d,d))
for j in range(n):
zij = np.asmatrix(Zi[j,:]).T
covi += w[j,i]*np.dot(zij, zij.T)
covi = covi/np.sum(w[:,i])
cov.append(covi)
pi = np.sum(w[:,i])/n
p.append(pi)
return (mu, cov, p)
##########main#########
#read data
D = np.loadtxt("iris.txt",delimiter=",", converters={4:class2int})
(n, d) = np.shape(D) #get input dimensions
d = d-1
X = D[:,0:d]
Y = D[:,d]
k = 3
#initialize the weights
w = np.zeros((n,k))
w[0:50,0] = 1.0
w[50:100,1] = 1.0
w[100:150,2] = 1.0
(mu, cov, p) = Mstep(X,w)
#EM algorithm
eps = 0.001
notconverged = True
t = 0
while(notconverged):
oldmu = mu
w = Estep(X, mu, cov, p)
(mu, cov, p) = Mstep(X,w)
t = t+1
#check if convereged
diff = 0
for i in range(k):
diff += np.linalg.norm(mu[i] - oldmu[i])
print t, diff
notconverged = diff > eps
#assign points to clusters
w = Estep(X, mu, cov, p)
Tc = np.zeros((n,k)) #true classes
Cc = np.zeros((n,k)) #cluster ids
Nc = np.zeros(k)
for j in range(n):
m = np.argmax(w[j,:]) #which cluster to assign xj to
Nc[m] += 1 #increment cluster m's size
Cc[j,m] = 1 #set m'th entry in row j to 1
Tc[j,Y[j]] = 1 #keep track of true labels
#print cluster information
for i in range(k):
print "cluster: C"+str(i)
print "mean", mu[i]
print "cov", cov[i]
print "size", Nc[i]
Imat = np.dot(Tc.T, Cc)
print "confusion matrix"
print "\tC0\tC1\tC2"
inv_clmap = {v:k for k, v in clmap.items()}
for i in sorted(inv_clmap):
print inv_clmap[i], Imat[i,:]
purity = 0
for i in range(k):
purity += np.max(Imat[:,i])
purity = purity/n
print "purity", purity