Assignment 1: Numeric Data Analysis

Due Date: Mon 26th Sep, before midnight

Download the iris.txt data file. More information on this dataset can be obtained from UCI ML Repository.

Write a script in python to answer the following questions. Unless stated otherwise, all questions must use all of the four numeric dimensions. You may ignore the last column, except when asked to specifically use the class information for each point.

  1. Compute the mean vector
  2. Compute the sample variance for Attribute 1 via vector operations
  3. Assuming that Attribute 1 is normally distributed, plot its probability density function.
  4. Compute the sample covariance between Attributes 1 and 2 via vector operations, and then compute the correlation between them by computing the cosine of the angle between the vectors. Plot the scatter plot between these two attributes (you may use python matplotlib module for this).
  5. Compute the sample covariance matrix via two methods, namely as inner products between the columns, and as outer products over the points. You should write two different functions, one for each of the methods.

The requisite background for these questions is available in chap1.pdf and chap2.pdf.

Note that you may not use any of the inbuilt operations for means and covariances in python/numpy. However, you may use the inbuilt function for computing and plotting the density function. You must of course use the matrix operations available in numpy.

What to turn in

  • Write a python script called RCSID-Assign1.py. Use comments lines to separate your code/function for each question. Here RCSID is your RPI email id (without the rpi.edu part).
  • Submit a PDF file named RCSID-Assign1.pdf that should include your solutions to each of the questions (just cut and paste the output from python). The figures (prob density function and scatter plot) should also be part of this file.
  • Do not hard code any of the file path names (you may hard code just the filename). You can assume that any data file will be in the local directory.
  • Submit the assignment as a zip or tar file that includes the RCSID-Assign1.py sript and RCSID-Assign1.pdf file. Email both as an attachment to the course assignment submission email address: . The subject of your email should be "RCSID-Assign1 Submission", where RCSID is your RPI email id.

Solution

Here are correct answers.

  • Q1: mean = [[ 5.84333333 3.054 3.75866667 1.19866667]]
  • Q2: variance for attribute 1 = 0.681122222222
  • Q3: The plot of the normal is here: PDF
  • Q4: covariance for attributes 1&2 = -0.0390066666667
    correlation for attributes 1&2 = -0.109369249951
    The scatter plot is here: PDF
  • Q5:
    Covariance Matrix (inner products)
    [[ 0.68112222 -0.03900667  1.26519111  0.51345778]
     [-0.03900667  0.18675067 -0.319568   -0.11719467]
     [ 1.26519111 -0.319568    3.09242489  1.28774489]
     [ 0.51345778 -0.11719467  1.28774489  0.57853156]]
    Covariance Matrix (outer products)
    [[ 0.68112222 -0.03900667  1.26519111  0.51345778]
     [-0.03900667  0.18675067 -0.319568   -0.11719467]
     [ 1.26519111 -0.319568    3.09242489  1.28774489]
     [ 0.51345778 -0.11719467  1.28774489  0.57853156]]
    

The python script file is as follows:

#!/usr/bin/env python

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

#read the data
D = np.loadtxt('iris.txt',delimiter=",",usecols=(0,1,2,3))
X = np.asmatrix(D)
(n,d) = np.shape(X)
print "X: points(n)=%d, dims(d)=%d" % (n, d)


#Q1. compute the mean
mu = np.sum(X,axis=0)
mu = mu/n
print "mean =", mu

#Q2. compute the variance for attr1
#centered data
Z = X - mu
s1 = Z[:,0].T * Z[:,0] / n
print "variance for attribute 1 = ", s1.item()

#Q3. plot the normal density for attr1
m1 = mu[0,0]
sd1 = np.sqrt(s1.item())
k = 5 #5 std-devs
sample = np.arange(m1-(k*sd1), m1+(k*sd1), 0.001)
plt.plot(sample, 1/(sd1 * np.sqrt(2 * np.pi)) * np.exp( - (sample - m1)**2 / (2 * sd1**2) ), linewidth=2, color='r')
plt.xlabel("x")
plt.ylabel("Normal PDF: f(x)")
plt.plot([5.84], [0], 'o')
plt.annotate('mean=5.84', xy=(5.84, 0), xytext=(5.84, 0.1),\
             arrowprops=dict(facecolor='black', shrink=0.05))
plt.savefig('normal1.pdf', format="pdf")
#plt.show()

#Q4. covariance/correlation between attr1 and attr2
s12 = Z[:,0].T * Z[:,1] / n
print "covariance for attributes 1&2 = ", s12.item()

s2 = Z[:,1].T * Z[:,1] / n
r12 = s12/np.sqrt(s1*s2) 
print "correlation for attributes 1&2 = ", r12.item()
plt.clf()
plt.scatter(Z[:,0], Z[:,1])
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.savefig('scatter12.pdf', format="pdf")
#plt.show()

#Q5. covariance matrix
#inner product version
Sin = Z.T*Z/n
print "Covariance Matrix (inner products)"
print Sin
#outer product version
Sout = 0
for i in range(n):
    Sout += Z[i,:].T*Z[i,:]
Sout = Sout/n

print "Covariance Matrix (outer products)"
print Sout
GlossyBlue theme adapted by David Gilbert
Powered by PmWiki