Download the Airfoil Data Set dataset from the UCI Machine Learning Repository. The dataset has 6 real valued attributes. We will treat all attributes equally for this assignment, which consists of two parts. Part I must be done by students in both sections, namely CSCI4390 and CSCI6390. For the second part, Part II-4390 must be done by those in CSCI4390, and Part II-CSCI6390 must be done by students registered for CSCI6390.

Compute the mean vector \(\mathbf{\mu}\) for the 6-dimensional data matrix, and then compute the total variance \(var(\mathbf{D})\); **see Eq. (1.4)** for the latter.

Compute the sample covariance matrix \(\mathbf{\Sigma}\) as **inner products** between the attributes of the centered data matrix (see **Eq. (2.30)** in chapter 2). Next compute the sample covariance matrix as sum of the **outer products** between the centered points (see **Eq. (2.31)**).

Compute the correlation matrix for this dataset using the formula for the cosine between centered attribute vectors (see **Eq. (2.25)**). Which attributes are the most correlated, the most anti-correlated, and least correlated?
Create the scatter plots for these interesting pairs using **matplotlib** and visually confirm the trends, i.e., describe how each of the three cases results in a particular type of plot.

Compute the dominant eigenvalue and eigenvector of the covariance matrix \(\mathbf{\Sigma}\) via the power-iteration method. One can compute the dominant eigen-vector/-value of the covariance matrix iteratively as follows.

Let $$\mathbf{x}_0 = \begin{pmatrix} 1 \\ 1\\ \vdots \\ 1 \end{pmatrix} $$ be the starting vector in \(R^d\), where \(d\) is the number of dimensions.

In each iteration \(i\), we compute the new vector: $$\mathbf{x}_i = \mathbf{\Sigma} \; \mathbf{x}_{i-1}$$ We then find the element of \(\mathbf{x}_i\) that has the maximum absolute value, say at index \(m\). For the next round, to avoid numerical issues with large values, we re-scale \(\mathbf{x}_i\) by dividing all elements by \(x_{im}\), so that the largest value is always 1 before we begin the next iteration.

To test convergence, you may compute the norm of the difference between the scaled vectors from the current iteration and the previous one, and you can stop if this norm falls below some threshold. That is, stop if $$\|\mathbf{x}_i - \mathbf{x}_{i-1}\|_2 < \epsilon$$ For the final eigen-vector, make sure to normalize it, so that it has unit length.

Also, the ratio \(\frac{x_{im}}{x_{i-1,m}}\) gives you the largest eigenvalue. If you did the scaling as described above, then the denominator will be 1, but the numerator will be the updated value of that element before scaling.

Once you have obtained the dominant eigenvector, \(\mathbf{u}_1\), project each of the original data points \(\mathbf{x}_i\) onto this vector, and print the coordinates for the new points along this "direction".

Compute the first two eigenvectors of the covariance matrix \(\mathbf{\Sigma}\) using a generalization of the above iterative method.

Let \(\mathbf{X}_0\) be a \(d \times 2\) (random) matrix with two non-zero \(d\)-dimensional column vectors with unit length. We will iteratively multiply \(\mathbf{X}_0\) with \(\mathbf{\Sigma}\) on the left.

The first column will not be modified, but the second column will be orthogonalized with respect to the first one by subtracting its projection along the first column (see section 1.3.3 in chapter 1). That is, let \(\mathbf{a}\) and \(\mathbf{b}\) denote the first and second column of \(\mathbf{X}_1\), where $$\mathbf{X}_1 = \mathbf{\Sigma} \; \mathbf{X}_0$$

Then we orthogonalize \(\mathbf{b}\) as follows: $$ \mathbf{b} = \mathbf{b} - \left({\mathbf{b}^T \mathbf{a} \over \mathbf{a}^T\mathbf{a}}\right) \mathbf{a} $$ After this \(\mathbf{b}\) is guaranteed to be orthogonal to \(\mathbf{a}\). This will yield the matrix \(\mathbf{X}_1\) with the two column vectors denoting the current estimates for the first and second eigenvectors.

Before the next iteration, normalize each column to be unit length, and repeat the whole process. That is, from \(\mathbf{X}_1\) obtain \(\mathbf{X}_2\) and so on, until convergence.

To test for convergence, you can look at the distance between \(\mathbf{X}_{i}\) and \(\mathbf{X}_{i-1}\). If the difference is less than some threshold \(\epsilon\) then we stop.

Once you have obtained the two eigenvectors: \(\mathbf{u}_1\) and \(\mathbf{u}_2\), project each of the original data points \(\mathbf{x}_i\) onto those two vectors, to obtain the new projected points in 2D. Plot these projected points in the two new dimensions.

Write a script named **assign1.py** that takes as input the data filename, and the epsilon parameter for convergence. You may assume that the data file resides in the local directory where the script will be called from. You can use epsilon as 0.001 or 0.0001. Save all your output to a pdf file named **assign1.pdf**. The output should comprise the mean vector, total variance, covariance matrix via inner and via outer product formulas, correlation matrix, the observations, the dominant eigen-vectors and eigenvalues. The scatter plots should also be part of this output file as well, with any required comments.

You should submit your python script and the output PDF file via the Submitty page: https://submitty.cs.rpi.edu//index.php?semester=f17&course=csci4390

This link will work for both CSCI4390 and CSCI6390 students; your account has already been set up. You can login via your RCS username (NOT RIN) and RCS password. Submission deadline is 11:59:59pm (strict) on Thurs, Sep 21th. You must test out the Submitty login right away. If you cannot login on 21st and consequently cannot submit your assignment, you will be held responsible.

Your script must use Python 2.7 or Python 3 (please specify which version you are using in the script itself, as a comment on the first line). Please note that you can use built-in NumPy/Python functions for reading and parsing the text input, but you should **NOT** use any of the built-in functions for this assignment. You may however verify your answers by comparing to the results from the built-in methods like **cov** or **eigen**, and so on.

For those not that familiar with python, you may google for tutorials, e.g. Python tutorial. For information on the NumPy module/package use: NumPy tutorial

You are free to discuss how to tackle the assignment, but all coding must be your own. Please do not cut, copy and/or paste from anyone else, including code on the web. Any students caught violating the academic honesty principle will get an automatic **F** grade on the course and will be referred to the dean of students for disciplinary action.

Retrieved from http://www.cs.rpi.edu/~zaki/www-new/pmwiki.php/Dmcourse/Assign1

Page last modified on September 15, 2017, at 02:41 PM