Assignment 2
Due Date: Fri 28th Sep, Before Midnight
Part I and II have to be done by both CSCI4390 and CSCI6390. There is an extra question for CSCI6390. The bonus question can be attempted by both CSCI4390 and CSCI6390.
(Part I) Diagonals in High Dimensions (50 points total)
Your goal is the compute the probability mass function
for the random variable \(X\) that represents the angle (in
degrees) between any two diagonals in high dimensions.
Assume that there are \(d\) primary dimensions (the standard axes
in cartesian coordinates), with each of them ranging from
-1 to 1. There are \(2^{d}\) additional
half-diagonals in this space, one for each corner of the \(d\)-dimensional hypercube.
- (20 points) Write a python script that randomly generates \(n=100,000\) pairs of
half-diagonals in the d-dimensional hypercube, and computes the angle
between them (in degrees).
- (30 points CSCI4390/ 20 Points CSCI6390) Plot the probability mass function (PMF) for three different
values of \(d\), as follows \(d=\{10,100,1000\}\). Recall that PMF is simply the plot of the angle versus the probability of observing that angle in the sample of \(n\) points for a given value of \(d\).
What is the min, max, value range, mean and variance of \(X\) for each value of \(d\)?
- Required Extra Question for CSCI6390 Only (10 points)
What would you have expected to have happened analytically? In other words, derive formulas for what should happen to angle between half-diagonals as \(d \to \infty\). Does the EPMF conform to this trend? Explain why? or why not?
- Bonus Question for both CSCI4390 and CSCI6390 (20 points)
What is the expected number of occurrences of a given angle \(\theta\) between two half-diagonals, as a function of d (the dimensionality) and n (the sample size)?
(Part II) Kernel Principal Components Analysis (50 points total)
- (20 points) You will implement the Kernel PCA (KPCA) algorithm as described in Algorithm 7.2 (Chapter 7, page 207). You need to compute the kernel and then center it, followed by extracting the dominant eigenvectors, which will give you the components of the directions in feature space. Next you will project and visualize the data. To compute the principal components (PCs) of the kernel matrix, you may use the inbuilt numpy function eigh.
- Run KPCA on the UCI dataset Concrete_Data.txt. You can read the description of the data at the UCI repository. I have already converted the xls file into text, and removed the titles for easy import with Numpy. The data has 1030 points and 9 attributes.
- (10 points) Using the linear kernel, how many dimensions are required to capture 95% of the total variance? For the same linear kernel,
compute the projected points along the first two kernel PCs, and create a scatter plot of the projected points.
- (10 points) Next, use the covariance matrix for the original data to compute the regular principal components, i.e., for the covariance matrix. Project the data onto the first two PCs. How do the eigenvalues and the projection compare with that obtained via Kernel PCA with linear kernel?
- (10 points) Finally, use the gaussian kernel and repeat the exercise for kernel PCA. Project the points onto the first two PCs and plot the scatter plot. For the variance or spread of the gaussian kernel, namely \(\sigma^2\), read it from the command line. Try different values and submit your plot for the value that makes most sense to you (observe the projected plots for various spread values and then decide).
What to submit
- Write two python scripts named as Assign2-part1.py and Assign2-part2.py, one for each of the parts.
- For part1, the script will be run as Assign2-part1.py.
- For part2, read the filename from the command line, assume it is in the local directory. So, part2 will be run as Assign2-part2.py FILENAME SPREAD. FILENAME is the datafile name, and SPREAD is the \(\sigma^2\) value.
- Submit a PDF file named Assign2.pdf that should include your solutions to each of the questions (just cut and paste the output from python). The figures should also be part of this file. Report in your output the spread value that made most sense to you.
- Submit the scripts and pdf file via submitty: https://submitty.cs.rpi.edu//index.php?semester=f18&course=csci4390