From Data Mining and Analysis

Main: HighDimensionalAnalysis

High Dimensional Data Analysis

Write a script to do the following:

Hypersphere Volume: Plot the volume of a unit hypersphere as a function of dimension. Plot for \(d=1, \cdots, 50\).

Hypersphere Radius: What value of radius would one need to maintain a hypersphere volume of 1 with increasing \(d\). Plot this value for \(d = 1, \cdots, 100\).

Nearest Neighbors: Assume we have a unit hypercube centered at \((0.5, \cdots, 0.5)\). Generate \(n=10000\) uniformly random points in \(d\) dimensions, in the range \((0,1)\) in each dimension. Find the ratio of the nearest and farthest point from the center of the space. Also store the actual distance of the nearest \(d_n\) and farthest \(d_f\) points from the center. Plot these value for \(d = 1, \cdots, 100\).

Fraction of Volume: Assume we have a hypercube of edge length \(l=2\) centered at the origin \((0,0, \cdots, 0)\). Generate \(n=10,000\) points uniformly at random for increasing dimensionality \(d = 1, \cdots, 100\). Now answer the following questions:

Diagonals in High Dimensions

Your goal is the compute the empirical probability mass function (EPMF) for the random variable \(X\) that represents the angle (in degrees) between any two diagonals in high dimensions.

Assume that there are \(d\) primary dimensions (the standard axes in cartesian coordinates), with each of them ranging from -1 to 1. There are \(2^{d}\) additional half-diagonals in this space, one for each corner of the \(d\)-dimensional hypercube.

Write a script that randomly generates \(n=100000\) pairs of half-diagonals in the \(d\)-dimensional hypercube, and computes the angle between them (in degrees).

Plot the EPMF for three different values of \(d\), as follows \(d=10,100,1000\). What is the min, max, value range, mean and variance of \(X\) for each value of \(d\)?

What would you have expected to have happened analytically? In other words, derive formulas for what should happen to angle between half-diagonals as \(d \to \infty\). Does the EPMF conform to this trend? Explain why? or why not?

What is the expected number of occurrences of a given angle \(\theta\) between two half-diagonals, as a function of \(d\) (the dimensionality) and \(n\) (the sample size)?

Retrieved from http://www.cs.rpi.edu/~zaki/dataminingbook/pmwiki.php/Main/HighDimensionalAnalysis
Page last modified on September 06, 2014, at 01:51 PM