CSCI-4973 - Spring 2012
Introduction to Visualization
Assignment #5: Wrangling High Dimensional Data
For your assignment this week, we focus on high dimensional data set and different techniques for visualizing information and patterns that can be difficult to detect because they can't trivially be plotted in two or even three dimensions.
Perhaps the most challenging and important task for this assignment is to identify a suitable high dimensional dataset that can't trivially be mapped to a 2D visualization. High dimensional datasets can be scientific-type data that has a clear spatial embedding in 3D plus additional scalar (e.g., temperature, time) or vector (e.g., color, velocity) information at each 3D point. Alternatively, high dimensional datasets may be more "InfoVis"-style tables with many rows of points and many columns of data. First, consider the data you have used for previous assignments. Is any of this data high dimensional or can it be extended by acquiring additional dimensions for each sample?
After identifying an appropriate and interesting dataset, you should hypothesize a pattern or correlation that might exist in the data. Now experiment with different strategies for high dimensional data visualization (you may need to try out new visualization toolkits, too). Also consider strategies for pre-processing to cluster or simplify this data (e.g., k-means clustering), or to reduce the dimensionality of the data (e.g., principal components analysis) allowing it to be plotted using more traditional tools.
Analyze the results, revise the visualization, draw conclusions, and compare your findings to your hypotheses. Describe your data, hypothesis, visualization process, and analysis in your README.txt file.
Working as a team (pick a new partner!) and/or revisiting an idea from an earlier assignment are encouraged, but not required.
Target Visualization Stage: Analysis & Validation and & Visualization Execution
How to SubmitEach team or individual should make a group post to LMS (one teammate makes a post for the whole team) to the "Assignment #5 Wrangling High Dimensional Data" discussion. The post should include a description of the data, the collection & parsing & organizing process, the visualization design, the visualization toolkit used, the identified challenges in work with progressively larger datasets, and the steps taken or proposed to mitigate these issues. The post should include images/links with the visualization results.
Each student (each teammate) should submit an individual plaintext README.txt (using the provided template) to the homework server (homework server info). Your README.txt file should focus on your contributions to the team, but you can cut & past from the group LMS post. Also submit to the homework server any source code you wrote for the assignment. (You do not need to submit a complete buildable software project, just submit source code/scripts you wrote). Do not try to submit the dataset! But include a small sample of the data as appropriate in your writeup.
Assignment Due: Tuesday February 28th, 11:59pm