CSCI 4550/6550 Interactive Visualization
Spring 2018

Home
  Contact Information
  Office Hours

Syllabus
  Prerequistites
  Learning Outcomes
  Course Grades

Calendar
  Lecture notes

Submitty
  Discussion Forum

Readings

Homework
  Late Day Policy

Final Project
   Spring '18 Projects
   Spring '16 Projects

References
  On-lin Material
  Optional Books

Homework #4: Data Collection and Preparation

For this homework, you may work in a team of 2 or individually. (Teams of 2 are encouraged!) You are encouraged to work with someone you hadn't met before this course.

This week your primary task is identify and collect a new and interesting (to you!) data set that is also interestingly large. You are expected to use your programming skills to obtain and/or wrangle this data into a file format you can visualize and analyze.

Some examples of where to start:

  • Take a non trivial computer program (for example a simulation or a solver) you have written and add dense logging information. How often does each function get called? How many times does an inner loop get called? What is the pattern of data stored in a variable or passed into a function?

  • Monitor your own computer activity, what keys do you press, where does your mouse move, what files do you open, what?

  • Scrape the GPS data off of your phone to gather your location over time. Or your heart-rate from a smart watch.

  • Setup a microphone or video camera and collect a stream of audio and/or images.

Try to find a dataset that's not simply "download a file". You should be doing a moderate amount of work (writing code) to either collect or parse/reorganize/simplify/post-process this data.

NOTE: Grad students working on a thesis or undergraduates working on a research project are strongly encouraged (required?) to work with a research-related data source.

Once you've selected a data source...

  • Design the detailed format for your raw data (the columns of your data "spreadsheet") and decide on the action or sampling frequency for each "row" of the data. What is the datatype for each column? Binary (true/false), category, string, integer, floating point, etc.? Make sure you are able to acquire an "interesting" amount of data, both number of samples (at least 1000 rows?) and dimensions per sample (at least 3 columns?) Note: These estimates are not requirements. If your data has many more columns, things can be quite interesting even with far fewer rows.

  • Let's apply a (new to you) visualization type to your dataset. This will depend on the details of your dataset.

    • Parallel Coordinates This will make sense if your data has 5-15 dimensions. Carefully designing the order and orientation of the axes can improve the effectiveness of this visualization.

    • LineUp This will make sense if your data has 3-10 floating point dimensions and you would like to compare and rank the datapoints.

    • Convex Hull This will make sense if your data has 2 (or possibly more) interesting floating point dimensions. The shape and area (or volume) of the convex hull can be informative. And we can identify outliers (data points that are on the convex hull vs. the interior).

    • Voronoi Diagram This will make sense if your data has 2 interesting floating point dimensions. You'll want to choose a moderate number of interesting sample points (maybe 10-40). Additional data points can be quickly compared to the original sample points.

    • k-Means Clustering This will make sense if your data has 2 interesting floating point dimensions. If your dataset is known to represent 2 or underlying types, k-Means Clustering can be used to automatically extract and label these groups.

  • Identify a visualization tool that enables this visualization type. If it's new to you, learn how to use this tool and wrangle your data into the appropriate format for the tool. Depending on your prior experience with the tool (if any) and the learning curve for the tool, you may or may not have time to revise and iterate and improve the end result of your visualization.

When you're ready to submit:

  • Prepare a writeup for this assignment as either a .pdf with inline images or a plaintext README.txt with well-named image files.

    • Include an overview of your collected dataset, your motivation/interest in this topic, and the source of your data. Describe your efforts to collect, parse, reorganize, simplify, and/or post-process this data source. Be sure to document any unexpected challenges.

    • Which visualization type did you choose and why?

    • Which visualization tool did you choose to execute your visualization? Was this tool new to you? Give a brief review of the tool. What are the strengths and weaknesses? Do you think you will use this tool in the future? What advice do you give to your peers who might try this tool?

    • At least one image visualizing the dataset (or a portion of the data). If you had time to iterate and revise on the visualization, include up to 5 images and describe the evolution of your visualization.

  • In a code directory, include the source code you wrote to collect the data. (Don't include 3rd party libraries, it won't be compiled or run for grading purposes.)

  • In a data directory, include interesting samples of the data. Don't attempt to upload the entire dataset (it might be too big!), but a sample that shows the format and range of values. Document the overall size of the data (# of rows and/or file size for context). Depending on any work you had to do to wrangle the data into an alternate format, include samples of the data at intermediate and final stages as well.

  • Note: Teams of two should indicate "who did what" in the writeup.