CSCI-4973 - Spring 2012
Introduction to Visualization
  Contact Information
  Office Hours


  Learning Outcomes

  Assigned Readings

  Lecture notes
  Lab materials

  On-line Material
  Optional Books

  Homework Late Policy
  Electronic Submission

Assignment #4: Big Data & Summarization

This week your primary task is to work with bigger datasets. This week you will certainly be expected to use your programming skills to obtain and/or wrangle this data. Your first goal is to locate a dataset of sufficient quantity that you have access to and that is interesting to you. This dataset might be something you can download from the web. Or it could be a dataset from your research or a class project. It might be something that you generate from a simulation or collect from a sensor. Just make sure there's plenty of data, that it very likely contains some interesting patterns, and you are interested in exploring the data and searching for those patterns. A excellent source of datasets (large & small) can be found at

Once the data & source have been identified, grab an initial dataset. Start small, no need to collect everything right away. Ultimately you'll want to create a variety of sizes of dataset from the same general source to test the robustness of your visualization toolkit, and the scalability of your visualization design.

Now parse & organize & process this data as needed. Convert the data from the format of the source to the format needed by your visualization toolkit (this may be trivial or alot of work!) You may need to write a preprocessing script or program for data conversion. You may need to do some basic process -- selecting certain columns or entities from the dataset, creating new tables/maps with correspondences, etc. In your writeup for this assignment (in your README.txt) document all the stages of data collection, parsing, and processing, including any efficient (or inefficient) data structures or algorithms you used.

Create a simple visualization plan and feed the data into a visualization toolkit of your choice. Create a preliminary visualization. Try progressively larger-sized datasets. Keep going until you identify significant challenges or flaws inherent when working with data of this size. For example:

  • "the data is so big it... crashes my favorite visualization toolkit"
  • "the data is so big it... runs really, really slow"
  • "the data is so big it... the font is too small and I can't read the results"
  • "the data is so big it... the data overlaps and I can't reliably interpret the pattern"
  • "the data is so big it... my computer runs out of RAM/disk and makes funny noises"
  • "the data is so big it... my graphics card starts acting flakey"

As time permits, propose/implement changes to the visualization design and/or process to accomodate this large data. For example, you could preprocess the data to summarize or simplify the data. Alternatively you could redesign the visualization strategy, perhaps using color instead of words, or switching to a new chart type or a new visualization tool, etc.

Working as a team (pick a new partner!) and/or revisiting an idea from an earlier assignment are encouraged, but not required.

Target Visualization Stage: Data Collection (primary) & Visualization Execution (secondary)

How to Submit

Each team or individual should make a group post to LMS (one teammate makes a post for the whole team) to the "Assignment #4 Big Data" discussion. The post should include a description of the data, the collection & parsing & organizing process, the visualization design, the visualization toolkit used, the identified challenges in work with progressively larger datasets, and the steps taken or proposed to mitigate these issues. The post should include images/links with the visualization results.

Each student (each teammate) should submit an individual plaintext README.txt (using the provided template) to the homework server (homework server info). Your README.txt file should focus on your contributions to the team, but you can cut & past from the group LMS post. Also submit to the homework server any source code you wrote for the assignment. (You do not need to submit a complete buildable software project, just submit source code/scripts you wrote). Do not try to submit the dataset! But include a small sample of the data as appropriate in your writeup.

Assignment Due: Tuesday February 21st, 11:59pm