CSCI 6968 - Cloud Computing - Spring 2016

General Information

Class Time and Place: TF 2:00pm - 3:50pm, Location: Sage 2112
Instructor: Stacy Patterson     sep AT
Office Hours: M 10:30am - 11:30am or by appointment

Course Syllabus

Paper Presentation Schedule

Paper reviews are due by noon on the day of the presentation.

Info for Project Proposal Presentations

Slides on Final Project Presentations and Reports

Course Description

In this course, we will study significant tools and applications that comprise today's cloud computing platform, with a special focus on using the cloud for big data applications. The course content will come directly from research papers, articles, and documentation of cloud and data center architectures and technologies. We will work together to develop a deep understanding of this content through class presentations and discussions of this material. Students will also create a research project of their choosing that uses several cloud computing components.

This course was inspired by a recent posting on LinkedIn by Anil Mada, Senior Director of Engineering at PayPal, entitled 100 open source Big Data architecture papers for data professionals. While we will not cover all 100 papers, papers from each component of the architecture will be included. I have also included some relevant papers for non-open source projects and from the Google Research page on Distributed Systems and Parellel Computing.

There are no course pre-requisites. Students are expected to have the ability to read and understand system research papers such as those listed below. Students should also be compfortable implementing reasonably complex software applications. Undergraduates who are interested in taking this class should contact the instructor for permission. There is also the possibility of taking this course for indpendent study credit.

Useful Links
Cloud Computing Resources

You can get $75 credit for the Amazon cloud by signing up for AWS Educate. RPI is a member institution.

You are also allotted a 6 month pass for Microsoft Azure, with $100 credit per month. Please email the professor to obtain a pass code.

You should do your development on a local machine and only use the cloud for testing and evaluation. Make sure to shut down your instances when you are done to avoid extra charges.

Project Info

Project presentations will be Tuesday, May 3 in class and Wednesday, May 11, 2pm - 4pm in AE 216.

Project reports must be 4 -6 pages, formatted in 2 column IEEE conference format. Latex and word templates can be found here.

Paper List
  1. The Google File System
  2. MapReduce: Simplified Data Processing on Large Clusters
  3. Bigtable: A Distributed Storage System for Structured Data
  4. The Hadoop Distributed File System
  5. Dynamo: Amazon's highly available key-value store
  6. Cassandra: a decentralized structured storage system
  7. Adapting Microsoft SQL Server for Cloud Computing
  8. Megastore: Providing Scalable, Highly Available Storage for Interactive Services
  9. Spanner: Google’s Globally-Distributed Database
  10. Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
  11. Mesos a platform for fine-grained resource sharing in the data center
  12. The Chubby lock service for loosely-coupled distributed systems
  13. ZooKeeper: Wait-free coordination for Internet-scale systems
  14. Spark: Cluster Computing with Working Sets
  15. Pregel: A System for Large-Scale Graph Processing
  16. Storm @Twitter
  17. Discretized Streams: Fault-Tolerant Streaming Computation at Scale
  18. Dremel: Interactive Analysis of Web-Scale Datasets
  19. Dryad: distributed data-parallel programs from sequential building blocks
  20. Druid: a real-time analytical data store
  21. Pig Latin: A Not-So-Foreign Language for Data Processing
  22. Hive – A Petabyte Scale Data Warehouse Using Hadoop
  23. Kafka: a Distributed Messaging System for Log Processing
  24. Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications
  25. Large-scale cluster management at Google with Borg
  26. Fast crash recovery in RAMCloud
  27. Gorilla: A Fast, Scalable, In-Memory Time Series Database
  28. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
  29. GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
  30. Low Latency Analytics of Geo-distributed Data in the Wide Area