Instructor: Mohammed
J. Zaki
Office:
Lally IT Bldg 307
Phone:
x6340
Email:
zaki.AT.cs.rpi.edu
Class Hours: Carnegie 112, TF 10:00-11:20am
Office Hours: TF11:20-12:00pm, or by appointment
Web Page: http://www.cs.rpi.edu/~zaki/cs6460/
Introduction
Given the infancy of data mining it is not clear how or where to
start, to realize the goal of building a Data Mining System that can
handle terabyte-sized (or larger) central or distributed datasets.
Part of the problem stems from the fact that Data Mining draws input
from diverse areas that have been traditionally studied in
isolation. Typically, the mining process is supported by a
hierarchical architecture consisting of the following layers: (from
bottom to top) I/O support, file system, database system, query
manager, and data mining.
In this course we will design and implement a Large Scale Data Mining Server building on top of existing database support (e.g. MySQL, PostgreSQL, Predator, Shore, etc.). The system must not only support the mining algorithms, but also the entire mining process. We will look at each layer of a database system and customize it to our domain. These layers include: data storage, data representation, index structures, multidimensional indexing, mining query execution, and other advanced topics like parallel and distributed databases, OLAP and data cubes, etc.
Perquisites include CSCI-4380 Database Systems, and good implementation skills. Knowledge of data mining will be helpful, but is not assumed. The course format will consist of paper reading and an intensive implementation (group) project. This is a hands-on implementation course, mainly for graduates, but motivated undergraduates are also encouraged to enroll. Note: if you are not familiar with C++ do not enroll in this course.
Text
We will be using the following book as a recommended text:
Database System Implementation, by H. Garcia-Molina,
J.D. Ullman and J. Widom, Prentice Hall, 2000 (ISBN:0-13-040264-8).
Class Format and Requirements
The class will be a mix of lectures and student presentations of
papers. Generally,
I'll introduce a topic, and then we'll read a latest paper on that topic.
For the implementation project, students will be assigned into groups.
The project is expected to culminate in a presentation
during the last week of class, and also a report on the experimental results
obtained. The project will be broken into several parts to assure
timely completion.
There will be two comprehensive exams, covering all the material up to that point.
The final grade will be determined as follows:
10% paper presentation/reading
10% class attendance & HW
15% Exam I
15% Exam II
50% project
Timeline for Topics
See the course website for the timeline. There might be a change in
the topics as we progress during the semester.
Academic Integrity
The school takes cases of academic dishonestly very seriously, resulting
in an automatic "F" grade for the course if anyone is caught cheating.
Students should familiarize themselves with the relevant portion of the
Rensselaer Handbook (http://www.rpi.edu/dept/doso/judicial
) on this topic.