Main

CSCI-4390/6390: Data Mining, Fall 2012


Class Time: MR 10-11:50AM
Room: Greene 120
Instructor Office Hours: MR 12-1PM, Lally 307


TA: Nilothpal Talukder
TA Office Hours: W 2-4PM, Amos Eaton 119
TA Contact:




Announcements

  • Nov 8: Assign5 has be posted. It is due on 16th Nov, before midnight.
  • Oct 23: Assign4 has be posted. It is due on 30th Oct, before midnight.
  • Oct 12: Assign3 has be posted. It is due on 19th Oct, before midnight.
  • Sep 24: Assign2 has be posted. It is due on 1st Oct, before midnight.
  • Sep 14: Assign1 has be posted. It is due on 21st Sep, before midnight.
  • Sep 7: Everyone enrolled in the course should have already signed up for the piazza account (or they should have received an email to do so). Please sign up immediately to receive class announcements and emails.
  • Aug 6: Course website is up, with the syllabus and tentative calendar.
  • Aug 6: Students in the class must sign up for the Piazza course discussion site. All discussions and Q&A will be carried out using Piazza.



Calendar & Lecture Notes

A tentative sequence of topics to be covered in the classes; changes are likely as the course progresses.

Day: Date Topic Readings Lectures
M: Aug 27 NO CLASS
R: Aug 30 NO CLASS
M: Sep 3 Labor Day Holiday
R: Sep 6 Data Mining and Analysis (DA): Algebraic and Probabilistic Views Attach:chap1.pdf Attach:dmintro.pptx, Attach:Lecture1.PDF
M: Sep 10 DA: Numeric Attributes Attach:chap2.pdf Attach:Lecture2.PDF
R: Sep 13 DA: Numeric Attributes: Eigen-decomposition Attach:Lecture3.PDF
M: Sep 17 DA: Dimensionality Reduction Attach:chap7.pdf Attach:Lecture4.PDF
R: Sep 20 DA: High Dimensional Analysis Attach:chap6.pdf Attach:Lecture5.PDF
M: Sep 24 DA: Categorical Data & Attach:chap3.pdf Attach:Lecture6.PDF
R: Sep 27 DA: Kernel Methods Attach:chap5.pdf Attach:Lecture7.PDF
M: Oct 1 DA: Kernels Attach:Lecture8.PDF
R: Oct 4 EXAM I
Tue: Oct 9 Classification (CLASS): Linear Discriminants, SVMs Attach:chap22.pdf Attach:Lecture9.PDF
R: Oct 11 CLASS: SVMs Attach:chap23.pdf Attach:Lecture10.PDF
M: Oct 15 CLASS: Bayesian Classifier, Decision Trees Attach:chap21.pdf, Attach:chap19.pdf Attach:Lecture11.PDF
R: Oct 18 CLASS: Classifier Evaluation Attach:chap24.pdf Attach:Lecture12.PDF
M: Oct 22 CLASS: Classifier Evaluation Attach:Lecture13.PDF
R: Oct 25 Clustering (CLUS): Partitional Attach:chap13.pdf Attach:Lecture14.PDF
M: Oct 29 NO CLASS
R: Nov 1 EXAM II
M: Nov 5 CLUS: EM-based Attach:Lecture15.PDF
R: Nov 8 CLUS: Hierarchical, Density-based Clustering Attach:chap14.pdf, Attach:chap15.pdf Attach:Lecture16.PDF
M: Nov 12 CLUS: Spectral & Graph Clustering Attach:chap17.pdf Attach:Lecture17.PDF
R: Nov 15 CLUS: Spectral & Graph Clustering Attach:Lecture18.PDF
M: Nov 19 CLUS: Evaluation & Assessment Attach:chap18.pdf Attach:Lecture19.PDF
R: Nov 22 Thanksgiving Break
M: Nov 26 Frequent Pattern Mining (FPM): Itemset Mining Attach:chap8.pdf, Attach:chap9.pdf Attach:Lecture20.PDF
R: Nov 29 FPM: Sequence Mining Attach:chap10.pdf Attach:Lecture21.PDF
M: Dec 3 FPM: Graph Mining Attach:chap11.pdf Attach:Lecture22.PDF
R: Dec 6 EXAM III




Syllabus

Introduction

Data mining is the process of automatic discovery of patterns, models, changes, associations and anomalies in massive databases. This course will provide an introduction to the main topics in data mining and knowledge discovery, including: algebraic and statistical foundations, pattern mining, classification, and clustering. Emphasis will be laid on the algorithmic approach.

Learning Objectives

After taking this course students will be

  • able to describe the fundamental data mining tasks like pattern mining, classification and clustering
  • able to analyze the key algorithms for the main tasks
  • able to implement and apply the techniques to real world datasets
Prerequisites

The pre-requisites for this course include data structures and algorithms and discrete mathematics. Linear algebra and probability & statistics are also essentially pre-requisites, though an attempt will be made to review the basic concepts. Assignments will require the use of the python language, with NumPy package for numeric computations. You are expected to learn python on your own via web tutorials, etc. Assignments must be submitted via email to .

Textbook

Students will be given draft chapters from the forthcoming book

  • Data Mining and Analysis: Foundations and Algorithms, Mohammed J. Zaki and Wagner Meira, Jr, Cambridge University Press, 2013.

The following text books are also good references:

  • Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Addison Wesley, 2006.
  • Data Mining: Concepts and Techniques (2nd edition), by Jiawei Han and Micheline Kamber, Morgan Kaufmann, 2006.
Grading Policy

Your grade will be a combination of the following items.

  • Assignments (40%): The assignments are meant to be practically oriented. You'll be asked to implement some algorithms and apply them to real datasets, to complement the theory. There will be roughly one assignment every two weeks.
  • Exams (60%): There will be three exams covering the main topics of the course. The tentative exam schedule is posted on the class schedule table. There is no comprehensive final exam. All exams are open book.
Other Policies
  • Attendance: Students are strongly encouraged to participate in the class, and should try to attend all classes. Students are responsible for any topics and assignments for the missed classes.
  • Laptops: Absolutely no laptops will be allowed in class during lectures. The only exception is during exams, to access the class notes online and to use the calculator functions. Even during the exam, you may not use any other software (e.g., R, python, matlab, etc.) for the computations.
  • Late Assignments: Most assignments will be due just before midnight on the due date. Students get an automatic one day extension with 20% penalty. No late assignments will be accepted after the midnight following the due date.
Academic Integrity

You may consult other members of the class on the assignments, but you must submit your own work. For instance you may discuss general approaches to solving a problem, but you must implement the solution on your own (similarity detection software may be used). Anytime you borrow material from the web or elsewhere, you must acknowledge the source.

The school takes cases of academic dishonesty very seriously, resulting in an automatic "F" grade for the course. Students should familiarize themselves with the relevant portion of the Rensselaer Handbook of Student Rights and Responsibilities on this topic.