Dmcourse

Main

CSCI-4390/6390: Data Mining, Fall 2009


Class: 10-11:50AM, MR, Low 3045
Instructor Office Hours: 12-1PM, MR



Announcements

  • Nov 17: Assignment 6 posted.
  • Nov 12: Exam II solutions have been posted.
  • Nov 4: Solutions to Assignment 5 posted on the assignment page.
  • Oct 31: Solutions to Assignment 4 posted on the assignment page.
  • Oct 24: Assignment 5 posted.
  • Oct 13: Exam I solutions have been posted.
  • Oct 10: Assignment 4 has been posted.
  • Oct 4: Solutions for Assignment 3 posted.
  • Sep 27: Solutions for Assignments 1 and 2 have been posted on the respective pages.
  • Sep 26: Assignment 3 is now available.
  • Sep 18: Assignment 2 is now available.
  • Sep 12: I have posted the notes below. They are time-stamped so that if I update them, you can check if your copy is the latest one or not.
  • Sep 8: Assignment 1 has been posted. See the general R/pmwiki instruction at Assignments and see the specific assignment at Assign1
  • Sep 2: Passwords for the assignment submission wiki were sent out yesterday. Contact me if you did not get the email.
  • Aug 30: Slight update of the syllabus.
  • Aug 19: Course website is up, with the tentative calendar and syllabus.



Calendar & Lecture Notes/Videos

A tentative sequence of topics to be covered in the classes; changes are likely as the course progresses.

Day: Date Topic Chapters Lecture Notes Video
M: Aug 31 Data Mining Overview PDF
R: Sep 3 Exploratory Data Analysis (EDA): Numeric Attributes PDF PDF Video
M: Sep 7 Labor Day Holiday
R: Sep 10 EDA: Numeric & Categorical Attributes PDF PDF Video
M: Sep 14 Frequent Pattern Mining (FPM): Itemset Mining PDF PDF Video
R: Sep 17 Clustering (CLUS): Partitional (KMeans, EM) PDF PDF Video
M: Sep 21 Classification (CLASS): Decision Trees PDF PDF Video
R: Sep 24 EDA: High Dimensional Data PDF PDF Video
M: Sep 28 EDA: Dimensionality Reduction: PCA PDF PDF Video
R: Oct 1 EDA: Dimensionality Reduction: PCA/SVD PDF PDF Video
M: Oct 5 EXAM I
R: Oct 8 EDA: Linear Discriminant Analysis: LDA PDF PDF Video
Tue: Oct 13 FPM: Itemset Summaries PDF PDF Video
R: Oct 15 FPM: Sequence Mining PDF PDF Video
M: Oct 19 FPM:Sequence Mining, CLASS: Probabilistic PDF PDF Video
R: Oct 22 CLASS: Support Vector Machines (SVM) PDF PDF Video
M: Oct 26 CLASS: SVM contd. PDF PDF Video
R: Oct 29 CLASS: Kernel SVM, Rule-based PDF PDF Video
M: Nov 2 CLASS: Classifier Evaluation PDF Video
R: Nov 5 EXAM II
M: Nov 9 CLUS: Hierarchical/Density-based Clustering PDF Video
R: Nov 12 CLUS: Density-based Clustering (Kernel Density Estimation) PDF PDF Video
M: Nov 16 CLUS: Subspace Clustering PDF PDF Video
R: Nov 19 CLUS: Spectral Clustering PDF Video
M: Nov 23 CLUS: Cluster Validity
R: Nov 26 Thanksgiving Break
M: Nov 30 CLASS: Kernel PCA/LDA
R: Dec 3 EXAM III
M: Dec 7 Social Network Analysis (SNA)
R: Dec 10 SNA: Graph Mining




Syllabus

Introduction

Data mining is the process of automatic discovery of patterns, models, changes, associations and anomalies in massive databases. This course will provide an introduction to the main topics in data mining and knowledge discovery, including: statistical foundations, pattern mining, classification, and clustering. Emphasis will be laid on the algorithmic foundations.

Learning Objectives

After taking this course students will be

  • knowledgeable about the fundamental data mining tasks like pattern mining, classification and clustering
  • able to understand the key algorithms for the main tasks
  • able to implement and apply the techniques to real world datasets
Prerequisites

The pre-requisites for this course include data structures and algorithms and discrete mathematics. Basics of linear algebra, and probability & statistics will be very useful as well. Assignments will require the use of the R software. Students are expected to learn R on their own. Assignments must be submitted online at the wiki site. Knowledge of pmwiki markup usage will be your responsibility.

Textbook

There is no required text for the course. Notes will be handed out in class.

The following text books are also good references:

  • Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Addison Wesley, 2006.
  • Data Mining: Concepts and Techniques (2nd edition), by Jiawei Han and Micheline Kamber, Morgan Kaufmann, 2006.
Grading Policy

Your grade will be a combination of the following items. Note that the final distribution is subject to some change depending on the number of assignments, but exams will be at least 60%.

  • Assignments (40%): The assignments are meant to be practically oriented. You'll be asked to run some mining methods on some real datasets, or to implement some algorithms, to complement the theory. There will be roughly one assignment per week, to be submitted via the course wiki site. User accounts will be created after first day of class.
  • Exams (60%): There will be three exams covering the main topics of the course. The tentative exam schedule is posted on the class schedule table. There is no comprehensive final exam.

Attendance: Students are strongly encouraged to participate in the class, and should try to attend all classes.

Academic Integrity

You may consult other members of the class on the homeworks, but you must submit your own work. Anytime you borrow material from the web or elsewhere, you must acknowledge the source.

The school takes cases of academic dishonesty very seriously, resulting in an automatic "F" grade for the course. Students should familiarize themselves with the relevant portion of the Rensselaer Handbook of Student Rights and Responsibilities on this topic.