You may download the Book Draft (PDF). The reader may take one copy only for personal online use. The book is now available for purchase from Cambridge University Press, Amazon.com and other standard distribution channels. No unauthorized distribution shall be allowed.

This book is an outgrowth of data mining courses at RPI and UFMG; the RPI course has been offered every Fall since 1998, whereas the UFMG course has been offered since 2002. While there are several good books on data mining and related topics, we felt that many of them are either too high-level or too advanced. Our goal was to write an introductory text which focuses on the fundamental algorithms in data mining and analysis. It lays the mathematical foundations for the core data mining methods, with key concepts explained when first encountered; the book also tries to build the intuition behind the formulas to aid understanding.

The main parts of the book include exploratory data analysis, frequent pattern mining, clustering and classification. The book lays the basic foundations of these tasks, and it also covers cutting edge topics like kernel methods, high dimensional data analysis, and complex graphs and networks. It integrates concepts from related disciplines like machine learning and statistics, and is also ideal for a course on data analysis. Most of the prerequisite material is covered in the text, especially on linear algebra, and probability and statistics.

The book includes many examples to illustrate the main technical concepts. It also has end of chapter exercises, which have been used in class. All of the algorithms in the book have been implemented by the authors. We suggest that the reader use their favorite data analysis and mining software to work through our examples, and to implement the algorithms we describe in text; we recommend the R software, or the Python language with its NumPy package. The datasets used and other supplementary material like project ideas, slides, and so on, are available online at the book's companion site and its mirrors at RPI and UFMG:

- http://dataminingbook.info
- http://www.cs.rpi.edu/~zaki/dataminingbook
- http://www.dcc.ufmg.br/dataminingbook

Having understood the basic principles and algorithms in data mining and data analysis, the readers will be well equipped to develop their own methods or use more advanced techniques.

1.1 Data Matrix 1.2 Attributes 1.3 Data: Algebraic and Geometric View 1.3.1 Distance and Angle 1.3.2 Mean and Total Variance 1.3.3 Orthogonal Projection 1.3.4 Linear Independence and Dimensionality 1.4 Data: Probabilistic View 1.4.1 Bivariate Random Variables 1.4.2 Multivariate Random Variable 1.4.3 Random Sample and Statistics 1.5 Data Mining 1.5.1 Exploratory Data Analysis 1.5.2 Frequent Pattern Mining 1.5.3 Clustering 1.5.4 Classification 1.6 Further Reading 1.7 Exercises

2.1 Univariate Analysis 2.1.1 Measures of Central Tendency 2.1.2 Measures of Dispersion 2.2 Bivariate Analysis 2.2.1 Measures of Location and Dispersion 2.2.2 Measures of Association 2.3 Multivariate Analysis 2.4 Data Normalization 2.5 Normal Distribution 2.5.1 Univariate Normal Distribution 2.5.2 Multivariate Normal Distribution 2.6 Further Reading 2.7 Exercises

3.1 Univariate Analysis 3.1.1 Bernoulli Variable 3.1.2 Multivariate Bernoulli Variable 3.2 Bivariate Analysis 3.2.1 Attribute Dependence: Contingency Analysis 3.3 Multivariate Analysis 3.3.1 Multi-way Contingency Analysis 3.4 Distance and Angle 3.5 Discretization 3.6 Further Reading 3.7 Exercises

4.1 Graph Concepts 4.2 Topological Attributes 4.3 Centrality Analysis 4.3.1 Basic Centralities 4.3.2 Web Centralities 4.4 Graph Models 4.4.1 Erdos-Renyi Random Graph Model 4.4.2 Watts-Strogatz Small-world Graph Model 4.4.3 Barabasi-Albert Scale-free Model 4.5 Further Reading 4.6 Exercises

5.1 Kernel Matrix 5.1.1 Reproducing Kernel Map 5.1.2 Mercer Kernel Map 5.2 Vector Kernels 5.3 Basic Kernel Operations in Feature Space 5.4 Kernels for Complex Objects 5.4.1 Spectrum Kernel for Strings 5.4.2 Dioeusion Kernels on Graph Nodes 5.5 Further Reading 5.6 Exercises

6.1 High-Dimensional Objects 6.2 High-Dimensional Volumes 6.3 Hypersphere Inscribed within Hypercube 6.4 Volume of Thin Hypersphere Shell 6.5 Diagonals in Hyperspace 6.6 Density of the Multivariate Normal 6.7 Appendix: Derivation of Hypersphere Volume 6.8 Further Reading 6.9 Exercises

7.1 Background 7.2 Principal Component Analysis 7.2.1 Best Line Approximation 7.2.2 Best Two-dimensional Approximation 7.2.3 Best r-dimensional Approximation 7.2.4 Geometry of PCA 7.3 Kernel Principal Component Analysis (Kernel PCA) 7.4 Singular Value Decomposition 7.4.1 Geometry of SVD 7.4.2 Connection between SVD and PCA 7.5 Further Reading 7.6 Exercises

8.1 Frequent Itemsets and Association Rules 8.2 Itemset Mining Algorithms 8.2.1 Level-Wise Approach: Apriori Algorithm 8.2.2 Tidset Intersection Approach: Eclat Algorithm 8.2.3 Frequent Pattern Tree Approach: FPGrowth Algorithm 8.3 Generating Association Rules 8.4 Further Reading 8.5 Exercises

9.1 Maximal and Closed Frequent Itemsets 9.2 Mining Maximal Frequent Itemsets: GenMax Algorithm 9.3 Mining Closed Frequent Itemsets: Charm algorithm 9.4 Non-Derivable Itemsets 9.5 Further Reading 9.6 Exercises

10.1 Frequent Sequences 10.2 Mining Frequent Sequences 10.2.1 Level-Wise Mining: GSP 10.2.2 Vertical Sequence Mining: SPADE 10.2.3 Projection-Based Sequence Mining: PreoxSpan 10.3 Substring Mining via Suffix Trees 10.3.1 Suffix Tree 10.3.2 Ukkonen's Linear Time Algorithm 10.4 Further Reading 10.5 Exercises

11.1 Isomorphism and Support 11.2 Candidate Generation 11.2.1 Canonical Code 11.3 The gSpan Algorithm 11.3.1 Extension and Support Computation 11.3.2 Canonicality Checking 11.4 Further Reading 11.5 Exercises

12.1 Rule and Pattern Assessment Measures 12.1.1 Rule Assessment Measures 12.1.2 Pattern Assessment Measures 12.1.3 Comparing Multiple Rules and Patterns 12.2 Significance Testing and Confidence Intervals 12.2.1 Fisher Exact Test for Productive Rules 12.2.2 Permutation Test for Significance 12.2.3 Bootstrap Sampling for Confidence Interval 12.3 Further Reading 12.4 Exercises

13.1 K-means Algorithm 13.2 Kernel K-means 13.3 Expectation Maximization (EM) Clustering 13.3.1 EM in One Dimension 13.3.2 EM in d-Dimensions 13.3.3 Maximum Likelihood Estimation 13.3.4 Expectation-Maximization Approach 13.4 Further Reading 13.5 Exercises

14.1 Preliminaries 14.2 Agglomerative Hierarchical Clustering 14.2.1 Distance between Clusters 14.2.2 Updating Distance Matrix 14.2.3 Computational Complexity 14.3 Further Reading 14.4 Exercises and Projects

15.1 The DBSCAN Algorithm 15.2 Kernel Density Estimation 15.2.1 Univariate Density Estimation 15.2.2 Multivariate Density Estimation 15.2.3 Nearest Neighbor Density Estimation 15.3 Density-based Clustering: DENCLUE 15.4 Further Reading 15.5 Exercises

16.1 Graphs and Matrices 16.2 Clustering as Graph Cuts 16.2.1 Clustering Objective Functions: Ratio and Normalized Cut 16.2.2 Spectral Clustering Algorithm 16.2.3 Maximization Objectives: Average Cut and Modularity 16.3 Markov Clustering 16.4 Further Reading 16.5 Exercises

17.1 External Measures 17.1.1 Matching Based Measures 17.1.2 Entropy Based Measures 17.1.3 Pair-wise Measures 17.1.4 Correlation Measures 17.2 Internal Measures 17.3 Relative Measures 17.3.1 Cluster Stability 17.3.2 Clustering Tendency 17.4 Further Reading 17.5 Exercises

18.1 Bayes Classifier 18.1.1 Estimating the Prior Probability 18.1.2 Estimating the Likelihood 18.2 Naive Bayes Classifier 18.3 Further Reading 18.4 Exercises

19.1 Decision Trees 19.2 Decision Tree Algorithm 19.2.1 Split-point Evaluation Measures 19.2.2 Evaluating Split-points 19.2.3 Computational Complexity 19.3 Further Reading 19.4 Exercises

20.1 Optimal Linear Discriminant 20.2 Kernel Discriminant Analysis 20.3 Further Reading 20.4 Exercises

21.1 Linear Discriminants and Margins 21.2 SVM: Linear and Separable Case 21.3 Soft Margin SVM: Linear and Non-Separable Case 21.3.1 Hinge Loss 21.3.2 Quadratic Loss 21.4 Kernel SVM: Nonlinear Case 21.5 SVM Training Algorithms 21.5.1 Dual Solution: Stochastic Gradient Ascent 21.5.2 Primal Solution: Newton Optimization

22.1 Classification Performance Measures 22.1.1 Contingency Table Based Measures 22.1.2 Binary Classification: Positive and Negative Class 22.1.3 ROC Analysis 22.2 Classifier Evaluation 22.2.1 K -fold Cross-Validation 22.2.2 Bootstrap Resampling 22.2.3 Confidence Intervals 22.2.4 Comparing Classifiers: Paired t-Test 22.3 Bias-Variance Decomposition 22.3.1 Ensemble Classifiers 22.4 Further Reading 22.5 Exercises

Retrieved from http://www.cs.rpi.edu/~zaki/www-new/pmwiki.php/Books/DataMining

Page last modified on May 27, 2014, at 02:10 PM