Data Mining

Book has been published!

Data Mining and Analysis: Fundamental Concepts and Algorithms

Mohammed J. Zaki and Wagner Meira, Jr.

Cambridge University Press, May 2014



You may download the Book Draft (PDF). The reader may take one copy only for personal online use. The book is now available for purchase from Cambridge University Press, and other standard distribution channels. No unauthorized distribution shall be allowed.


This book is an outgrowth of data mining courses at RPI and UFMG; the RPI course has been offered every Fall since 1998, whereas the UFMG course has been offered since 2002. While there are several good books on data mining and related topics, we felt that many of them are either too high-level or too advanced. Our goal was to write an introductory text which focuses on the fundamental algorithms in data mining and analysis. It lays the mathematical foundations for the core data mining methods, with key concepts explained when first encountered; the book also tries to build the intuition behind the formulas to aid understanding.

The main parts of the book include exploratory data analysis, frequent pattern mining, clustering and classification. The book lays the basic foundations of these tasks, and it also covers cutting edge topics like kernel methods, high dimensional data analysis, and complex graphs and networks. It integrates concepts from related disciplines like machine learning and statistics, and is also ideal for a course on data analysis. Most of the prerequisite material is covered in the text, especially on linear algebra, and probability and statistics.

The book includes many examples to illustrate the main technical concepts. It also has end of chapter exercises, which have been used in class. All of the algorithms in the book have been implemented by the authors. We suggest that the reader use their favorite data analysis and mining software to work through our examples, and to implement the algorithms we describe in text; we recommend the R software, or the Python language with its NumPy package. The datasets used and other supplementary material like project ideas, slides, and so on, are available online at the book's companion site and its mirrors at RPI and UFMG:

Having understood the basic principles and algorithms in data mining and data analysis, the readers will be well equipped to develop their own methods or use more advanced techniques.

Table of Contents

1 Data Mining and Analysis
    1.1     Data Matrix  
    1.2     Attributes  
    1.3     Data:  Algebraic and Geometric View  
            1.3.1       Distance and Angle 
            1.3.2       Mean and Total Variance    
            1.3.3       Orthogonal Projection   
            1.3.4       Linear Independence and Dimensionality  
    1.4     Data:  Probabilistic View  
            1.4.1       Bivariate Random Variables  
            1.4.2       Multivariate Random Variable    
            1.4.3       Random Sample and Statistics   
    1.5     Data Mining 
            1.5.1       Exploratory Data Analysis 
            1.5.2       Frequent Pattern Mining 
            1.5.3       Clustering  
            1.5.4       Classification   
    1.6     Further Reading    
    1.7     Exercises

Part I Data Analysis Foundations

2 Numeric Attributes
    2.1     Univariate Analysis  
            2.1.1       Measures of Central Tendency 
            2.1.2       Measures of Dispersion  
    2.2     Bivariate Analysis    
            2.2.1       Measures of Location and Dispersion  
            2.2.2       Measures of Association   
    2.3     Multivariate Analysis  
    2.4     Data Normalization 
    2.5     Normal Distribution   
            2.5.1       Univariate Normal Distribution  
            2.5.2       Multivariate Normal Distribution  
    2.6     Further Reading    
    2.7     Exercises 
3 Categorical Attributes
    3.1     Univariate Analysis  
            3.1.1       Bernoulli Variable    
            3.1.2       Multivariate Bernoulli Variable   
    3.2     Bivariate Analysis    
            3.2.1       Attribute Dependence:  Contingency Analysis   
    3.3     Multivariate Analysis  
            3.3.1       Multi-way Contingency Analysis    
    3.4     Distance and Angle  
    3.5     Discretization  
    3.6     Further Reading    
    3.7     Exercises 
4 Graph Data
    4.1     Graph Concepts 
    4.2     Topological Attributes   
    4.3     Centrality Analysis  
            4.3.1       Basic Centralities  
            4.3.2       Web Centralities   
    4.4     Graph Models  
            4.4.1       Erdos-Renyi Random Graph Model  
            4.4.2       Watts-Strogatz Small-world Graph Model   
            4.4.3       Barabasi-Albert Scale-free Model  
    4.5     Further Reading    
    4.6     Exercises 
5 Kernel Methods
    5.1     Kernel Matrix  
            5.1.1       Reproducing Kernel Map    
            5.1.2       Mercer Kernel Map 
    5.2     Vector Kernels    
    5.3     Basic Kernel Operations in Feature Space   
    5.4     Kernels for Complex Objects    
            5.4.1       Spectrum Kernel for Strings  
            5.4.2       Dioeusion Kernels on Graph Nodes   
    5.5     Further Reading    
    5.6     Exercises 
6 High-Dimensional Data
    6.1     High-Dimensional Objects  
    6.2     High-Dimensional Volumes 
    6.3     Hypersphere Inscribed within Hypercube  
    6.4     Volume of Thin Hypersphere Shell   
    6.5     Diagonals in Hyperspace  
    6.6     Density of the Multivariate Normal  
    6.7     Appendix:  Derivation of Hypersphere Volume  
    6.8     Further Reading    
    6.9     Exercises 
7 Dimensionality Reduction
    7.1     Background   
    7.2     Principal Component Analysis 
            7.2.1       Best Line Approximation    
            7.2.2       Best Two-dimensional Approximation    
            7.2.3       Best r-dimensional Approximation   
            7.2.4       Geometry of PCA    
    7.3     Kernel Principal Component Analysis (Kernel PCA)   
    7.4     Singular Value Decomposition  
            7.4.1       Geometry of SVD    
            7.4.2       Connection between SVD and PCA 
    7.5     Further Reading    
    7.6     Exercises 

Part II Frequent Pattern Mining

8 Itemset Mining
    8.1     Frequent Itemsets and Association Rules  
    8.2     Itemset Mining Algorithms    
            8.2.1       Level-Wise Approach:  Apriori Algorithm    
            8.2.2       Tidset Intersection Approach:  Eclat Algorithm   
            8.2.3       Frequent Pattern Tree Approach:  FPGrowth Algorithm   
    8.3     Generating Association Rules   
    8.4     Further Reading    
    8.5     Exercises
9 Summarizing Itemsets
    9.1     Maximal and Closed Frequent Itemsets  
    9.2     Mining Maximal Frequent Itemsets:  GenMax Algorithm   
    9.3     Mining Closed Frequent Itemsets:  Charm algorithm    
    9.4     Non-Derivable Itemsets 
    9.5     Further Reading    
    9.6     Exercises
10 Sequence Mining
    10.1    Frequent Sequences  
    10.2    Mining Frequent Sequences    
            10.2.1      Level-Wise Mining:  GSP 
            10.2.2      Vertical Sequence Mining:  SPADE   
            10.2.3      Projection-Based Sequence Mining:  PreoxSpan   
    10.3    Substring Mining via Suffix Trees  
            10.3.1      Suffix Tree    
            10.3.2      Ukkonen's Linear Time Algorithm   
    10.4    Further Reading    
    10.5    Exercises 
11 Graph Pattern Mining
    11.1    Isomorphism and Support   
    11.2    Candidate Generation   
            11.2.1      Canonical Code  
    11.3    The gSpan Algorithm    
            11.3.1      Extension and Support Computation  
            11.3.2      Canonicality Checking  
    11.4    Further Reading    
    11.5    Exercises 
12 Pattern and Rule Assessment
    12.1    Rule and Pattern Assessment Measures    
            12.1.1      Rule Assessment Measures 
            12.1.2      Pattern Assessment Measures  
            12.1.3      Comparing Multiple Rules and Patterns   
    12.2    Significance Testing and Confidence Intervals   
            12.2.1      Fisher Exact Test for Productive Rules 
            12.2.2      Permutation Test for Significance  
            12.2.3      Bootstrap Sampling for Confidence Interval   
    12.3    Further Reading    
    12.4    Exercises

Part III Clustering

13 Representative-based Clustering
    13.1    K-means Algorithm 
    13.2    Kernel K-means  
    13.3    Expectation Maximization (EM) Clustering   
            13.3.1      EM in One Dimension   
            13.3.2      EM in d-Dimensions   
            13.3.3      Maximum Likelihood Estimation   
            13.3.4      Expectation-Maximization Approach  
    13.4    Further Reading    
    13.5    Exercises 
14 Hierarchical Clustering
    14.1    Preliminaries   
    14.2    Agglomerative Hierarchical Clustering   
            14.2.1      Distance between Clusters  
            14.2.2      Updating Distance Matrix  
            14.2.3      Computational Complexity   
    14.3    Further Reading    
    14.4    Exercises and Projects
15 Density-based Clustering
    15.1    The DBSCAN Algorithm    
    15.2    Kernel Density Estimation  
            15.2.1      Univariate Density Estimation    
            15.2.2      Multivariate Density Estimation    
            15.2.3      Nearest Neighbor Density Estimation 
    15.3    Density-based Clustering:  DENCLUE    
    15.4    Further Reading    
    15.5    Exercises 
16 Spectral and Graph Clustering
    16.1    Graphs and Matrices  
    16.2    Clustering as Graph Cuts   
            16.2.1      Clustering Objective Functions:  Ratio and Normalized Cut  
            16.2.2      Spectral Clustering Algorithm  
            16.2.3      Maximization Objectives:  Average Cut and Modularity 
    16.3    Markov Clustering   
    16.4    Further Reading    
    16.5    Exercises
17 Clustering Validation
    17.1    External Measures   
            17.1.1      Matching Based Measures  
            17.1.2      Entropy Based Measures  
            17.1.3      Pair-wise Measures  
            17.1.4      Correlation Measures  
    17.2    Internal Measures  
    17.3    Relative Measures    
            17.3.1      Cluster Stability   
            17.3.2      Clustering Tendency   
    17.4    Further Reading    
    17.5    Exercises

Part IV Classification

18 Probabilistic Classification
    18.1    Bayes Classifier  
            18.1.1      Estimating the Prior Probability   
            18.1.2      Estimating the Likelihood  
    18.2    Naive Bayes Classifier    
    18.3    Further Reading    
    18.4    Exercises 
19 Decision Tree Classifier
    19.1    Decision Trees 
    19.2    Decision Tree Algorithm  
            19.2.1      Split-point Evaluation Measures    
            19.2.2      Evaluating Split-points  
            19.2.3      Computational Complexity   
    19.3    Further Reading    
    19.4    Exercises 
20 Linear Discriminant Analysis
    20.1    Optimal Linear Discriminant    
    20.2    Kernel Discriminant Analysis   
    20.3    Further Reading    
    20.4    Exercises 
21 Support Vector Machines
    21.1    Linear Discriminants and Margins    
    21.2    SVM: Linear and Separable Case   
    21.3    Soft Margin SVM: Linear and Non-Separable Case   
            21.3.1      Hinge Loss    
            21.3.2      Quadratic Loss   
    21.4    Kernel SVM: Nonlinear Case    
    21.5    SVM Training Algorithms   
            21.5.1      Dual Solution:  Stochastic Gradient Ascent 
            21.5.2      Primal Solution:  Newton Optimization  
22 Classification Assessment
    22.1    Classification Performance Measures   
            22.1.1      Contingency Table Based Measures  
            22.1.2      Binary Classification:  Positive and Negative Class    
            22.1.3      ROC Analysis 
    22.2    Classifier Evaluation   
            22.2.1      K -fold Cross-Validation   
            22.2.2      Bootstrap Resampling   
            22.2.3      Confidence Intervals   
            22.2.4      Comparing Classifiers:  Paired t-Test   
    22.3    Bias-Variance Decomposition   
            22.3.1      Ensemble Classifiers   
    22.4    Further Reading    
    22.5    Exercises