# Main: TableOfContents

### 1 Data Mining and Analysis......................... 1

1.1 Data Matrix 1
1.2 Attributes 3
1.3 Data: Algebraic and Geometric View 4
1.4 Data: Probabilistic View 14
1.5 Data Mining 25
1.7 Exercises 30

## PART I. DATA ANALYSIS FOUNDATIONS

### 2 Numeric Attributes ............................ 33

2.1 Univariate Analysis 33
2.2 Bivariate Analysis 42
2.3 Multivariate Analysis 48
2.4 Data Normalization 52
2.5 Normal Distribution 54
2.7 Exercises 60

### 3 Categorical Attributes........................... 63

3.1 Univariate Analysis 63
3.2 Bivariate Analysis 72
3.3 Multivariate Analysis 82
3.4 Distance and Angle 87
3.5 Discretization 89
3.7 Exercises 91

### 4 Graph Data ................................ 93

4.1 Graph Concepts 93
4.2 Topological Attributes 97
4.3 Centrality Analysis 102
4.4 Graph Models 112
4.6 Exercises 132

### 5 Kernel Methods .............................. 134

5.1 Kernel Matrix 138
5.2 Vector Kernels 144
5.3 Basic Kernel Operations in Feature Space 148
5.4 Kernels for Complex Objects 154
5.6 Exercises 161

### 6 High-dimensional Data........................... 163

6.1 High-dimensional Objects 163
6.2 High-dimensional Volumes 165
6.3 Hypersphere Inscribed within Hypercube 168
6.4 Volume of Thin Hypersphere Shell 169
6.5 Diagonals in Hyperspace 171
6.6 Density of the Multivariate Normal 172
6.7 Appendix: Derivation of Hypersphere Volume 175
6.9 Exercises 180

### 7 Dimensionality Reduction ......................... 183

7.1 Background 183
7.2 Principal Component Analysis 187
7.3 Kernel Principal Component Analysis 202
7.4 Singular Value Decomposition 208
7.6 Exercises 214

## PART II. FREQUENT PATTERN MINING

### 8 Itemset Mining............................... 217

8.1 Frequent Itemsets and Association Rules 217
8.2 Itemset Mining Algorithms 221
8.3 Generating Association Rules 234
8.5 Exercises 237

### 9 Summarizing Itemsets ........................... 242

9.1 Maximal and Closed Frequent Itemsets 242
9.2 Mining Maximal Frequent Itemsets: GenMax Algorithm 245
9.3 Mining Closed Frequent Itemsets: Charm Algorithm 248
9.4 Nonderivable Itemsets 250
9.6 Exercises 256

### 10 Sequence Mining.............................. 259

10.1 Frequent Sequences 259
10.2 Mining Frequent Sequences 260
10.3 Substring Mining via Suffix Trees 267
10.5 Exercises 277

### 11 Graph Pattern Mining........................... 280

11.1 Isomorphism and Support 280
11.2 Candidate Generation 284
11.3 The gSpan Algorithm 288
11.5 Exercises 297

### 12 Pattern and Rule Assessment ....................... 301

12.1 Rule and Pattern Assessment Measures 301
12.2 Significance Testing and Confidence Intervals 316
12.4 Exercises 328

## PART III. CLUSTERING

### 13 Representative-based Clustering...................... 333

13.1 K-means Algorithm 333
13.2 Kernel K-means 338
13.3 Expectation-Maximization Clustering 342
13.5 Exercises 361

### 14 Hierarchical Clustering........................... 364

14.1 Preliminaries 364
14.2 Agglomerative Hierarchical Clustering 366
14.4 Exercises and Projects 373

### 15 Density-based Clustering.......................... 375

15.1 The DBSCAN Algorithm 375
15.2 Kernel Density Estimation 379
15.3 Density-based Clustering: DENCLUE 385
15.5 Exercises 391

### 16 Spectral and Graph Clustering ...................... 394

16.1 Graphs and Matrices 394
16.2 Clustering as Graph Cuts 401
16.3 Markov Clustering 416
16.5 Exercises 423

### 17 Clustering Validation ........................... 425

17.1 External Measures 425
17.2 Internal Measures 440
17.3 Relative Measures 448
17.5 Exercises 462

## PART IV. CLASSIFICATION

### 18 Probabilistic Classification......................... 467

18.1 Bayes Classif ier 467
18.2 Naive Bayes Classifier 473
18.3 K Nearest Neighbors Classifier 477
18.5 Exercises 479

### 19 Decision Tree Classifier .......................... 481

19.1 Decision Trees 483
19.2 Decision Tree Algorithm 485
19.4 Exercises 496

### 20 Linear Discriminant Analysis ....................... 498

20.1 Optimal Linear Discriminant 498
20.2 Kernel Discriminant Analysis 505
20.4 Exercises 512

### 21 Support Vector Machines ......................... 514

21.1 Support Vectors and Margins 514
21.2 SVM: Linear and Separable Case 520
21.3 Soft Margin SVM: Linear and Nonseparable Case 524
21.4 Kernel SVM: Nonlinear Case 530
21.5 SVM Training Algorithms 534
21.7 Exercises 546

### 22 Classification Assessment ......................... 548

22.1 Classification Performance Measures 548
22.2 Classifier Evaluation 562
22.3 Bias-Variance Decomposition 572