Online Course: Machine Learning From Data, Magdon-Ismail

Machine Learning From Data, Rensselaer
Online Course

A course on the foundations of machine learning suitable for an advanced undergraduate or begining graduate student interested in the theory and applications of learning. Familiarity with calculus, linear algebra and probability is helpful.

Instructor: Malik Magdon-Ismail
Text Book: Learning From Data by Abu-Mostafa, Magdon-Ismail, Lin (plus e-Chapters)
Part Ⅰ: Foundations. (lectures 1-15) A companion to the textbook.
Part Ⅱ: Techniques. (lectures 16-27) e-Chapters 6-9 (book-forum, www.amlbook.com.)

The slides and lectures are available as is with no explicit or implied warranties.
The copyright for all material remains with the original copyright holder (in almost all cases the authors of the "Learning From Data" book).

I: Foundations

01. The Learning Problem
02. Linear Model: Perceptron
03. Is Learning Feasible
04. Real Learning is Feasible
05. An Effective Model Size
06. Bounding Model Size
07. VC-Dimension
08. Classification, Regression
09. Logistic Regression
10. Nonlinear Transforms
11. Overfitting
12. Regularization
13. Validation
14. 3 Learning Principles
15. Epilogue to Part I
II: Techniques

16. Nearest Neighbor
17. Efficiency of k-NN
18. Radial Basis Functions
19. Unsup. Learning: A Peak
20. Deep Networks
21. Backpropagation
22. Deep Networks Overfit
23. Support Vector Machine
24. Max. Margin Regularizes
25. The Kernel Trick
26. Choosing a Kernel
27. Learning Aides & PCA
28. Bonus: Blending and RL

Ⅰ. Foundations

1: The Learning Problem.
Class motto (sing it loud):

A pattern exists.
We don't know it.
We have data to learn it.
Modules:
   (04min) Storyline: What is learning? Can we do it? How?
   (13min) Applications, "What is a tree?
   (16min) Netflix and Credit.
   (38min) Learning problem setup.

Reading: LFD Chapter 1 (The Learning Problem), § 1.1
Slides: (full) (compact)
Full video on youtube (74min)

2: Linear Model: Perceptron.
We can deduce a simple but powerful classifier using an analogy to a credit score. We condinut with a general overview of learning paradigms and end on a puzzle which forces us to consider if learning is feasible.

Modules:
   (19min) Importance of testing and Edward Jenner's story.
   (28min) Deriving the perceptron from a credit score.
   (04min) Geometry of the perceptron.
   (29min) The Perceptron Learning Algorithm (PLA).
   (09min) Overview of learning paradigms.
   (10min) A puzzle. Is learning feasible, even for humans?

Reading: LFD Chapter 1 (The Learning Problem), § 1.1, 1.2
Slides: (full) (compact)
Full video on youtube (98min)

PAUSE. Reinforce and Practice: Assignment 1.

3: Is Learning Feasible?
To learn, we must be able to reach outside the data. Setting our sights low, we ask: Can we reach outside the data to infer something, however small, about the target function? We'll need probability to rescue us.

Modules:
   (07min) Learning setup and our puzzle from last time.
   (40min) Can we learn something outside the data.
   (31min) Relating learning to the bin model: verification.
   (18min) Verification vs. real learning.
   (05min) Medical testing and repeated trials.

Reading: LFD Chapter 1 (The Learning Problem), § 1.3
Slides: (full) (compact)
Full video on youtube (102min)

4: Real Learning is Feasible.
We get real learning in two steps: fit and predict. Your error measure and noisy target functions also play a role.

Modules:
   (28min) A Hoeffding bound for real learning.
   (22min) Interpretting the bound: fit vs. error bar.
   (21min) Learning is feasible in the 2-step approach.
   (09min) Complex and noisy target functions.
   (14min) The error measure.
   (07min) Full learning setup - it must be possible to fail.
   (04min) Supplement: 5 Questions on the feasibility of learning.
   (03min) Supplement: Do humans follow the 2-step approach?
   (02min) Supplement: Commercial tools and data exploration.
   (04min) Supplement: Sequential search through H - do you pay for the full H?
   (04min) Supplement: Repeated hypothesis testing.
   (03min) Supplement: Supervisor's dilemma.

Reading: LFD Chapter 1 (The Learning Problem), § 1.3, 1.4
Slides: (full) (compact)
Full video on youtube (100min)

PAUSE. Reinforce and Practice: Assignment 2.

5: An Effective Model Size.
We can handle finite models. In practice, all models are infinite. Can we also handle infinite models? We need a theory which allows us to link in-sample to out-of-sample for infinite models. The first step is to recast the size of a model into something manageable.

Modules:
   (08min) What will the theory achieve: infinite models.
   (05min) Model complexity through the lens of a data set.
   (24min) Growth function: an effective model size.
   (29min) 2-d perceptron; positive ray; positive rectangle.
   (12min) Shattering the data and a combinatorial puzzle.

Reading: LFD Chapter 2 (Training Vs. Testing), § 2.1.1
Slides: (full) (compact)
Full video on youtube (77min)

PAUSE. Reinforce and Practice: Assignment 3.

6: Bounding Model Size and the VC-Error Bar.
We tamed infinite models by defining an effective size. When is the effective size small? Can the effective size be used to link in-sample to out-of-sample? Indeed, real learning is feasible, but only for good infinite models with small effective size, not bad infinite models.

Modules:
   (17min) Review the growth function and shattering.
   (37min) The combinatorial quantity B(N,k).
   (15min) A polynomial bound for the growth function.
   (08min) The good and bad hypothesis sets. No ugly ones.
   (07min) The Vapnik-Chervonenkis (VC) error bound.

Reading: LFD Chapter 2 (Training Vs. Testing), § 2.1.2
Slides: (full) (compact)
Full video on youtube (83min)

PAUSE. Reinforce and Practice: Assignment 4.

7: VC-Dimension: Fit vs. Predict.
The approximation vs. generalization tradeoff is inescapable. It is tug-o-war between fitting in-sample and predicting out-of-sample. The starring role goes to the VC-dimension, a single parameter that captures model complexity. We also cover another approach to this tradeoff, bias and variance.

Modules:
   (09min) Recap on the story of learning.
   (24min) VC-dimension: a model's complexity.
   (08min) VC-dimension of the perceptron in d-dimensions.
   (19min) VC-dimension in theory vs. practice.
   (30min) Bias-variance analysis.
   (04min) Learning curves.

Reading: LFD Chapter 2 (Training Vs. Testing), § 2.1.3, 2.2, 2.3
Slides: (full) (compact)
Full video on youtube (93min)

PAUSE. Reinforce and Practice: Assignment 5.

8: Linear Classification & Regression.
The theory says: (i) fit the data and (ii) ensure the link from in-sample to out-of-sample. The simple linear model has the link to out-of-sample. Can we fit? We revisit the linear model with non-separable data. We also consider regression for predicting real values. With regression it is "easy" to fit.

Modules:
   (12min) Recap of the approximation vs. generalization.
   (04min) Classification, regression, logistic regression.
   (09min) The linear signal
   (12min) Classification, error and the pocket algorithm
   (19min) The digits data and choosing good features.
   (24min) Regression and the pseudoinverse algorithm.
   (04min) Linear regression used for classification

Reading: LFD Chapter 3 (The Linear Model), § 3.1, 3.2
Slides: (full) (compact)
Full video on youtube (85min)

PAUSE. Reinforce and Practice: Assignment 6.

Zip Code Digits Data:
(Training Data), (Test Data), (Some info on the data)

Sample MATLAB code for plotting digits:
Wrapper to plot images, Plots a single image

9: Logistic Regression and Gradient Descent.
Your data contains realizations of events (e.g. credit defaults), and you want to estimate the probability of the event. How do you estimate a probability? We need the right error measure and a method to fit the data by minimizing the error measure. The linear model will take care of the link from in-sample to out-of-sample.

Modules:
   (03 min) Good features, classification and regression.
   (20min) Predicting a probability -- the squared error is bad.
   (21min) Max. Likelihood and the cross-entropy error.
   (29min) Gradient descent to minimize the error.
   (08min) Stochastic gradient descent (SGD).

Reading: LFD Chapter 3 (The Linear Model), § 3.3
Slides: (full) (compact)
Full video on youtube (79min)

10: Nonlinear Feature Transforms.
What do you do when you suspect the linear model can't get a good fit to the data? Nonlinear transforms are an add on to linear models which allows us to separate with a nonlinear classification boundary, but using all the machinary of the linear models. Elegant and very powerful, but dangerous.

Modules:
   (04min) Recap of linear models.
   (25min) Mechanics of the nonlinear feature transform.
   (08min) Pick your transform before seeing the data.
   (17min) Qth-th order polynomial transform. Mega power.
   (03min) Advertisement: use the linear model. It's solid.

Reading: LFD Chapter 3 (The Linear Model), § 3.4
Slides: (full) (compact)
Full video on youtube (58min)

PAUSE. Reinforce and Practice: Assignment 7.

11: Overfitting.
Overfitting separates the pro from the amateur. What is overfitting? When does it occur? What causes it? Is it just the use of complex models on little data? Even simple models can overfit when there is stochastic or deterministic noise.

Modules:
   (06min) Recap of linear models.
   (08min) What is overfitting?
   (25min) A case study in complexity and noise.
   (23min) Stochastic and deterministic noise. The culprits.
   (09min) Bias variance decomposition and noise.
   (04min) Overfitting is the disease. Noise is the virus. Cures?

Reading: LFD Chapter 4 (Overfitting), § 4.1
Slides: (full) (compact)
Full video on youtube (75min)

12: Regularization.
We need weapons against overfitting. Regularization works by constraining a model toward simpler hypotheses. Noise (stochastic of deterministic) is complex. Regularization helps by combating the effects of noise without hindering your ability to fit signal.

Modules:
   (03min) Overfitting: stochastic and deterministic noise.
   (09min) Regularization in a nutshell: constrain the model.
   (18min) Linear model, Legendre polynomials, constraints.
   (34min) A concrete setting: linear models on a budget.
   (09min) More noise needs more regularization.
   (07min) Regularization fights noise, not signal. It's a MUST.

Reading: LFD Chapter 4 (Overfitting), § 4.2
Slides: (full) (compact)
Full video on youtube (79min)

PAUSE. Reinforce and Practice: Assignment 8.

13: Validation and Model Selection.
Can we get a sneak peak at the out-of-sample error? This would be reality check that can help prevent overfitting as well allow us to make high level choices in learning (hyperparameters, model selection, etc.).

Modules:
   (03min) Recap of regularization for fighting overfitting.
   (26min) Validation simulates a test set.
   (09min) Restoring the full data and the validation set size.
   (17min) Model selection, an application of validation.
   (16min) Can we use validation set size 1: cross validation.
   (07min) Validation on the digits data: selecting features.

Reading: LFD Chapter 4 (Overfitting), § 4.3
Slides: (full) (compact)
Full video on youtube (79min)

PAUSE. Reinforce and Practice: Assignment 9.

14: Occam's Razor, Sampling Bias, Snooping.
We take stock of the foundations with three simple but important learning principles to address the three main steps in learning: (i) choosing the model; (ii) getting the data; (iii) handling the data.

Modules:
   (25min) Occam's razor: pick your model carefully.
   (12min) The postal scam.
   (07min) Sampling bias: get your data carefully.
   (09min) Social media; medical studies; standardized tests.
   (04min) Extrapolation vs. interpolation.
   (04min) Credit approval and sampling bias.
   (08min) Data snooping: handle the data with care. Stocks.
   (10min) Data snooping is a subtle happy hell.

Reading: LFD Chapter 5 (Three Learning Priniciples)
Slides: (full) (compact)
Full video on youtube (78min)

15: Epilogue to Part I.
Let us relax a bit and reflect on our path and the storyline. We earned it. We'll revisit our learning principles, try to give you feel for how precipitous learning is, build a taxonomy for what is out there and plot our path forward.

Modules:
   (04min) Occam (give the data a chance), bias, snooping.
   (06min) Pause and take in a zen moment.
   (08min) Storyline. What is learning? Can we do it? How?
   (21min) The ML jungle: theory, techniques, paradigms.

Reading: LFD
Slides: (full) (compact)
Full video on youtube (39min)

Ⅱ. Techniques

16: Similarity and Nearest Neighbor.
A five-year old if asked to classify an apple would search their experience for things that look similar and classify accordingly. It is the simplest learning rule of all, yet it is very powerful. So, let us take a systematic look at such methods.

Modules:
   (06min) Measuring similarity.
   (13min) Nearest-neighbor. Ein=0 (no link to Eout)!
   (22min) Nearest-neighbor is within 2 times optimal.
   (12min) K-Nearest Neighbor. Optimal Eout!
   (06min) Parametric vs. nonparametric.
   (07min) Multiclass, regression and logistic regression.

Reading:LFD e-Chapter 6 (Similarity), § 6.2
Slides: (full) (compact)
Full video on youtube (65min)

17: Efficiency of Nearest Neighbor.
Nearest neighbor is heavy and slow: you need all the data and for every test classification. The 5-year old does not store all horses. Can we make nearest neighbor more practical? We don't want to carry around all the data. We don't want to spend hours searching for a nearest neighbor.

Modules:
   (10min) Memory and speed load of Nearest Neighbor.
   (11min) Data condensing.
   (24min) Condensed nearest neighbor (CNN).
   (19min) Search using branch and bound and clustering.
   (07min) Clustering and Lloyds algorithm.

Reading: LFD e-Chapter 6 (Similarity), § 6.2.3
Slides: (full) (compact)
Full video on youtube (71min)

PAUSE. Reinforce and Practice: Assignment 10.

18: Radial Basis Functions (RBF).
Nearest neighbor uses only some data to classify a test point. That seems an unnecessary limitation. We develop a "soft" extension of nearest neighbor that uses all data to classify. When we simplify this model to the RBF-network, we get our first nonlinear model.

Modules:
   (25min) Nonparametric RBF: soft nearest neighbor.
   (27min) Parametric RBF: fixed bumps on data points.
   (12min) RBF-network: reducing the number of bumps.
   (15min) Fitting data: linear model + similarity features.

Reading: LFD e-Chapter 6 (Similarity), § 6.3

Slides: (full) (compact)
Full video on youtube (79min)

19: A Peek at Unsupervised Learning.
Unsupervised learning made two appearances: to organize the data for nearest neighbor search and as the first step in RBF-learning. Let us make a very short digression to look at two simple important unsupervised techniques for organizing data: k-means and Gaussian Mixture Models.

Modules:
   (11min) The clustering problem.
   (18min) K-means clustering and Lloyds algorithm.
   (06min) Probability density estimation: parzen windows.
   (10min) Gaussian Mixture Model.
   (18min) EM-Algorithm for learning a GMM.

Reading: LFD e-Chapter 6 (Similarity), § 6.4
Slides: (full) (compact)
Full video on youtube (62min)

PAUSE. Reinforce and Practice: Assignment 11.

20: Multilayer (Deep) Perceptron.
What would happen if you started from the fundamental simple linear perceptron and cascaded multiple of them together? You get awesome power. Such a generalization of the perceptron ultimately leads us to the neural network, a powerful biologically inspired model.

Modules:
   (14min) A biography of deep networks.
   (27min) Multilayer perceptron. Cascaded linear models.
   (15min) Approximation: 3-layers can fit any target function.
   (18min) Notation for Deep Multilayer Networks.

Reading: LFD e-Chapter 7 (Neural Networks), § 7.1
Slides: (full) (compact)
Full video on youtube (74min)

21: Fitting Deep Networks to Data.
Unlike the simple perceptron, we don't have an explicit formula for the deep network hypothesis or its derivative. We need an algorithm to compute the hypothesis. We also need to efficiently compute gradients and learn weights by using gradient descent to fit the data.

Modules:
   (10min) Recap of neural network notation.
   (22min) Forward propagarion: computing the output.
   (42min) Backward propagation: getting the gradient.
   (03min) Fitting a network to digits data. Or is it overfitting?

Reading: LFD e-Chapter 7 (Neural Networks), § 7.2
Slides: (full) (compact)
Full video on youtube (77min)

22: Deep Networks: Overfitting / Faster Fitting.
The awesome power of deep networks leads to overfitting, and so this power must be reined-in, or else we don't get the link to out-of-sample. We can minimize a regularized in-sample error, but, it's not that easy anymore. We need better, more efficient tools for fitting/minimizing error-measures. Can we beef up gradient descent?

Modules:
   (23min) Deep network vs. RBF vs. nonlinear transforms.
   (19min) Generalization and VC-dimension of cascading.
   (13min) Regularization and early stopping.
   (18min) Beefing up gradient descent.
   (10min) Conjugate gradients.

Reading: LFD e-Chapter 7 (Neural Networks), § 7.3-7.5
Slides: (full) (compact)
Full video on youtube (82min)

PAUSE. Reinforce and Practice: Assignment 12, Problems 1,2.

23: Support Vector Machine.
Many hyperplanes can fit the data equally well. Which one is the best? We move in the direction of robustness to noise in the inputs. A most robust linear model that maximizes the "margin" for error is robust to noise and has links to automatic regularization. Can such an optimal hyperplane be efficiently found?

Modules:
   (09min) Robustness to input noise: fattest hyperplane.
   (22min) Geometry of hyperplanes.
   (20min) Finding the fattest maximum margin hyperplane.
   (20min) Quadratic programming.
   (06min) Comparing SVM with PLA.

Reading: LFD e-Chapter 8 (SVM), § 8.1.1
Slides: (full) (compact)
Full video on youtube (77min)

24: SVM Overfits Less.
The optimal (max. margin) hyperplane overfits less. We show evidence that does not explicitly depend on the dimension, opening up a world of possibilities.

Modules:
   (18min) Large margin is better than small margin.
   (08min) Fat hyperplanes have smaller VC-dimension.
   (10min) Optimal hyperplane cross-validation error.
   (08min) SVM generalization not controled by dimension.
   (17min) Non-separable data: soft margin SVM.
   (06min) Non-separable data: feature transforms.

Reading: LFD e-Chapter 8 (SVM), § 8.1.2, 8.1.3
Slides: (full) (compact)
Full video on youtube (68min)

25: SVM: The Kernel Trick.
The support vector machine has the power to efficiently use nonlinear transforms without physically transforming to the nonlinear feature-space. This allows us to learn in infinite dimensions!

Modules:
   (52min) Deriving the dual for the optimal hyperplane.
   (12min) SVM dual version is an inner product algorithm.
   (05min) A kernel gives inner products in feature space.
   (09min) Kernel for polynomial features.
   (13min) Learning in infinite dimensions.

Reading: LFD e-Chapter 8 (SVM), § 8.2--8.4
Slides: (full) (compact)
Full video on youtube (91min)

PAUSE. Reinforce and Practice: Assignment 12, Problems 3,4,5.

26: SVM: Choosing A Kernel.
The support vector machine with the kernel trick can simulate other methods. But, at the end of the day, the kernel measures similarity, which brings us full circle back to similarity based methods. If you think about it, what else could there be? We'll touch popular kernels, design choices and kernels in different applications.

Modules:
   (10min) The polynomial and Gaussian (RBF) kernels.
   (07min) RBF-kernel is a fully-automatic RBF-network.
   (09min) Tanh-kernel is a fully automatic neural network.
   (10min) The kernel computes a scaled similarity.
   (11min) Task dependence: strings, text, graphs, images.

Reading: LFD e-Chapter 8 (SVM), § 8.2--8.4
Slides: (full) (compact)
Full video on youtube (min)

27: Learning Aides & PCA.
Typical models have default settings and expect the data to conform to these defaults. Learning aides help the learning by putting data in the desired state, and can be used with any learning technique. Beware: it is easy to data snoop your test set when using a learning aid.

Modules:
   (10min) Data scales can affect learning outcome.
   (14min) Preprocess: center, normalize & whiten.
   (32min) Principal Components Analysis.
   (04min) Quick overview of other learning aides.

Reading: LFD e-Chapter 9 (Learning Aides), § 9.1 -- 9.4
Slides: (full) (compact)
Full video on youtube (60min)

28: Model Blending & Reinforcement Learning.
If you have multiple final hypotheses, must you pick one or can you combine. You can combine hypotheses either during learning or after the fact. We will end with some flavors of reinforcement learning. It is the predominant way animals learn, and that makes it an appealing paradigm.

Modules:
   (26min) Boosting, Bagging and Blending.
   (15min) The reinforcement learning setting.
   (09min) Bandit problems: exploration vs. exploitation.
   (19min) Online decisions. Magdon's 1/e-theorem for dating.

Reading:
Slides: (full) (compact)
Full video on youtube (68min)

CELEBRATE! You came this far.
Apply your learning.
Take the FINAL

Lectures 1-27
LFD Chapters 1-9
Assignments 1-12
Fall 2019 (solution).
Fall 2018 (solution).
Fall 2017 (solution).
Fall 2016 (solution).