Machine Learning From Data, Rensselaer
Online Course A course on the foundations of machine learning suitable for an advanced undergraduate or begining graduate student interested in the theory and applications of learning. Familiarity with calculus, linear algebra and probability is helpful. Instructor: Malik MagdonIsmail Text Book: Learning From Data by AbuMostafa, MagdonIsmail, Lin (plus eChapters) Part Ⅰ: Foundations. (lectures 115) A companion to the textbook. Part Ⅱ: Techniques. (lectures 1627) eChapters 69 (bookforum, www.amlbook.com.) 

1: The Learning Problem. Class motto (sing it loud): A pattern exists.Modules: (04min) Storyline: What is learning? Can we do it? How? (13min) Applications, "What is a tree? (16min) Netflix and Credit. (38min) Learning problem setup. Reading: LFD Chapter 1 (The Learning Problem), § 1.1 Slides: (full) (compact) Full video on youtube (74min) 

2: Linear Model: Perceptron. We can deduce a simple but powerful classifier using an analogy to a credit score. We condinut with a general overview of learning paradigms and end on a puzzle which forces us to consider if learning is feasible. Modules: (19min) Importance of testing and Edward Jenner's story. (28min) Deriving the perceptron from a credit score. (04min) Geometry of the perceptron. (29min) The Perceptron Learning Algorithm (PLA). (09min) Overview of learning paradigms. (10min) A puzzle. Is learning feasible, even for humans? Reading: LFD Chapter 1 (The Learning Problem), § 1.1, 1.2 Slides: (full) (compact) Full video on youtube (98min) 
PAUSE. Reinforce and Practice: Assignment 1.
 

3: Is Learning Feasible? To learn, we must be able to reach outside the data. Setting our sights low, we ask: Can we reach outside the data to infer something, however small, about the target function? We'll need probability to rescue us. Modules: (07min) Learning setup and our puzzle from last time. (40min) Can we learn something outside the data. (31min) Relating learning to the bin model: verification. (18min) Verification vs. real learning. (05min) Medical testing and repeated trials. Reading: LFD Chapter 1 (The Learning Problem), § 1.3 Slides: (full) (compact) Full video on youtube (102min) 

4: Real Learning is Feasible. We get real learning in two steps: fit and predict. Your error measure and noisy target functions also play a role. Modules: (28min) A Hoeffding bound for real learning. (22min) Interpretting the bound: fit vs. error bar. (21min) Learning is feasible in the 2step approach. (09min) Complex and noisy target functions. (14min) The error measure. (07min) Full learning setup  it must be possible to fail. (04min) Supplement: 5 Questions on the feasibility of learning. (03min) Supplement: Do humans follow the 2step approach? (02min) Supplement: Commercial tools and data exploration. (04min) Supplement: Sequential search through H  do you pay for the full H? (04min) Supplement: Repeated hypothesis testing. (03min) Supplement: Supervisor's dilemma. Reading: LFD Chapter 1 (The Learning Problem), § 1.3, 1.4 Slides: (full) (compact) Full video on youtube (100min) 

5: An Effective Model Size. We can handle finite models. In practice, all models are infinite. Can we also handle infinite models? We need a theory which allows us to link insample to outofsample for infinite models. The first step is to recast the size of a model into something manageable. Modules: (08min) What will the theory achieve: infinite models. (05min) Model complexity through the lens of a data set. (24min) Growth function: an effective model size. (29min) 2d perceptron; positive ray; positive rectangle. (12min) Shattering the data and a combinatorial puzzle. Reading: LFD Chapter 2 (Training Vs. Testing), § 2.1.1 Slides: (full) (compact) Full video on youtube (77min) 
PAUSE. Reinforce and Practice: Assignment 3.
 

6: Bounding Model Size and the VCError Bar. We tamed infinite models by defining an effective size. When is the effective size small? Can the effective size be used to link insample to outofsample? Indeed, real learning is feasible, but only for good infinite models with small effective size, not bad infinite models. Modules: (17min) Review the growth function and shattering. (37min) The combinatorial quantity B(N,k). (15min) A polynomial bound for the growth function. (08min) The good and bad hypothesis sets. No ugly ones. (07min) The VapnikChervonenkis (VC) error bound. Reading: LFD Chapter 2 (Training Vs. Testing), § 2.1.2 Slides: (full) (compact) Full video on youtube (83min) 
PAUSE. Reinforce and Practice: Assignment 4.
 

7: VCDimension: Fit vs. Predict. The approximation vs. generalization tradeoff is inescapable. It is tugowar between fitting insample and predicting outofsample. The starring role goes to the VCdimension, a single parameter that captures model complexity. We also cover another approach to this tradeoff, bias and variance. Modules: (09min) Recap on the story of learning. (24min) VCdimension: a model's complexity. (08min) VCdimension of the perceptron in ddimensions. (19min) VCdimension in theory vs. practice. (30min) Biasvariance analysis. (04min) Learning curves. Reading: LFD Chapter 2 (Training Vs. Testing), § 2.1.3, 2.2, 2.3 Slides: (full) (compact) Full video on youtube (93min) 
PAUSE. Reinforce and Practice: Assignment 5.
 

8: Linear Classification &
Regression. The theory says: (i) fit the data and (ii) ensure the link from insample to outofsample. The simple linear model has the link to outofsample. Can we fit? We revisit the linear model with nonseparable data. We also consider regression for predicting real values. With regression it is "easy" to fit. Modules: (12min) Recap of the approximation vs. generalization. (04min) Classification, regression, logistic regression. (09min) The linear signal (12min) Classification, error and the pocket algorithm (19min) The digits data and choosing good features. (24min) Regression and the pseudoinverse algorithm. (04min) Linear regression used for classification Reading: LFD Chapter 3 (The Linear Model), § 3.1, 3.2 Slides: (full) (compact) Full video on youtube (85min) 
PAUSE. Reinforce and Practice: Assignment 6.
 

9: Logistic Regression and
Gradient Descent. Your data contains realizations of events (e.g. credit defaults), and you want to estimate the probability of the event. How do you estimate a probability? We need the right error measure and a method to fit the data by minimizing the error measure. The linear model will take care of the link from insample to outofsample. Modules: (03 min) Good features, classification and regression. (20min) Predicting a probability  the squared error is bad. (21min) Max. Likelihood and the crossentropy error. (29min) Gradient descent to minimize the error. (08min) Stochastic gradient descent (SGD). Reading: LFD Chapter 3 (The Linear Model), § 3.3 Slides: (full) (compact) Full video on youtube (79min) 

10: Nonlinear Feature Transforms. What do you do when you suspect the linear model can't get a good fit to the data? Nonlinear transforms are an add on to linear models which allows us to separate with a nonlinear classification boundary, but using all the machinary of the linear models. Elegant and very powerful, but dangerous. Modules: (04min) Recap of linear models. (25min) Mechanics of the nonlinear feature transform. (08min) Pick your transform before seeing the data. (17min) Qthth order polynomial transform. Mega power. (03min) Advertisement: use the linear model. It's solid. Reading: LFD Chapter 3 (The Linear Model), § 3.4 Slides: (full) (compact) Full video on youtube (58min) 
PAUSE. Reinforce and Practice: Assignment 7.
 

11: Overfitting.
Overfitting separates the pro from the amateur. What is overfitting? When does it occur? What causes it? Is it just the use of complex models on little data? Even simple models can overfit when there is stochastic or deterministic noise. Modules: (06min) Recap of linear models. (08min) What is overfitting? (25min) A case study in complexity and noise. (23min) Stochastic and deterministic noise. The culprits. (09min) Bias variance decomposition and noise. (04min) Overfitting is the disease. Noise is the virus. Cures? Reading: LFD Chapter 4 (Overfitting), § 4.1 Slides: (full) (compact) Full video on youtube (75min) 

12: Regularization. We need weapons against overfitting. Regularization works by constraining a model toward simpler hypotheses. Noise (stochastic of deterministic) is complex. Regularization helps by combating the effects of noise without hindering your ability to fit signal. Modules: (03min) Overfitting: stochastic and deterministic noise. (09min) Regularization in a nutshell: constrain the model. (18min) Linear model, Legendre polynomials, constraints. (34min) A concrete setting: linear models on a budget. (09min) More noise needs more regularization. (07min) Regularization fights noise, not signal. It's a MUST. Reading: LFD Chapter 4 (Overfitting), § 4.2 Slides: (full) (compact) Full video on youtube (79min) 
PAUSE. Reinforce and Practice: Assignment 8.
 

13: Validation and Model Selection. Can we get a sneak peak at the outofsample error? This would be reality check that can help prevent overfitting as well allow us to make high level choices in learning (hyperparameters, model selection, etc.). Modules: (03min) Recap of regularization for fighting overfitting. (26min) Validation simulates a test set. (09min) Restoring the full data and the validation set size. (17min) Model selection, an application of validation. (16min) Can we use validation set size 1: cross validation. (07min) Validation on the digits data: selecting features. Reading: LFD Chapter 4 (Overfitting), § 4.3 Slides: (full) (compact) Full video on youtube (79min) 
PAUSE. Reinforce and Practice: Assignment 9.
 

14: Occam's Razor, Sampling Bias, Snooping. We take stock of the foundations with three simple but important learning principles to address the three main steps in learning: (i) choosing the model; (ii) getting the data; (iii) handling the data. Modules: (25min) Occam's razor: pick your model carefully. (12min) The postal scam. (07min) Sampling bias: get your data carefully. (09min) Social media; medical studies; standardized tests. (04min) Extrapolation vs. interpolation. (04min) Credit approval and sampling bias. (08min) Data snooping: handle the data with care. Stocks. (10min) Data snooping is a subtle happy hell. Reading: LFD Chapter 5 (Three Learning Priniciples) Slides: (full) (compact) Full video on youtube (78min) 

15: Epilogue to Part I. Let us relax a bit and reflect on our path and the storyline. We earned it. We'll revisit our learning principles, try to give you feel for how precipitous learning is, build a taxonomy for what is out there and plot our path forward. Modules: (04min) Occam (give the data a chance), bias, snooping. (06min) Pause and take in a zen moment. (08min) Storyline. What is learning? Can we do it? How? (21min) The ML jungle: theory, techniques, paradigms. Reading: LFD Slides: (full) (compact) Full video on youtube (39min) 

16: Similarity and Nearest Neighbor. A fiveyear old if asked to classify an apple would search their experience for things that look similar and classify accordingly. It is the simplest learning rule of all, yet it is very powerful. So, let us take a systematic look at such methods. Modules: (06min) Measuring similarity. (13min) Nearestneighbor. Ein=0 (no link to Eout)! (22min) Nearestneighbor is within 2 times optimal. (12min) KNearest Neighbor. Optimal Eout! (06min) Parametric vs. nonparametric. (07min) Multiclass, regression and logistic regression. Reading:LFD eChapter 6 (Similarity), § 6.2 Slides: (full) (compact) Full video on youtube (65min) 

17: Efficiency of Nearest Neighbor. Nearest neighbor is heavy and slow: you need all the data and for every test classification. The 5year old does not store all horses. Can we make nearest neighbor more practical? We don't want to carry around all the data. We don't want to spend hours searching for a nearest neighbor. Modules: (10min) Memory and speed load of Nearest Neighbor. (11min) Data condensing. (24min) Condensed nearest neighbor (CNN). (19min) Search using branch and bound and clustering. (07min) Clustering and Lloyds algorithm. Reading: LFD eChapter 6 (Similarity), § 6.2.3 Slides: (full) (compact) Full video on youtube (71min) 
PAUSE. Reinforce and Practice: Assignment 10.
 

18: Radial Basis Functions (RBF). Nearest neighbor uses only some data to classify a test point. That seems an unnecessary limitation. We develop a "soft" extension of nearest neighbor that uses all data to classify. When we simplify this model to the RBFnetwork, we get our first nonlinear model. Modules: (25min) Nonparametric RBF: soft nearest neighbor. (27min) Parametric RBF: fixed bumps on data points. (12min) RBFnetwork: reducing the number of bumps. (15min) Fitting data: linear model + similarity features. Reading: LFD eChapter 6 (Similarity), § 6.3 Slides: (full) (compact) Full video on youtube (79min) 

19: A Peek at Unsupervised Learning. Unsupervised learning made two appearances: to organize the data for nearest neighbor search and as the first step in RBFlearning. Let us make a very short digression to look at two simple important unsupervised techniques for organizing data: kmeans and Gaussian Mixture Models. Modules: (11min) The clustering problem. (18min) Kmeans clustering and Lloyds algorithm. (06min) Probability density estimation: parzen windows. (10min) Gaussian Mixture Model. (18min) EMAlgorithm for learning a GMM. Reading: LFD eChapter 6 (Similarity), § 6.4 Slides: (full) (compact) Full video on youtube (62min) 
PAUSE. Reinforce and Practice: Assignment 11.
 

20: Multilayer (Deep) Perceptron. What would happen if you started from the fundamental simple linear perceptron and cascaded multiple of them together? You get awesome power. Such a generalization of the perceptron ultimately leads us to the neural network, a powerful biologically inspired model. Modules: (14min) A biography of deep networks. (27min) Multilayer perceptron. Cascaded linear models. (15min) Approximation: 3layers can fit any target function. (18min) Notation for Deep Multilayer Networks. Reading: LFD eChapter 7 (Neural Networks), § 7.1 Slides: (full) (compact) Full video on youtube (74min) 

21: Fitting Deep Networks to Data. Unlike the simple perceptron, we don't have an explicit formula for the deep network hypothesis or its derivative. We need an algorithm to compute the hypothesis. We also need to efficiently compute gradients and learn weights by using gradient descent to fit the data. Modules: (10min) Recap of neural network notation. (22min) Forward propagarion: computing the output. (42min) Backward propagation: getting the gradient. (03min) Fitting a network to digits data. Or is it overfitting? Reading: LFD eChapter 7 (Neural Networks), § 7.2 Slides: (full) (compact) Full video on youtube (77min) 

22: Deep Networks: Overfitting / Faster Fitting. The awesome power of deep networks leads to overfitting, and so this power must be reinedin, or else we don't get the link to outofsample. We can minimize a regularized insample error, but, it's not that easy anymore. We need better, more efficient tools for fitting/minimizing errormeasures. Can we beef up gradient descent? Modules: (23min) Deep network vs. RBF vs. nonlinear transforms. (19min) Generalization and VCdimension of cascading. (13min) Regularization and early stopping. (18min) Beefing up gradient descent. (10min) Conjugate gradients. Reading: LFD eChapter 7 (Neural Networks), § 7.37.5 Slides: (full) (compact) Full video on youtube (82min) 
PAUSE. Reinforce and Practice: Assignment 12, Problems 1,2.
 

23: Support Vector Machine. Many hyperplanes can fit the data equally well. Which one is the best? We move in the direction of robustness to noise in the inputs. A most robust linear model that maximizes the "margin" for error is robust to noise and has links to automatic regularization. Can such an optimal hyperplane be efficiently found? Modules: (09min) Robustness to input noise: fattest hyperplane. (22min) Geometry of hyperplanes. (20min) Finding the fattest maximum margin hyperplane. (20min) Quadratic programming. (06min) Comparing SVM with PLA. Reading: LFD eChapter 8 (SVM), § 8.1.1 Slides: (full) (compact) Full video on youtube (77min) 

24: SVM Overfits Less. The optimal (max. margin) hyperplane overfits less. We show evidence that does not explicitly depend on the dimension, opening up a world of possibilities. Modules: (18min) Large margin is better than small margin. (08min) Fat hyperplanes have smaller VCdimension. (10min) Optimal hyperplane crossvalidation error. (08min) SVM generalization not controled by dimension. (17min) Nonseparable data: soft margin SVM. (06min) Nonseparable data: feature transforms. Reading: LFD eChapter 8 (SVM), § 8.1.2, 8.1.3 Slides: (full) (compact) Full video on youtube (68min) 

25: SVM: The Kernel Trick. The support vector machine has the power to efficiently use nonlinear transforms without physically transforming to the nonlinear featurespace. This allows us to learn in infinite dimensions! Modules: (52min) Deriving the dual for the optimal hyperplane. (12min) SVM dual version is an inner product algorithm. (05min) A kernel gives inner products in feature space. (09min) Kernel for polynomial features. (13min) Learning in infinite dimensions. Reading: LFD eChapter 8 (SVM), § 8.28.4 Slides: (full) (compact) Full video on youtube (91min) 
PAUSE. Reinforce and Practice: Assignment 12, Problems 3,4,5.
 

26: SVM: Choosing A Kernel. The support vector machine with the kernel trick can simulate other methods. But, at the end of the day, the kernel measures similarity, which brings us full circle back to similarity based methods. If you think about it, what else could there be? We'll touch popular kernels, design choices and kernels in different applications. Modules: (10min) The polynomial and Gaussian (RBF) kernels. (07min) RBFkernel is a fullyautomatic RBFnetwork. (09min) Tanhkernel is a fully automatic neural network. (10min) The kernel computes a scaled similarity. (11min) Task dependence: strings, text, graphs, images. Reading: LFD eChapter 8 (SVM), § 8.28.4 Slides: (full) (compact) Full video on youtube (min) 

27: Learning Aides & PCA. Typical models have default settings and expect the data to conform to these defaults. Learning aides help the learning by putting data in the desired state, and can be used with any learning technique. Beware: it is easy to data snoop your test set when using a learning aid. Modules: (10min) Data scales can affect learning outcome. (14min) Preprocess: center, normalize & whiten. (32min) Principal Components Analysis. (04min) Quick overview of other learning aides. Reading: LFD eChapter 9 (Learning Aides), § 9.1  9.4 Slides: (full) (compact) Full video on youtube (60min) 

28: Model Blending & Reinforcement Learning. If you have multiple final hypotheses, must you pick one or can you combine. You can combine hypotheses either during learning or after the fact. We will end with some flavors of reinforcement learning. It is the predominant way animals learn, and that makes it an appealing paradigm. Modules: (26min) Boosting, Bagging and Blending. (15min) The reinforcement learning setting. (09min) Bandit problems: exploration vs. exploitation. (19min) Online decisions. Magdon's 1/etheorem for dating. Reading: Slides: (full) (compact) Full video on youtube (68min)  

Fall 2019
(solution).
Fall 2018 (solution). Fall 2017 (solution). Fall 2016 (solution).  