# CSCI 4961/6961 Machine Learning and Optimization, Fall 2020

## Overview

Modern (i.e. large-scale, or “big data”) machine learning and data science typically proceed by formulating the desired outcome as the solution to an optimization problem, then applying randomized algorithms to solve these problems efficiently. This class introduces the probability and optimization background necessary to understand these randomized algorithms, and surveys several popular randomized algorithms, placing the emphasis on those widely used in ML applications. The homeworks will involve hands-on applications and empirical characterizations of the behavior of these algorithms.

## Course Logistics

The syllabus is available as an archival pdf, and is more authoritative than this website.

**Instructor**: Alex Gittens (`gittea` at `rpi` dot `edu`)

**Lectures (via Zoom)**: MTh 10am-12pm (email me if you have not yet received the Zoom invitation--- this may happen if you enrolled close to the start of the semester)

**Questions and Discussions**: Piazza

**Office Hours (use Submitty Instructor OH queue with code mloptoh)**: MTh 12pm-1pm ET, or by appointment

**TA**: Owen Xie (`xieo` at `rpi` dot `edut`)

**TA Office Hours (use Submitty TA OH queue with code mloptoh)**: T 9-10am ET, Th 4-5pm ET

** Course Text**: None

** Grading Criteria**:

- Homeworks, 50%
- Weekly Participation, 15%
- Project, 35%

Letter grades will be computed from the semester average. Lower-bound cutoffs for A, B, C and D grades are 90%, 80%, 70%, and 60%, respectively. These bounds may be moved lower at the instructor's discretion.

## Lecture Schedule

The recorded lectures are made available on the course's MediaSite channel shortly after lectures: https://mediasite.mms.rpi.edu/Mediasite5/Channel/mlandopt.

- Monday, August 31. Lecture 1. Course logistics, what is ML, k-nearest neighbors, SVMs. lecture notes.
- Thursday, September 3. Lecture 2. Finish SVMs, gradient vs stochastic first-order methods, probability distributions, discrete and continuous random variables. Lecture notes.
- Tuesday, September 8. Lecture 3. Mean and variance, random vectors, marginals, covariances, covariance matrix, multivariate Gaussians. Lecture notes.
- Thursday, September 10. Lecture 4. Independence and conditioning, Law of Large Numbers, Bayes' rule, Law of Total Probability, Law of Total Expectation, Tower Property, Von Neumann's algorithm for using a biased coin to simulate a fair one. Lecture notes.
- Monday, September 14. Lecture 5. Runtime of Von Neumann's algorithm, transformations of random variables, properties of multivariate gaussians, empirical risk minimization. Lecture notes.
- Thursday, September 17. Lecture 6. Regression losses; conditional mean as Bayes optimal estimator for squared loss; decomposition of risk of empirical risk minimizer into approximation error, generalization gap and optimization error; maximum likelihood estimation and logistic regression. Lecture notes.
- Monday, September 21. Lecture 7. Multinomial Logistic regression: logits, softmax, logsumexp. Regularization: overfitting, l2 and l1 regularization, regularized empirical risk minimization. Lecture notes. Scikit-learn code/example of overfitting and regularization.
- Thursday, September 24. Lecture 8. Confusion matrices, convex sets, convex functions, convex optimization, consequences of convexity for optimization, Jensen's inequality, first- and second-order characterizations of convexity, example: logsumexp is convex. Lecture notes.
- Monday, September 28. Lecture 9. Jensen's inequality (proof); examples of convex functions; operations that preserve function convexity; examples of convex sets; examples of convex optimization problems. Lecture notes.
- Thursday, October 1. Lecture 10. Project expectations and guidelines; optimality conditions for smooth and nonsmooth, constrained and unconstrained convex optimization; Cauchy-Schwarz and angles between vectors; gradient descent and projected gradient descent; subdifferentials and subgradients; examples of subdifferentials: the positive part and the absolute value function. Lecture notes.
- Monday, October 5. Lecture 11. Subdifferential of l1 norm; calculation rules for subdifferentials; subdifferential of l2 norm; subdifferential of l1-regularized SVM; computational example: error-correcting output coding vclassification using gradient descent. Lecture notes. Example of gradient descent for ECOCC on Fashion-MNIST.
- Thursday, October 8. Lecture 12. Stochastic gradient descent; stochastic subgradient descent; minibatches and epochs. Lecture notes
**(Updated since original posting)**. - Thursday, October 15. Lecture 13. Backtracking line search, method of steepest descent, Newton's method. Lecture notes.
- Monday, October 19. Lecture 14. Adaptive gradient descent methods: a motivating example where we benefit from feature-specific learning rates, SGD with momentum, Nesterov's Accelerated Gradient Descent, AdaGrad. Lecture notes. Example code comparing accelerated methods.
- Thursday, October 22. Lecture 15. RMSProp and ADAM. The kernel trick for nonlinear machine learning, with regression over quadratic features as an example. Lecture notes.
- Monday, October 26. Lecture 16. Polynomial and gaussian kernels; the finite dimensional kernel machine optimization problem; kernel regression and kernel logistic regression. Lecture notes.
- Thursday, October 29. Lecture 17. Kernel logistic regression example using heavyball method. Nystrom approximations. Random Feature Maps approximations. Lecture notes. Example code comparing linear, exact kernel, and approximate kernel regression on the adult dataset (see HW4 for this dataset).
- Monday, November 2. Lecture 18. Random feature maps for gaussian and polynomial kernels. Neural networks: neurons, fully connected neural networks, feedforward networks. Lecture notes. Example Keras code using a simple fully connected neural network for a synthetic regression problem.
- Thursday, November 5. Lecture 19. Activation functions, expressivity of neural networks, vector notation for MLPs, training MLPs via sgd, intuition behind backpropagation. Lecture notes.
- Monday, November 9. Lecture 20. The backpropagation algorithm. Autoencoders. Lecture notes. Autoencoder example in Keras.
- Thursday, November 12. Lecture 21. Convolutional neural networks: convolutions, pooling. Lecture notes. LeNet-5 example in Keras.
- Monday, November 16. Lecture 22. CNNs continued: working with multichannel images, backprop for CNNs, vanishing and exploding gradients. Lecture notes.
- Thursday, November 19. Lecture 23. Issues with DL: overfitting, vanishing/exploding gradients, hyperparameter selection. Dropout for preventing feature co-adaptation. Batch normalization to enable gradient training of deeper architectures. Lecture notes. Example of using pretrained Keras models.
- Monday, November 23. Lecture 24. Data augmentation. Inception architecture. Skip connections and residual blocks. ResNet-152. Lecture notes.
- Monday November 30. Lecture 25. Recursive Neural Networks for sequential data. Design patterns for RNNs, vanilla RNNs, architecture diagrams for RNNs, backpropagation through time (BTT), truncated BTT. Lecture notes.
- Monday December 7. Lecture 26. LSTM (Long Short Term Memories). Tokenization, encoding, and word embeddings for using neural networks for NLP tasks. Keras examples of FFNs, simple RNNs, and LSTMs for sentiment classification of IMDB reviews. Lecture notes. Example of sentiment determination using FFNs, simple RNNs, and LSTMs on the IMDB review data set.
- Thursday December 10. Lecture 27. Bidirectional and Deep RNNs. Sequence-to-Sequence learning with encoder-decoder RNN architectures. A recap of what we learned in the class, and some topics we didn't cover, to whet your interest. Lecture notes. Example of machine translation of Spanish to English, using character level sequence to sequence learning.

## Homeworks and Weekly Participation

**Homework and Weekly participation submission link**: pdf and python code only, 1MB limit

** Late assignments will not be accepted, unless you contact the instructor at least two days before the due date to receive a deferral**. Deferrals will be granted at the instructor’s discretion, of course.

- Weekly Participation 1. Posted 8/31/2020, due 9/4/2020.
- Weekly Participation 2. Posted 9/8/2020, due 9/11/2020.
- Weekly Participation 3. Posted 9/15/2020, due 9/18/2020.
- Weekly Participation 4. Posted 9/22/2020, due 9/25/2020.
- Weekly Participation 5. Posted 9/28/2020, due 10/2/2020.
- Weekly Participation 6. Posted 10/5/2020,
**extended to 10/15/2020 due to Internet outages**. - Weekly Participation 7. Posted 10/15/2020, due 10/16/2020.
- Weekly Participation 8. Posted 10/20/2020, due 10/23/2020.
- Weekly Participation 9. Posted 10/29/2020, due 11/6/2020.
- Weekly Participation 10. Posted 11/12/2020, due 11/20/2020.
- Homework 1. Posted 9/10/2020, due 9/17/2020.
- Homework 2. Posted 9/21/2020, due 9/28/2020.
- Homework 3. Posted 9/28/2020, due 10/5/2020.
- Homework 4. Posted 10/5/2020,
**extended to 10/15/2020 due to Internet outages**. - Homework 5. Posted 10/15/2020, due 10/26/2020.
- Homework 6. Posted 11/13/2020, due 11/23/2020.

## Project

In teams of up to five, you will present either an original research project or an exposition on a topic relevant to the course. See the project page for more details and deadlines. Your group assignments will be posted to Piazza.

## Supplementary Materials

For your background reading, if you are unfamiliar with the linear algebra and probability being used:- Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares. Boyd and Vandenberghe.
- Jeff Erickson's notes on discrete probability. Erickson.
- Introduction to Probability, Statistics, and Random Processes. Pishro-Nik.
- Chapter 3 of "Deep Learning". Goodfellow, Bengio, and Courville.
- Chapter 1 of "Bayesian Reasoning and Machine Learning". Barber.

- Convexity and Optimization. Lecture notes by R. Tibshirani.
- Optimization for Machine Learning. Lecture notes by E. Hazan.
- Optimization Methods for Large-scale Machine Learning. SIAM Review article. Bottou, Curtis, and Nocedal.
- Theory of Convex Optimization for Machine Learning. Bubeck
- Convex Optimization. Boyd and Vandenberghe.

- Deep Learning with Python. Chollet.