CSCI 6961/4961 Machine Learning and Optimization, Fall 2021

Overview

Modern (i.e. large-scale, or “big data”) machine learning and data science typically proceed by formulating the desired outcome as the solution to an optimization problem, then using suitable algorithms to solve these problems efficiently.

The first portion of this course introduces the probability and optimization background necessary to understand the randomized algorithms that dominate applications of ML and large-scale optimization, and surveys several popular randomized and deterministic optimization algorithms, placing the emphasis on those widely used in ML applications.

The second portion of the course introduces architectures used in modern machine learning because of the proven effectiveness of their inductive biases, and presents common regularization techniques used to mitigate the issues that arise in solving the nonlinear optimization problems ubiquitous within modern machine learning.

The homeworks involve hands-on applications and empirical characterizations of the behavior of these algorithms and model architectures. A project gives the students experience in critically reading the research literature and crafting articulate technical presentations.

Course Logistics

The syllabus is available as an archival pdf, and is more authoritative than this website.

Instructor: Alex Gittens (gittea at rpi dot edu)

Lectures: MTh 10am-12pm ET (in person, Sage 3101)

Questions and Discussions: Campuswire

Office Hours: MTh 12pm-1pm ET in the #office-hours live chat on Campuswire, or by appointment

TA: Ian Bogle (boglei at rpi dot edu)

TA Office Hours: TWed 1-2pm ET

Course Text: None

Grading Criteria:

Letter grades will be computed from the semester average. Lower-bound cutoffs for A, B, C and D grades are 90%, 80%, 70%, and 60%, respectively. These bounds may be moved lower at the instructor's discretion.

Lecture Schedule

  1. Monday, August 30. Lecture 1. Course logistics; what is ML; examples of ML models: k-nn and svms. lecture notes
  2. Thursday, September 2. Lecture 2. Probability theory for modeling and analysis in ML, as algorithmic tool in optimization. Ordinary least squares. Population and empirical risk minimization. Basic probability: sample spaces, pmfs/pdfs, probability measures, random variables, random vectors, expectations, examples, joint pmfs/pdfs. lecture notes
  3. Tuesday, September 6. Lecture 3. Joint random variables, marginals, conditional distributions, independence, conditional independence, Naive Bayes assumption for classification. lecture notes
  4. Thursday, September 8. Point Estimates: expectation and variance. Independence, expectation, and variance. Law of Large Numbers. Conditional expectation. Conditional expectation as a point estimate for least squares regression; the regression function. lecture notes
  5. Monday, September 13. Parameterized ML models. Generalized linear models: Poisson regression, Bernoulli regression (logistic regression), Categorical regression (multiclass logistic regression). Maximum likelihood estimation via minimization of negative log-likelihood. MLE for Gaussian model is ordinary least squares. lecture notes
  6. Thursday, September 16. Binary and multiclass logistic regression from a different viewpoint: geometric interpretations and linear separability, MLE for both leads to ERM, the logistic loss function, softmax, and logsumexp. lecture notes
  7. Monday, September 20. Train/validation/test splits. Regularization: l2/ridge regression, l1/LASSO. Risk decomposition into approximation, generalization, and optimization errors. lecture notes
  8. Thursday, September 23. Convex sets. Convex functions, first and second-order characterizations. Examples of convex sets and functions. lecture notes
  9. Monday, September 27. Project details. 2nd order characterization of convexity, application to logsumexp. Jensen's inequality. Operations that preserve convexity of functions. Examples of convex optimization problems. lecture notes
  10. Thursday, September 30. More examples of convex optimization problems. Cauchy-Schwarz inequality and its geometric interpretation as a generalized law of cosines. Optimality conditions for smooth unconstrained and constrained optimization. Examples of using the optimality conditions to solve some optimization problems. lecture notes
  11. Monday, October 4. Revisited lecture notes from last time to see how optimality conditions can be used to solve simple problems. Gradient Descent and Projected Gradient Descent, β-smoothness and the descent lemma. lecture notes. Visualization of gradient descent. Optional: see Chapter 3 of Bubeck for full proof of convergence of (projected) gradient descent.
  12. Thursday, October 7. Subdifferentials and subgradients. Fermat's Optimality Condition. Rules for manipulating subdifferentials. Subgradient descent. lecture notes. Optional: see Chapter 3 of Beck for more on subgradients and plenty of examples (the chapter pdfs are accessible through the Get it @ RPI link if you're on campus).
  13. Thursday, October 14. Exact and backtracking line search. Steepest Descent Method. Newton's method. lecture notes. Optional: see sections 9.4 (steepest descent) and 9.5 (Newton's method) of Boyd and Vandenberghe.
  14. Monday, October 18. Stochastic subgradient descent (SGD). Minibatch SGD. Convergence rate of SGD for strongly convex functions. Tips and tricks for SGD. lecture notes. Optional: see Bottou's notes on stochastic gradient descent for more tips on SGD.
  15. Thursday, October 21. Momentum, preconditioning, and adaptive (stochastic) gradient methods: heavy-ball method, Nesterov's Accelerated Gradient Descent, AdaGrad, RMSProp, ADAM. lecture notes. Implementations and comparisons of gradient descent, heavy-ball method, NAG, AdaGrad, RMSProp..
  16. Monday, October 25. Feature maps, kernels, reproducing kernel hilbert spaces (RKHSes). The Representer Theorem for empirical risk minimization in a RKHS. Common kernels: polynomial, Gaussian, Laplacian. lecture notes. Example code for comparing performances of Kernel SVM and linear SVM on the UCI Adult data set, which requires the data files a9a.txt and a9a.t to be in the working directory. Optional: A recent tutorial and survey chapter on kernels, kernel methods, and some applications.
  17. Thursday, October 28. Neural networks: neurons, neural network architectures as DAGs, common activation functions, fully-connected feedforward NNs and MLPs. Expressivity of neural networks: universal approximation using two-layer MLPs, how all linear models are subsumed into MLPs. Vector notation for computing with and parametrizing MLPs. lecture notes. Optional: An overview of activation functions with some guidance on when to use which activation function.
  18. Monday, November 1. Nonconvex optimization: lack of global convergence guarantees, approximate stationarity. (Stochastic) first-order methods find approximately stationary points of smooth non-convex functions. The chain rule, and backpropagation for fully connected feedforward neural networks. lecture notes. Example code for training a (hand-written) two-layer MLP for FashionMNIST.
  19. Thursday, November 4. Backpropagation for MLPs, and ML frameworks. Autoencoders as nonlinear generalizations of the SVD. Using Pytorch to implement a two-layer NN for FashionMNIST. lecture notes. Example Pytorch 2-layer NN implementation.
  20. Monday, November 8. Convolutional Neural Networks: cross-correlation (convolution), locality and translational invariance as useful inductive biases. LeNet implementation in PyTorch. lecture notes. Example LeNet implementation.
  21. Thursday, November 11. Multi-channel convolutional neural networks. Backpropagation for CNNs. Issues with deep neural networks: overfitting, vanishing and exploding gradients, hyperparameter selection. Batch Normalization. lecture notes.
  22. Monday, November 15. Dropout regularization and model ensembling. Inception architecture. lecture notes. Optional references: the original dropout paper, and the original inception paper.
  23. Thursday, November 18. Recurrent neural networks (RNNs) for sequential data, common model configurations, backpropagation through time, an example NLP task (classifying news clips). lecture notes. Example AG_NEWS classification using an RNN.
  24. Monday, November 22. LSTMs (Long Short Term Memory) to handle the vanishing/exploding gradients issue exacerbated by BPTT, bidirectional RNNs, deep RNNs, sequence-to-sequence models, teacher forcing. lecture notes. Optional: see the original Sequence to Sequence paper.
  25. Monday, November 29. Seq2Seq models using attention to allow selective focus on relevant historical information. Dot-product attention. Start of transformers: replacing hidden state representations in encoders with contextual representations from self-attention. Multi-head self attention, to capture different aspects of relevance between different elements in the sequence. Multi-head self-attention + dense layers for learning contextual representations. lecture notes. Optional: the original papers introducing the use of attention for soft alignment between input and output sequences, Neural Machine Translation by Jointly Learning to Align and Translate, and the transformer model, Attention is All you Need.
  26. Thursday, December 2. Skip-connections and residual connections to address vanishing gradients. Layer normalization as an alternative to batch normalization that helps stabilize training and does not depend on minibatch size. Formulae for rescaled dot-product self-attention and general attention. The Transformer encoder-decoder architecture and positional encodings. lecture notes. Optional: the original Transformer paper, the example of coding the Transformer model from scratch in Python that we went through in class.

Homeworks and Weekly Participation

Late assignments will not be accepted, unless you contact the instructor at least two days before the due date to receive a deferral. Deferrals will be granted at the instructor’s discretion, of course.

Project

In teams of up to five, you will present either an original research project or an exposition on a topic relevant to the course. See the project page for more details and deadlines. Your group assignments will be posted to Campuswire.

Supplementary Materials

For your background reading, if you are unfamiliar with the linear algebra and probability being used: If you want further reading on convexity and convex optimization: As a reference for PyTorch, the machine learning framework we will use in the deep learning portion of the course: