CSCI 6962/4962 Machine Learning and Optimization, Fall 2023


Machine learning and data science typically proceed by formulating the desired outcome as the solution to an optimization problem, then using suitable algorithms to solve these problems efficiently. Neural networks (aka deep learning) have become the dominant paradigm of modern machine learning, and their impressive successes are due to four factors:

As a second course in machine learning, this course focuses on the optimization algorithms used in machine learning and the neural architectures used in modern deep learning.

The first portion of the course introduces the optimization background necessary to understand the optimization algorithms that dominate applications of ML and large-scale optimization, and surveys several popular randomized and deterministic optimization algorithms, placing the emphasis on those widely used in ML applications.

The second portion of the course introduces architectures widely used in modern machine learning because of the proven effectiveness of their inductive biases, and presents common regularization techniques used to mitigate the issues that arise in solving the nonlinear optimization problems ubiquitous within modern machine learning.

The homeworks involve hands-on applications and empirical characterizations of the behavior of these algorithms and model architectures. A project gives the students experience in critically reading the research literature and crafting articulate technical presentations.

Course Logistics

The syllabus is available as an archival pdf, and is more authoritative than this website.

Instructor: Alex Gittens (gittea at rpi dot edu)

Lectures: TF 10am-11:50am ET in JEC 5119

Questions and Discussions: Piazza

Office Hours: T 8:30am-9:30am ET and F 12pm-1pm ET in Lally 316, or by appointment

TA: Dong Hu (hud3 at rpi dot edu)

TA Office Hours: M/Th 8:30-10:30am ET in Lally 09 (at the basement level)

Course Text: None

Grading Criteria:

CSCI 4962 CSCI 6962
  • Homeworks, 50%
  • Project, 35%
  • Weekly Participation, 15%
  • Homeworks, 50%
  • Project, 45%
  • Weekly Participation, 5%

Letter grades will be computed from the semester average. Lower-bound cutoffs for A, B, C and D grades are 90%, 80%, 70%, and 60%, respectively. These bounds may be moved lower at the instructor's discretion.

Lecture Schedule

  1. Lecture 1, August 29/2023. Course logistics; introduction to optimization; ordinary least squares, gradients and their computation; why gradient descent is not always the best optimization algorithm. Lecture notes. Code comparing gradient descent and conjugate gradient descent for solving linear systems.
  2. Lecture 2, September 1/2023. Examples of probabilistic ML systems: ChatGPT and MidJourney; ML is using data to learn models that make accurate predictions for a given task in a specified domain; Domains as probability spaces; Empirical Risk Minimization; Soft-margin support vector machines for binary classification. Lecture notes.
  3. Lecture 3, September 8/2023. PyTorch introductory example: autograd, fitting a hinge-loss l2-regularized SVM on FashionMNIST. Probability theory: named distributions; joint, conditional, and marginal distributions; (conditional) expectation and (conditional variance). Parameterized distributions and regression functions. Lecture notes. Jupyter notebook for PyTorch SVM fitting.
  4. Lecture 4, September 12/2023. Variance of sums of i.i.d random variables, weak law of large numbers; how LLN justifies the ERM principle; parameterized ML model; Gaussian noise model for regression problems; Bernoulli noise model for binary classification problems; Categorical noise model for multiclass classification problems. Lecture notes.
  5. Lecture 5, September 15/2023. Geometric interpretation of logits for Bernoulli and Categorical noise models. Maximum likelihood principle and maximum likelihood estimation (MLE) for fitting parametric models from data. Negative log-likelihood minimization is MLE. MLE for Gaussian noise model (OLS), Bernoulli noise model (logsitic regression), Categorical noise model (multiclass logistic regression). Lecture notes.
  6. Lecture 6, September 19/2023. MLE for multiclass logistic regression gives an ERM problem with cross-entropy loss. Train, validation, and test set splits. Linear algebra recap: vectors, independence, bases, inner and outer products, Cauchy-Schwarz inequality and angles and law of cosines and Pythagorean theorem, eigenvalue decompositions of symmetric matrices, PSD matrices. Lecture notes.
  7. Lecture 7, September 22/2023. Multivariate Taylor series review, iterative (oracle-based) optimization, nonconvex vs convex optimization, convex sets, examples of convex functions, convex functions. Lecture notes.
  8. Lecture 8, September 26/2023. Local minima of convex functions are global minina; strict and strong convexity; strictly convex functions have unique global minima; first and second-order characterizations of convex functions, and example of their use; operations that preserve convexity of functions; examples of convex optimization problems. Lecture notes.
  9. Lecture 9, September 29/2023. Projection onto convex sets. Optimality conditions (Fermat's condition) for unconstrained and constrained smooth convex optimization problems. Example: minimizing l2-regularized logsumexp. Lecture notes.
  10. Lecture 10, October 3/2023. Proof of the form of the projection onto the unit ball for l2 using Fermat's condition. Convex separation theorem, separating and supporting hyperplanes. Subdifferential and subgradients, and examples of their computation from the definitions. Extended value convex functions, convex indicator functions and their subdifferentials. Fermat's condition for general non-smooth convex optimization problems. Subdifferential calculus rules. Examples of computing subdifferentials using these rules. Lecture notes. Lecture video.
  11. Lecture 11, October 6/2023. The kernel trick and kernel machines for l2-regularized ERMs, via Fermat's condition. β-smoothness and its consequences. The descent lemma. Gradient descent for unconstrained convex problems and its convergence rate and iteration complexity for β-smooth convex objectives. Lecture notes. See also Chapter 3 of Bubeck's manuscript.
  12. Lecture 12, October 10/2023. Strong convexity. Hessian-based characterization of strong convexity and β-smoothness. Polyak-Lojasiewicz inequality. Linear convergence of Gradient Descent for β-smooth and strongly convex function; convex condition number. Convergent rates for Gradient Descent and corresponding stepsizes for different classes of smooth convex optimization problems. Subgradient descent algorithm for non-smooth convex optimization. Projected subgradient descent/gradient descent for constrained convex optimization. Lecture notes. If interested, see also Revisiting Polyak's step size.
  13. Lecture 13, October 13/2023. Newton's method, unguarded and damped versions. Damped phase and purely Newton phase. (Optional) analysis of the quadratic convergence during the purely Newton phase. Drawbacks to Newton's method: expensive per iteration. Drawbacks to gradient descent: expensive per iteration. Minibatch stochastic gradient descent, Randomized Reshuffling. Linear parameter convergence of SGD with constant stepsize applied to strongly convex functions, up to the noise level of the gradient approximation. Monotonicity of the gradient operator for convex functions. Lecture notes.
  14. Lecture 14, October 17/2023. Convergence rate of stochastic gradient descent on non-convex objectives. Step-size optimization a la Bottou. Tools for obtaining faster convergence of first-order methods: momentum, exponential averaging, preconditioning. Adaptive gradient descent methods: SGD with momentum, Nesterov's accelerated gradient descent, AdaGrad, RMSProp, ADAM. Numerical comparison of adaptive GD methods to standard GD. Lecture notes. Code comparing adaptive gradient descent methods to gradient descent. If you are interested, you can read Bottou's short paper on SGD: Stochastic Gradient Descent Tricks
  15. Lecture 15, October 20/2023. Motivations for using neural network models. Neural networks as directed acyclic computational graphs. Fully-connected feed-forward neural networks (FCFNNs), perceptron-type neurons, and multilayer perceptron neural networks (MLPs). Activation functions and common choices for them. Linear, logistic, and multiclass logistic regression as MLPs. XOR as an example of a problem that is not linearly separable, and example of using an MLP to generate nonlinear features under which the problem is linearly separable. Universal approximation theorem for non-polynomial activation functions. Vector notation for computing the outputs of MLPs. Lecture notes.
  16. Lecture 16, October 27/2023. Autoencoders (MLPs) as a nonlinear generalization of PCA/SVD. The chain rule for multivariate functions, and its application to backpropagation in MLPs. Lecture notes. Two-layer MLP for classifying FashionMNIST images, in PyTorch.
  17. Lecture 17, October 31/2023. Recap of backpropagation. Convolutional neural networks: convolutional filters, inductive biases, convolutional layers, multichannel convolutional layers, CNNs, pooling, LeNet5. Lecture notes. LeNet5 implementation in PyTorch, applied to FashionMNIST.
  18. Lecture 18, November 3/2023. Transposed convolutions and the U-net architecture for image to image problems, efficient convolutions as GEMM operations, sketch of backprop for convolutional layers. Issues with deep networks: overfitting, vanishing and exploding gradients, hyperparameter selection. GoogLeNet Inception v1 architecture, auxiliary classification losses. Lecture notes. References: U-Net paper, Inception v1 paper, overview of Inception architectures, Anatomy of a high-speed convolution, A guide to convolution arithmetic for deep learning.
  19. Lecture 19, November 7/2023. Batch and Layer Normalization for smoother propagation of gradient information; modified batch normalization for CNNs. Skip connections and ResNets for the same. Lecture notes. References: Batch Normalization paper, Layer Normalization paper, ResNets paper. Pytorch code illustrating impact of batch normalization.
  20. Lecture 20. November 10/2023. Dropout to prevent overfitting; interpretation as an ensembling method. RNNs for sequential data; design patterns for RNNs, backpropagation through time (BPTT), truncated BPTT. Lecture notes. The Dropout paper.
  21. Lecture 21. November 14/2023. NLP preprocessing: tokenization and special tokens, normalization, vocabulary generation and handling unknown tokens, embedding layers. Sequence classification using a many to one RNN architecture.Long-short term memory units (LSTMs). Bidirectional RNNs, Deep RNNs. Lecture notes. (TO COME) Python code for classifying AG_NEWS data set using an RNN.
  22. Lecture 22. November 17/2023. Sequence-to-Sequence modeling using encoder-decoder RNN architectures; teacher forcing; seq2seq modeling using RNNs with attention. Lecture notes. For more details, (on, e.g. beam decoding) see the original papers Sequence to Sequence Learning With Neural Networks and Neural Machine Translation by Jointly Learning to Align and Translate.
  23. Lecture 23. November 21/2023. Recap of sequence-to-sequence modeling using RNNs with general/cross- attention. Self-attention and multi-headed attention. The encoder-decoder Transformer architecture: encoder blocks using self-attention, decoder blocks using general attention and causal self-attention. Lecture notes. See the original paper Attention is all you need.
  24. Lecture 24. November 28/2023. Low data machine learning: data augmentation and transfer learning. Supervised training and full or partial fine-tuning. Self- and unsupervised pre-training and fine-tuning. Model reprogramming. BERT, an encoder-only language model. GPT, a decoder-only language model. See the original papers for BERT and GPT, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding and Improving Language Understanding by Generative Pre-Training.

Homeworks and Weekly Participation

Homework/Participation submission links: