CSCI 6962/4962 Machine Learning and Optimization, Fall 2023
Overview
Machine learning and data science typically proceed by formulating the desired outcome as the solution to an optimization problem, then using suitable algorithms to solve these problems efficiently. Neural networks (aka deep learning) have become the dominant paradigm of modern machine learning, and their impressive successes are due to four factors:
 the availability of massive training data sets,
 access to teraFLOPs of compute power,
 powerful general purpose stochastic optimization algorithms, and
 welldesigned neural architectures.
As a second course in machine learning, this course focuses on the optimization algorithms used in machine learning and the neural architectures used in modern deep learning.
The first portion of the course introduces the optimization background necessary to understand the optimization algorithms that dominate applications of ML and largescale optimization, and surveys several popular randomized and deterministic optimization algorithms, placing the emphasis on those widely used in ML applications.
The second portion of the course introduces architectures widely used in modern machine learning because of the proven effectiveness of their inductive biases, and presents common regularization techniques used to mitigate the issues that arise in solving the nonlinear optimization problems ubiquitous within modern machine learning.
The homeworks involve handson applications and empirical characterizations of the behavior of these algorithms and model architectures. A project gives the students experience in critically reading the research literature and crafting articulate technical presentations.
Course Logistics
The syllabus is available as an archival pdf, and is more authoritative than this website.
Instructor: Alex Gittens (gittea at rpi dot edu)
Lectures: TF 10am11:50am ET in JEC 5119
Questions and Discussions: Piazza
Office Hours: T 8:30am9:30am ET and F 12pm1pm ET in Lally 316, or by appointment
TA: Dong Hu (hud3 at rpi dot edu)
TA Office Hours: M/Th 8:3010:30am ET in Lally 09 (at the basement level)
Course Text: None
Grading Criteria:
CSCI 4962  CSCI 6962 



Letter grades will be computed from the semester average. Lowerbound cutoffs for A, B, C and D grades are 90%, 80%, 70%, and 60%, respectively. These bounds may be moved lower at the instructor's discretion.
Lecture Schedule
 Lecture 1, August 29/2023. Course logistics; introduction to optimization; ordinary least squares, gradients and their computation; why gradient descent is not always the best optimization algorithm. Lecture notes. Code comparing gradient descent and conjugate gradient descent for solving linear systems.
 Lecture 2, September 1/2023. Examples of probabilistic ML systems: ChatGPT and MidJourney; ML is using data to learn models that make accurate predictions for a given task in a specified domain; Domains as probability spaces; Empirical Risk Minimization; Softmargin support vector machines for binary classification. Lecture notes.
 Lecture 3, September 8/2023. PyTorch introductory example: autograd, fitting a hingeloss l2regularized SVM on FashionMNIST. Probability theory: named distributions; joint, conditional, and marginal distributions; (conditional) expectation and (conditional variance). Parameterized distributions and regression functions. Lecture notes. Jupyter notebook for PyTorch SVM fitting.
 Lecture 4, September 12/2023. Variance of sums of i.i.d random variables, weak law of large numbers; how LLN justifies the ERM principle; parameterized ML model; Gaussian noise model for regression problems; Bernoulli noise model for binary classification problems; Categorical noise model for multiclass classification problems. Lecture notes.
 Lecture 5, September 15/2023. Geometric interpretation of logits for Bernoulli and Categorical noise models. Maximum likelihood principle and maximum likelihood estimation (MLE) for fitting parametric models from data. Negative loglikelihood minimization is MLE. MLE for Gaussian noise model (OLS), Bernoulli noise model (logsitic regression), Categorical noise model (multiclass logistic regression). Lecture notes.
 Lecture 6, September 19/2023. MLE for multiclass logistic regression gives an ERM problem with crossentropy loss. Train, validation, and test set splits. Linear algebra recap: vectors, independence, bases, inner and outer products, CauchySchwarz inequality and angles and law of cosines and Pythagorean theorem, eigenvalue decompositions of symmetric matrices, PSD matrices. Lecture notes.
 Lecture 7, September 22/2023. Multivariate Taylor series review, iterative (oraclebased) optimization, nonconvex vs convex optimization, convex sets, examples of convex functions, convex functions. Lecture notes.
 Lecture 8, September 26/2023. Local minima of convex functions are global minina; strict and strong convexity; strictly convex functions have unique global minima; first and secondorder characterizations of convex functions, and example of their use; operations that preserve convexity of functions; examples of convex optimization problems. Lecture notes.
 Lecture 9, September 29/2023. Projection onto convex sets. Optimality conditions (Fermat's condition) for unconstrained and constrained smooth convex optimization problems. Example: minimizing l2regularized logsumexp. Lecture notes.
 Lecture 10, October 3/2023. Proof of the form of the projection onto the unit ball for l2 using Fermat's condition. Convex separation theorem, separating and supporting hyperplanes. Subdifferential and subgradients, and examples of their computation from the definitions. Extended value convex functions, convex indicator functions and their subdifferentials. Fermat's condition for general nonsmooth convex optimization problems. Subdifferential calculus rules. Examples of computing subdifferentials using these rules. Lecture notes. Lecture video.
 Lecture 11, October 6/2023. The kernel trick and kernel machines for l2regularized ERMs, via Fermat's condition. βsmoothness and its consequences. The descent lemma. Gradient descent for unconstrained convex problems and its convergence rate and iteration complexity for βsmooth convex objectives. Lecture notes. See also Chapter 3 of Bubeck's manuscript.
 Lecture 12, October 10/2023. Strong convexity. Hessianbased characterization of strong convexity and βsmoothness. PolyakLojasiewicz inequality. Linear convergence of Gradient Descent for βsmooth and strongly convex function; convex condition number. Convergent rates for Gradient Descent and corresponding stepsizes for different classes of smooth convex optimization problems. Subgradient descent algorithm for nonsmooth convex optimization. Projected subgradient descent/gradient descent for constrained convex optimization. Lecture notes. If interested, see also Revisiting Polyak's step size.
 Lecture 13, October 13/2023. Newton's method, unguarded and damped versions. Damped phase and purely Newton phase. (Optional) analysis of the quadratic convergence during the purely Newton phase. Drawbacks to Newton's method: expensive per iteration. Drawbacks to gradient descent: expensive per iteration. Minibatch stochastic gradient descent, Randomized Reshuffling. Linear parameter convergence of SGD with constant stepsize applied to strongly convex functions, up to the noise level of the gradient approximation. Monotonicity of the gradient operator for convex functions. Lecture notes.
 Lecture 14, October 17/2023. Convergence rate of stochastic gradient descent on nonconvex objectives. Stepsize optimization a la Bottou. Tools for obtaining faster convergence of firstorder methods: momentum, exponential averaging, preconditioning. Adaptive gradient descent methods: SGD with momentum, Nesterov's accelerated gradient descent, AdaGrad, RMSProp, ADAM. Numerical comparison of adaptive GD methods to standard GD. Lecture notes. Code comparing adaptive gradient descent methods to gradient descent. If you are interested, you can read Bottou's short paper on SGD: Stochastic Gradient Descent Tricks
 Lecture 15, October 20/2023. Motivations for using neural network models. Neural networks as directed acyclic computational graphs. Fullyconnected feedforward neural networks (FCFNNs), perceptrontype neurons, and multilayer perceptron neural networks (MLPs). Activation functions and common choices for them. Linear, logistic, and multiclass logistic regression as MLPs. XOR as an example of a problem that is not linearly separable, and example of using an MLP to generate nonlinear features under which the problem is linearly separable. Universal approximation theorem for nonpolynomial activation functions. Vector notation for computing the outputs of MLPs. Lecture notes.
 Lecture 16, October 27/2023. Autoencoders (MLPs) as a nonlinear generalization of PCA/SVD. The chain rule for multivariate functions, and its application to backpropagation in MLPs. Lecture notes. Twolayer MLP for classifying FashionMNIST images, in PyTorch.
 Lecture 17, October 31/2023. Recap of backpropagation. Convolutional neural networks: convolutional filters, inductive biases, convolutional layers, multichannel convolutional layers, CNNs, pooling, LeNet5. Lecture notes. LeNet5 implementation in PyTorch, applied to FashionMNIST.
 Lecture 18, November 3/2023. Transposed convolutions and the Unet architecture for image to image problems, efficient convolutions as GEMM operations, sketch of backprop for convolutional layers. Issues with deep networks: overfitting, vanishing and exploding gradients, hyperparameter selection. GoogLeNet Inception v1 architecture, auxiliary classification losses. Lecture notes. References: UNet paper, Inception v1 paper, overview of Inception architectures, Anatomy of a highspeed convolution, A guide to convolution arithmetic for deep learning.
 Lecture 19, November 7/2023. Batch and Layer Normalization for smoother propagation of gradient information; modified batch normalization for CNNs. Skip connections and ResNets for the same. Lecture notes. References: Batch Normalization paper, Layer Normalization paper, ResNets paper. Pytorch code illustrating impact of batch normalization.
 Lecture 20. November 10/2023. Dropout to prevent overfitting; interpretation as an ensembling method. RNNs for sequential data; design patterns for RNNs, backpropagation through time (BPTT), truncated BPTT. Lecture notes. The Dropout paper.
 Lecture 21. November 14/2023. NLP preprocessing: tokenization and special tokens, normalization, vocabulary generation and handling unknown tokens, embedding layers. Sequence classification using a many to one RNN architecture.Longshort term memory units (LSTMs). Bidirectional RNNs, Deep RNNs. Lecture notes. (TO COME) Python code for classifying AG_NEWS data set using an RNN.
 Lecture 22. November 17/2023. SequencetoSequence modeling using encoderdecoder RNN architectures; teacher forcing; seq2seq modeling using RNNs with attention. Lecture notes. For more details, (on, e.g. beam decoding) see the original papers Sequence to Sequence Learning With Neural Networks and Neural Machine Translation by Jointly Learning to Align and Translate.
 Lecture 23. November 21/2023. Recap of sequencetosequence modeling using RNNs with general/cross attention. Selfattention and multiheaded attention. The encoderdecoder Transformer architecture: encoder blocks using selfattention, decoder blocks using general attention and causal selfattention. Lecture notes. See the original paper Attention is all you need.
 Lecture 24. November 28/2023. Low data machine learning: data augmentation and transfer learning. Supervised training and full or partial finetuning. Self and unsupervised pretraining and finetuning. Model reprogramming. BERT, an encoderonly language model. GPT, a decoderonly language model. See the original papers for BERT and GPT, BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding and Improving Language Understanding by Generative PreTraining.
Homeworks and Weekly Participation
Homework/Participation submission links: CSCI 4962: pdfs and Jupyter notebooks only, 5MB limit
 CSCI 6962: pdfs and Jupyter notebooks only, 5MB limit
 Weekly Participation 1. Due 9/8/2023.
 Weekly Participation 2. Due 9/15/2023.
 Weekly Participation 3. Due 9/22/2023.
 Weekly Participation 4. Due 10/3/2023.
 Weekly Participation 5. Due 10/10/2023.
 Weekly Participation 6. Due 10/20/2023.
 Weekly Participation 7. Due 10/27/2023.
 Weekly Participation 8. Due 11/14/2023.
 Homework 1. Assigned Tuesday September 12, 2023. Due by 11:59pm ET Friday September 22, 2023.
 Homework 2. Assigned Tuesday October 3, 2023. Due by 11:59pm ET Tuesday October 17, 2023.
 Homework 3. Assigned Tuesday October 17, 2023. Due by 11:59pm ET Tuesday October 31, 2023.
 Homework 4. Assigned Tuesday October 31, 2023. Due by 11:59pm ET Tuesday November 14, 2023.
 Homework 5. Everyone got credit
 Homework 6. Assigned Monday December 4, 2023. Due by 11:59pm ET Friday December 8, 2023. This is an optional assignment: if you choose to submit, it will be used in computing your grade average over 6 assignments; if not, it will not be used in computing your grade average over 5 homework assignments.
 Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares. Boyd and Vandenberghe.
 Jeff Erickson's notes on discrete probability. Erickson.
 Introduction to Probability, Statistics, and Random Processes. PishroNik.
 Chapter 3 of "Deep Learning". Goodfellow, Bengio, and Courville.
 Chapter 1 of "Bayesian Reasoning and Machine Learning". Barber.
 Convexity and Optimization. Lecture notes by R. Tibshirani.
 Optimization for Machine Learning. Lecture notes by E. Hazan.
 Optimization Methods for Largescale Machine Learning. SIAM Review article. Bottou, Curtis, and Nocedal.
 Convex Optimization: Algorithms and Complexity. Bubeck
 Convex Optimization. Boyd and Vandenberghe.
Late assignments will not be accepted, unless you contact the instructor at least two days before the due date to receive a deferral. Deferrals will be granted at the instructorâ€™s discretion, of course.
Project
In teams of up to five, you will present either an original research project or an exposition on a topic relevant to the course. See the project page for more details and deadlines. Your group assignments will be posted to Piazza.