Modern (i.e. large-scale, or “big data”) machine learning and data science typically proceed by formulating the desired outcome as the solution to an optimization problem, then using suitable algorithms to solve these problems efficiently.
The first portion of this course introduces the probability and optimization background necessary to understand the randomized algorithms that dominate applications of ML and large-scale optimization, and surveys several popular randomized and deterministic optimization algorithms, placing the emphasis on those widely used in ML applications.
The second portion of the course introduces architectures used in modern machine learning because of the proven effectiveness of their inductive biases, and presents common regularization techniques used to mitigate the issues that arise in solving the nonlinear optimization problems ubiquitous within modern machine learning.
The homeworks involve hands-on applications and empirical characterizations of the behavior of these algorithms and model architectures. A project gives the students experience in critically reading the research literature and crafting articulate technical presentations.
The syllabus is available as an archival pdf, and is more authoritative than this website.
Instructor: Alex Gittens (gittea at rpi dot edu)
Lectures: TF 2pm-3:50pm ET in Troy 2012
Questions and Discussions: Piazza
Office Hours: Tues 4:30pm-5:30pm ET and Thurs 3pm-4pm ET in Lally 316, or by appointment
TA: Jesse Ellin (ellinj2 at rpi dot edu)
TA Office Hours: Wed 2-4pm ET in AE 118
Course Text: None
|CSCI 4968||CSCI 6968|
Letter grades will be computed from the semester average. Lower-bound cutoffs for A, B, C and D grades are 90%, 80%, 70%, and 60%, respectively. These bounds may be moved lower at the instructor's discretion.
- Lecture 1, January 10/2023. Course logistics; introduction to optimization and machine learning; support vector machines and ordinary least squares. Lecture notes.
- Lecture 2, January 13/2023. Recap of probability: pdfs, expectation, variance, independence, conditioning, marginalization; parameterized probabilty distributions. Lecture notes.
- Lecture 3, January 17/2023. Parameterized ML models: Gaussian noise model for regression, Bernoulli noise model for binary classification, Categorical noise model for multiclass classification; Maximum likelihood estimation (MLE) for fitting the first two of these models: leading to OLS, then binary logistic regression. Lecture notes.
- Lecture 4, January 20/2023. MLE for fitting maximum likelihood model; KL-divergence and cross-entropy. Training, test, and validation data set splits. Lecture notes.
- Lecture 5, January 24/2023. Maximum A Posteriori (MAP) estimation for machine learning models. Regularization (ℓ1/2) and Regularized Empirical Risk Minimization. Risk Decomposition: approximation, generalization, and optimization errors. Lecture notes.
- Lecture 6, January 27/2023. Revuew of Taylor Series. Oracle Models of Optimization. Convex sets, functions, and optimization problems. Strict convexity. Uniqueness of minimizers of strictly convex functions. Lecture notes.
- Lecture 7, January 31/2023. First and second order characterizations of convex functions. Positive semdefinite matrices. Examples of convex functions. Operations that preserve convexity. Examples of convex optimization problems. Jensen's inequality. Lecture notes.
- Lecture 8, February 3/2023. Projection onto a convex set. The Cauchy-Schwarz inequality and geometric interpretation. Optimality conditions for smooth convex optimization problem, unconstrained and constrained. Lecture notes.
- Lecture 9, February 7/2023. Example of using the smooth optimality condition for constrained optimization to find the formula for the projection onto the unit Euclidean ball. Subgradients, subdifferentials, and Fermat's condition for nonsmooth convex optimization problems. Extended value convex functions and convex indicator functions. Subgradient calculus rules. Subgradient of the l1-regularized svm objective function. Lecture notes.
- Lecture 10, February 10/2023. Gradient descent. Formulations of approximate convergence. Smoothness of convex functions and the descent lemma. Projected gradient descent. Project subgradient descent and Polyak averaging. Visualization of gradient descent. Lecture notes.
- Lecture 11, February 14/2023. Strong convexity, convex condition numbers. Polyak-Lojasiewicz (PL) inequality. Gradient Descent converges linearly for strongly convex functions. Gradient Descent convergence rates for different classes of convex functions. Lecture notes.
- Lecture 12, February 17/2023. Newton's method and the importance of modeling curvature. Stochastic gradient descent, minibatch stochastic gradient descent, convergence of SGD for non-convex functions with various step size choices. Lecture notes.
- Lecture 13. February 24/2023. Practical stepsize selection for SGD. Drawbacks of SGD. Adaptive gradient descent methods: Polyak heavy-ball method, Nesterov's accelerated gradient descent, AdaGrad. Lecture notes. Code comparing adaptive gradient descent methods.
- Guest Lecture. February 28/2023. Adversarial Robustness of Neural Networks. Lecture notes.
- Guest Lecture. March 3/2023. Sketching for solution of linear systems; the Kaczmarz method. (Lecture notes unvailable)
- Lecture 16. March 14/2023. Adaptive gradient descent: RMSProp and Adam. Neural networks and motivation for modern NN usage. Fully connected feedforward neural networks (FCFFNs), perceptron-type neurons, activation functions, expressivity of neural networks, vector notation for multilayer perceptron networks (MLPs). Lecture notes.
- Lecture 17. March 17/2023. Autoencoders, backpropagation, backpropagation for MLPs. Lecture notes (modified from in class). Two-layer MLP for classifying FashionMNIST images, in PyTorch.
- Lecture 18. March 21/2023. Convolutional neural networks: convolutional filters, inductive biases, convolutional layers, multichannel convolutonal layers, CNNs, pooling, LeNet5. Lecture notes. LeNet5 implementation in PyTorch, applied to FashionMNIST.
- Lecture 19. March 24/2023. Transposed convolutions and the U-net architecture for image to image problems, efficient convolutions as GEMM operations, sketch of backprop for convolutional layers. Issues with deep networks: overfitting, vanishing and exploding gradients, hyperparameter selection. GoogLeNet Inception v1 architecture, auxiliary classification losses. Lecture notes. References: U-Net paper, Inception v1 paper, overview of Inception architectures.
- Lecture 20. March 28/2023. Batch and Layer Normalization for smoother propagation of gradient information; modified batch normalization for CNNs. Skip connections and ResNets for the same. Lecture notes. References: Batch Normalization paper, Layer Normalization paper, ResNets paper. Pytorch code illustrating impact of batch normalization.
- Lecture 21. March 31/2023. Dropout to prevent overfitting; interpretation as an ensembling method. RNNs for sequential data; design patterns for RNNs, backpropagation through time (BPTT), truncated BPTT. Lecture notes.
- Lecture 22. Apr 4/2023. NLP preprocessing: tokenization and special tokens, normalization, vocabulary generation and handling unknown tokens, embedding layers. Sequence classification using a many to one RNN architecture.Long-short term memory units (LSTMs). Bidirectional RNNs, Deep RNNs. Lecture notes. Python code for classifying AG_NEWS data set using an RNN.
- Lecture 23. Apr 7/2023. Sequence-to-Sequence modeling using encoder-decoder RNN architectures; teacher forcing; seq2seq modeling using RNNs with attention. Lecture notes. For more details, (on, e.g. beam decoding) see the original papers Sequence to Sequence Learning With Neural Networks and Neural Machine Translation by Jointly Learning to Align and Translate.
Homeworks and Weekly ParticipationHomework/Participation submission link: pdf and python code only, 1MB limit
Late assignments will not be accepted, unless you contact the instructor at least two days before the due date to receive a deferral. Deferrals will be granted at the instructor’s discretion, of course.
- Weekly Participation 1.
- Weekly Participation 2.
- Weekly Participation 3.
- Weekly Participation 4.
- Weekly Participation 5
- Weekly Participation 6.
- Weekly Participation 7. To be posted.
- Weekly Participation 8. To be posted.
- Weekly Participation 9. To be posted.
In teams of up to five, you will present either an original research project or an exposition on a topic relevant to the course. See the project page for more details and deadlines. Your group assignments will be posted to Piazza.
Supplementary MaterialsFor your background reading, if you are unfamiliar with the linear algebra and probability being used:
- Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares. Boyd and Vandenberghe.
- Jeff Erickson's notes on discrete probability. Erickson.
- Introduction to Probability, Statistics, and Random Processes. Pishro-Nik.
- Chapter 3 of "Deep Learning". Goodfellow, Bengio, and Courville.
- Chapter 1 of "Bayesian Reasoning and Machine Learning". Barber.
- Convexity and Optimization. Lecture notes by R. Tibshirani.
- Optimization for Machine Learning. Lecture notes by E. Hazan.
- Optimization Methods for Large-scale Machine Learning. SIAM Review article. Bottou, Curtis, and Nocedal.
- Theory of Convex Optimization for Machine Learning. Bubeck
- Convex Optimization. Boyd and Vandenberghe.