# CSCI 6971/4971 Large Scale Matrix Computations and Machine Learning, Spring 2019

## Overview

Modern machine learning routinely deals with millions of points in high-dimensional spaces. Classical linear algebra and optimization algorithms can be prohibitively costly in such applications, as they aim at machine precision and/or scale super-linearly in the size of the input data. Randomization can be used to bring the costs of machine learning algorithms closer to linear in the size of the input data; this is done by sacrificing, in a principled manner, computational accuracy for increased speed. This course surveys modern randomized algorithms and their applications to machine learning, with the goal of providing a solid foundation for the use of randomization in large-scale machine learning.

Topics covered will include time-accuracy tradeoffs, stochastic first-order and second-order methods, applications of low-rank approximation, approximate kernel learning, distributed optimization, and hyperparameter optimization.

## Course Information

The syllabus is available as an archival pdf, and is more authoritative than this website.

** Course Text**: Lecture notes (the ones you scribe for yourself).

** Grading Criteria**:

- Homeworks, 50%
- In-class Pop Quizzes, 20%
- Project, 30%

Students are expected to have writing supplies on hand in each class to complete the in-class pop quizzes. If you are an athlete or for some other reason are not able to attend each class, make alternative arrangements with the instructor in the first two weeks of the course.

Letter grades will be computed from the semester average. Maximum lower bound cutoffs for A, B, C and D grades are 90%, 80%, 70%, and 60%, respectively. These bounds may be moved lower at the instructor's discretion.

## Topics and Schedule

- Lecture 1, Thursday 1/10: course logistics, large-scale considerations in ML, supervised and unsupervised learning, approximate nearest neighbors, hinge-loss linear classification
- Lecture 2, Monday 1/14: risk minimization, empirical risk minimization, least squares loss and ordinary least squares regression. probability recaps: joint distributions, expectations, conditional and marginal distributions.
- Lecture 3, Thursday 1/17: regularized empirical risk minimization, l2- and l1- regularization, lasso, logistic regression.
- Lecture 4, Thursday 1/24: convex sets and convex functions. some rules for building convex functions, first order convexity condition. Pop Quiz: Polarization Identity.
- Lecture 5, Monday 1/28: second order convexity condition, convex optimization problems, uniqueness of solutions and strict convexity, uniqueness of convex projection onto closed convex sets. Pop Quiz: Convexity of l2-regularized SVM (hinge-loss) objective function.
- Lecture 6, Thursday 1/31: Lipschitz continuity, beta-smoothness, gradient descent algorithm, subgradients, subgradient descent algorithm, outline of convergence proof.
- Lecture 7, Monday 2/4: projected gradient descent, review of conditional expectation and law of total expectation, idea of stochastic (sub)gradient descent
- Lecture 8, Thursday 2/7: chain rule for subdifferentials, stochastic gradient descent convergence, benefits of SGD. See the supplementary materials for optional reading, particularly practical tips for using SGD.
- Lecture 9, Monday 2/11: Frechet differentiability and derivatives of matrix-valued functions (for use on HW2), drawbacks of SGD, strong convexity: its relationship to the spectrum of the Hessian, optimization properties it implies (strict convexity, objective suboptimality controls distance of parameter from true minimizer, gradient size controls objective suboptimality, etc.), geometric interpretation of smooth and strongly convex functions, condition number of a convex optimization problem.
- Lecture 10, Thursday 2/14: Exact and inexact (backtracking) line search. Linear convergence of gradient descent with exact line search. Undesirable dependence on coordinate system of gradient descent. Newton's method, convergence phases, and guarantees. Desirable and undesirable properties of Newton's method.
- Lecture 11, Monday 2/18: Quasi-newton algorithms, BFGS and L-BFGS. Locally superlinear convergence (w/o proof).
- Class canceled Thursday 9/22.
- Lecture 12, Monday 2/25: Kernelization of ridge regression via the matrix inversion lemma or SVD, Reproducing Kernel Hilbert Spaces for convex, finite-dimensional fitting of nonparametric ML models. See the supplementary material for more on RKHSes and their application in ML.
- Lecture 13, Thursday 2/28: A representation theorem for convex optimization in RKHSes, and the finite-dimensional convex optimization problem this leads to. Equivalence of RKHSes, kernel functions, and feature mappings. Several standard kernel functions used in machine learning. The computational cost of exact kernel methods.
- Spring Break the week of 3/4
- Lecture 14, Monday 3/11: Duality, support functions, Fenchel conjugation.
- Lecture 15, Thursday 3/14: Fenchel-Young inequality, Fenchel-Rockafellar duality theorem, the dual of kernel problems, and the consequences for using low-rank approximations to kernel matrices to reduce the cost of kernel methods.
- Lecture 16, Monday 3/18: Kernel approximations schemes: Nystrom and Random Feature Maps. Linear-algebraic properties of Nystrom approximation, equivaelence to column subset selection problem on square-root of the kernel matrix.
- Lecture 17, Thursday 3/21: Limits of column subset selection, coherence and leverage scores, outline of Nystrom error guarantees, concentration inequalities, Matrix Bernstein inequality.
- Lecture 18, Monday 3/25: Matrix Chernoff inequality and application to uniform column subsampling
- Lecture 19, Thursday 3/28: Matrix Bernstein inequality and application to leverage score based column subsampling.
- Lecture 20, Monday 4/1: the shrinkage/soft-thresholding operator, the scaled and unscaled ADMM algorithm, convergence guarantees and stopping criteria, examples of application to Lasso, Robust PCA, and constrained optimization. See the supplementary material for optional reading on ADMM.
- Lecture 21, Thursday 4/5: Adaptive gradient descent methods: motivation via momentum, variance reduction, and coordinate-specific update rates; heavy-ball method, Nesterov accelerated gradient descent, AdaGrad, RMSProp, ADAM; caveats about generalization performance; sublinear convergence of adaptive gradient descent methods to approximate critical points of nonconvex functions.
- Lecture 22, Monday 4/8: Feedforward neural networks, their representational power, common activation functions, optimization using SGD.
- Lecture 23, Thursday 4/11: The backpropagation algorithm, one hidden-layer autoencoders.
- Lecture 24, Monday 4/15: Deep autoencoders, sparse autoencoders, recursive neural networks and back propagation through time. See the supplementary material for a recent survey on RNNs.
- Lecture 25, Thursday 4/18: Language modeling with RNNs, LSTMs, and CNNs (roughly). See the supplementary material for information on CNNs.
- TBA

*Tentative topics (subject to change, and in no particular order)*: block coordinate descent, projected stochastic (quasi-)newton methods, solvers for l1-regularized problems, large-scale latent factor models, large-scale ranking, extreme multi-class and multi-label classification, low-rank matrix factorization and completion algorithms, ADMM, large-scale kernel machines, SVRG methods, distributed and asynchronous SGD, large-scale gaussian processes, hyperparameter optimization, online learning.

## Homeworks

All assignments must be typed (preferably in LaTeX) and are due at the start of class (defined as the first 15 minutes) via email.
*Late assignments will be penalized and accepted at the instructor's discretion.*

- Self-assessment, due Monday 1/14 (in class, you may handwrite the answers)
- HW1, due Thursday 1/31.
- HW2, due Thursday 2/21. Get started early, this will take time.
- HW3, due Monday 2/25.
- HW4, due Monday 4/22.

## Projects

As described in the syllabus, each student will present a paper to the class; this entails

*critically*reading the paper- summarizing its theoretical results and emprical results
- discussing how the theory presented relates or does not relate to the experiments
- reimplementing the main algorithm and appropriately empirically evaluating it on reasonable data sets
- delivering a 20 minute presentation to the class summarizing your findings
- submitting your implementation and presentation slides in a github or gitlab repository

*CSCI6971 students*: read the CS graduate seminar skills presentation on reading papers, and in your presentation, directly address the questions
it suggests asking yourself as you are reading the paper.

You may choose any paper related to the material covered in class, and this choice must be approved. Here are some possibilities (if you choose one, let me know so I can remove it):

~~Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares~~- Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence
- Scalable Kernel Methods via Doubly Stochastic Gradients
- Fast Large-scale Optimization by Unifying Stochastic Gradient and Quasi-Newton Methods
- Stochastic L-BFGS: Improved Convergence Rates and Practical Acceleration Strategies
~~Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage~~- Asynchronous Stochastic Gradient Descent with Delay Compensation
~~Scalable Adaptive Stochastic Optimization Using Random Projections~~~~Sever: A Robust Meta-Algorithm for Stochastic Optimization~~- Coupling Adaptive Batch Sizes with Learning Rates
~~Asynchronous Accelerated Stochastic Gradient Descent~~- HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
- Large Scale Optimization with Proximal Stochastic Newton-Type Gradient Descent
- A Proximal Stochastic Quasi-Newton Algorithm

**Paper selections are due March 28, your completed git repo is due April 22, and the presentations will be held the week of April 22.**
In the interim period, you must attend at least two office hours to discuss this paper with me:

- Once after you have read your selection, the week of
**April 8**or earlier. - Again after you have completed the implementation and experiments, the week of
**April 16**or earlier.

## Supplementary Materials

- Stochastic Gradient Descent Tricks. Leon Bottou.
- A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. S. Lacoste-Julien, M. Schmidt, F. Bach. Arxiv, 2012.
- Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization. A. Rakhlin, O. Shamir, K. Sridharan. ICML, 2012.
- Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms. Tong Zhang, ICML, 2004.
- Kernel Methods in Machine Learning. T. Hofmann, B. Scholkopf, and A. J. Smola, 2008. Has a list of domain-specific kernels.
- Gretton's notes on kernels in Machine Learning.
- Handout on Full and Reduced SVD
- Yuxin Chen's lecture notes on ADMM. Motivation and derivation of the algorithm using duality, some examples of using ADMM for specific problems, and convergence proof for ADMM.
- "Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers" by Boyd, Parikh, et al. A standard modern reference for ADMM. Very readable, and covers many applications and examples.
- Recent Advances in Recurrent Neural Networks by Salehinejad, et al. A survey of RNN fundamentals, training methods (including a detailed discussion of Backprop through Time), and recent applications.
- TBA