Modern machine learning routinely deals with millions of points in high-dimensional spaces. Classical numerical linear algebra algorithms can be prohibitively costly in such applications, as they aim at machine precision and scale super-linearly in the size of the input data. Fortunately, machine precision is seldom necessary, and recent developments have established that randomization can be used to bring the costs of matrix algorithms closer to linear in the size of the input data; this is done by sacrificing, in a principled manner, computational accuracy for increased speed. This course surveys modern randomized numerical linear algebra and its applications to machine learning, with the goal of providing a solid foundation for developing and analyzing the performance of randomized matrix algorithms.
Topics will include fast matrix multiplication, fast linear regression, column subset selection, fast approximate low- rank decompositions, fast k-means and spectral clustering, fast approximate kernel methods, and promising research directions in the field.
The syllabus is available as an archival pdf, and is more authoritative than this website.
Course Text: Lecture notes.
- Homeworks, 50%
- In-class Pop Quizzes, 20%
- Project, 30%
Students are expected to have writing supplies on hand in each class to complete the in-class pop quizzes. If you are an athlete or for some other reason are not able to attend each class, make alternative arrangements with the instructor in the first two weeks of the course.
Letter grades will be computed from the semester average. Maximum lower bound cutoffs for A, B, C and D grades are 90%, 80%, 70%, and 60%, respectively. These bounds may be moved lower at the instructors discretion.
Topics and Schedule
- Lecture 1, Thursday 1/18: Large-scale considerations in ML, supervised and unsupervised learning
- Lecture 2, Monday 1/22: Kernel learning, review of SVD
- Lecture 3, Thursday 1/25: Linear algebra review, approximate matrix multiplication, using randomization to speed up direct and iterative solvers
- Lecture 4, Monday 1/29: Kaczmarz algorithm introduction
- Thursday 2/1, Lecture is canceled
- Lecture 5, Monday 2/5: Review of expectations, conditional expectation; Kaczmarz algorithm and convergence rate with uniform row sampling
- Lecture 6, Thursday 2/8: Central Limit Theorem, Berry-Esseen Theorem, asymptotic vs Non-asymptotic tail bounds, Markov's inequality
- Lecture 7, Monday 2/12: Moment inequalities, moment generating functions, Laplace transform method, sub-gaussian r.v.s, Hoeffding bound for sub-gaussian sums; see the supplementary material
- Lecture 8, Thursday 2/15: Sub-exponential r.v.s, Bernstein's inequality, sub-gaussian random vectors, the sub-gaussian JLT; see the supplementary material
- Lecture 9, Tuesday 2/19: the matrix Bernstein inequality
- Lecture 10, Thursday 2/22: matrix Bernstein inequality application to proving the number of rows needed for column sampling to give subspace embedding, motivations for low-rank approximation, Eckart-Young theorem and truncated SVD cost
- Lecture 11, Monday 2/26: column subset selection and CX algorithms, fast subspace embeddings
- Lecture 12, Thursday 2/29: CX guarantees, random projections for sketching; see the supplementary material
- Lecture 13, Monday 3/5: leverage score CX approximation, approximation of the leverage scores of tall skinny matrices
- Lecture 14, Thursday 3/8: CUR approximation, fast CUR approximation, leverage-score CUR approximation, Nystrom approximations
- Monday 3/12 and Thursday 3/15: Spring Break
- Lecture 17, Monday 3/19: Regularized Empirical Risk Minimization, examples of linear and logistic regression, hinge-loss linear SVM
- Lecture 18, Thursday 3/22: Reproducing Kernel Hilbert Spaces, Representation Theorem, the kernel formulation of RERM on RKHSes
- Lecture 19, Monday 3/26: RKHSes on non-Euclidean spaces (graphs, text), creating kernels, convex conjugation, the impact of kernel approximation on kernel learning
- Lecture 20, Thursday 3/29: impact of kernel approximation on kernel learning; see the supplementary material
All assignments must be typed (preferably in LaTeX) and are due at the start of class (defined as the first 15 minutes) via email. Late assignments will be penalized and accepted at the instructor's discretion.
- Self-assessment, due Monday 1/22 (in class, you can write in the answers)
- HW1, due Monday 2/5. Get started early, as you will need to familiarize yourself with Python+Numpy and you will need run the experiments.
- HW2, due Thursday 2/22. Only problem 1 is due, as we did not cover the matrix Bernstein inequality in time.
- HW3, due Thursday 2/29 (ignore the date in the pdf). Problem 2 is now due.
- HW4, due Monday 3/19.
- HW5, due Monday 3/26.
As described in the syllabus, each student will present a paper to the class; this entails
- critically reading the paper
- summarizing its theoretical results and emprical results
- discussing how the theory presented relates or does not relate to the experiments
- reimplementing the main algorithm and appropriately empirically evaluating it on reasonable data sets
- delivering a 20 minute presentation to the class summarizing your findings
Graduate students: read the CS graduate seminar skills presentation on reading papers, and in your presentation, directly address the questions it suggests asking yourself as you are reading the paper.
You may choose any paper related to the material covered in class, and this choice must be approved. Here are some possibilities (if you choose one, let me know so I can remove it):
Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means Fast and Robust Least Squares Estimation in Corrupted Linear Models Time and Space Efficient Spectral Clustering via Column Sampling
- Effective Dimensionality Reduction for Canonical Correlation Analysis
CUR Algorithm for Partially Observed Matrices Single Pass PCA of Matrix Products Recursive Sampling for the Nystrom Method
- Faster Kernel Ridge Regression Using Sketching and Preconditioning
- Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares
- Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence
GIANT: Globally Improved Approximate Newton Method for Distributed Optimization
- Scalable Kernel Methods via Doubly Stochastic Gradients
Paper selections are due March 19, and the presentations will be held the week of April 23. In the interim period, you must attend at least two office hours to discuss this paper with me: once after you have read your selection, the week of March 26 or earlier, and again after you have completed the implementation and experiments, the week of April 16 or earlier. Your efforts during those discussions, as well as your final presentation, will determine your project grade.
- Lectures 7 and 8 drew from the first and second of Vershynin's excellent lectures on probabilistic methods for data science
- We mentioned that sums of gaussians are gaussian
- A write-up of the relative error guarantee for CX decompositions.
- We mentioned in class that the matrix inversion lemma lets us use low-rank approximations of our kernel matrix to efficiently approximately solve ridge regression problems, and suggested that it is a general phenomenon that algorithms for solving kernel problems can be adapted to take advantage of low-rank approximations of the kernel. One of the most famous examples of this adaptation is the interior point algorithm suggested by Fine and Scheinberg for fitting hinge-loss SVMs using kernel approximations.