Intended Schedule

Course References:
  1. [B] C. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, Oxford. (Suggested Text)
  2. [M] T. M. Mitchell, Machine Learning, McGraw Hill.
  3. [SB] R. Sutton, A. Barto, Reinforcement Learning, MIT Press.
  4. [V] V. Vapnik, Statistical Learning Theory, Wiley.
  5. [HKP] J. Hertz, A. Krogh and R. Palmer Introduction to the Theory of Neural Computation, Addison-Wesley.
  6. [DH] R. Duda & P. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons.
  7. [H] S. Haykin, Neural Networks -- A Comprehensive Foundation, Macmillan Publishing, New Jersey.
  8. [N] N. Nilsson, The Mathematical Foundations of Learning Machines, Morgan Kaufmann Publishing Company, San Mateo, California.
  9. [R] B. Ripley, Pattern Recognition and Neural Networks, Cambridge Press.
Non-Course Related but General Mathematical Background:
  1. [P] J. Pitman, Probability, Springer-Verlag.
  2. [D] M. Degroot, Probability and Statistics, Addison Wesley.
  3. [L] S. Lang, Undergraduate Algebra, Springer-Verlag.
  4. [HJ] R. Horn and C. Johnson, Matrix Analysis, Cambridge Press.
Note that the lecture contents can change without notice.

Lecture 1:
Motivating examples. Concrete example - boolean function of 3 boolean variables. Outline of general problem.
(1.1-1.7,[B]), (Chapter 1,[DH]), (Chapter 1,[M]), (Chapter 1,2,[H]), (Chapter 1,[R])

Lecture 2:
Formal setup. Probability tools; Bayes theorem and example; Bayes optimal rule/decision function.
(1.8-1.10,[B]), (Chapter 2,[DH]), (Chapter 1,2,3,[N]) , (2.1-2.4,[R])

Lecture 3:
Review of Bayes optimal rule. Minimum error rate loss matrix for 2 class/2 action problem. Gaussian class conditional density and derivation of nearest mu rule. Derivation of perceptron rule.
(3.1-3.5,[B]), (Chapter 2,[DH]), (Chapter 6.1-6.5,[M]), (Chapter 1,2,3,[N]), (3.6,[R])

Lecture 4:
Approach to Bayes optimal rule by starting at the perceptron and determining v,v0 by minimizing R_emp. Expressions for R_emp.Perceptron learning algorithm. Overview of second approach by smoothening the surface.
(3.1-3.5,[B]), (Chapter 5,[DH]), (6.1-6.5,[M])

Lecture 5:
Perceptron learning model. Softening the threshold. Minimizing R_emp. Gradient descent and normalized gradient descent. Expression for the gradients of the perceptron.
(3.1-3.5,[B]), (Chapter 5,[DH]), (Chapter 4,[M]), (Chapter 3,[H])

Lecture 6:
Problems with the Perceptron. Generalization to the multilayer perceptron. Computation of the output, forward propagation.
(4.1-4.6,[B]), (Chapter 4,[M]), (Chapter 3,4,[H]), (Chapter 5,[R])

Lecture 7:
Gradients via Backpropagation. Algorithm for minimization of R_emp(w).
(4.8-4.9,[B]), (Chapter 4,[M]), (Chapter 5,[R]), (Chapter 4,[H])

Lecture 8:
Universal approximation with neural networks. Summary. Introduction to the generalization question. Coin model for functions.
(9.9,[B])

Lecture 9:
Coin Model for functions. Generalization performance of learning model with a single function. Generalization Perfoemance of learning model with a finite number of functions.
(Notes), (Chapter 7,[M])

Lecture 10:
Axiom of Non-falsifiability. Definition of m(N), the growth function for a learning model L.
(3.10-3.11,[V]), (Chapter 7,[M]), (2.8,[R]), (Chapter 2,[H])

Lecture 11:
Computation of M(N) for various learning models: 1) positive ray, 2) positive interval, 3) positive rectangle 4) convex subsets. Bound for M(N) - either exponential growth or at most polynomial. Definition of the VC dimension d_vc.
(Chapter 4,[V]), (Chapter 7,[M]), (2.8,[R]), (Chapter 2,[H])

Lecture 12:
Proof of bound on m(N). Separation of learning models into good learning models and bad ones. VC dimension for perceptron. Bound on VC dimension for neural networks.
(Notes), (Chapter 4,[V]), (Chapter 2.8,[R]), (Chapter 2,[H])

Lecture 13:
VC theorem: Bound on probability of generalization error. Computation of sample complexity. Test error bound. The complexity approximatability tradeoff.
(Notes), (Chapter 4,[V]), (Chapter 2,[H])


Lecture 14:
Relationship between VC bound and NFL. Use of prior information. Bias and Variance.
(9.1,[B])

Lecture 15:
Bias and Variance continued. Introduction to regularization - general approaches. Early Stopping.
(9.1,9.2.4,[B]), (Chapter 5,[R])

Lecture 16:
Cross Validation.
(9.8.1,[B]), (2.6,[R]), (4.14,[H])

Lecture 17:
Complexity penalties. Use of noise models to obtain error functions - maximum likelihood. "Derivation" of weight decay.
(6.0-6.1,9.2,9.4,9.5,[B]), (Chapter 5,[R])

Lecture 18:
Regularization by addition of penalty terms to the error function including other complexity regularizers. Enforcing hints and other prior information such as rotation/reflection symmetry and monotonicity using penalty terms. Choice of error functions from risk preferences. Choice of regularization parameters. Bagging and bootstrap.
(6.0-6.11,9.2,9.4,9.5,[B]), (4.15,[H]), (4.6.5, 4.8.1,[M]), (2.7,[R])

Lecture 19:
Committees/Voting/Boosting. Road map and where we go from here.
(9.6,9.7,10.7,[B]), (Chapter 7,[H])

Lecture 20:
Weight initialization and input preprocessing. When to stop training. Approach to optimization algorithms.
(Chapter 8, [B])

Lecture 21:
Optimization: Zeroth order model - exhaustive search. First order model - Gradient descent with fixed and variable learning rate. Steepest descent.
(Chapter 7, [B]), (5.3, [R]), (4.17,4.18, [R])

Lecture 22:
Steepest descent, momentum and conjugate gradient.
(Chapter 7, [B]), (5.3, [R]), (4.17,4.18, [R])

Lecture 23:
Conjugate gradient. Second order methods: Newton step, Levenberg Marquardt methods.
(Chapter 7, [B]), (5.3, [R]), (4.17,4.18, [R])


Lecture 24:
The Nearest Neighbor Rule.
(6.2, [R]), (4.6, [DH]), (2.5.4, [B]), (8.2, [M])


Lecture 25:
Radial Basis Functions
(Chapter 5, [B]), (Chapter 5, [H])


Lecture 26:
Gaussian Processes

Lecture 27:
Support Vector Machines