blank

Teaching kernel learning

2023-10-05T15:11:56-04:00

Here’s a neat approach to teaching kernel learning (for empirical risk minimization), following section 2.2.6 of the book “First-order and Stochastic Optimization Methods for Machine Learning” by Guanghui Lan. It is a straightforward approach, based on the first-order optimality condition, and I have not seen it elsewhere.

Consider the convex empirical risk minimization problem

\[ \omega^\star = \text{argmin}_{\omega \in \mathbb{R}^d} \frac{1}{n} \sum_{i=1}^n \ell(\omega^T x_i, y_i) + \lambda \|\omega\|_2^2.\]

Fermat’s condition implies that the optimal model weights are in the span of the training data points, \(\omega^\star = \sum_{i=1}^n \alpha_i x_i \), so the problem can be recast as

\[ \alpha^\star = \min_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^n \ell( \sum_{j=1}^n \alpha_j \langle x_i, x_j \rangle, y_i) + \lambda \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j \langle x_i, x_j \rangle. \]

Define the linear kernel matrix \(K = [ \langle x_i, x_j \rangle ]_{i,j=1,\ldots,n}\), and this becomes

\[ \alpha^\star = \text{argmin}_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^n \ell( (K \alpha)_i, y_i) + \lambda \alpha^T K \alpha, \]

which is again a convex optimization problem. Once \(\alpha^\star\) is known, the model can be used to predict the value of the target given a new input, also in terms of the kernel function \( \kappa(x_1, x_2) = \langle x_1, x_2 \rangle \) :

\[ (\omega^\star)^T x = \sum_{i=1}^n \alpha_i \langle x_i, x \rangle = \sum_{i=1}^n \alpha_i \kappa(x_i, x). \]

This reformulation of the ERM is not particularly useful unless \(d > n\), as the original ERM involved a \(d\)-dimensional optimization problem while the latter involves a \(n\)-dimensional optimization problem and involves forming and working with an \(n \times n\) kernel matrix. However, if \(d\) is larger than \(n\), the kernel trick that led to the second ERM is useful, as the latter ERM may be more efficient to work with.

Now, consider that we want to learn using linear combinations of some fixed functions of the raw features \(x\), e.g. we want to learn using the features \(x_1, x_2, x_1^2, x_2^2, x_1 x_2, \ldots\). One can imagine that using such features could be of benefit over simply using the raw features. Let \(\phi(x)\) denote the feature map that maps the raw features from \(d\) dimensions into \(D\) new features, then the ERM of interest is

\[ \min_{\omega \in \mathbb{R}^D} \frac{1}{n} \sum_{i=1}^n \ell(\omega^T \phi(x_i), y_i) + \lambda \|\omega\|_2^2,\]

and the same reasoning as before tells us that \(\omega^\star = \sum_{i=1}^n \alpha_i \phi(x_i)\). Defining the nonlinear kernel function \( \kappa(x_1, x_2) = \langle \phi(x_1), \phi(x_2) \rangle \) and the corresponding nonlinear kernel matrix \(K = [ \kappa(x_1, x_2) ]_{i,j=1,\ldots,n}\) corresponding to the feature map \(\phi\), we see that we can fit this model by solving

\[ \min_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^n \ell( (K \alpha)_i, y_i) + \lambda \alpha^T K \alpha, \]

the exact same convex problem as before, and also predict using

\[ (\omega^\star)^T \phi(x) = \sum_{i=1}^n \alpha_i \langle \phi(x_i), \phi(x) \rangle = \sum_{i=1}^n \alpha_i \kappa(x_i, x). \]

Note that \(\phi\) could even be infinite-dimensional, and this approach still works: the ERM stated in terms of the feature map is not solvable on a computer, but the latter is, as it involves working with only an \(n\)-dimensional optimization problem. In practice, even if \(\phi\) is finite-dimensional, \(D\) is taken to be sufficiently high that the kernel version of the ERM is more efficiently solvable than the feature map version of the ERM. Further, \(\phi\) is chosen so that the kernel function is efficiently evaluable.

The canonical example of the feature map used in kernel learning is the Gaussian/radial basis feature map, an infinite dimensional feature map that consists of all the monomials in the raw features, each downweighed by an appropriate exponential weight (it’s tedious to write out, you can look it up). The nice thing is that this choice of feature map leads to the RBF kernel function that we all know and love

\[\kappa(x_1, x_2) = \exp\left(-\frac{\|x_1 – x_2\|_2^2}{2\sigma^2} \right),\]

where \(\sigma^2\) is a hyperparameter.

Nystrom vs Random Feature Maps

2014-12-26T22:37:49-05:00

I haven’t seen a truly convincing study comparing Nystrom approximations to Random Feature Map approximations. On the one hand, a NIPS 2012 paper compared the two and argued that because the bases Nystrom approximations use are adaptive to the problem, whereas those used by RFMs are not, Nystrom approximations are more efficient.

This is an indisputable point, but the experiments done in the paper are not convincing: they used the same number of samples in Nystrom approximations as random features in RFMS. Instead, the fair comparison is to allot both methods the same number of FLOPs; since Nystrom methods involve an additional pseudoinversion of a (huge, for a large number of samples) matrix, one can potentially use more random features than sample points for the same number of FLOPs. Also, as always, it is important to choose an appropriate kernel — this paper only considered RBF kernels.

On the other hand, recently IBM researchers have used large-scale RFM approaches to get state-of-the-art performance on vision and speech tasks. Their results use the simplest of RFM approaches: linear regression on top of a very large number (~400K) random fourier features. The key to their success is a well-engineered ADMM approach to parallelizing the solution of the system. It’s not clear to me that a similar approach couldn’t be used to scale up a Nystrom-based solution and obtain similar results. Also, I’ve not seen anyone implement Wainwright et al.’s divide and conquer approach to kernel regression; theoretically, this could also be used to distribute the cost of a truly large-scale Nystrom implementation.

Personally, I’m of the opinion that a well-engineered Nystrom solution (using uniform sampling, even) should always outperform a well-engineered RFM solution. But, I’m interested in seeing this convincingly demonstrated.

Mirror descent is, in a precise sense, a second order algorithm

2014-10-01T22:47:15-04:00

For one of our projects at eBay, I’ve been attempting to do a Poisson MLE fit on a large enough dataset that Fisher scoring is not feasible. The problem is that the data also has such large variance in the scales of the observation that stochastic gradient descent does not work, period — because of the exponentiation involved, you need to take a very tiny step size to avoid overflow errors, but this step size is shared by all the parameters, so you can’t make progress in this way.

An alternative is adagrad, which maintains separate stepsizes for each parameter, but that seems to run into the same divergence issue, albeit much slower — slow enough that it’s unclear to me whether the fit is actually diverging, or if it ‘just’ needs to run a couple hundred of iterations before it converges. So for the past week I’ve been massaging the initial conditions and amount of information I hard-bake into the parametrization of the problem to see if I can get Adagrad to work reasonably. Still no luck.

I just came across Raskutti’s and Mukherjee’s paper “The information geometry of mirror descent“, which seems relevant to my situation and is a nice (albeit, in need of proof-reading) read. The main result of the paper is that the mirror descent algorithm associated with the Bregman divergence of a function \(G\) is equivalent to natural gradient descent in the dual manifold with metric tensor defined by Hessian of the convex conjugate of \(G.\) This sounds wonderful, because the connection between exponential families and Bregman divergences suggests that one can then perform a first-order optimization in a certain dual manifold, and reap all the benefits of having done Fisher scoring, a second-order algorithm, in parameter space. I have to reread the paper carefully to get a handle on the precise manipulations required, but this may be a nice alternative to Adagrad for my problem.

I wonder: is there a similarly geometric interpretation of what composite mirror descent does?

Update: A more readable recent paper, “Stochastic Discriminative EM” from UAI 2014, does a better job of explaining the interpretation of the dual manifold and has a very similar algorithm.

Algebra: it matters

2014-09-15T16:42:38-04:00

I’m looking at two different models for learning polynomial functions, and trying to determine if they are equivalent. After a couple days of thinking, I’ve reduced the question to the following:

Can every symmetric polynomial of degree \(r\) in \(d\) variables that has no constant term be written as a sum of the \(r\)-th powers of linear polynomials in \(d\) degrees and a homogeneous polynomial of degree \(r\) each of whose monomials involves at most \(d-1\) variables?

Eigenvector two-condition number for a product of PSD matrices

2014-04-07T09:31:20-04:00

I’m pushing to submit a preprint on the Nystrom method that has been knocking around for the longest time.

I find myself running into problems centering around expressions of the type \(B^{-1}A\), where \(A, B\) are SPSD matrices satisfying \(B \preceq A\). This expression will be familiar to numerical linear algebraists: there \(B\) would be a preconditioner for a linear system \(A x = b,\) and the relevant quantity of interest is the spectral radius of \(B^{-1} A\).

It’s not hard to show that the spectral radius of this product is at most 1… but instead, I’m interested in the norm of this product. Because the spectral radius of the product is at most 1, we can use the bound
\[
\|B^{-1} A\|_2 \leq \kappa(U_{B^{-1}A})
\]
where \(\kappa(U_{B^{-1}A})\) is the two-condition number of the matrix of eigenvectors of \(B^{-1}A \).

In the applications I’m interested in, some rough numerical experiments show that this bound is good enough for my purposes (the two terms are of the same order). Assuming this is the case, how can I get a good theoretical bound on this condition number?

Canonical Correlation Analysis (CCA)

2014-04-07T09:20:11-04:00

I am not completely satisfied with the expositions of CCA that I’ve come across, so I decided to write one that reflects my own intuition.

CCA is useful in the case where you observe two random variables that are both noisy linear functions of some underlying latent random variable, and you want to use this fact to help you guess the latent variable. Formally, assume
\[
x = T_X z + n_x, \quad \text{ and } \quad y = T_Y z + n_y,
\]
where, without loss of generality, we assume that the entries of \(z\) are uncorrelated and unit variance. Here \(T_X\) and \(T_Y\) are matrices whose image spaces may be of different dimension, and \(n_x\) and \(n_y\) are white noise. For convenience, we assume that \(z\) is mean-zero, so that \(x\) and \(y\) are also.

In order to recover \(z\) from \(x\) and \(y\), we phrase the following question: find two transformations \(\Phi_X\) and \(\Phi_Y\) so that the entries of \(\Phi_X x\) are uncorrelated and unit variance, as are those of \(\Phi_Y y\), and the correlation of \(\Phi_X x\) with \(\Phi_Y y\) is maximized.

We could use PCA to whiten \(x\) and \(y\) individually to get two different noisy estimates of \(z\), but this would ignore the fact that knowing both \(x\) and \(y\) gives us more information on \(z\). Indeed, the requirement that the correlation of \(\Phi_X x\) and \(\Phi_Y y\) be maximized should tend to choose \(\Phi_X\) and \(\Phi_Y\) which remove the directions containing the noise \(n_x\) and \(n_y\).

CCA can then be formulated as the following:
\[
\max_{\Phi_X, \Phi_Y} \mathbb{E} \langle \Phi_X x, \Phi_Y y \rangle \text{ subject to }
\mathbb{E} \Phi_X x (\Phi_X x)^T = \mathbb{E} \Phi_Y y (\Phi_Y y)^T = I,
\]
or equivalently,
\[
\max_{\Phi_X, \Phi_Y} \text{Trace}(\Phi_Y C_{y x} \Phi_X^T) \text{ subject to } \Phi_X C_{xx} \Phi_X^T = \Phi_Y C_{yy} \Phi_Y^T = I,
\]
where \(C_{xx}, C_{yy}\) are the covariance matrices of \(x\) and \(y\) and \(C_{yx}\) is the cross-covariance matrix of \(y\) and \(x\) given by \(C_{yx} = \mathbb{E} (yx^T).\)

Since \(C_{xx}\) and \(C_{yy}\) are full-rank, by taking \(\Phi_X^\prime = \Phi_X C_{xx}^{1/2}\) and \(\Phi_Y^\prime = \Phi_Y C_{yy}^{1/2}, \)this program can be transformed to
\[
\max_{\Phi_X^\prime, \Phi_Y^\prime} \text{Trace}(\Phi_Y^\prime C_{yy}^{-1/2} C_{yx} C_{xx}^{-1/2} \Phi_X^{\prime T}) \text{ subject to } \Phi_x^\prime \Phi_x^{\prime T} = \Phi_Y^\prime \Phi_Y^{\prime T} = I.
\]

From the standard characterization of the SVD, it is clear that the solution \(\Phi_X^\prime\) and \(\Phi_Y^\prime\) are given by the top right and left singular vectors of the matrix \(C_{yy}^{-1/2} C_{yx} C_{xx}^{-1/2}.\) Equivalently, they are the top eigenvectors of the two matrices \(C_{xx}^{-1/2} C_{xy} C_{yy}^{-1} C_{yx} C_{xx}^{-1/2}\) and \(C_{yy}^{-1/2} C_{yx} C_{xx}^{-1} C_{xy} C_{yy}^{-1}.\) The eigenvectors of these two unwieldy matrices are also the top eigenvectors of the generalized eigenvalue problems given by
\begin{align*}
C_{xy} C_{yy}^{-1} C_{yx} \phi_i^x & = \lambda_i C_{xx} \phi_i^x \\
C_{yx} C_{xx}^{-1} C_{xy} \phi_i^y & = \lambda_i C_{yy} \phi_i^y.
\end{align*}
Here the \(\lambda_i\) are the squared canonical correlations, and the \(\phi^x_i, \phi^y_i\) are the canonical correlation basis vectors.

None of the non-variational characterizations given so far seem terribly efficiently, since they all involve two matrix inversion. It turns out that CCA can be done by using a QR decomposition instead. See Björck and Golub, 1973 for the details of the algorithm, and the connection between CCA and the principle angles between subspaces.

Adagrad and projections onto ellipsoids

2014-02-06T22:16:32-05:00

((Caveat! I am not sure the manipulations done in this post are correct, but the gist is certainly there.))

One of my favorite optimization techniques is Adagrad, a first-order technique that approximates the Hessian by using all the gradients up to that point. It calls for updates of the form:
\[
x_{t+1} = \Pi_{\mathcal{X}}^{G_t^{1/2}} (x_t – \eta G_{t}^{-1/2} g_t),
\]
or more practically for high-dimensional problems,
\[
x_{t+1} = \Pi_{\mathcal{X}}^{\text{diag}(G_t)^{1/2}} (x_t – \eta \text{diag}(G_{t})^{-1/2} g_t).
\]
Here, \(g_t\) denotes the gradient at step \(t\), \(\eta\) is a fixed stepsize, \(\mathcal{X}\) is the constraint set for your optimization problem,
\[
G_t = \sum_{i=1}^t g_t g_t^T,
\]
and \(\Pi_{\mathcal{X}}^A\) is the projection onto the constraint set with respect to a distance function defined by a positive-definite matrix \(A\):
\[
\Pi_{\mathcal{X}}^A (y) := \text{argmin}_x \langle x – y, A (x-y) \rangle = \|A^{1/2} (x-y)\|_2^2.
\]
By convention \(\Pi_{\mathcal{X}} = \Pi_{\mathcal{X}}^I\) denotes the Euclidean projection operator.

The neat thing about Adagrad is that it chooses different effective step sizes for each variable, rather than using one fixed step size for all of them. I’ve found that Adagrad outperforms sgd in my non-constrained optimization problems.

Now I need to use adagrad on a ball constrained problem: \(\mathcal{X} = \{ x : \|x\|_2 \leq B \}\). Computing the associated projection operator \(\Pi_{\mathcal{X}}^A)\) was a nice enough exercise in basic convex optimization that it seems worth recording here. Actually, given the ubiquity of this constraint set, I’m surprised the authors didn’t seem to provide this as an example in their paper (I may have missed it, since I only read the absolute minimum to get the gist of the idea and then skimmed the rest).

Adagrad’s connection to Euclidean projection onto ellipsoids

Our first step is to observe that
\[
\min_{\|x\|_2 \leq B} \|A^{1/2} (x- y)\|_2^2 = \min_{\|A^{-1/2} x\| \leq B} \|x – A^{1/2}y\|_2^2
\]
since \(A^{1/2}\) is invertible, and that the unique minimizer of the first program is a linear transformation of the unique minimizer of the second. Concretely,
\begin{align*}
\Pi_{\|x\|_2 \leq B}^{A}(y) & = A^{-1/2} \text{argmin}_{\|A^{-1/2} x\| \leq B} \|x – A^{1/2}y\|_2^2 \\
& = A^{-1/2} \Pi_{\|A^{-1/2} x\|_2 \leq B} (A^{1/2} y).
\end{align*}

The Lagrangian dual to Euclidean projection onto an ellipsoid

Thus, it suffices to determine the Euclidean projection operator onto an ellipsoid, \(\Pi_{\|Q x\|_2 \leq B}(z)\), where \(Q\) is positive-semidefinite. To do so, let’s write out the optimization problem that defines this projection operator:
\[
\min_{x^T Q^T Q x \leq B^2} (x – z)^T(x-z).
\]
The associated Lagrangian is:
\begin{align*}
L(x,\lambda) &= (x-z)^T(x-z) – \lambda ( B^2 – x^TQ^TQx) \\
& = x^T( I + \lambda Q^TQ) x – 2z^T x + z^Tz – \lambda B^2.
\end{align*}
Recall that \(L(x,\lambda)\) is a lower bound on the objective whenever \(x\) is feasible and \(\lambda \geq 0.\) To find a uniform lower bound, we minimize \(L\) with respect to \(x\) by setting \(\nabla_x L = 0\). This occurs when
\[
x = (I + \lambda Q^TQ)^{-1} z,
\]
and gives the uniform lower bound
\[
g(\lambda) = \min_x L(x, \lambda) = z^T[ I – (I + \lambda Q^TQ)^{-1} ]z – \lambda B^2.
\]
The optimization problem we’re looking at satisfies strong duality, so we know that maximizing \(g\) over the set of nonnegative \(\lambda\) gives the same optimal value of the original problem, and corresponding to the optimal \(\lambda\) there is an optimal \(x\). We now find the optimal \(\lambda\) and show how to recover the optimal \(x\).

From the dual optimal point to the primal optimal point

First write the SVD of \(Q^TQ\) as \(Q^TQ = U \Sigma U^T.\) It follows that
\[
I – (I + \lambda Q^TQ)^{-1} = \left[ \frac{\lambda \sigma_{i}}{1 + \lambda \sigma_{i}} \right]_{ii},
\]
where \(\sigma_i = \Sigma_{ii}.\) The optimal \(\lambda\) then is the nonnegative \(\lambda\) which maximizes
\begin{align*}
g(\lambda) & = z^T[ I – (I + \lambda Q^TQ)^{-1} ]z – \lambda B^2 \\
& = \sum\nolimits_i (U^Tz)_i^2 \frac{\lambda \sigma_{i}}{1 + \lambda \sigma_{i}} – \lambda B^2.
\end{align*}

Observe that \(g(\lambda)\) is concave on \(\mathbb{R}^+,\) so its maximum occurs at \(0\) or the point where \(g^\prime(\lambda) = 0\). Thus, if the equation
\[
\sum\nolimits_i (U^Tz)_i^2 \frac{\sigma_i}{(1 + \lambda \sigma_i)^2} = \|Q(I + \lambda Q^TQ)^{-1} z\|_2^2 = B^2
\]
has a nonnegative solution, that solution is the optimal \(\lambda\). If not, then the optimal \(\lambda\) is \(0.\)

Given this optimal \(\lambda,\) the corresponding optimal \(x\) is given by
\[
x = \Pi_{\|Qx\|_2 \leq B}(z) = (I + \lambda Q^TQ)^{-1} z.
\]

From the Euclidean projection on an ellipsoid back to the Adagrad projector

Putting all the pieces together, we get the following expression for the projection operator needed in Adagrad:
\begin{align*}
\Pi_{\|x\|_2 \leq B}^A(y) & = A^{-1/2} \Pi_{\|A^{-1/2}x\|_2 \leq B}(A^{1/2}y) \\
& = A^{-1/2} ( I + \lambda A^{-1})^{-1} A^{1/2} y \\
& = (I + \lambda A^{-1})^{-1}y = A(A + \lambda I)^{-1}y,
\end{align*}
where \(\lambda\) is either the nonnegative solution to the nonlinear equation
\begin{align*}
\sum\nolimits_i (\Sigma^{1/2} V^T y)_{n-i+1}^2 \frac{\sigma_i(A^{-1})}{(1 + \lambda \sigma_i(A^{-1}))^2} & \\
& = \|(I + \lambda A^{-1})^{-1} y \|_2^2 \\
& = \|A(A + \lambda I)^{-1} y\|_2^2 = B^2,
\end{align*}
where \(A = U \Sigma U^T,\) or if such a solution does not exist, \(0.\)

Back of the envelope calculations of how fast your computer can do linear algebra operations

2014-01-23T15:28:14-05:00

Let’s talk about CPU speed, practically. By practically, I mean, how fast can your CPU do linear algebra operations. And by linear algebra operations, I mean matrix-matrix multiplies.

First, you need to calculate how many FLOPS your computer can do. The following formula comes in handy:
\[
\text{nFLOPS} = \text{cores} \cdot \frac{\text{clock cycles}}{\text{second}} \cdot \frac{\text{FLOPS}}{\text{cycle}}.
\]
You probably already know the number of cores in your computer, and the number of clock cyles. The interesting thing here is the number of FLOPS per cycle: this depends on the architecture of your CPU and what exactly you take to be the size of a float.

It’s standard to take a float to consist of 32 bits, so the number of FLOPS per cycle depends on how many multiples of 32 bits can fit into your registers. SSE capable CPUs have 128 bit registers, so can do 4 FLOPS per cycle (this is the most common set of CPUs). AVX capable CPUs have 256 bit registers, so can do 8 FLOPS per cycle (e.g. the latest Macbook Pros are AVX capable).

Putting these bits together, I get that my workstation, which has 2 hexa-core SSE-capable CPUS each running at 2 GHz achieves
\[
\text{nFLOPS} = (2*6) * (2*10^9)*4 = 96 \text{GFLOPS}.
\]

The cost of a matrix-matrix multiply of two \(n\)-by-\(n\) matrices is essentially \(n^3\) floating point operations. Thus it should take this workstation about \(\frac{n^3}{96} * 10^{-9}\) seconds to do this multiply.

E.g., in my case, the predicted time of squaring two \(16\text{K} \times 16\text{K}\) matrices is about 42.6 seconds. A quick Matlab check shows it does take about 43 seconds.

A useful trick for computing gradients w.r.t. matrix arguments, with some examples

2013-11-12T19:56:52-05:00

I’ve spent hours this week and last week computing, recomputing, and checking expressions for matrix gradients of functions. It turns out that except in the simplest of cases, the most painfree method for finding such gradients is to use the Frechet derivative (this is one of the few concrete benefits I derived from the differential geometry course I took back in grad school).

Remember that the Frechet derivative of a function \(f : X \rightarrow \mathbb{R}\) at a point \(x\) is defined as the unique linear operator \(d\) that is tangent to \(f\) at \(x\), i.e. that satisfies
\[
f(x+h) = f(x) + d(h) + o(\|h\|).
\]
This definition of differentiability makes sense whenever \(X\) is a normed linear space. If \(f\) has a gradient, then the Frechet derivative exists and the gradient satisfies the relation \(d(h) = \langle \nabla f(x), h \rangle.\)

Simple application

As an example application, lets compute the gradient of the function
\[
f(X) = \langle A, XX^T \rangle := \mathrm{trace}(A^T XX^T) = \sum_{ij} A_{ij} (XX^T)_{ij}
\]
over the linear space of \(m\) by \(n\) real-valued matrices equipped with the Frobenius norm. First we can expand out \(f(X+H)\) as
\[
f(X + H) = \langle A, (X+H)(X+H)^T \rangle = \langle A, XX^T + XH^T + HX^T + HH^T \rangle
\]
Now we observe that the terms which involve more than one power of \(H\) are \(O(\|H\|^2) = o(\|H\|)\) as \(H \rightarrow 0\), so
\[
f(X + H) = f(X) + \langle A, XH^T + HX^T \rangle + o(\|H\|).
\]
It follows that
\[
d(H) = \langle A, XH^T + HX^T \rangle = \mathrm{trace}(A^TXH^T) + \mathrm{trace}(A^THX^T),
\]
which is clearly a linear function of \(H\) as desired. To write this in a way that exposes the gradient, we use the
cyclicity properties of the trace, and exploit its invariance under transposes to see that
\begin{align}
d(H) & = \mathrm{trace}(HX^TA) + \mathrm{trace}(X^TA^T H) \\
& = \mathrm{trace}(X^TAH) + \mathrm{trace}(X^TA^T H) \\
& = \langle AX, H \rangle + \langle A^TX, H \rangle \\
& = \langle (A + A^T)X, H \rangle.
\end{align}
The gradient of \(f\) at \(X\) is evidently \((A + A^T)X\).

More complicated application

If you have the patience to work through a lot of algebra, you could probably calculate the above gradient component by component using the standard rules of differential calculus, then back out the simple matrix expression \((A + A^T)X\). But what if we partitioned \(X\) into \(X = [\begin{matrix}X_1^T & X_2^T \end{matrix}]^T\) and desired the derivative of
\[
f(X_1, X_2) = \mathrm{trace}\left(A \left[\begin{matrix} X_1 \\ X_2 \end{matrix}\right] \left[\begin{matrix}X_1 \\ X_2 \end{matrix} \right]^T\right)
\]
with respect to \(X_2\)? Then the bookkeeping necessary becomes even more tedious if you want to compute component by component derivatives (I imagine, not having attempted it). On the other hand, the Frechet derivative route is not significantly more complicated.

Some basic manipulations allow us to claim
\begin{align}
f(X_1, X_2 + H) & = \mathrm{trace}\left(A \left[\begin{matrix} X_1 \\ X_2 + H \end{matrix}\right] \left[\begin{matrix}X_1 \\ X_2 + H \end{matrix} \right]^T\right) \\
& = f(X_1, X_2) + \mathrm{trace}\left(A \left[\begin{matrix} 0 & X_1 H^T \\
H X_2^T & H X_2^T + X_2 H^T + H H^T \end{matrix} \right]\right)
\end{align}
Once again we drop the \(o(\|H\|)\) terms to see that
\[
d(H) = \mathrm{trace}\left(A \left[\begin{matrix} 0 & X_1 H^T \\
H X_2^T & H X_2^T + X_2 H^T \end{matrix} \right]\right).
\]
To find a simple expression for the gradient, we partition \(A\) (conformally with our partitioning of \(X\) into \(X_1\) and \(X_2\)) as
\[
A = \left[\begin{matrix} A_1 & A_2 \\ A_3 & A_4 \end{matrix} \right].
\]
Given this partitioning,
\begin{align}
d(H) & = \mathrm{trace}\left(\left[\begin{matrix}
A_2 H X_1^T & \\
& A_3 X_1 H^T + A_4 H X_2^T + A_4 X_2 H^T
\end{matrix}\right] \right) \\
& = \langle A_2^TX_1, H \rangle + \langle A_3X_1, H \rangle + \langle A_4^T X_2, H \rangle + \langle A_4X_2, H \rangle \\
& = \langle (A_2^T + A_3)X_1 + (A_4^T + A_4)X_2, H \rangle.
\end{align}
The first equality comes from noting that the trace of a block matrix is simply the trace of its diagonal parts, and the second comes from manipulating the traces using their cyclicity and invariance to transposes.

Thus \(\nabla_{X_2} f(X_1, X_2) = (A_2^T + A_3)X_1 + (A_4^T + A_4)X_2.\)

A masterclass application

Maybe you didn’t find the last example convincing. Here’s a function I needed to compute the matrix gradient for— a task which I defy you to accomplish using standard calculus operations—:
\[
f(V) = \langle 1^T K^T, \log(1^T \mathrm{e}^{VV^T}) \rangle = \log(1^T \mathrm{e}^{VV^T})K1.
\]
Here, \(K\) is an \(n \times n\) matrix (nonsymmetric in general), \(V\) is an \(n \times d\) matrix, and \(1\) is a column vector of ones of length \(n\). The exponential \(\mathrm{e}^{VV^T}\) is computed entrywise, as is the \(\log\).

To motivate why you might want to take the gradient of this function, consider the situation that \(K_{ij}\) measures how similar items \(i\) and \(j\) are in a nonsymmetric manner, and the rows of \(V\) are coordinates for representations of the items in Euclidean space. Then \((1^T K)_j\) measures how similar item \(j\) is to all the items, and
\[
(1^T \mathrm{e}^{VV^T})_j = \sum_{\ell=1}^n \mathrm{e}^{v_\ell^T v_j}
\]
is a measure of how similar the embedding \(v_j\) is to the embeddings of all the items. Thus, if we constrain all the embeddings to have norm 1, maximizing \(f(V)\) with respect to \(V\) ensures that the embeddings capture the item similarities in some sense. (Why do you care about this particular sense? That’s another story altogether.)

Ignoring the constraints (you could use a projected gradient method for the optimization problem), we’re now interested in finding the gradient of \(f\). In the following, I use the notation \(A \odot B\) to indicate the pointwise product of two matrices.
\begin{align}
f(V + H) & = \langle 1^T K, \log(1^T \mathrm{e}^{(V+H)(V+H)^T} \rangle \\
& = \langle 1^T K, \log(1^T [\mathrm{e}^{VV^T} \odot \mathrm{e}^{VH^T} \odot \mathrm{e}^{HV^T} \odot \mathrm{e}^{HH^T} ]) \rangle
\end{align}
One can use the series expansion of the exponential to see that
\begin{align}
\mathrm{e}^{VH^T} & = 11^T + VH^T + o(\|H\|), \\
\mathrm{e}^{HV^T} & = 11^T + HV^T + o(\|H\|), \text{ and}\\
\mathrm{e}^{HH^T} & = 11^T + o(\|H\|).
\end{align}
It follows that
\begin{multline}
f(V + H) = \langle 1^T K, \log(1^T [\mathrm{e}^{VV^T} \odot (11^T + VH^T + o(\|H\|)) \\
\odot (11^T + HV^T + o(\|H\|)) \odot (11^T + o(\|H\|)) ]) \rangle.
\end{multline}
This readily simplifies to
\begin{align}
f(V + H) & = \langle 1^T K, \log(1^T [\mathrm{e}^{VV^T} \odot(11^T + VH^T + HV^T + o(\|H\|) )]) \rangle \\
& = \langle 1^T K, \log(1^T [\mathrm{e}^{VV^T} + e^{VV^T} \odot (VH^T + HV^T) + o(\|H\|) )]) \rangle
\end{align}
Now recall the linear approximation of \(\log\):
\[
\log(x) = \log(x_0) + \frac{1}{x_0} (x-x_0) + o(|x- x_0|^2).
\]
Apply this approximation pointwise to conclude that
\begin{multline}
f(V + H) = \langle 1^T K, \log(1^T \mathrm{e}^{VV^T}) + \\
\{1^T \mathrm{e}^{VV^T}\}^{-1}\odot (1^T [\mathrm{e}^{VV^T} \odot (VH^T + HV^T) + o(\|H\|)]) \rangle,
\end{multline}
where \(\{x\}^{-1}\) denotes the pointwise inverse of a vector.
Take \(D\) to be the diagonal matrix with diagonal entries given by \(1^T \mathrm{e}^{VV^T}\). We have shown that
\[
f(V + H) = f(V) + \langle K^T1, D^{-1} [\mathrm{e}^{VV^T} \odot (VH^T + HV^T)]1 \rangle + o(\|H\|),
\]
so
\begin{align}
d(H) & = \langle K^T1, D^{-1} [\mathrm{e}^{VV^T} \odot (VH^T + HV^T)]1 \rangle \\
& = \langle D^{-1}K^T 11^T, \mathrm{e}^{VV^T} \odot (VH^T + HV^T) \rangle \\
& = \langle \mathrm{e}^{VV^T} \odot D^{-1}K^T 11^T, (VH^T + HV^T) \rangle.
\end{align}
The second inequality follows from the standard properties of inner products and the third from the observation that
\[
\langle A, B\odot C \rangle = \sum_{ij} A_{ij}*B_{ij}*C_{ij} = \langle B \odot A, C \rangle.
\]
Finally, manipulations in the vein of the two preceding examples allow us to claim that
\[
\nabla_V f(V) = [\mathrm{e}^{VV^T} \odot (11^T K D^{-1} + D^{-1} K^T 11^T)] V.
\]

As a caveat, note that if instead \(f(V) = \log(1^T \mathrm{e}^{VV^T} ) K^T 1\), then one should substitute \(K\) for \(K^T\) in the last expression.

Quick note on the Chen, Chi, Goldsmith covariance sketching paper

2013-10-07T15:19:12-04:00

NB: I will update this post as I read the paper, in case it turns out that the first issue I raised is not legitimately a concern.

Covariance estimation (and the natural extension, precision estimation) has always been an interesting topic for me because it (can) represent a concise, concrete, and very broadly applicable instance of applied nonasymptotic random matrix theory. Likewise, I’m also quite interested in matrix sketching algorithms. Thus I was very excited to see the latest preprint by Chen, Chi, and Goldsmith on arxiv which presents a convex optimization algorithm for recovering covariance matrices from randomized sketches obtained in a streaming manner. Their convex optimization problem is essentially the PhaseLift formulation used for recovering phase from magnitude, but their proof shows that it works for covariance matrix recovery.

I have only just started reading this paper, and I’m still excited, but I have two concerns already: first, it is not clear to me that the algorithm they propose is actually the algorithm they provide a guarantee for! At the least, the text must be corrected to make it clear that this is indeed the case. To be more precise, their algorithm is to compute a few random sketching vectors ahead of time, then as the rows of the matrix come in, compute the magnitude of the projection of each row onto *one* randomly chosen sketching vector. The measurement model described mathematically seems to compute the magnitude of the projection of each row onto *each* of the sketching vectors. Big difference there.

Second, their algorithm provides a Frobenius norm guarantee, which is par for the course, but they make claims about things like subspace detection, for which afaik, Frobenius norm guarantees are too weak to really be of interest. But here, this may be a matter of preference, and practitioners probably don’t care about sharp guarantees as long as it works in practice and has at least a semblance of theoretical support.