Handy Matrix Cheat Sheet, with Gradients

Posted on February 4, 2011 2:27 PM by Bob Carpenter

This post is an (unpaid) advertisement for the following extremely useful resource:

Petersen, K. B. and M. S. Pedersen. 2008. The Matrix Cookbook. Tehcnical Report, Technical University of Denmark.

It contains 70+ pages of useful relations and derivations involving matrices. What grabbed my eye was the computation of gradients for matrix operations ranging from eigenvalues and determinants to multivariate normal density functions. I had no idea the multivariate normal had such a clean gradient (see section 8).

We’ve been playing around with Hamiltonian (aka Hybrid) Monte Carlo for sampling from the posterior of hierarchical generalized linear models with lots of interactions. HMC speeds up Metropolis sampling by using the gradient of the log probability to drive samples in the direction of higher probability density, which is particularly useful for correlated parameters that mix slowly with standard Gibbs sampling. Matt “III” Hoffman‘s already got it working in Cython/numpy for the discrete-predictor case.

To really get this going, we need to be able to handle the varying intercept/varying slope models used in the Red State/Blue State analysis of Gelman, Shor, Bafumi and Park, which is also explained by Andrew and Jennifer in their regression book. The slopes and intercepts get multivariate normal priors, the covariance matrices of which require hyperpriors. That means log probability functions involving operations like matrix-inverse products or eigenvalues (depending on how you model the covariance prior).

I’ve been following up some earlier suggestions on this blog to automatic differentiation. I’ve pretty much settled on using David Gay‘s elegant little Reverse Automatic Differentiation (RAD) package, which is a very straightforward C++ template-based library for computing gradients. All you need to do is replace the floating point calcs in code with templated versions.

This in turn means we need a templated C++ vector/matrix library with BLAS/LAPACK-like functionality. An exchange on Justin Domke’s blog led me to the Eigen package, which is a beautifully templated implementation of BLAS and at least what I need from LAPACK. Their doc led me to the cookbook.

14 thoughts on “Handy Matrix Cheat Sheet, with Gradients”

jsalvatier on February 4, 2011 11:41 AM at 11:41 am said:

Thanks for reminding me about this.

By the way, I am curious why you like HMC so much. Why do you prefer it to (say) Langevin MCMC? I only have a cursory understanding of HMC, but I was under the impression that HMC without momentum is equivalent to Langevin MCMC (http://www.lbreyer.com/classic.html). What advantage does adding momentum get you?
Severin Pappadeux on February 4, 2011 12:31 PM at 12:31 pm said:

There is also uBLAST package in BOOST
http://www.boost.org
Bob Carpenter on February 4, 2011 1:12 PM at 1:12 pm said:

The momentum's what helps you take bigger steps in the Metropolis proposals without sacrificing a high acceptance rate. So you get faster mixing. The math of how all this works is still beyond me, but there are basic introductions in both MacKay's and Bishop's machine learning books.

The paper that everyone's buzzing about now is Girolami and Calderhead's Riemann Manifold Langevin and Hamiltonian Monte Carlo, which apparently mixes even faster.
Bob Carpenter on February 4, 2011 1:21 PM at 1:21 pm said:

I checked out uBLAS for templated BLAS-level operations, and even managed to install it and get it working.

Overall, the Boost C++ libraries of which uBLAS is a part, look a bit more mature and widely used than Eigen. But they also look much more complex and don't have any solvers (not even inverse or even LU decomposition) built on top of uBLAS.

Boost also has some stats libraries as part of the Boost Math Toolkit, but for some reason, none of the multivariate distros we need, not even multivariate normal.
Corey on February 4, 2011 3:05 PM at 3:05 pm said:

At MCMSki 3, a prof named Omar Ghattas presented techniques that seemed very similar to (and possibly even more sophisticated than) RM-HMC. <a href="http://scholar.google.com/scholar?q=omar+ghattas+mcmc" rel="nofollow">Google scholar link
Andrew Gelman on February 4, 2011 4:19 PM at 4:19 pm said:

Bob:

See here for discussion of the Girolami and Calderhead paper.
http://models.street on February 4, 2011 4:30 PM at 4:30 pm said:

Bob, I wonder how does the HMC type stuff compare with this approach from back in 1995 that I found a while ago:

http://www.jstor.org/stable/2291325
Radford Neal on February 4, 2011 6:22 PM at 6:22 pm said:

HMC is much, much better than Langevin methods for most problems because it avoids the inefficient exploration of the state space by a random walk that Langevin (and random-walk Metropolis) methods do. HMC also has better scaling with dimensionality. See my review at <a>http://www.cs.utoronto.ca/~radford/ham-mcmc.abstr….

The paper by Geyer and Thompson referenced just above tackles the problem of multiple modes, which HMC doesn't try to solve. (At least standard HMC doesn't try to solve it, see the section on tempering in my HMC review for a variation that does.)
http://models.street on February 4, 2011 7:00 PM at 7:00 pm said:

Andrew, your link goes to a book review that doesn't seem to be related to the paper you refer to. perhaps check the link and re-post?
Mikkel N. Schmidt on February 4, 2011 11:11 PM at 11:11 pm said:

Radford Neal has an excellent review paper on MCMC using Hamiltonian dynamics. http://www.cs.toronto.edu/~radford/ham-mcmc.abstr…
Andrew Gelman on February 5, 2011 12:14 AM at 12:14 am said:

Dan:
Link fixed; thanks.
Andrew Gelman on February 5, 2011 1:39 AM at 1:39 am said:

Radford:

Yes, one of the longstanding confusions in this field is when people think that slow convergence has to come from multiple modes. I remember this coming up back in 1991: when I demonstrated the use of multiple sequences to monitor convergence, some people were skeptical, pointing out that my method wouldn't work if all the sequences were started at the same mode. But the real problem was their intuition that convergence within any mode is instantaneous. They didn't realize that you can have slow convergence even aside from any issues with well-separated modes.
Mark Girolami on February 5, 2011 9:08 AM at 9:08 am said:

The MCMSki presentation by Omar Ghattas was about the use of Metropolis Adjusted Langevin Algorithm (MALA) where a pre-conditioning matrix is used to locally adjust the gradient, and the covariance of the proposal. The methods presented were not Riemannian Hamiltonian (RM-HMC) or HMC based.

The Girolami & Calderhead paper will appear in the March issue of JRSS-B. There is a huge number of really interesting discussion contributions published with the paper, that are fizzing with great ideas for further development of the general methodology, making it quite long – ninety one pages in total.

Riemann manifold Langevin and Hamiltonian Monte Carlo Method,
Girolami, M. & Calderhead, B., J.R.Statist. Soc. B, {with discussion}, (2011), 73, 2, 123 – 214.

http://www.ucl.ac.uk/statistics/research/rmhmc

Regarding the question about why Hamiltonian Monte Carlo is preferred to Langevin methods. If the Hamiltonian is defined on an appropriate Riemann manifold for the statistical model it is straightforward to show that the RM-HMC proposals actually follow the paths (geodesics) required to minimise the variance of the estimated log-ratio of normalising constants of the densities – see the discussion in the Girolami & Calderhead paper.

Interestingly these geodesic equations were developed in the Gelman & Meng Stat Sci paper (1998 vintage) to define the variance minimising path for Bridge sampling – which the Hamilton-Jacobi equations solve – see rejoinder in the Girolami paper.

There is also a chapter in the Mixtures Estimation and Applications book edited by Xian Robert, et al, on employing manifold Hamiltonian and Langevin MCMC methods for mixture models (both RM-HMC and MMALA).

For Gaussian mixtures the preliminary results reported suggest substantial improved sampling performance over Gibbs sampling and HMC (where manual tuning of the mass matrix is carried out).

See Xian's Og for details of the book.
jsalvatier on February 6, 2011 6:59 PM at 6:59 pm said:

Thanks for the helpful responses. I had a little bit more reaction here: http://goodmorningeconomics.wordpress.com/2011/02…

Comments are closed.