Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond

Charles Margossian, Aki Vehtari, Daniel Simpson, Raj Agrawal write:

Gaussian latent variable models are a key class of Bayesian hierarchical models with applications in many fields. Performing Bayesian inference on such models can be challenging as Markov chain Monte Carlo algorithms struggle with the geometry of the resulting posterior distribution and can be prohibitively slow. An alternative is to use a Laplace approximation to marginalize out the latent Gaussian variables and then integrate out the remaining hyperparameters using dynamic Hamiltonian Monte Carlo, a gradient-based Markov chain Monte Carlo sampler. To implement this scheme efficiently, we derive a novel adjoint method that propagates the minimal information needed to construct the gradient of the approximate marginal likelihood. This strategy yields a scalable differentiation method that is orders of magnitude faster than state of the art differentiation techniques when the hyperparameters are high dimensional. We prototype the method in the probabilistic programming framework Stan and test the utility of the embedded Laplace approximation on several models, including one where the dimension of the hyperparameter is ∼6,000. Depending on the cases, the benefits can include an alleviation of the geometric pathologies that frustrate Hamiltonian Monte Carlo and a dramatic speed-up.

“Orders of magnitude faster” . . . That’s pretty good!

12 thoughts on “Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond

    • To be more descriptive, the title would have to be paragraphs long!

      Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond

      Monte Carlo: technique for computing integrals based on random numbers

      Hamiltonian Monte Carlo: an efficient form of Markov chain Monte Carlo (MCMC) that uses gradients of the log posterior; this is what Stan does.

      Laplace approximation: an approximation of a density by a multivariate normal centered at the density’s mode

      Bayesian inference: this is all about computing posterior expectations, which are expectations of quantities of interest conditioned on observation, and include predictions for future quantitiesm, parameter estimates, and event probability estimates.

      Gaussian model: one where parameters get a normal distribution (I’d like to see Gauss’s citation count!)

      latent Gaussian model: one where there are unobserved parameters that get a normal distribution

      adjoint differentiatied: adjoints are derivatives of final values w.r.t. intermediate values; reverse-mode autodiff is a form of adjoint algorithm

      Now what this all does is use the Laplace approximation to marginalize the latent Gaussians out of a model. So if your model is p(alpha, beta) and beta are latent Gaussian parameters, then we want to compute p(alpha) by marginalizing out beta. That’s a lot easier with the Laplace approximation. It’s what INLA does for Bayes and lme4 does for max marginal likelihood. Having an adjoint algorithm for differentiating p(alpha), the marginalized form of p(alpha, beta) is huge.

      • Gauss’ citation count would certainly be impressive, but wasn’t the normal distribution first introduced by De Moivre and then Laplace before Gauss?

    • I’d also suggest fixing the y axis.

      With a linear scale, it’s impossible to read anything from the graph other than the growth of the non-adjoint method and that the adjoint method is a lot faster. How much faster? Can’t tell. A log scale for y would let us read that off of the plot.

      I like to go one step further and normalize the baseline comparison system to 0. Then the y axis is just time as a fraction of baseline time, so you can directly read off the speedups. That’s how I presented results in the autodiff paper.

      • I’d go for the aesthetic purity of self-referential symmetry.

        Keep Figure 1, but label it Figure 1a. For Figure 1b, rescale the y axis so that the adjoint method curve in 1b matches the benchmark curve in 1a, and the benchmark curve in 1b becomes a straight vertical line.

    • Yes, one way or another.

      There’s also a proposal Philip Greengard et al. in their paper, A Fast Linear Regression via SVD and Marginalization. The title’s a bit misleading because it’s considering a hierarchcical model, not a simple linear regression.

      And then there are longstanding proposals from Andrew (don’t know if there’s a public reference anywhere) similar to what we use for autodiff variational inference (ADVI), which look like generic forms of the Markov chain Monte Carlo Expectation Maximization (MCMC-EM) algorithm, but with a Laplace approximation in the middle.

  1. Very interesting

    This will work for multivariate probit models (which augment the data with unobserved latent normally distributed variables), right?

    Are there plans to implement this into Stan?

Leave a Reply

Your email address will not be published. Required fields are marked *