## Hamiltonian Monte Carlo stories

Tomas Iesmantas had asked me for advice on a regression problem with 50 parameters, and I’d recommended Hamiltonian Monte Carlo.

A few weeks later he reported back:

After trying several modifications (HMC for all parameters at once, HMC just for first level parameters and Riemman manifold Hamiltonian Monte Carlo method), I finally got it running with HMC just for first level parameters and for others using direct sampling, since conditional distributions turned out to have closed form.

However, even in this case it is quite tricky, since I had to employ mass matrix and not just diagonal but at the beginning of algorithm generated it randomly (ensuring it is positive definite). Such random generation of mass matrix is quite blind step, but it proved to be quite helpful.

Riemman manifold HMC is quite vagarious, or to be more specific, metric of manifold is very sensitive. In my model log-likelihood I had exponents and values of metrics matrix elements was very large and when inverting this matrix, algorithm often produced singular matrices.

P.S. I even tried adaptive HMC (Bayesian Adaptive Hamiltonian Monte Carlo with and Application to High-Dimensional BEKK GARCH Models of Martin Burda), but it did not worked. Adaptation seemed strange, since there were no vanishing adaptation, just half of sum of previous and new metrics.

I passed this on to Bob and Matt3. Bob asked:

How did HMC for all the parameters work compared to using HMC for low-level ones and direct sampling for others? Radford Neal discusses this approach in his MCMC handbook chapter on HMC, but we were hoping (backed up by some back of the envelope calculations) that we could just do all the parameters at once.

To which Tomas replied:

When applying HMC for all parameter set I never got good mixing. Sometimes first level parameters showed “wish to mix”, but second level parameters almost didn’t move at all, and usually all parameters were stuck on some values and didn’t move at all. If I reduced leapfrog intergration step more, I just obtained very correlated chains and no matter how I tried to vary intergration step, number of steps, or mass matrix, second level parameters showed (almost) no will to mix.

As for HMC just for first level parameters, everything worked better
than HMC for all parameters at once.

It seemes that second level parameters gave some very odd topology. And for Riemman manifold HMC Fisher information metric wasn’t the right one in my case.

And Matt3 wrote:

Interesting. So there are no negative elements in the mass matrix? I think this should help if there are more positive correlations in your posterior than negative ones, but hurt if there are many negative correlations. I wonder if there’s something about your model that makes positive correlations more common than negative ones.

1. jsalvatier says:

Interesting, thanks for this.

2. Iain Murray says:

For hierarchical models it can be important to have the right parameterization, whatever Monte Carlo method is being used. Two relevant papers (the first is my own and improves on the second in some ways, although in some ways is less general):

• I agree – in my experience HMC can work quite well in hierarchical models, but it depends on choosing the right parameterization. I have used it to sample covariance hyperparameteres in a latent Gaussian model.

http://mikkelschmidt.dk/icml2009/

3. Joe says:

All technical points aside, I love the word “vagarious”.

4. Does this mean HMC is dead?

• Andrew says:

No. Why do you say that?

• I meant that there’s this promise hanging out there that someday we’ll get a method that “just (always) works”. I don’t really expect that sort of method to appear yet (especially when, like HMC, it’s been around for a while), but when I saw the title of the post I was hoping the rest of the post would be dramatically positive.

On a related note, would you be willing to comment about what made you bet on HMC as a generally applicable sampling method? There are so many competing approaches that hearing about the rationale would be very interesting.

• Andrew says:

Krzysztof:

WIthin MCMC, Metropolis is the standard. HMC is a generalization of Metropolis so should perform better if tuned right. There’s a cost to HMC because you need to calculate gradients, but that’s not such a problem with autodiff. In some examples I’ve been involved with, HMC has solved the problem where Metropolis and various natural extensions such as parallel tempering have failed.

Maybe MCMC itself isn’t the best; maybe something like variational Bayes is better. But within MCMC, Hamiltonian seems like the current state of the art.

• I don’t see why there should be a method that “just (always) works”.