## How Hamiltonian Monte Carlo works

Marco Inancio posted this one on the Stan users list:

( Statement 1) If the kinetic energy equation comes from a distribution $L$ which is not a symmetric distribution, then thanks to the “Conservation of the Hamiltonian” property we’ll still be able to accept the proposal with probability 1 if we are computing the Hamiltonian’s equations exactly.

( Statement 2) But if we are approximating the Hamilton’s equations by discretizing time (using leapfrog) then the acceptance probability [from $(q_0, p_0)$ to $(q_1, p_1)$] won’t be simply $\exp(U_{0}+K_{0}-U_{1}-K_{1})$

We would have to find $p*$ for which $(q_{0}, p_{0})$ will be the proposal values if we start from $(q_{1}, p*)$

And, in this case, the acceptance probability will be: $\exp(U_{0}+K_{0}-U_{1}-K_{1})L(p*)/L(p_0)$

Are both statements correct? If so, it seems that it’s pretty difficult to find the value of $p*$…

I replied that this is a job for Betancourt. And Betancourt responded:

Let’s review the basics of HMC. We have our target distribution, pi(q) = exp(-V(q)), and the conditional distribution of the momenta, pi(p|q) = exp(-T(q, p)). The HMC transition is then (i) sample the momenta, p ~ pi(p|q) (ii) evolve using Hamiltonian flow, (q, p) -> \phi_{t} (q, p) (iii) marginalize the momenta, (q, p) -> q. Each step in the operation preserves pi(q) and so the entire transition preserves pi(q) and yields a desired sample. T(q, p) can be almost anything and there is no Metropolis correction here.

In practice we can’t do (ii) exactly and we have to approximate the flow with a symplectic integrator. This introduces error and pi(q) is no longer preserved exactly. Typically we fix this by treating the approximate flow as a proposal and add a Metropolis correction, but this requires that the flow be reversible which it is not. To make the flow reversible we can do a few things — we can sample the integration time from any distribution symmetric around zero or, if the kinetic energy is symmetric with respect to p, we can add a momentum flip at the end of the flow, (q, p) -> (q, -p). Outside of NUTS people typically consider a fixed integration time which leaves only the second option, hence the importance of the symmetry of the kinetic energy.

What you’ve proposed is to sample from a distribution pi(p|q) = exp(-L(q, p)) that is not related to the kinetic energy. Immediately this will cause a problem because (i) will preserve only exp(-L) while (ii) preserves only exp(-T) and the combined transition no longer has a stationary distribution. So the exact HMC algorithm doesn’t work. We can still try to use this as a Metropolis-Hastings proposal (although a poorly performing one) if we can make it reversible. How we make it reversible depends on the choice of L(q, p), but it general it will not be an easy problem.

Choosing a kinetic energy is a somewhat subtle problem. Nominally there are no constraints on T(q, p) which is usually a really bad sign — the more the math constrains our options the less tuning we have to do and the more robust the algorithm will be. One way to introduce constraints is to introduce more structure, such as a metric. It turns out that a metric gives a canonical family of kinetic energies with nice properties, and every member of that family is symmetric because of the symmetry of the metric. So asymmetric kinetic energies are not only awkward to use they’re actually really hard to motivate in the first place.

For all the gory details see http://arxiv.org/abs/1410.5110.

1. hjk says:

Cool paper. Betancourt’s other stuff on arxiv looks nice too, thanks for posting.

2. jrc says:

“Symplectic integrators” and “symmetric kinetic energy”, sure. But how is an applied social science researcher supposed to understand “sample from a distribution?” I mean, the word “sample/sampling” came up 70+ times in 187 comments on p-values, and what did we really learn there? I think it was something about state sponsored hip hop, but I don’t remember exactly.

3. I remember reading an article that proved that the condition of detailed balance is stronger than what is actually required for a stationary distribution. Wikipedia also makes this claim without reference. I think the paper made a distinction between “detailed balance” and simply “balance”. The paper was something someone on this blog mentioned which is how I found out about it.

Manousiouthakis and Deem, “Strict detailed balance is unnecessary in Monte Carlo simulation” J. Chem. Phys. 110 1999

Can you comment on what that means in this context? I think intuitively what the result shows is that it’s possible to construct a flow through the parameter space such that on net there is instantaneously a flow of probability from one place to another, while on long time-scales this instantaneous value averages to zero, so the flow doesn’t “pile up” anywhere and has on average a given stationary distribution. Could that be useful in the context of Stan?

• Hamiltonian flow by itself is not reversible and hence does not satisfy detailed balance, but because it’s a measure-preserving transformation it can be used to construct a valid Markov transition. There’s growing interest in non-reversible strategies because detailed balance induces random walk behavior and inefficient sampling. There’s an upcoming workshop in Warwick on this topic if anyone is around, http://www2.warwick.ac.uk/fac/sci/statistics/crism/workshops/nonrevmcmc/.

The problem arises when you can’t simulate Hamiltonian flow exactly (as in anytime you’re trying to use this in practice). The usual trick is to treat the flow as a proposal and apply a Metropolis correction, and _this_ is where the requirement of detailed balance arises. In Stan we use the No-U-Turn sampler which applies a more sophisticated Metropolis correction that avoids random walk behavior.
correction is a

• Yes, I can see where a correction could be required, but it seems even with the Metropolis correction, *detailed* balance isn’t actually required (though it is sufficient).

• It is for the Metropolis-Hastings algorithm. Intuitively you can think about the algorithm as living on the double of the sample space and comparing (q_initial, q_propose) to (q_propose, q_initial). Such comparisons are always symmetric and hence require detailed balance.

• I guess that does make sense if you look only at the first and last location in the leapfrog sequence, you only have two locations to think about, and there seems to be no way to maintain “balance” without this detailed balance.

I wonder though, if there’s some way to use intermediate locations along the leapfrog path to improve the chance of moving, which as Radford Neal’s paper (linked below) shows, can only reduce estimation variance. In particular, I wonder if there can be some emphasis placed on accepting proposals that move THROUGH low probability regions, thereby potentially improving multi-modal performance.

At this point, it’s just speculation, maybe wishful thinking, but it does seem like designing flows through the parameter space that favor moving first into low probability regions and then back into high probability regions makes some sense.

• “I wonder though, if there’s some way to use intermediate locations along the leapfrog path to improve the chance of moving, which as Radford Neal’s paper (linked below) shows, can only reduce estimation variance.”

It’s not so much that intermediate locations improve the chance of moving, it’s that the entire trajectory has useful information and the more you use it the more precise your estimation. More formally this is known as Rao-Blackwellization; it is very naturally compatible with HMC as discussed in future papers currently in the pipeline.

“At this point, it’s just speculation, maybe wishful thinking, but it does seem like designing flows through the parameter space that favor moving first into low probability regions and then back into high probability regions makes some sense.”

The problem is that moving through regions of low probability is exactly counter to the idea of a well-behaved Markov chain. What you need to do is modify the target distribution to fill that gap, explore, and then undo the modification to return to your target distribution. And that’s exactly what Adiabatic Monte Carlo (http://arxiv.org/abs/1405.3489) does!

• Giri says:

Hi Daniel,

Indeed reversibility (detailed balance), while sufficient, is not a necessary condition to build a Markov chain such that the fixed point (stationary distribution) is a distribution one would like to sample from.

See a paper by Diaconis, Holmes, and Neal on this point: http://projecteuclid.org/euclid.aoap/1019487508

However, detailed balance is a simple constructive principle with physical motivation for building a Markov chain with a desired fixed point, which might explain its pervasiveness in MCMC algorithms.

• Giri says:

I should also add that by itself detailed balance does not guarantee uniqueness of the stationary distribution.

• Thank you for that link. I am very interested in this general topic of alternative ways to sample, adaptive sampling, HMC, etc and I love a good Diaconis or Neal paper ;-)