So, Charles and I wrote this article. It’s for the forthcoming Handbook of Markov Chain Monte Carlo, 2nd edition. We still have an opportunity to revise it, so if you see anything wrong or misleading, or if there are other issues you think we should mention, please let us know right here in comments!
Remember, it’s just one chapter of an entire book—it’s intended to be an updated version of my chapter with Kenny, Inference from Simulations and Monitoring Convergence, from the first edition. For our new chapter, Charles liked the snappier, “How many iterations,” title.
In any case, our chapter does not have to cover MCMC, just inference and monitoring convergence. But I think Charles is right that, once you think about inference and monitoring convergence, you want to step back and consider design, in particular how many chains to run, where do you start them, and how many iterations per chain.
Again, all comments will be appreciated.
Towards the bottom of page 5: ‘Each run entails 4 Matkov chains with 1000’
Personally, I prefer Matkov chains to Markov chains. ;)
Thank you for spotting the typo — will correct in the upcoming revision!
Andrew, it’s the middle of the night and my dog woke me up to go out so what better thing to do than read about MCMC?
The chapter is really nice, it gives a very good overview of the way to think about MCMC.
One thought I’ve had in the past is that often we actually know the sample size we want, and we could think of computing as a way to make ESS approach sample size. In other words, suppose I want 100 samples. Now I run MCMC to get 100 samples, but ESS is 18. So I could select each of my samples and run them through a chain and take the endpoint, resulting in 100 samples but now ESS is say 50. Repeat again and ESS is maybe 90… Maybe that’s good enough for me. In other words the sample converges towards an independent random sample.
In addition you can use different algorithms, or even choose randomly from among algorithms. HMC is good for moving around the distribution, but it can fail to get into certain corners because of divergent transitions. Diffusive or langevin type transitions never “go off to infinity” so the mixture of two methods can benefit from the two strengths. Maybe diffusion can make samples settle into funnels or cracks that HMC would never get into.
In general I think of the process as relaxation from an initial position towards a final position. It’s my guess that it could help even to sample from different distributions than the target distro. Suppose you start sampling from the posterior to get some very basic estimate… Then construct a diagonal normal distribution from that estimate and sample from that (a kind of variational inference, but also umbrella sampling). Then from those starting places sample towards the posterior again.
The biggest problems I’ve had with HMC involve divergences, the biggest problems with diffusive samplers is slowness of movement. The biggest problem with variational inference is bias… Each of them have a different problem, in combination it seems like the strengths and weaknesses can complement each other.
Hi Daniel,
> One thought I’ve had in the past is that often we actually know the sample size we want, and we could think of computing as a way to make ESS approach sample size., So I could select each of my samples and run them through a chain and take the endpoint, resulting in 100 samples but now ESS is say 50. Repeat again and ESS is maybe 90…
I don’t think ESS needs to converge to the sample size N. It’s fine if the sample size is larger than ESS, as long as ESS itself is sufficiently large. Thinning is a technique to selectively discard samples; it reduces both N and ESS, but N decreases more than ESS, so you can get roughly ESS = N. Still, thinning increases the variance of our Monte Carlo estimators. The main reason to do thinning is to save memory, but it doesn’t improve our estimators.
> Maybe diffusion can make samples settle into funnels or cracks that HMC would never get into.
Can you share references on how diffusion algorithms can handle high curvature, such as funnels? I was under the impression they suffered from the same issues as HMC. In fact, there is way of defining divergent transitions for Langevin methods.
> It’s my guess that it could help even to sample from different distributions than the target distro…
Interesting. We’ve definitely thought about starting with variational inference (or any parametric approximation) and then running MCMC. Doing the opposite, i.e. running MCMC and then doing a density approximation, can make sense when doing online learning and we want to use our posterior as a prior for the same model with new data.
> The biggest problems I’ve had with HMC involve divergences, the biggest problems with diffusive samplers is slowness of movement.
I’d be interested to read about how HMC and diffusive samplers are complementary here. It seems to me that both HMC and diffusive samplers would need a small step size in order to go down the neck of a funnel. But I like the idea of exploring different algorithms.
Charles:
Thinning isn’t just to save memory or computation time; it’s also to make the simulations easier to interpret. For example, suppose you have a model with two parameters, a and b, and further suppose that you do MCMC with high posterior correlation, and you end up with a 1000 x 2 matrix of simulations of (a, b). If you take a scatterplot of a vs. b in these simulations, and they’re autocorrelated, then the scatterplot can look patchy in a way that you would not get from 1000 independent draws. It can make sense to do some thinning so that graphs or other computations of posterior simulations will have simpler properties.
Exactly. In the end we want to use a sample as a summary of the properties of the distribution. Of course 80000 samples tends to gives you more information than 800 but if 800 is what you’re allotted in terms of what you’re going to store and transmit to others, then you’re better off with 800 that have ESS near 800 than ESS near 27. My point was that we often have a sense that we are willing to use some number of samples for our purposes, we should think of computation time as an effort to make that set of 800 or whatever we want more like independent samples.
My point about diffusion sampling is that if you’ve already sampled approximately enough diversity, you don’t need to move the samples around the space, you just need to move them better into equilibrium. Small diffusion steps can be ok and they won’t get “kicked to infinity”.
Often people think about moving around the space as the hard part, but if you start with over dispersed samples from something like variational inference, or samples from an umbrella sampler or tempered distribution then you can have plenty of diversity already, you just need to modify the sample to be more in equilibrium with the proper density.
Think specifically of samples from a model with 20000 parameters (a few parameters per each county in the US or each hospital in a country wide survey of surgery outcomes) people don’t want 80000 samples with an ESS of 800 (80000*20000 parameters * 8 bytes = 13 gigabytes) they want 800 samples with ESS 720 or something like that (128 megabytes)
One way to get your 800 samples is to exactly sample from the distribution of interest and then thin. But a better method might be to sample from an easier distribution to sample from (like variational) and then equilibriate. This is especially true if you can avoid having to take derivatives during computation. For example sampling fromnthe distribution that is constant over the region of space where the log density exceeds some threshold. You can move in a straight line until you slam into the edge of this region without calculating derivatives to get diversity. Then after you have 800 of those samples you can equilibrate it by diffusion.
To give some more specific idea… suppose you’re markov sampling from logp(q)/T(i) where logp(q) is the log probability density of the posterior and T(i) is a temperature variable for the ith sample. You can “schedule” the temperature to change on some either standard or random schedule.
Now suppose you set T=4 and fit a diagonal gaussian Variational model, then sample 800 samples from the variational model. This should be extremely fast, but it’s a sample from a fairly severely overdispersed model. In fact because it’s gaussian and because the dimensionality of the model is high (I’m assuming at least 10 parameters up to tens or hundreds of thousands), the samples will be on the surface of a hyper-ellipsoid which extends far beyond the region of interest due to the high temperature tempering. Our goal in the next step will be to shrink the points towards the region of interest and get them distributed over a non-hyperellipsoidal surface. (“high” dimensional models are always very near to some surface of near-constant logp, it’s already basically true for n = 10 dimensions but especially true for n=100 or 1000 or more)
Now think of this ensemble of 800 samples as one single ensemble sample from the product distribution (which makes some of the high dimensional arguments valid even when the dimension of the individual model is small like 3 or 4 parameters). Take any given sample and select a random direction, perform one cycle of hit-and-run sampling, again not requiring any derivatives or gradients, just move in a random direction and it’s opposite direction until you hit a boundary where logp equals a sufficiently low value, for example the minimum value from your initial set of 800… Now select randomly from among all the points along this straight line path with multinomial probabilities (ie. probability of each point proportional to exp(logp) at the point). Repeat this process for each of the 800 samples in a kind of gibbs update.
Now, just keep doing this over and over, at the end of each cycle of N (ie. 800) evaluate the sum of the logp… eventually this converges to a constant and is constant to within epsilon for several cycles. At this point stop.
Now you’ve got 800 “very good” samples or alternatively, one sample from the N=800 product distribution. It’s on the hypersurface of appropriate constant logp. At no time did you do gradients or autodiff, the required calculations are not obviously more than the autodiff for NUTS… It seems promising to me in general to think about shrinking an overdispersed sample from a simple “blanketing” gaussian towards a good sample from the real distribution.
42
Adede:
That’s the number– you win! ;)
A good question. As I wrote in my ISBA2008 spoof of the song “Sweet Home Chicago” (titled “Markov Chain Monte Carlo”):
“1 and 1 is 2
2 and 2 is 4
Has your chain converged or must you iterate some more?”