When MCMC fails: The advice we’re giving is wrong. Here’s what we you should be doing instead. (Hint: it’s all about the folk theorem.)

In applied Bayesian statistics we often use Markov chain Monte Carlo: a family of iterative algorithms that yield approximate draws from the posterior distribution. For example, Stan uses Hamiltonian Monte Carlo.

One annoying thing about these iterative algorithms is that they can take awhile, but on the plus side this spins off all sorts of auxiliary information that can be used to identify and diagnose problems.

That’s what today’s post is about. What to do when your MCMC has problems.

I think much of current default practice is wrong. What do I mean by that? What I mean is that the typical default behavior with MCMC issues is to just throw more computational resources at the problem. And that’s wrong, because problems with your computation often imply problems with your model (that’s the folk theorem of statistical computing).

Whassup?

When you have problems with MCMC mixing, the usual first step is to run your chains longer. 10,000 iterations, 100,000 iterations, etc. Or to tweak parameters in your HMC such as adapt_delta and max_treedepth to make it explore more carefully. Or to get into heavy math things like reparameterizing your model. I’m not saying these steps are never useful, but they should not be the first thing you try. Or the second thing. Or the third.

Rather, when you have mixing problems you should immediately try to figure out what went wrong. Some common possibilities:

1. Priors on some parameters are weak or nonexistent, data are too weak to identify all the parameters in the model, and the MCMC drifts all over the place. This could be a flat-out improper posterior, or it could just be so poorly constrained that it includes all sorts of weird regions that you don’t really care about but which make it harder for the chains to mix. You should be able to notice this sort of problem because the posterior simulations go to some really weird places, things like elasticity parameters of -20 or people with 8 kg livers.

2. Bugs, plain and simple. For example coding a Poisson regression without using poisson_log so you’re not including the log link. Coding the standard deviation as sigma^2 rather than sigma. Forgetting a line of code. Screwing up your array indexing. All sorts of things. You should be able to notice this sort of problem by debugging, going through the code line by line and saving intermediate parameters.

3. Minor modes in the tails of the distribution. An example is in section 11 of our Bayesian workflow paper. You should be able to notice this sort of problem because different chains will cluster in different places, and the target function (“lp__”) will be much lower in some modes than others. The problem can sometimes be fixed by just using starting points near the main mode, or else you can use stacking to discard minor modes. Soon we should have Pathfinder implemented and this will get rid of lots of these minor modes automatically.

4. Sometimes an apparent problem isn’t a problem at all. You can get computational overflows in warmup, during the period where the chains are exploring all sorts of faraway places that they then wisely give up on. Overflows, divergences, etc., during warmup are typically nothing to be concerned about. Dan Simpson refers to this as the “awkward adolescence” period of the simulation, and all we need to worry about is the “sensible adult” stage. (To continue the analogy, there’s no need to run these simulations all the way to senility.)

5. The model can be reparameterized. The two big ways of doing this are: (a) rescaling so parameters are roughly on unit scale. For example, don’t have a country-level regression where one of the predictors is the country’s population, so that you’d have coefficients like 0.000002 or whatever. Instead, use log population or population in millions or whatever. In general try to have parameters be scale-free, introducing a multiplier in the model if necessary. And (a) rescaling hierarchical models so that the group-level errors are unit-scale error terms, the so-called non-centered parameterization. Most of the time when you have mixing problems, it’s not a parameterization issue, but these do arise so I’m mentioning them here. In any case it’s good practice to model your parameters on unit scale.

6. And, of course, sometimes you have legit slow mixing. But, even in those cases, often you can change your model (not just “reparameterizing” the model while keeping it the same, but actually changing it) and the chains will mix better. I’m not saying you should make your model worse just to improve computation. What I’m saying is that typically there’s a lot of arbitrariness to your model, and if some arbitrary feature is making your model run slowly, what’s the point of that? This loops back to the point 1 above. An improved model can often be thought of as a (softly-) constrained version of what you’ve been fitting before. You can always go back and relax the constraints later to see what’s going on.

And that brings us to our final point:

7. You’re not fitting just one model, nor should you want to fit just one model. Statistical workflow involves fitting a sequence of models in order to understand a problem. You’re gonna be fitting lots of models already, so don’t obsess on this one model that’s mixing poorly. Rather, take the poor mixing as a signal that something’s going on that you don’t understand, and go figure it out—using the ideas of multiple modeling that you should be doing anyway!

Why the current default behavior is bad

You might ask, What’s so bad about running longer and more slowly? What’s bad about it is two things:

– It’s a waste of time. Time spent running and running and running a model that you don’t want to be running at all—if it is the case, as it typically is, that your model has major problems.

– By just running longer and more slowly, you’re just putting off the inevitable moment where you have to go back and debug your model. Better to go to it right away. Get into the workflow already.

18 thoughts on “When MCMC fails: The advice we’re giving is wrong. Here’s what we you should be doing instead. (Hint: it’s all about the folk theorem.)

  1. Thanks Andrew this is awesome! Basically where I’ve evolved to in my own practice with Stan is that if I’m seeing problems at 1000-2000 iterations it almost ^never^ gets fixed by longer runs. Stan is that good :) it would be good to get more published papers out there on this kind of workflow. Those of us in applied fields are often lucky to get a single reviewer comfortable with Bayesian stats, and those that are often have 10-20 year old info on ‘best practices’, usually from the BUGS days of conjugate priors, the quest for uninformativeness etc

  2. I don’t find your first argument that running longer is a waste of time that compelling. A waste of whose time? The computer’s? I can at least do other things while it runs. Performing the 7 steps when I could just let the computer get on with it might be the waste of time! I’m more persuaded by the fact that mixing problems may indicate a bug or something I haven’t appreciated about the model (which you obviously do also mention).

    • Andrew (other):

      It’s a waste of time because “just letting the computer get on with it” won’t get me anywhere! It just puts off the day of reckoning when you have to figure out what’s wrong with the model. If you have poor mixing and you run it overnight for a zillion iterations, then you’ve wasted a day.

  3. I rarely see “+1″s given to original posts, but I will do so in this case:

    +1

    Some combination of these things are what I tend to do myself and what I recommend when working with others. Things like bugs or typos should ideally have been ironed out by building tiny versions that exemplify each moving part in the model and then running those parts with some simulated data so you can verify each component is working as intended. But neither I nor anyone I know is always that disciplined, so it is crucial to keep in mind that odd behavior can sometimes be traced back to using an “i” instead of a “j” somewhere.

    I tend to think of 1 and 5 as different aspects of the same idea: Whether you are changing priors, changing scales, or reparameterizing, you are ultimately changing the *meaning* of the quantities in the model. While these are mechanically different, they all represent what you might call “semantic” fixes. By changing the meaning of what the model is meant to represent, you are also changing various technical features of the model to make it run better.

    A side benefit to doing these “semantic” fixes is that they are deepening your understanding of the model and its relationship to the data. This guides you in steps 6/7 when you explore more substantive changes to model structure.

  4. It seems the computing culture has shifted from 20 years ago, when a bold dictum from the handbook of MCMC read: “One should start a run when the paper is submitted and keep running until the referees’ reports arrive. This cannot delay the paper.”

    This shift is attributed to many factors: the idea of workflow integrates computing into the bigger modeling problem; the flexibility of bayesian formulation enables free creations of more model; the referees’ reports arrive in longer time; we have more papers to write such that there is no free computer to wait for the referees’ reports.

  5. I recently discovered people in my lab were just running more iterations or increasing tree depth (or both) on their fake data (here I mean the data we write to test our Stan code is doing what we want it to). They had relatively simple models where their test data and code took an hour to run (!) when obviously something deeper was wrong.

    I honestly blamed (and continue to blame) myself for being a poor trainer, but now I can also blame the Stan warnings (but not the developers, you guys are great!).

  6. As regarding 5) couldn’t you just think about the priors more carefully and place priors that make sense with the scale of the variables rather than rescaling so everything is on unit scale?

    • The reason to do it so everything is on unit scale is so that MCMC proposals that are relatively “isotropic” will work well. it’s a sampling issue, not a modeling issue. If you have one parameter that’s 1e6 +- 1e5 and another parameter that’s .002 +- .0003 then it’s really hard for a sampler to figure out it should propose a movement that’s around 100,000 in one dimension and around .0001 in another.

        • I think a lot could be done to give hints, Bob would be the person to ask. But the real reason it can’t be done automatically is because you want the **posterior** distribution to be on unit scale, and you can’t really automatically know the posterior scale without running the sampler.

        • One thing I’ve wondered about before involves the automated exploration of centered and non-centered parameterizations of hierarchical (or flat) models. I feel that when implementing these in Stan these I must often rely on vague heuristics for one one form or the other is appropriate, and sampling of some parameters in the same model may benefit from the non-centered form where sampling for other parameters may not. Often the final arbiter of what’s appropriate where is observed performance, since occasionally the non-hierarchical form will sample more efficiently.

          Since reparameterization is such a straightforward & rote procedure, why not incorporate it into something like the warm-up process? A user could specify the model in its fully non-centered form (which, read aloud, is often much easier for me to parse, at least), and an algorithm could toggle on-and-off different parameterizations and try to adaptively identify which transformed target distributions are easiest to explore (according to convergence & mixing diagnostics like E-FMI or w/e). Is the reason this is not done just that warm-up would have to start from scratch with each parameterization (but would it?).

          Even so, it seems like something that would be fairly straightforward to implement, even at the cost of lots of early redundancy (certainly faster than doing everything by hand). Just something that enumerates (some subset of) possible non-centered parameterizations, launches however many independent samplers, and then prunes the ones that are sampling poorly might offer considerably improved convenience and efficiency.

        • You might be interested in Maria Gorinova’s paper w/ myself and Matt Hoffman from a couple years ago on ‘Automatic Reparameterisation of Probabilistic Programs’ (https://arxiv.org/abs/1906.03028). A few of the takeaways:

          – Converting a model from centered to non-centered parameterizations is quite mechanical, while the reverse in general requires a symbolic algebra system, so you want to start with the centered parameterization.

          – A model with N sampling sites has 2**N ways to choose centering vs non-centering, and that’s intractable to search over in general, but you can do pretty well with a gradient-based / hill-climbing optimization strategy.

          – With an automatic search you can actually find intermediate parameterizations, partway between centered and noncentered, that work better than either extreme. Intuitively: non-centering works well when you have weak or no evidence (so the posterior is not very different from the prior), but poorly when you have very strong evidence (bc it induces coupling in your latent variables via ‘explaining away’ effects), and in real models the strength of the evidence is usually neither zero nor infinity, but somewhere in between.

        • Thank you for the link! I will give it a look.

          (also see some typos in my earlier comment — was distracted on mobile. Sorry! Had meant to write:

          >since occasionally the non-*centered* form will sample more efficiently.

          >specify the model in its fully *centered* form (which, read aloud, is often much easier for me to parse, at least)

          )

  7. Well… Just so you have a data point of someone (me) that has interpreted the default fixes to be the ones you listed, mostly starting at 7.

    I guess if you JUST read the warnings you could be assuming the defaults are what you find people are doing. But if you read any manuals, guides, tutorials, recommendations, papers etc. then the defaults are as you presented.

    So good work Stan peeps.

  8. Shouldn’t absolute waiting time prescriptions mentioned here be in light of the size of your data set and the dimensionality of your model? Like, if you have a model with a million observations and 10,000 parameters, it seems fair for it to be an overnight job vs. a 15 minute job — time complexity of calculating the likelihood would at least be linear in n. Otherwise is the idea that that HMC / NUTS sampler should sample equally well on a per iteration basis from all posteriors of arbitrary dimensionality?

    • The “longer time” in this post isn’t wall time, but rather number of iterations. HMC should mix at the same rate per-iteration on arbitrary dimensionality, but yeah each iteration is going to take longer on larger data and dimensionality due to all the extra computation.

      • This is a good first order explanation, but for models with many parameters the dependencies between parameters will often make the sampler slower for those models, as it will have to take smaller timesteps and do more of them per iteration. Bigger models are harder to sample in general.

        • Would it take more computation to fully characterize the joint distribution of a collection of p independent random variables than doing each marginally? Specifically thinking in curse of dimensionality terms — would assurance that one has mixed adequately over the whole target distribution not involve some check that you haven’t completely ignored some lonely orthant in the high-dimensional space? (ie if you don’t know a priori that the parameters are indeed independent)? Or would it just scale linearly with p?

          Most of the time I see high dimensional parameter space is in the context of a hierarchical model. It seems like if the hyperparameters are well identified, adding more members to some population of effects won’t induce interdependence between parameters in the same way that adding more top-level parameters would.

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published. Required fields are marked *