It’s time to review the folk theorem, an old saw on this blog, on the Stan forums, and in all of Andrew’s and my applied modeling.

**Folk Theorem**

Andrew uses “folk” in the sense of being folksy as opposed to rigorous.

The Folk Theorem of Statistical Computing(Gelman 2008): When you have computational problems, often there’s a problem with your model.

**Isn’t computation often the culprit?**

Better samplers like the no-U-turn sampler (NUTS) are able to fit models to data sets that were previously unfittable. ML computation is accelerating even faster than MCMC tech, fitting ever deeper, wider, and more highly structured neural networks.

This seems to imply the folk theorem is backwards. If we had better computation, maybe we *could* fit our data with our preferred model.

**A second folk theorem**

Here’s another folk theorem, stemming from decades of experience in statistical computing.

Folk Theorem II: When you have computational problems, often there’s a problem with your computation.

Here’s the rub. That problem in the computation might take years or decades or even centuries to get solved.

**Now what?**

In the meanwhile, you have data to analyze. In these cases, I take Andrew’s folk theorem as pragmatic advice for users. Someone trying to fit a problematic model in Stan is unlikely to solve their problem by building a better sampler (though they might solve it by using a different sampler, such as INLA). But they can fit simpler models (where unfortunately for Stan, simpler is largely the geometric issue of the condition of the posterior curvature, so it can involve things like tightening the tails, standardizing scales, and reducing posterior correlation).

**But sometimes it really is the model**

For example, if you have an unidintifiable likelihood, you probably don’t want to just sort that out in the prior. You’re better off reparameterizing with one fewer degrees of freedom (e.g., by pinning a value or enforcing a sum-to-zero constraint). This should always be the first recourse—trying to figure out if the model does in fact make sense. Bad computation is still a useful clue to this.

**Bottom line**

The frontier at which you run into computational problems is a feature of the fitting tools you have at your command. What we’ve been trying to do with Stan is push that frontier. In the meanwhile, you can heed Andrew’s folk theorem and try to simplify your model.

Bob,

I think you and the rest of the Stan team are often deliberately working to make problems computable that were previously impractical to compute, so you’re often working with stuff that is barely computable in reasonable time.

Most of us, though, are working with whatever model seems appropriate to our data and not too crazy to compute, and…well, I guess I can’t speak for ‘most of us’, but I can speak for me: there are models I wouldn’t dream of trying to fit because they are clearly way too big, and then there are the models I actually try to fit in practice…and in that latter category the Folk Theorem of Computation works almost infallibly: just about every time my model won’t converge in a reasonable amount of time it’s because I’ve either coded a model that isn’t what I actually had in mind, or I’ve underspecified the model as in your ‘sometimes it really is the model’ example.

+1 here

In my experience (and I use to code my computational methods myself) every time convergence fails, I was better off when I changed the model instead of improving computations.

So I feel about all advanced computation methods the same way as I feel about gyroscooters: new fancy way of breaking one’s limb.

Absolutely this. In my experience many modelling errors show themselves as computational issues. Either it converges suspiciously fast or not at all.

I have mixed feelings about this one. On the one hand, it’s true that technological advances are relentless and computation getting better necessarily means somethings that are not computable today will be computable tomorrow. On the other, I feel like there’s a categorical difference between the things in the Bayesian world that were not computable before NUTs and HMC and the things that are not computable now. With Gibbs sampling, most models you’d arrive at naturally couldn’t be estimated without unrealistic conjugacy assumptions. With general metropolis hastings, sampling could fail all the time for reasons that are hard to investigate. With modern HMC, posteriors that can’t be sampled are typically also posteriors that can’t be interpreted easily–multimodel, long troughs of stationary points, etc. Though I don’t actually have any hands-on experience with the world before HMC–I’m really just guessing based on what I read on wikipedia.

I’ve always thought that the folk theorem was less about the theoretical limits of computations and more about where errors are likely to occur. So if you have a model that you can’t fit, it’s more likely to be because your model has a typo or is nonsensical rather than because you’re running into fundamental limits of Stan.

That interpretation is probably more useful for someone like me who’s relatively inexperienced with Stan though.

Yes, if things go wrong, a bug is always a possible cause. See Another Modeler’s response below.

But usually we’re talking about the post-debugging stage when the problem arises because of modeling decisions or due to mismatches of the model to the data.

For example, suppose we have an IRT model with a likelihood y[n] ~ bernoulli(inv_logit(a[jj[n]] – b[ii[n]])). There’s an additive non-identifiability between a and b. Now you can identify in the prior, say by taking student abilities a[1:J] ~ normal(0, sigma) and question difficulties b[1:I] ~ normal(mu, tau). By centering the abilities a around 0 I’ve technically identified the location of a and b, so this should be OK. There’s nothing wrong with this model per se as far as posterior inference goes. But it’s going to be a nightmare to fit with HMC for two reasons. First, the non-identifiability in the likelihood is only weakly identified in the prior. You can mitigate this problem by forcing the model to respect mean(a) = 0 and mean(b) = mu by reducing the number of parameters to I – 1 and J – 1 and defining a[J] = -sum(1:J-1) and pulling mu out into an intercept term in the likelihood. Now you could try setting a[1] = 0 as is commonly recommended, but that doesn’t work as well computationally. I discuss this in the Stan user’s guide chapter on problematic posteriors. Second, we have a centered parameterization of the hierarchical prior. This is going to introduce difficult geometry into the posterior. So we’re going to have to reparameterize, again not because the model’s wrong, but because our compute can’t handle the model in it’s more natural centered parameterization.

Other cases come up with things like too-wide priors with not-enough data. Andrew convinced a bunch of people that half Cauchy priors for scale are a good thing, then we saw that they can blow up posterior variance if there’s not much data. I think this was a combination of folk theorem catching up to us. Stan really works hard on those tails and when you see the consequence of your assumption of a Cauchy prior, you realize it’s unreasonable. Hence our latter day emphasis on prior predictive checks as well as posterior predictive checks. Your prior will be your posterior in small data situations!

Phil and Mikhail and somebody: I agree that the folk theorem is great from a user perspective when you don’t want to undertake a computational stats research program. My point was just that if we applied it in a comp stats research setting, we’d all just go home.

Mikhail: a lot of people consider any Bayesian inference a fancy gyroscooter (still sounds safer than a jetpack!). It’s all a matter of how despearate you are to fit a model.

Multimodal posteriors remain problematic, especially the combinatorial ones like in Ising models, mixture models, or neural nets. But I’m really hoping we can do better on the models with lots of weakly identified effects, like multilevel models. I’m having a really hard time fitting a couple applied models which I’d really like to fit. One is a spatio-temporal Covid prevalence model with millions of observations with a half-dozen covariates over hundreds of spatial regions and dozens of weeks. We need to adjust for test sensitivity and specificity to make this sensible, and would like to jointly estimate varying lab effects in a hierarchical model. We’d also like hierarchical priors on all the effects. I’d very much like to use a trend-following AR(2) time-series prior and an ICAR spatial prior, but we can’t get even close to fitting that model. The simple random walk model with fixed sensitivity and specificity and no hierarchical priors takes two days to fit roughly. Now what? I really want to smooth in both space and time and I want to adjust for uncertainty in sensitivity and specificiity. But so far, no luck. We may just scale back all the priors to be fixed and use an MLE. Another is a genomics model of splice variation with biological replicates, where there’s natural hiearchical structure, but adding the dispersion necessary for the model to make sense makes it very hard to fit. That’s not just a matter of “tightening up the priors.” Another instance where I have problem is in models of data annotation/coding/rating, where I very much want to fit IRT-like difficulty and discimrination and guessing parameters, but adding them all to a model makes the models very hard to fit.

Thanks for these examples of conceptually natural and simple models that still run into sampling difficulties

As someone who works in another field of computational modeling, I tell my grad students and anyone that listens that in the US legal system you are presumed innocent until proven guilty, but in computation you should assume your calculation is messed up and incorrect until proven correct. In other words, always assume you have made errors in your computation and think hard about how to look for them.

Absolutely. The important point for production software is to encapsulate how to look for bugs.

Yes!

At the process level, we try to agree on designs first. Then we do code review.

At the code API level we do unit testing, integration testing, and regression testing. It’s literally quite expensive for the compute to run all those tests for a project our size.

At the Stan language level, you have strong static typing, which I find helps tremendously with catching errors at compile time. At run time, we check domain conditions and report exceptions if they’re violated, which also helps with debugging.

At a statistical methodology paper, we recommend testing with simulated data, then doing posterior predictive checks on real data, and finally verifying with cross-validation.

Even better, think about how to automate tests that look for them. That’s the key to having maintainable code.

Another modeler said,

“I tell my grad students and anyone that listens that in the US legal system you are presumed innocent until proven guilty, but in computation you should assume your calculation is messed up and incorrect until proven correct. In other words, always assume you have made errors in your computation and think hard about how to look for them.”

Well put — and this applies to lots of other things as well — it’s part of being honest with yourself (sometimes called “intellectual honesty”)

Yea, good scientific process/thinking.

All is less unwell however you get to being less wrong.

Part of Bob’s insight here is sometimes better computation is better at that than switching models. Even if the model was too wrong, it’s adantageous to have learned in which ways.

Having worked in simulation myself I cringe to think of the number of PhD thesis whose conclusions are crap simply due to unreliable code.

I tend to emphasize a slightly different aspect of the Folk Theorem: If you *need* heavy computing resources to simulate/estimate your model, this in itself is a sign that you might want to reconsider your model. The main problem is not that it will take too long, or won’t converge, or that you made a typo (although these are all important problems that need to be addressed too), the problem is that a model that requires heavy computing resources is also likely to be a model that will be difficult to interpret and communicate. As a result, the model may be limited in its scientific and even practical value.

We do modeling for many purposes, and they are not mutually exclusive. Sometimes we need a model to crank through the numbers and give us predictions that we can use for decision making. Sometimes we need a model to help us identify important quantitative features of the data. And sometimes we use a model to embody causal relations for the purpose of understanding or explaining a system. Many times, we want our model to accomplish more than one of these goals. But these goals exist on a continuum from “practical” (needing to make a decision) to “scientific” (explaining a system).

The need for heavy computing resources limits the scientific value of a model for at least two reasons. First, it makes the model inaccessible to anyone without those resources—you just have to trust that whoever has the computing cluster has done it right, and you can’t explore possibilities beyond those imagined by the original modelers. Second, it makes it difficult to understand and communicate why the model does what it does. Is it just some quirk of the parameters? Would a different ancillary assumption about the distribution of some quantity matter? The need for heavy computation hampers the communicative and explanatory power of the model.

It also hampers the practical utility of the model, for both obvious and less obvious reasons. The obvious reasons include that it takes a long time to run, requires money to get the technology needed to run it, and it is not always possible to guarantee the validity of the final output. The less obvious reasons include the fact that it is hard to apply the model again as circumstances and knowledge change into the future (you have to expend those resources again), and it is costly to adjust or adapt the model to new settings, if it is possible at all (which parameter corresponds to what?). For these reasons, a model that requires heavy computation is one that is unlikely to be picked up again in the future, meaning those resources were expended for basically a one-shot problem.

Where do I think modeling languages like Stan, JAGS, or Julia fit into this? They redefine what counts as “heavy” computing resources; you don’t have to code up a sampler and estimation/simulation can be done with low-cost technology. I think this is what Bob is emphasizing in his post, and I think it is extremely important. But where it seems like Bob emphasizes the practical gains, I think there are also scientific ones too, which I view as potentially more important. By abstracting away the computational aspects of the model, it makes it easier to focus on the causal structures the model is meant to embody. This makes it easier to understand what is going on as well as how to apply the model in new situations, because it is obvious where new variables or knowledge should “slot in” to the model.

But this is also why I think Andrew’s original Folk Theorem will be with us no matter how good Stan gets at computation. The ultimate bottleneck in modeling is the human ability to understand and express the causal relations a model is meant to represent. Better computing makes it easier to express more complex relationships, and when these run into computational problems, it is likely not the fault of the computer which is doing the best with what it’s got, but of the human modeler failing to understand the implications of the relationships they were trying to express.

> The need for heavy computation hampers the communicative and explanatory power of the model.

An addendum, but there are plenty of models where the computational complexity comes from trying to make the computation tractable (here: https://statmodeling.stat.columbia.edu/2021/03/26/question-on-multilevel-modeling-reminds-me-that-we-need-a-good-modeling-workflow-building-up-your-model-by-including-varying-intercepts-slopes-etc-and-a-good-computing-workflow/#comment-1768255)

So I guess indirectly in that case, need for computation -> need for approximation/complex inference -> hampering of communication