Yea, good scientific process/thinking.

All is less unwell however you get to being less wrong.

Part of Bob’s insight here is sometimes better computation is better at that than switching models. Even if the model was too wrong, it’s adantageous to have learned in which ways.

]]>> The need for heavy computation hampers the communicative and explanatory power of the model.

An addendum, but there are plenty of models where the computational complexity comes from trying to make the computation tractable (here: https://statmodeling.stat.columbia.edu/2021/03/26/question-on-multilevel-modeling-reminds-me-that-we-need-a-good-modeling-workflow-building-up-your-model-by-including-varying-intercepts-slopes-etc-and-a-good-computing-workflow/#comment-1768255)

So I guess indirectly in that case, need for computation -> need for approximation/complex inference -> hampering of communication

]]>We do modeling for many purposes, and they are not mutually exclusive. Sometimes we need a model to crank through the numbers and give us predictions that we can use for decision making. Sometimes we need a model to help us identify important quantitative features of the data. And sometimes we use a model to embody causal relations for the purpose of understanding or explaining a system. Many times, we want our model to accomplish more than one of these goals. But these goals exist on a continuum from “practical” (needing to make a decision) to “scientific” (explaining a system).

The need for heavy computing resources limits the scientific value of a model for at least two reasons. First, it makes the model inaccessible to anyone without those resources—you just have to trust that whoever has the computing cluster has done it right, and you can’t explore possibilities beyond those imagined by the original modelers. Second, it makes it difficult to understand and communicate why the model does what it does. Is it just some quirk of the parameters? Would a different ancillary assumption about the distribution of some quantity matter? The need for heavy computation hampers the communicative and explanatory power of the model.

It also hampers the practical utility of the model, for both obvious and less obvious reasons. The obvious reasons include that it takes a long time to run, requires money to get the technology needed to run it, and it is not always possible to guarantee the validity of the final output. The less obvious reasons include the fact that it is hard to apply the model again as circumstances and knowledge change into the future (you have to expend those resources again), and it is costly to adjust or adapt the model to new settings, if it is possible at all (which parameter corresponds to what?). For these reasons, a model that requires heavy computation is one that is unlikely to be picked up again in the future, meaning those resources were expended for basically a one-shot problem.

Where do I think modeling languages like Stan, JAGS, or Julia fit into this? They redefine what counts as “heavy” computing resources; you don’t have to code up a sampler and estimation/simulation can be done with low-cost technology. I think this is what Bob is emphasizing in his post, and I think it is extremely important. But where it seems like Bob emphasizes the practical gains, I think there are also scientific ones too, which I view as potentially more important. By abstracting away the computational aspects of the model, it makes it easier to focus on the causal structures the model is meant to embody. This makes it easier to understand what is going on as well as how to apply the model in new situations, because it is obvious where new variables or knowledge should “slot in” to the model.

But this is also why I think Andrew’s original Folk Theorem will be with us no matter how good Stan gets at computation. The ultimate bottleneck in modeling is the human ability to understand and express the causal relations a model is meant to represent. Better computing makes it easier to express more complex relationships, and when these run into computational problems, it is likely not the fault of the computer which is doing the best with what it’s got, but of the human modeler failing to understand the implications of the relationships they were trying to express.

]]>Absolutely this. In my experience many modelling errors show themselves as computational issues. Either it converges suspiciously fast or not at all.

]]>Having worked in simulation myself I cringe to think of the number of PhD thesis whose conclusions are crap simply due to unreliable code.

]]>Another modeler said,

“I tell my grad students and anyone that listens that in the US legal system you are presumed innocent until proven guilty, but in computation you should assume your calculation is messed up and incorrect until proven correct. In other words, always assume you have made errors in your computation and think hard about how to look for them.”

Well put — and this applies to lots of other things as well — it’s part of being honest with yourself (sometimes called “intellectual honesty”)

]]>less about the theoretical limits of computations and more about where errors are likely to occur

Yes, if things go wrong, a bug is always a possible cause. See Another Modeler’s response below.

But usually we’re talking about the post-debugging stage when the problem arises because of modeling decisions or due to mismatches of the model to the data.

For example, suppose we have an IRT model with a likelihood y[n] ~ bernoulli(inv_logit(a[jj[n]] – b[ii[n]])). There’s an additive non-identifiability between a and b. Now you can identify in the prior, say by taking student abilities a[1:J] ~ normal(0, sigma) and question difficulties b[1:I] ~ normal(mu, tau). By centering the abilities a around 0 I’ve technically identified the location of a and b, so this should be OK. There’s nothing wrong with this model per se as far as posterior inference goes. But it’s going to be a nightmare to fit with HMC for two reasons. First, the non-identifiability in the likelihood is only weakly identified in the prior. You can mitigate this problem by forcing the model to respect mean(a) = 0 and mean(b) = mu by reducing the number of parameters to I – 1 and J – 1 and defining a[J] = -sum(1:J-1) and pulling mu out into an intercept term in the likelihood. Now you could try setting a[1] = 0 as is commonly recommended, but that doesn’t work as well computationally. I discuss this in the Stan user’s guide chapter on problematic posteriors. Second, we have a centered parameterization of the hierarchical prior. This is going to introduce difficult geometry into the posterior. So we’re going to have to reparameterize, again not because the model’s wrong, but because our compute can’t handle the model in it’s more natural centered parameterization.

Other cases come up with things like too-wide priors with not-enough data. Andrew convinced a bunch of people that half Cauchy priors for scale are a good thing, then we saw that they can blow up posterior variance if there’s not much data. I think this was a combination of folk theorem catching up to us. Stan really works hard on those tails and when you see the consequence of your assumption of a Cauchy prior, you realize it’s unreasonable. Hence our latter day emphasis on prior predictive checks as well as posterior predictive checks. Your prior will be your posterior in small data situations!

]]>Absolutely. The important point for production software is to encapsulate how to look for bugs.

in computation you should assume your calculation is messed up and incorrect until proven correct

Yes!

At the process level, we try to agree on designs first. Then we do code review.

At the code API level we do unit testing, integration testing, and regression testing. It’s literally quite expensive for the compute to run all those tests for a project our size.

At the Stan language level, you have strong static typing, which I find helps tremendously with catching errors at compile time. At run time, we check domain conditions and report exceptions if they’re violated, which also helps with debugging.

At a statistical methodology paper, we recommend testing with simulated data, then doing posterior predictive checks on real data, and finally verifying with cross-validation.

assume you have made errors in your computation and think hard about how to look for them.

Even better, think about how to automate tests that look for them. That’s the key to having maintainable code.

]]>Thanks for these examples of conceptually natural and simple models that still run into sampling difficulties

]]>Mikhail: a lot of people consider any Bayesian inference a fancy gyroscooter (still sounds safer than a jetpack!). It’s all a matter of how despearate you are to fit a model.

(from someobody)With modern HMC, posteriors that can’t be sampled are typically also posteriors that can’t be interpreted easily–multimodel, long troughs of stationary points, etc.

Multimodal posteriors remain problematic, especially the combinatorial ones like in Ising models, mixture models, or neural nets. But I’m really hoping we can do better on the models with lots of weakly identified effects, like multilevel models. I’m having a really hard time fitting a couple applied models which I’d really like to fit. One is a spatio-temporal Covid prevalence model with millions of observations with a half-dozen covariates over hundreds of spatial regions and dozens of weeks. We need to adjust for test sensitivity and specificity to make this sensible, and would like to jointly estimate varying lab effects in a hierarchical model. We’d also like hierarchical priors on all the effects. I’d very much like to use a trend-following AR(2) time-series prior and an ICAR spatial prior, but we can’t get even close to fitting that model. The simple random walk model with fixed sensitivity and specificity and no hierarchical priors takes two days to fit roughly. Now what? I really want to smooth in both space and time and I want to adjust for uncertainty in sensitivity and specificiity. But so far, no luck. We may just scale back all the priors to be fixed and use an MLE. Another is a genomics model of splice variation with biological replicates, where there’s natural hiearchical structure, but adding the dispersion necessary for the model to make sense makes it very hard to fit. That’s not just a matter of “tightening up the priors.” Another instance where I have problem is in models of data annotation/coding/rating, where I very much want to fit IRT-like difficulty and discimrination and guessing parameters, but adding them all to a model makes the models very hard to fit.

]]>That interpretation is probably more useful for someone like me who’s relatively inexperienced with Stan though.

]]>+1 here

In my experience (and I use to code my computational methods myself) every time convergence fails, I was better off when I changed the model instead of improving computations.

So I feel about all advanced computation methods the same way as I feel about gyroscooters: new fancy way of breaking one’s limb.

]]>I think you and the rest of the Stan team are often deliberately working to make problems computable that were previously impractical to compute, so you’re often working with stuff that is barely computable in reasonable time.

Most of us, though, are working with whatever model seems appropriate to our data and not too crazy to compute, and…well, I guess I can’t speak for ‘most of us’, but I can speak for me: there are models I wouldn’t dream of trying to fit because they are clearly way too big, and then there are the models I actually try to fit in practice…and in that latter category the Folk Theorem of Computation works almost infallibly: just about every time my model won’t converge in a reasonable amount of time it’s because I’ve either coded a model that isn’t what I actually had in mind, or I’ve underspecified the model as in your ‘sometimes it really is the model’ example.

]]>