Simulation-based calibration: Some challenges and directions for future research

Simulation-based calibration is the idea of checking Bayesian computation using the following steps, originally from Samantha Cook’s Ph.D. thesis, later appearing in this article by Cook et al., and then elaborated more recently by Talts et al.:

1. Take one draw of the vector of parameters theta_tilde from the prior distribution, p(theta).

2. Take one draw of the data vector y_tilde from the data model, p(y|theta_tilde).

3. Take a bunch of posterior draws of theta from p(theta|y_tilde). This is the part of the computation that typically needs to be checked. Call these draws theta^1,…,theta^L.

4. For each scalar component of theta or quantity of interest h(theta), compute the quantile of the true value h(theta_tilde) within the distribution of values h(theta^l). For a continuous parameter or summary, this quantile will take on one of the values 0/L, 1/L, …, 1.

If all the computations above are correct, then the result of step 4 should be uniformly distributed over the L+1 possible values. To do simulation-based computation, repeat the above 4 steps N times independently and then check that the distribution of the quantiles in step 4 is approximately uniform.

Connection to simulated-data experimentation

Simulation-based calibration is a special case of the more general activity of simulated-data experimentation. In simulation-based calibration, you’re comparing the assumed true value theta_tilde to the posterior simulations theta. More generally, we can do all sorts of experiments. For example, simulated-data experimentation is useful when designing a study: you make some assumptions, simulate some fake data, and then see how accurate the parameter estimates are. The goal here is not necessarily to check that the posterior inference is working well—indeed, there’s no requirement that the inference be Bayesian at all—but rather just to see what might happen under some scenarios. Here’s a discussion from earlier this year on why this can be so useful, and here’s a simple example.

Another application of simulated-data experimentation is to assess the bias in some existing estimation method. For example, suppose we’re concerned about selection bias in an experiment. We can simulate fake data under some assumed true parameter values and some assumed selection model, then do a simple uncorrected estimate and compare this inference with the assumed parameter values. This is similar to simulation-based calibration but not quite the same thing, because we’re not trying to assess whether the fit is calibrated; rather, we’re using the simulation to assess the bias of the estimate (given some assumptions).

A similar idea arises when assessing the bias of a computational algorithm. For example, variational inference can underestimate uncertainty of local parameters in a hierarchical model. One way to get a sense of this inferential bias is to simulate data from the assumed model, then fit using variational inference, check coverage of the resulting 50% intervals (for example), and loop this to see how often these intervals contain the true value. Or this could be done in other ways; the point is that this is similar to, but not the same as, straight simulation based calibration as outlined at the top of this post.

Perhaps most important is the use of simulated-data experimentation in modeling workflow. As Martin Modrák put it: simulation-based calibration was presented originally as validating an algorithm given that you trust your model, but in practice we are typically interested in validating a model implementation, given that you trust your sampling algorithm. Or maybe we want to be checking both the model and the computation.

Problems with simulation-based calibration

OK, now back to the specific idea of using simulated data to check calibration of parameter inferences. There are some problems with the method of Cook et al. (2006) and Talts et al. (2021):

– The method can be computationally expensive. If you loop the above steps 1-4 many times, then you’ll need to do posterior inference—step 3—lots of times. You might not want to occupy your cluster with N=1000 parallel simulations.

– The method is a hypothesis test: it’s checking whether the simulations are exactly correct (to within the accuracy of the test). But our simulations are almost never exact. HMC/Nuts is great, but it’s still only approximate. So there’s an awkwardness to the setup in that we’re testing a hypothesis we don’t expect to be true, and we’re assuming this will work because of some slop in the uncertainty based on a finite number of replications.

– Sometimes we want to test an approximate method that’s not even simulation consistent, something like ADVI or Pathfinder. Simulation-based calibration still seems like a good general idea, but it won’t quite work as stated because we’re not expecting perfect coverage.

– In practice, we don’t necessarily care about the calibration of a procedure everywhere. In a particular application, we typically care about calibration in the neighborhood of the data. If the model being fit has a weak prior distribution, simulation-based calibration drawing from the prior (as in steps 1-4 above) might miss the areas of parameter and data space that we really care about.

– Because of the above concerns, we commonly perform simulation-based calibration using a simpler approach, using a single fixed value theta_tilde set to a reasonable value, where “reasonable” is defined based on our understanding of how the model will be used. This has the virtue of testing the computation where we care about it, but now that we’re no longer averaging over the prior, we can’t make any general statements about calibration, even if the posterior simulations are perfect.

Possible solutions

So, what to do? I’m not sure. But I have a few thoughts.

First, in practice we can often learn a lot from just a single simulated dataset, that is, N=1. Modeling or computational problems are often so severe that they show up from just one simulation, for example we’ll get parameter estimates blowing up or getting stuck somewhere far away from where they should be. In my workflow, I’m often doing this sort of informal experimentation where I set the parameter vector to some reasonable value, simulate fake data, fit the model, and check that the parameter inferences are in the ballpark. It would be good to semi-automate this process so that it can be easy to do in the cmdstanR and cmdstanPy environments, as well as in rstanarm and brms.

Second, we’re often fitting hierarchical models, and then we can follow a hybrid approach. For a hierarchical model with hyperparameters phi and local parameters alpha, we can set phi_tilde to a reasonable value and draw alpha_tilde from its distribution, p(alpha|phi_tilde). If the number of groups is large, there can be enough internal replication in alpha that much can be learned about coverage from just a single simulated dataset—although we have to recognize that this can all depend strongly on phi, so it might make sense to repeat this for a few values of phi.

Third, when checking the computation of a model fit to a particular dataset, I have the following idea. Start by fitting the model, which gives you posterior simulations of theta. Then, for each of several random draws of theta from the posterior, simulate a new dataset y_rep, re-fit the model to this y_rep, and check the coverage of these inferences with respect to the posterior draw of theta. This has the form of simulation-based calibration except that we’re drawing theta from the posterior, not the prior. This makes a lot of sense to me—after all, it’s the posterior distribution that I care about—but we’re no longer simply drawing from the joint distribution, so we can’t expect the same coverage. On the other hand, if we had really crappy coverage, that would be a problem, right? I’m thinking we should try to understand this in the context of some simple examples.

Fourth, I’d like any version of simulation-based calibration to be set up as a measure of miscalibration rather than as a hypothesis test. My model here is R-hat, which is a potential scale reduction factor and which is never expected to be exactly 1.00000.

Fifth, this last idea connects to the use of simulation-based calibrations for explicitly approximate computations such as ADVI or Pathfinder, where the goal should not be to check if the method are calibrated but rather to measure the extent of the miscalibration. One measure that we could try is ((computed posterior mean) – (true parameter value)) / (computed posterior sd). Or maybe that’s not quite right, I’m not sure. The point is that we want a measure of how bad is the fit, not a yes/no hypothesis test.

To summarize, I see three clear research directions:

(a) Adjusting for the miscalibration that will arise when we’re no longer averaging over the prior (because the number of replication draws N is not large or because we want to focus on the posterior or some other area of interest of parameter space).

(b) Coming up with an interpretable measure of miscalibration rather than framing as a test of the hypothesis of perfect calibration.

(c) Incorporating this into workflow so that it’s convenient and not computationally expensive.

13 thoughts on “Simulation-based calibration: Some challenges and directions for future research

  1. I’ve never thought very deeply about a), but the SBC package team will hopefully have something to say about b) and c) reasonably soon! In particular a primary design consideration for the SBC package API was to let you easily gradually transition from 1 simulation on your laptop to 1000 simulations on a cluster (as needed) without really changing anything in your code and with very little biolerplate. We’re not quite there yet (the code will fail on a cluster and there is _some_ biolerplate), but I am quite confident that the infrastructure is right and this goal is attainable without much additional effort.

    A specific way to test miscalibration from just a single/few simulations in a complex model with a lot of parameters is to just pool the ranks/z-scores/whatever you have for all parameters together and do you coverage test (or anything else you want) with this bigger pool – you loose a lot of theoretical guarantees as the variables are not independent, but you greatly increase your sensitivity to issues. I’ve tested this informally in my past model development adventures and it worked reasonably well in practice, so it’s definitely coming to the SBC package soon.

  2. > Start by fitting the model, which gives you posterior simulations of theta. Then, for each of several random draws of theta from the posterior, simulate a new dataset y_rep, re-fit the model to this y_rep, and check the coverage of these inferences with respect to the posterior draw of theta.

    Sound a lot like the double bootstrap.

    The “simulated-data experimentation” does label it as experimentation and experimentation is always open ended and continually evolves as one learns.

    Now the experimentation is really on the probability models as abstract models.

    Simulation is just drawing pseudo-random numbers as _coarsened_ implications of the probability models being logically connected to them by the way they were drawn.

  3. One question I have about simulating data sets. Does this kind of sbc like approach conflict in any way with the idea, which appears oftennin ckassical Bayesian texts, that data are fixed once collected? We are not assuming that the data are fixed when simulating. Of course once the real data come in we can treat that a fixed set of numbers (which is what they are).

    • My understanding is that there are two types of testing for simulation models: verification and validation. Verification usually precedes validation test and it checks internal consistency contrary to external consistency for validation (see Boehm, Verifying and Validating Software Requirements and Design Specifications. His software development philosophy seems to have inspired some ideas in Bayesian workflow according to Aki’s lecture: https://youtu.be/ppKpwtGy8KQ?t=1402).

      What you mentioned as data being fixed can be understood in a validation setting. A predictive check where the simulated outcome is compared with the fixed outcome to test the validity of the model is an example.

      On the other hand, simulation-based calibration is a verification test in that it checks the internal consistency of the assumptions and tools at hand. To be precise, it checks the consistency of the three components (parameter value, outcome model p(y|theta), inference engine f(theta|y)) that modelers assumed for their task.

    • Loosely, the joint model is the major premise and the data is the minor premise and by deduction the posterior follows.

      The data is just the dead past but in science we are more interested in (betting about) the future.

      The posterior model is just a representation of reality that informs our future actions. One way to better understand that representation would be to simulate from the posterior to see what repeatedly would happen.

      For a wider sense of reference of trying to understand how the joint model would repeatedly perform, one could sample from the joint model (draw parameters from the prior and then fake data from the data generating model) to see how it would repeatedly perform. Here we know the truth – the parameter point drawn from the prior – so the performance of the repeated use of the joint model is directly assessed.

      But with all deduction, it depends on the joint model being exactly true and we all know its not. To get beyond that, you can simulate from a different joint model than is used in the original analysis. Or you can evaluate those horrid frequency properties to get a safety net – if my joint model is too wrong so that the implied posterior is not an applicable posterior – you can at least know the interval coverage is still reasonable, at least for parameters that had reasonable prior probability.

      • These go into much more depth about going beyond just doing deduction in Bayesian analyses (accepting the posterior without reservations) and might be of interest:

        https://statmodeling.stat.columbia.edu/2021/08/18/measuring-the-information-in-an-empirical-prior/

        In this video David Dunson discusses the topic in a number of places – David Dunson | Advancing Statistical Science | Philosophy of Data Science https://www.youtube.com/watch?v=-rWf0TvPyis

        • Interesting; thanks. Interesting video too. I think the most amazing part of that video recording was around the 29th minute, when the astonishing birds of Townsville start live-tweeting his lecture. [I’m just crazy about the birds in Townsville. I could listen to them all day].

        • I watched the video to the end. Some observations:

          – It’s great that Dunson is pretty open about how cavalier Bayesians are about decisions about priors. It’s ironic that we don’t generally put enough thought into it, given the central role that priors play in our lives.

          – As an advisor for PhD students, I found it very interesting that Dunson says he has no strategy for mentoring PhD students. He seems to have a self-organizing and fairly “random search” approach to deciding what his students should work on. The most important thing he said was that he doesn’t give his students projects to work on; he has them develop their own projects. After 20 years of advising, I think this was the biggest mistake I made: in most of the cases where I assigned the topic to the PhD student, things didn’t work out as well as they could have. What works best is when the student owns their own problem by developing it themselves. It’s really important to let the student take up a problem they themselves choose to work on.

          About the main topic of this thread, I will respond to Ben’s comment.

    • For EACH set of draws whose PDF is p(theta|y_tilde), the draws of theta are conditional on the (simulated) data, y_tilde. So, the distribution of the indicator function I{ (theta | y_tilde) > theta_tilde} is a valid function of THAT posterior distribution. However, with SBC, we are interested in the distribution of this indicator function over the prior distribution of theta_tilde (which each y_tilde is a function of). So, it is more like we are making an inference (about an algorithm rather than about a data-generating process) using an indicator function of posterior distributions induced by a prior distribution. While I would agree that the conclusion — namely, an algorithm does or does not draw from the intended posterior distribution — is not a Bayesian conclusion (in the sense, that we are not saying something like “there is a greater than 99% chance that the algorithm will work correctly given the data”), it is nevertheless a coherent, probability-based conclusion about a purported Bayesian algorithm.

      • Thanks to Hyunji, Ben, and Keith for their responses.

        I understand what you are all saying, and I agree with everything you say of course.

        I’m talking about how simulating *data* implies that the data come from a random variable, i.e., they are not fixed but random. That’s a frequentist move and it wasn’t anything I ever saw discussed when I started studying Bayesian methods (except of course in Andrew’s books).

        When we compute a posterior f(theta | y), the y is a fixed quantity. It isn’t a random variable when we fit the model.

        In classical Bayesian presentations in textbooks, the y is always treated as a fixed quantity; there is no assumption that we could repeatedly generate hypothetical new instances of y, and what the properties are of such repeated hypothetical instances of y.

        Now, when we study the posterior predictive distribution f(ypred|y), is still fixed. But y_pred is now a random variable.

        If we now study the properties of the repeated instances of ypred with respect to the model specification, e.g., by fitting the model repeatedly on ypred to study the variation in the posterior under repeated sampling, we are stepping away from treating the “data” (here, ypred) as a random quantity. I understand the goal here of course. I’m just saying that we have stepped away from treating the data as a constant quantity, and this need to be spelt out (e.g., for students). It’s a frequentist way of thinking.

        • Hi,
          I think that’s a super interesting perspective – there defintely are frequentist connections to some aspects of SBC. I admit I’ve never thought too deeply about them, because, I’ve always seen SBC from the perspective of software testing. I.e. I don’t think SBC or similar approaches carry much philosphical assumptions with them – it’s IMHO more like computing (x + y)^2 and then computing x^2 + x*y + y^2 and checking how well the results match: we know from mathematical theory that they should be equal so any discrepancy tells us we haven’t implemented the theory precisely. And indeed for the standard floating point implementation on modern computers, the values will generally not be equivalent (and may differ substantially).

          With that said, I see where this might be confusing and spelling out the philosophy and connections to frequentist statistics is probably something we want to be explicit about and thanks for reminding me of that.

        • > It’s a frequentist way of thinking.
          I believe it is and there is no way around that, but that may be why so many Bayesian avoid being explicit about it.

          It is simply good scientific practice, and the chagrin about it being frequentist is just bad metaphysics.

Leave a Reply

Your email address will not be published. Required fields are marked *