I really liked this paper, and am curious what other people think before I base a grant application around applying Stan to this problem in a machine-learning context.

- Gneiting, T., Balabdaoui, F., & Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness.
*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*, 69(2), 243–268.

Gneiting et al. define what I think is a pretty standard notion of *calibration* for Bayesian models based on coverage, but I’m not 100% sure if there are alternative sensible definitions.

They also define a notion of *sharpness*, which for continuous predictions is essentialy narrow posterior intervals, hence the name.

By way of analogy to point estimators, calibration is like unbiasedness and sharpness is like precision (i.e., inverse variance).

I seem to recall that Andrew told me that calibration is a frequentist notion, whereas a true Bayesian would just believe their priors. I’m not so worried about those labels here as about the methodological ramifications of taking the ideas of calibration and sharpness seriously.

I’ve seen this paper before and also like it. These proper scoring rules make sense for probabilistic predictions on out of sample data. I’ve used CRPS to compare models before, and it makes intuitive sense when you check individual predictive distributions versus actual value on why one model is scored higher than another on a per example basis.

I know that proper scoring methods generally and CRPS specifically are advocated by Briggs.

It’s a nice paper. Posterior predictive checking and especially LOO predictive checking (e.g. LOO-PIT) (implemented, e.g., in bayesplot) are assessing the calibration. Sharpness with log score is same as negative entropy of the predictive distribution (ie small entropy corresponds to sharp distribution), which can be called also expected log self predictive density (see, Vehtari & Ojanen, 2012). elpd in loo is good if both calibration and sharpness is good, but doesn’t separate them, and that’s why we recommend to use PPC and LOO-PIT in addition of making model comparisons with elpd_loo.

Thanks, Aki. That’s exactly the kind of methodological advice I was looking for. Specifically, it verified my hunch that sharpness was just (negative) entropy and that calibration was tied to PPCs. I’ll definitely check out your paper next.

The only thing with which I was previously familar for assessing calibration is the way we’re now using simulation-based calibration. That is indeed just repeated posterior predictive checks based on simulated parameters and data using the parameter draws as the statistic of interest. In general, PPCs can check any quantity of interest, not just parameter estimates.

P.S. I was about to ask Aki Vehtari and Dan Simpson and Andrew Gelman directly, but figured I’d ask everyone instead.

Bob, Aki, thanks for this. I, too, have been trying to figure out where calibration-in-the-small fits into a Bayesian framework, and now it just seems obvious that it’s a special case of posterior predictive checking.

Haven’t read more than the abstract (yet), but: I always thought of calibration as being about the accuracy of *interval* estimates, i.e. analogous to a method for constructing confidence intervals having nominal coverage (check plots, etc etc). I don’t see how it’s connected to unbiasedness/point estimates.

And BTW, it looks like the link you give for the paper leads back to this page (rather than to the paper at https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2007.00587.x )

Yes, calibration is definitely about the coverage.

Let me try again, though it may just be a doomed analogy. Unbiasedness means getting the right estimate in expectation, whereas calibration is about getting the right posterior in terms of expected coverage. In the same way, sharpness seems related to the precision (inverse variance) of an estimator.

The reason I brought this up is that in point estimation you can reduce variance at the cost of an increase in bias (e.g., by adding a prior or hierarchical prior). So I was wondering if you could do something similar w.r.t. calibration and sharpness (e.g., increasingly sharpness by sacrificing some calibration).

then calibration of a Bayesian posterior seems like the same kind of condition as unbiasedness of an estimator.

I also fixed the link.

When I plot a high-resolution calibration curve for a probability prediction model, I’m plotting the actual proportion of cases at each level of predicted risk. The interval of interest here is the entire range of predicted risk, but I also get information about the accuracy of predicted risk at any point along the (0,1) domain. And when comparing two models, I can see whether one model is well-calibrated at a broader range of the (0,1) domain than another.

I assume you know the classic paper in JASA by David, but I found the discussion about it and related problems in afinetheorem useful: https://afinetheorem.wordpress.com/2011/01/23/the-well-calibrated-bayesian-a-p-dawid-1982/

Yes, but I found it much much harder to read. I wrote a similar blog post here a few years back, in a post titled, Bayesian posteriors are calibrated by definition.

The only drawback is that it requires the model to be right to work with real data in the wild. Since we rarely assume our models are actually correct, in practice, we have to fall back on empirical coverage tests like posterior predictive checks.

I’m glad to see that paper here, I used it recently in a paper presenting a method to improve epidemic forecasts (https://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0006526). This paper uses Stan too. It is very applied but maybe it will be of use.

There has been more work on calibration since the 207 paper, see, for example Tsyplakov (2013, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2236605) or Strähl and Ziegel (2017, https://projecteuclid.org/euclid.ejs/1488531637) for newer notions of calibrations.

Regarding posterior evaluation, it has been argued (e.g., Little, 2006, https://doi.org/10.1198/000313006X117837) that frequentist principles are useful for evaluating Bayesian models. See also Krüger et al. (2016, https://arxiv.org/abs/1608.06802) regarding scoring rule computation (such as the CRPS) in an MCMC context.

Thanks for the references. Gelman was a student and collaborator of Little and Rubin, whereas I learned stats from Gelman. So it’s not surprising we’re all thinking the same way about a lot of these problems. Maybe I’ll be able to read them all and post again with a summary.

Proper scoring rules were used to assess the performance of prediction intervals in the recent M4 forecasting competition: See section 3.3 of https://www.sciencedirect.com/science/article/pii/S0169207019301128

Of note, the best-performing method combined statistical and machine learning methods (https://www.sciencedirect.com/science/article/pii/S0169207019301153)

Thanks for the references. Looks like I have a lot of reading to do.

By the usual definition of statistics, machine learning methods

arestatistical methods. The difference seems mainly to be that they’re developed algorithmically by computer scientists and focused on prediction rather than estimation or hypothesis testing. I have a lot of respect for a lot of what’s been accomplished in machine learning, not the least of which is refocusing broader attention on predictive methods. Perhaps that’s because I spent fifteen years working on machine learning applications in speech and natural language processing before moving to Columbia to work on and eventually learn Bayesian stats.> methodological ramifications of taking the ideas of calibration and sharpness seriously.

The ramifications are currently unclear, muddled, confused and likely a long way from clarification, elucidation and general acceptance.

Some current alternatives are Bayesian notions of calibration given in this book by Michael Evans https://www.crcpress.com/Measuring-Statistical-Evidence-Using-Relative-Belief/Evans/p/book/9781482242799 and related papers, Michael Betancourt’s section 2.3 Model Sensitivity in https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html as well as many papers Andrew has done with others.

So I would suggest some pluralism here, implement the methods of the paper with a clear warning that there isn’t any wide consensus of assurance that the methods do more good than harm. Enabling more to get experience working with these method might help give a better sense in the future.

Don’t worry—I know I don’t know enough to be a dogmatic statistician!

I’m also a big believer in multiple diagnostics to get at multiple aspects of a problem as long as they don’t have too many false positives. Michael Betancourt has built some of his recommendations in section 2.3 of the methodology paper into the Stan repo stat_comp_benchmarks. The nice thing about those tests is that they’re pretty powerful (reject a lot of bad stuff) and very fast compared to something like simulation-based calibration.

As far as sensitivity goes, I’d recommend checking out Ryan Giordano’s thesis work (U.C. Berkeley, though he’s now a postdoc at M.I.T.), where he uses autodiff to characterize sensitiity of variational solutions w.r.t. parameters. This is a standard approach in aplied math where they do sensitivity analysis of solutions to differential equations with respect to its parameter(s) [which is also where we got the technique of differentiating solutions to differential equations in Stan].

In the paper:

“The method uses off-site information from the nearby meteorological towers at Goodnoe Hills and Kennewick, identifies atmospheric regimes and fits conditional predictive models for each regime, based on a sliding 45-day training period.”

There is something “multi-level” about this, but I cannot articulate it. At which scale are these models supposed to predict?