When estimating a treatment effect with a cluster design, you need to include varying slopes, even if the fit gives warning messages.

Nick Brown and I recently published this paper, How statistical challenges and misreadings of the literature combineto produce unreplicable science: An example from psychology, in the journal, Advances in Methods and Practices in Psychological Science, about a published article that claimed to find that healing of bruises could be sped or slowed by manipulating people’s subjective sense of time. The article under discussion had major problems–including weak theory, a flawed data analysis, and multiple errors in summarizing the literature–to the extent that we did not find their claims to be supported by their evidence.

That sort of thing happens–we call it everyday, bread-and-butter pseudoscience. The point of the new paper by Nick and me was not so much to shoot down one particular claim of mind-body healing but rather to explore some general reasons for scientific error and overconfidence and to provide a template for future researchers to avoid these problems in the future.

We posted this the other day on the blog (under the heading, 7 steps to junk science that can achieve worldly success), and something came up in comments, something I hadn’t thought of before and I wanted to share with you. The commenter, Jean-Paul, pointed to a misunderstanding that can arise when fitting multilevel models using some software.

Need for multilevel modeling or some equivalent adjustment for clustering in data

Here was the problem. The article in question estimated treatment effects with a cluster design: they had three treatments applied to 33 research participants (in psychology jargon, “subjects”) with data provided by 25 raters. The total number of measurements was 2425 (not quite 3*33*25 because of some missing data, which is not the point in this story).

When you want to estimate treatment effects from a clustered design, you need to fit a multilevel model or make some equivalent adjustment. Otherwise your standard errors will be too small. In this case you should include effects for participants (some have more prominent bruises than others in this study) and for raters (who will have systematic differences in how they characterize the severity of a bruise using a numerical rating).

The researchers knew to do this, so they fit a multilevel model. Fortunately for me, they used the R program lmer, which I’m familiar with. Here’s the output, which for clarity I show using the display function in the arm package:

lmer(formula = Healing ~ Condition + (1 | Subject) + (1 | ResponseId), 
    data = DFmodel, REML = FALSE)
            coef.est coef.se
(Intercept) 6.20     0.31   
Condition28 0.23     0.09   
Condition56 1.05     0.09   

Error terms:
 Groups     Name        Std.Dev.
 Subject    (Intercept) 1.07    
 ResponseId (Intercept) 1.22    
 Residual               1.87    
---
number of obs: 2425, groups: Subject, 33; ResponseId, 25
AIC = 10131.4, DIC = 10119.4
deviance = 10119.4 

So, fine. As expected, participant and rater effects are large–they vary by more than the magnitudes of the estimated treatment effects!–so it’s a good thing we adjusted for them. And the main effects are a stunning 2.4 and 11.1 standard errors away from zero, a big win!

Need for varying slopes, not just varying intercepts

But . . . wait a minute. The point of this analysis is not just to estimate average responses; it’s to estimate treatment effects. And, for that, we need to allow the treatment effects to vary by participant and rater–in multilevel modeling terms, we should be allowing slopes as well as intercepts to vary. This is in accordance with the well-known general principle of the design and analysis of experiments that the error term for any comparison should be at the level of analysis of the comparison.

No problem, lmer can do that, and we do so! here’s the result:

lmer(formula = Healing ~ Condition + (1 + Condition | Subject) + 
    (1 + Condition | ResponseId), data = DFmodel, REML = FALSE)
            coef.est coef.se
(Intercept) 6.18     0.39   
Condition28 0.25     0.36   
Condition56 1.09     0.37   

Error terms:
 Groups     Name        Std.Dev. Corr        
 Subject    (Intercept) 1.71                 
            Condition28 1.99     -0.71       
            Condition56 2.03     -0.72  0.65 
 ResponseId (Intercept) 1.24                 
            Condition28 0.07      1.00       
            Condition56 0.13     -1.00 -1.00 
 Residual               1.51                 
---
number of obs: 2425, groups: Subject, 33; ResponseId, 25
AIC = 9317.9, DIC = 9285.9
deviance = 9285.9 

The estimated average treatment effects have barely changed, but the standard errors are much bigger. The fitted models shows that treatment effects vary a lot by participant, not so much by rater.

You might be happy because the t-statistic for one of the treatment comparisons is still 3 standard errors away from zero, but other problems remain, both with multiple comparisons and with the interpretation of the data, as we discuss in section 2.4 of our paper.

There’s one thing you might notice if you look carefully at the above output, which is that the estimated covariance matrix for the rater effects is degenerate: the correlations are at 1 and -1. This sort of thing happens a lot with multilevel models when data are noisy and the number of groups is small. It’s an issue we discussed in this paper from 2014.

In practice, it’s no big deal here: the degeneracy is a consequence of a very noisy estimate, which itself arises because the noise in the data is much larger than the signal, which itself arises because the variation in treatment effects across raters is indistinguishable from zero (look at those small estimated standard deviations of the varying slopes for rater). But that’s all pretty subtle, not something that’s in any textbook, even ours!

That scary warning on the computer

Here’s the problem. When you fit that varying slopes model, R displays a warning in scary red type:

It’s possible that the researchers tried this varying-slope fit, saw the singularity, got scared, and retreated to the simpler varying-intercept model, which had the incidental benefit of giving them an estimated treatment effect that was 11 standard error from zero.

Just to be clear, I’m not saying this is a problem with R, or with lme4. Indeed, if you type help(“isSingular”) in the R console, you’ll get some reasonable advice, none of which is to throw out all the varying slopes. But the first option suggested there is to “avoid fitting overly complex models in the first place,” which could be misunderstood by users.

If you fit the model using stan_lmer on its default settings, everything works fine:

 family:       gaussian [identity]
 formula:      Healing ~ Condition + (1 + Condition | Subject) + (1 + Condition | 
	   ResponseId)
 observations: 2425
------
            Median MAD_SD
(Intercept) 6.22   0.40  
Condition28 0.26   0.34  
Condition56 1.09   0.37  

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.51   0.02  

Error terms:
 Groups     Name        Std.Dev. Corr       
 Subject    (Intercept) 1.731               
            Condition28 2.015    -0.63      
            Condition56 2.056    -0.66  0.57
 ResponseId (Intercept) 1.225               
            Condition28 0.158     0.33      
            Condition56 0.172    -0.51  0.04
 Residual               1.509               
Num. levels: Subject 33, ResponseId 25 

These estimates average over the posterior distribution, so nothing on the boundary, no problem. And, no surprise, the estimates and standard errors of the treatment effects are basically unchanged.

That said, when you run stan_lmer, it gives warnings in red too!

These warning messages are a big problem! On one hand, yeah, you want to warn people; on the other hand, it would be just horrible if the warnings are so scary that they send users to use bad models that don’t happen to spit out warnings.

I also tried fitting the model using blmer, which I thought would work fine, but that produced some warning messages that scared even me:

By default, blme uses a degeneracy-avoiding prior, so this sort of thing shouldn’t happen at all. We should figure out what’s going on here!

Summary

1. If you are estimating treatment effects using clustered data, you should be fitting a multilevel model with varying intercepts and slopes (or do the equivalent adjustment in some other way, for example by boostrapping according to the cluster structure). Varying intercepts is not enough. If you do it wrong, your standard errors can be way off. This is all consistent with the general principle to use all design information in the analysis.

2. If you apply a more complicated method on the computer and it gives you a warning message, this does not mean that you should go back to a simpler method. More complicated models can be harder to fit. This is something you might have to deal with. If it really bothers you, then be more careful in your data collection; you can design studies that can be analyzed more simply. Or you can do some averaging of your data, which might lose some statistical efficiency but will allow you to use simpler statistical methods (for example, see section 2.3, “Simple paired-comparisons analysis,” of our paper).

How far can exchangeability get us toward agreeing on individual probability?

This is Jessica. What’s the common assumption behind the following? 

    • Partial pooling of information over groups in hierarchical Bayesian models 
    • In causal inference of treatment effects, saying that the outcome you would get if you were treated (Y^a) shouldn’t change depending on whether you are assigned the treatment (A) or not
    • Acting as if we believe a probability is the “objective chance” of an event even if we prefer to see probability as an assignment of betting odds or degrees of belief to an event

The question is rhetorical, because the answer is in the post title. These are all examples where statistical exchangeability is important. Exchangeability says the joint distribution of a set of random variables is unaffected by the order in which they are observed. 

Exchangeability has broad implications. Lately I’ve been thinking about it as it comes up at the ML/stats intersection, where it’s critical to various methods: achieving coverage in conformal prediction, using counterfactuals in analyzing algorithmic fairness, identifying independent causal mechanisms in observational data, etc. 

This week it came up in the course I’m teaching on prediction for decision-making. A student asked whether exchangeability was of interest because often people aren’t comfortable assuming data is IID. I could see how this might seem like the case given how application-oriented papers (like on conformal prediction) sometimes talk about the exchangeabilty requirement as an advantage over the usual assumption of IID data. But this misses the deeper significance, which is that it provides a kind of practical consensus between different statistical philosophies. This consensus, and the ways in which it’s ultimately limited, is the topic of this post.

Interpreting the probability of an individual event

One of the papers I’d assigned was Dawid’s “On Individual Risk,” which, as you might expect, talks about what it means to assign probability to a single event. Dawid distinguishes “groupist” interpretations of probability that depend on identifying some set of events, like the frequentist definition of probability as the limiting frequency over hypothetical replications of the event, from individualist interpretations, like a “personal probability” reflecting the beliefs of some expert about some specific event conditioned on some prior experience. For the purposes of this discussion, we can put Bayesians (subjective, objective, and pragmatic, as Bob describes them here) in the latter personalist-individualist category. 

On the surface, the frequentist treatment of probability as an “objective” quantity appears incompatible with the individualist notion of probability as a descriptor of a particular event from the perspective of the particular observer (or expert) ascribing beliefs. If you have a frequentist and a personalist thinking about the next toss of a coin, for example, you would expect the probability the personalist assigns to depend on their joint distribution over possible sequences of outcomes, while the frequentist would be content to know the limiting probability. But de Finetti’s theorem shows that if one believes a sequence of events to be exchangeable, then you can’t distinguish their beliefs about those random variables from conceiving of independent events with some underlying probability. Given a sequence of exchangeable Bernoulli random variables X1, X2, X3, … , you can think of a draw from their joint distribution as sampling p ~ mu, then drawing X1, X2, X3, … from Bernoulli(p) (where mu is a distribution on [0,1]). So the frequentist and personalist can both agree, under exchangeability, that p is meaningful for decision making. David Spiegalhalter recently published an essay on interpreting probability that he ended by commenting on how remarkable this pragmatic consensus is.

But Dawid’s goal is to point out ways in which the apparent alignment is not as satisfactory as it may seem in resolving the philosophical chasm. It’s more like we’ve thrown a (somewhat flimsy) plank over it. Exchangeability may sometimes get us across by allowing the frequentist and personalist to coordinate in terms of actions, but we have to be careful how much weight we put on this.  

The reference set depends on the state of information

One complication is that the personalist’s willingness to assume exchangeability depends on the information they have. Dawid uses the example of trying to predict the exam score of some particular student. If they have no additional information to distinguish the target student from the rest, the personalist might be content to be given an overall limiting relative frequency p of failure across a set of students. But as soon as they learn something that makes the individual student unique, p is no longer the appropriate reference for the individual student’s probability of passing the exam. 

As an aside, this doesn’t mean that exchangeability is only useful if we think of members of some exchangeable set as identical. There may still be practical benefits of learning from the other students in the context of a statistical model, for example. See, e.g., Andrew’s previous post on exchangeability as an assumption in hierarchical models, where he points out that assuming exchangeability doesn’t necessarily mean that you believe everything is indistinguishable, and if you have additional information distinguishing groups, you can incorporate that in your model as group-level predictors.

But for the purposes of personalists and frequentists agreeing on a reference for the probability of a specific event, the dependence on information is not ideal. Can we avoid this by making the reference set more specific? What if we’re trying to predict a particular student’s score on a particular exam in a world where that particular student is allowed to attempt the same exam as many times as they’d like? Now that the reference group refers to the particular student and particular exam, would the personalist be content to accept the limiting frequency as the probability of passing the next attempt? 

The answer is, not necessarily. This imaginary world still can’t get us to the generality we’d need for exchangeability to truly reconcile a personalist and frequentist assessment of the probability. 

Example where the limiting frequency is constructed over time

Dawid illustrates this by introducing a complicating (but not at all unrealistic) assumption: that the student’s performance on their next try on the exam will be affected by their performance on the previous tries. Now we have a situation where the limiting frequency of passing on repeated attempts is constructed over time. 

As an analogy, consider drawing balls from an urn, where when we draw our first ball, there is 1 red ball and 1 green ball in it. Upon drawing a ball, we immediately return and add an additional ball of the same color. At each draw, each ball in the urn is equally likely of being drawn, and  the sequence of colors is exchangeable. 

Given that p is not known, which do you think the personalist would prefer to consider as the probability of a red ball on the first draw: the proportion of red balls currently in the urn, or the limiting frequency of drawing a red ball over the entire sequence? 

Turns out in this example, the distinction doesn’t actually matter: the personalist should just bet 0.5. So why is there still a problem in reconciling the personalist assessment with the limiting frequency?

The answer is that we now have a situation where knowledge of the dynamic aspect of the process makes it seem contradictory for the personalist to trust the limiting frequency. If they know it’s constructed over time, then on what ground is the personalist supposed to assume the limiting frequency is the right reference for the probability on the first draw? This gets at the awkwardness of using behavior in the limit to think about individual predictions we might make.

Why this matters in the context of algorithmic decision-making

This example is related to some of my prior posts on why calibration does not satisfy everyone as a means of ensuring good decisions. The broader point in the context of the course I’m teaching is that when we’re making risk predictions (and subsequent decisions) about people, such as in deciding who to grant a loan or whether to provide some medical treatment, there is inherent ambiguity in the target quantity. Often there are expectations that the decision-maker will do their best to consider the information about that particular person and make the best decision they can. What becomes important is not so much that we can guarantee our predictions behave well as a group (e.g., calibration) but that we understand how we’re limited by the information we have and what assumptions we’re making about the reference group in an individual case. 

“The terror among academics on the covid origins issue is like nothing we’ve ever seen before”

Michael Weissman sends along this article he wrote with a Bayesian evaluation of Covid origins probabilities. He writes:

It’s a peculiar issue to work on. The terror among academics on the covid origins issue is like nothing we’ve ever seen before.

I was surprised he was talking about “terror” . . . People sometimes send me stuff about covid origins and it all seems civil enough. I guess I’m too far out of the loop to have noticed this! That said, there have been times that I’ve been attacked for opposing some aspect of the scientific establishment, so I can believe it.

I asked Weissman to elaborate, and he shared some stories:

A couple of multidisciplinary researchers from prestigious institutions were trying to write up a submittable paper. They were leaning heavily zoonotic, at least before we talked. They said they didn’t publish because they could not get any experts to talk with them. They said they prepared formal legal papers guaranteeing confidentiality but it wasn’t enough. I guess people thought that their zoo-lean was a ruse.

The extraordinarily distinguished computational biologist Nick Patterson tells me that a prospective collaborator cancelled their collaboration because Patterson had blogged that he thought the evidence pointed to a lab leak. It is not normal for a scientist to drop an opportunity to collaborate with someone like Patterson over a disagreement on an unrelated scientific question. You can imagine the effect of that environment on younger, less established scientists.

Physicist Richard Muller at Berkeley tried asking some bio colleague about an origins-related technical issue. The colleague blew him off. Muller asked if a student or postdoc could help. No way- far too risky, would ruin their career. (see around minute 43 here)

Come to think about it, I got attacked (or, at least, misrepresented) for some of my covid-related research too; the story is here. Lots of aggressive people out there in the academic research and policy communities.

Also, to put this in the context of the onset of covid in 2020, whatever terror we have been facing by disagreeing with powerful people in academia and government is nothing compared to the terror faced by people who were exposed to this new lethal disease. Covid is now at the level of a bad flu season, so still pretty bad but much less scary than a few years ago.

Postdoc, doctoral student, and summer intern positions, Bayesian methods, Aalto

Postdoc and doctoral student positions in developing Bayesian methods at Aalto University, Finland! This job post is by Aki (maybe someday I write an actual blog post).

The positions are funded by Finnish Center for Artificial Intelligence FCAI and there are many other topics, but if you specify me as the preferred supervisor then it’s going to be Bayesian methods, workflow, cross-validation, and diagnostics. See my video on Bayesian workflow, Bayesian workflow paper, my publication list, and my talk list, for more about what I’m working on.

There are also plenty of other topics and supervisors in

  1. Reinforcement learning
  2. Probabilistic methods
  3. Simulation-based inference
  4. Privacy-preserving machine learning
  5. Collaborative AI and human modeling
  6. Machine learning for science

Join us, and learn why Finland has been ranked the happiest country in the world for seven years in row!

See how to apply at fcai.fi/winter-2025-researcher-positions-in-ai-and-machine-learning

I might also hire one summer intern (BSc or MSc level) to work on the same topics. Applications through Aalto Science Institute call

Progress in 2024 (Aki)

Here’s my 2024 progress report. There are 5 publications common with Andrew in 2024.

Active Statistics book is the biggest in size, but personally getting the Pareto smoothed importance sampling paper published after 9 years from the first submission was a big event, too. I think I only blogged 2023 progress report and job ads (I sometimes have blog post ideas, but as I’m a slow writer, it’s difficult to find time to turn them to actual posts). I’m very happy with the progress in 2024, but also excited on what we are going to get done in 2025!

Book

Papers published or accepted for publication in 2024

  • Yann McLatchie, Sölvi Rögnvaldsson, Frank Weber, and Aki Vehtari (2025). Advances in projection predictive inference. Statistical Science, accepted for publication. arXiv preprint arXiv:2306.15581. Software: projpred, kulprit.

  • Christopher Tosh, Philip Greengard, Ben Goodrich, Andrew Gelman, Aki Vehtari, and Daniel Hsu (2025). The piranha problem: Large effects swimming in a small pond. Notices of the American Mathematical Society, 72(1):15-25. arXiv preprint arXiv:2105.13445.

  • Kunal Ghosh, Milica Todorović, Aki Vehtari, and Patrick Rinke (2025). Active learning of molecular data for task-specific objectives. The Journal of Chemical Physics, doi:10.1063/5.0229834.

  • Charles C. Margossian, Matthew D. Hoffman, Pavel Sountsov, Lionel Riou-Durand, Aki Vehtari, and Andrew Gelman (2024). Nested Rhat: Assessing the convergence of Markov chain Monte Carlo when running many short chains. Bayesian Analysis, doi:10.1214/24-BA1453.Software: posterior.

  • Yann McLatchie and Aki Vehtari (2024). Efficient estimation and correction of selection-induced bias with order statistics. Statistics and Computing, 34(132). doi:10.1007/s11222-024-10442-4.

  • Frank Weber, Änne Glass, and Aki Vehtari (2024). Projection predictive variable selection for discrete response families with finite support. Computational Statistics, doi:10.1007/s00180-024-01506-0. Software projpred.

  • Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry (2024). Pareto smoothed importance sampling. Journal of Machine Learning Research, 25(72):1-58. Online. Software: loo, posterior, ArviZ

  • Manushi Welandawe, Michael Riis Andersen, Aki Vehtari, and Jonathan H. Huggins (2024). A framework for improving the reliability of black-box variational inference. Journal of Machine Learning Research, 25(219):1-71. Online.

  • Noa Kallioinen, Topi Paananen, Paul-Christian Bürkner, and Aki Vehtari (2024). Detecting and diagnosing prior and likelihood sensitivity with power-scaling. Statistics and Computing, 34(57). Online.
    Supplementary code.
    Software: priorsense

  • Erik Štrumbelj, Alexandre Bouchard-Côté, Jukka Corander, Andrew Gelman, Håvard Rue, Lawrence Murray, Henri Pesonen, Martyn Plummer, and Aki Vehtari (2024). Past, present, and future of software for Bayesian inference. Statistical Science, 39(1):46-61. Online.

  • Alex Cooper, Dan Simpson, Lauren Kennedy, Catherine Forbes, and Aki Vehtari (2024). Cross-validatory model selection for Bayesian autoregressions with exogenous regressors. Bayesian Analysis, doi:10.1214/23-BA1409.

  • Marta Kołczyńska, Paul-Christian Bürkner, Lauren Kennedy, and Aki Vehtari (2024). Trust in state institutions in Europe, 1989–2019. Survey Research Methods, 18(1). doi:10.18148/srm/2024.v18i1.8119.

  • Alex Cooper, Aki Vehtari, Catherine Forbes, Lauren Kennedy, and Dan Simpson (2024). Bayesian cross-validation by parallel Markov chain Monte Carlo. Statistics and Computing, 34:119. doi:10.1007/s11222-024-10404-w.

  • Ryoko Noda, Michael Francis Mechenich, Juha Saarinen, Aki Vehtari, Indrė Žliobaitė (2024). Predicting habitat suitability for Asian elephants in non-analog ecosystems with Bayesian models. Ecological Informatics, 82:102658. doi:10.1016/j.ecoinf.2024.102658.

  • Petrus Mikkola, Osvaldo A. Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, Paul-Christian Bürkner, Arto Klami (2024). Prior knowledge elicitation: The past, present, and future. Bayesian Analysis, 19(49):1129-1161. doi:10.1214/23-BA1381.

arXived in 2024

  • Marvin Schmitt, Chengkun Li, Aki Vehtari, Luigi Acerbi, Paul-Christian Bürkner, and Stefan T. Radev (2024). Amortized Bayesian Workflow (Extended Abstract). arXiv preprint arXiv:2409.04332.

  • Måns Magnusson, Jakob Torgander, Paul-Christian Bürkner, Lu Zhang, Bob Carpenter, and Aki Vehtari (2024). posteriordb: Testing, benchmarking and developing Bayesian inference algorithms. arXiv preprint arXiv:2407.04967. Database and software: posteriordb

  • David Kohns, Noa Kallionen, Yann McLatchie, and Aki Vehtari (2024). The ARR2 prior: flexible predictive prior definition for Bayesian auto-regressions. arXiv preprint arXiv:2405.19920.

  • Anna Elisabeth Riha, Nikolas Siccha, Antti Oulasvirta, and Aki Vehtari (2024). Supporting Bayesian modelling workflows with iterative filtering for multiverse analysis. arXiv preprint arXiv:2404.01688.

  • Guangzhao Cheng, Aki Vehtari, and Lu Cheng (2024). Raw signal segmentation for estimating RNA modifications and structures from Nanopore direct RNA sequencing data. bioRxiv preprint.

Software

Case studies

  • Aki Vehtari (2024). Nabiximols. Model checking and comparison, comparison of continuous and discrete models, LOO-PIT checking, calibration plots, prior sensitivity analysis, model refinement, treatment effect, effect of model mis-specification.

  • Aki Vehtari (2024). Birthdays. Workflow example for iterative building of a time series model. In 2024, added demonstration of Pathfinder for quick initial results and MCMC initialization.

FAQ

Video

Echoing Eco: From the logic of stories to posterior predictive simulation

“When I put Jorge in the library I did not yet know he was the murderer. He acted on his own, so to speak. And it must not be thought that this is an ‘idealistic’ position, as if I were saying that the characters have an autonomous life and the author, in a kind of trance, makes them behave as they themselves direct him. That kind of nonsense belongs in term papers. The fact is that the characters are obliged to act according to the laws of the world in which they live. In other words, the narrator is the prisoner of his own premises.” — Umberto Eco, Postscript to The Name of the Rose (translated by William Weaver)

Perfectly put. As I wrote less poetically a few years ago, the development of a story is a working-out of possibilities, and that’s why it makes sense that authors can be surprised at how their own stories come out. In statistics jargon, the surprise we see in a story is a form of predictive check, a recognition that a scenario, if carried out logically, can lead to unexpected places.

In statistics, one reason we make predictions is to do predictive checks, to elucidate the implications of a model, in particular what it says (probabilistically) regarding observable outcomes, which can then be checked with existing or new data.

To put it in storytelling terms, if you tell a story and it leads to a nonsensical conclusion, this implies there’s something wrong with your narrative logic or with your initial scenario.

Again, I really like how Eco frames the problem, reconciling the agency of the author (who is the one who comes up with premise and the rules of the game and who works out their implications) and the apparent autonomy of the character (which is a consequence of the logic of the story).

This also connects to a discussion we had a year ago about chatbots. As I wrote at the time, a lot of what I do at work—or when blogging!—is a sort of autocomplete, where I start with some idea and work out its implications. Indeed, an important part of the writing process is to get into a flow state where the words, sentences, and paragraphs come out smoothly, and in that sense there’s no other way to do this than with some sort of autocomplete. Autocomplete isn’t everything—sometimes I need to stop and think, make plans, do some math—but it’s a lot.

Different people do autocomplete in different ways. Just restricting ourselves to bloggers here, give the same prompt to Jessica, Bob, Phil, and Lizzie, and you’ll get four much different posts—and, similarly, Umberto Eco’s working out of the logic of a murder in a medieval monastery will come out different from yours. Not even considering the confounding factor that we get to choose the “prompts” for our blog posts, and Eco picked his scenario because he thought it would be fruitful ground for a philosophical novel. Writing has value, in the same way that prior or posterior simulation has value: we don’t know how things will come out any more than we can know the millionth digit of pi without doing the damn calculation.

Calibration “resolves” epistemic uncertainty by giving predictions that are indistinguishable from the true probabilities. Why is this still unsatisfying?

This is Jessica. The last day of the year is like a good time for finishing things up, so I figured it’s time for one last post wrapping up some thoughts on calibration. 

As my previous posts got into, calibrated prediction uncertainty is the goal of various posthoc calibration algorithms discussed in machine learning research, which use held out data to learn transformations on model predicted probabilities in order to achieve calibration on the held out data. I’ve reflected a bit on what calibration can and can’t give us in terms of assurances for decision-making. Namely, it makes predictions trustworthy for decisions in the restricted sense that a decision-maker who will choose their action purely based on the prediction can’t do better than treating the calibrated predictions as the true probabilities. 

But something I’ve had trouble articulating as clearly as I’d like involves what’s missing (and why) when it comes to what calibration gives us versus a more complete representation of the limits of our knowledge in making some predictions. 

The distinction involves how we express higher order uncertainty. Let’s say we are doing multiclass classification, and fit a model fhat to some labeled data. Our “level 0” prediction fhat(x) contains no uncertainty representation at all; we check it against the ground truth y. Our “level 1” prediction phat(.|x) predicts the conditional distribution over classes; we check it against the empirical distribution that gives a probability p(y|x) for each possible y. Our “level 2” prediction is trying to predict the distribution of the conditional distribution over classes, p(p(.|x), e.g. a Dirichlet distribution that assigns probability to each distribution p(.|x), which we can distinguish using some parameters theta.

From a Bayesian modeling perspective, it’s natural to think about distributions of distributions. A prior distribution over model parameters implies a distribution over possible data-generating distributions. Upon fitting a model, the posterior predictive distribution summarizes both “aleatoric” uncertainty due to inherent randomness in the generating process and “epistemic” uncertainty stemming from our lack of knowledge of the true parameter values. 

In some sense calibration “resolves” epistemic uncertainty by providing point predictions that are indistinguishable from the true probabilities. But if you’re hoping to get a faithful summary of the current state of knowledge, it can seem like something is still missing. In the Bayesian framework, we can collapse our posterior prediction of the outcome y for any particular input x to a point estimate, but we don’t have to. 

Part of the difficulty is that whenever we evaluate performance as loss over some data-generating distribution, having more than a point estimate is not necessary. This is true even without considering second order uncertainty. If we train a level 0 prediction of the outcome y using the standard loss minimization framework with 0/1 loss, then it will learn to predict the mode. And so to the extent that it’s hard to argue one’s way out of loss minimization as a standard for evaluating decisions, it’s hard to motivate faithful expression of epistemic uncertainty.

For second order uncertainty, the added complication is there is no ground truth. We might believe there is some intrinsic value in being able to model uncertainty about the best predictor, but how do we formalize this given that there’s no ground truth against which to check our second order predictions? We can’t learn by drawing samples from the distribution that assigns probability to different first order distributions p(.|x) because technically there is no such distribution beyond our conception of it. 

Daniel Lakeland previously provided an example I found helpful of the difference between putting Bayesian probability on a predicted frequency, where there’s no sense in which we can check the calibration of the second order prediction. 

Related to this, I recently came across a few papers by Viktor Bengs et al that formalize some of this in an ML context. Essentially, they show that there is no well-defined loss function that can be used in the typical ML learning pipeline to incentivize the learner to make correct predictions that are also faithful as expressions of the epistemic uncertainty. This can be expressed in terms of trying to find a proper scoring rule. In the case of first order predictions, as long as we use a proper scoring rule as the loss function, we can expect accurate predictions, because a proper scoring rule is one for which one cannot score higher by deviating from reporting our true beliefs. But there is no loss function that incentivizes a second-order learner to faithfully represent its epistemic uncertainty like a proper scoring rule does for a first order learner. 

This may seem obvious, especially if you’re coming from a Bayesian tradition, considering that there is no ground truth against which to score second order predictions. And yet, various loss functions have been proposed for estimating level 2 predictors in the ML literature, such as minimizing the empirical loss of the level 1 prediction averaged over possible parameter values. These results make clear that one needs to be careful interpreting the predictors they give, because, e.g., they can actually incentivize predictors that appear to be certain about the first order distribution. 

I guess a question that remains is how to talk about incentives for second order uncertainty at all in a context where minimizing loss from predictions is the primary goal. I don’t think the right conclusion is that it doesn’t matter since we can’t integrate it into a loss minimization framework. Having the ability to decompose predictions by different sources of uncertainty and be explicit about what our higher order uncertainty looks like going in (i.e., by defining a prior) has scientific value in less direct ways, like communicating beliefs and debugging when things go wrong. 

Bayesian inference (and mathematical reasoning more generally) isn’t just about getting the answer; it’s also about clarifying the mapping from assumptions to inference to decision.

Palko writes:

I’m just an occasional Bayesian (and never an exo-biologist) so maybe I’m missing some subtleties here, but I believe educated estimates for the first probability vary widely with some close to 0 and some close to 1 with no good sense of the distribution. Is there any point in applying the theorem at that point? From this Wired article:

If or when scientists detect a putative biosignature gas on a distant planet, they can use a formula called Bayes’ theorem to calculate the chance of life existing there based on three probabilities. Two have to do with biology. The first is the probability of life emerging on that planet given everything else that’s known about it. The second is the probability that, if there is life, it would create the biosignature we observe. Both factors carry significant uncertainties, according to the astrobiologists Cole Mathis of Arizona State University and Harrison Smith of the Earth-Life Science Institute of the Tokyo Institute of Technology, who explored this kind of reasoning in a paper last fall.

My reply: I guess it’s fine to do the calculation, if only to make it clear how dependent it is on assumps. Bayesian inference isn’t just about getting the answer; it’s also about clarifying the mapping from assumptions to inference to decision.

Come to think about it, that last paragraph remains true if you replace “Bayesian inference” with “Mathematics.”

The marginalization or Jeffreys-Lindley paradox: it’s already been resolved.

Nicole Wang writes:

I am a PhD student in statistics starting this year. I read the Dawid et al. (1973) marginalization paradoxes paper. I found several approaches claiming their “resolution” to this paradox. I am very curious whether you think there is a complete resolution to this paradox so far. If there is one, I sincerely wonder what it is. If not, which approach(s) do you think it(they) is(are) closer to the resolution?

I replied by pointing to these two posts:
Christian Robert on the Jeffreys-Lindley paradox; more generally, it’s good news when philosophical arguments can be transformed into technical modeling issues
and
My problem with the Lindley paradox.
It’s not a topic I’ve thought about much lately because to me it’s already been resolved.

Stanford medical school professor misrepresents what I wrote (but I kind of understand where he’s coming from)

This story is kinda complicated. It’s simple, but it’s complicated.

The simple part is the basic story, which goes something like this:

– In 2020, a study was done at Stanford–a survey of covid exposure–that I publicly criticized: I wrote that this study had statistical problems, that its analysis did not adjust for uncertainty in the false positive rate of the test they were using, and that they did something wrong by not making use of the statistical expertise that is readily available at Stanford.

– In 2023, one of the people involved in that study wrote a long post about that study and its aftermath. In that post, my comments from 2020 on that study were misrepresented–hence the title of the present post.

– Just last week someone pointed me to the 2023 post. I was unhappy to have been misrepresented so I emailed the author of the post. He didn’t respond–I don’t take that personally: he’s very busy, the holiday season is coming, the post came out over a year ago (I just only happened to hear about it the other day), and my complaint concerns only one small paragraph in a long post–so, just to correct the record, I’m posting this here.

The more complicated, and interesting, part, involves the distinction between evidence and truth. It’s something we’ve talked about before–indeed, I published a short article on the topic back in 2020 using this very example!–and here it has come again, so here goes:

And now, the details:

Stanford professor Jay Bhattacharya wrote about a covid study from 2020 that he was involved in, which attracted some skepticism at the time:

Some serious statisticians also weighed in with negative reviews. Columbia University’s Andrew Gelman posted a hyperbolic blog that we should apologize for releasing the study. He incorrectly thought we had not accounted for the possibility of false positives. He later recanted that harsh criticism but wanted us to use an alternative method of characterizing the uncertainty around our estimates.

On the plus side, I appreciate that he characterizes me as a serious statistician. He also called my post “hyperbolic.” He doesn’t actually link to it, so I’ll give the link here so you can make your own judgment. The title of that post is, “Concerns with that Stanford study of coronavirus prevalence,” and I don’t think it’s hyperbolic at all! But that’s just a matter of opinion on Bhattacharya’s part so I can’t say it’s wrong.

He does have two specific statements there that are wrong, however:

1. It’s not true that I “incorrectly thought their study had not accounted for the possibility of false positives.” In my post, I explicitly recognized that their analysis accounted for the possibility of false positives. What I wrote is that they were “focusing on the point estimates of specificity.” Specificity = 1 – false positive rate. What I wrote is that they didn’t properly account for uncertainty in the false positive rate. I did not say they had not accounted for the possibility of false positives.

2. I never “recanted that harsh criticism.” What I wrote in my post is that their article “does not provide strong evidence that the rate of people in Santa Clara county exposed by that date was as high as claimed.” But I also wrote, “I’m not saying that the claims in the above-linked paper are wrong . . . The Bendavid et al. study is problematic if it is taken as strong evidence for those particular estimates, but it’s valuable if it’s considered as one piece of information that’s part of a big picture that remains uncertain.” And I clarified, “When I wrote that the authors of the article owe us all an apology, I didn’t mean they owed us an apology for doing the study, I meant they owed us an apology for avoidable errors in the statistical analysis that led to overconfident claims. But, again, let’s not make the opposite mistake of using uncertainty as a way to affirm a null hypothesis.”

I do not see this as a “recanting,” nor did I recant at any later time, but I’m fine if Bhattacharya or anyone else wants to quote me directly to make clear that at no time did I ever say that their substantive claims were false; I only said that the data offered in that study did not supply strong evidence.

This bothers me. I’d hate for people to think that I’d incorrectly thought they had not accounted for the possibility of false positives, or that I’d recanted my criticism. Again I emphasize that my criticism was statistical and involved specification of uncertainty; it was not a claim on my part that the percentage of people who’d been exposed to Covid was X, Y, or Z.

The good news is that this post from Bhattacharya appeared in 2023 and I only heard about it just the other day, so I guess it did not get wide circulation. Maybe more people will see this correction than the original post! In any case I’m glad to have the opportunity to correct the record.

Stanford contagion

I have no problem with Bhattacharya making arguments about covid epidemiology and policy–two topics he’s thought a lot about. It’s completely reasonable for him and his colleagues to say that their original study was inconclusive but that it was consistent with their larger message.

Bhattacharya also writes:

In the end, Stanford’s leadership undermined public and scientific confidence in the results of the Santa Clara study. Given this history, members of the public could be forgiven if they wonder whether any Stanford research can be trusted.

He doesn’t fully follow up on this point, but I think he’s right.

A few weeks before the above-discussed covid study came out, Stanford got some press when law professor Richard Epstein published something through Stanford’s Hoover Institution predicting that U.S. covid deaths would max out at 500, a prediction he later updated to 5000 (see here for details). I’ve never met Epstein or corresponded with him, but he comes off as quite the asshole, having said this to a magazine interviewer: “But, you want to come at me hard, I am going to come back harder at you. And then if I can’t jam my fingers down your throat, then I am not worth it. . . . But a little bit of respect.” A couple years later, he followed up with some idiotic statements about the covid vaccine. Fine–the guy’s just a law professor, not a health economist or anyone else with relevant expertise here–the point is that Stanford appears to be stuck with him. In Bhattacharya’s words, “Given this history, members of the public could be forgiven if they wonder whether any Stanford research can be trusted.”

This sort of guilt-by-Stanford-association would represent poor reasoning. Just cos Stanford platforms idiots like Richard Epstein, it doesn’t mean we shouldn’t trust the research of serious scholars such as Rob Tibshirani and Jay Bhattacharya. But I guess that members of the public could be forgiven if they show less trust in the Stanford brand. Just as my Columbia affiliation is tarnished by my employer’s association with Mehmet Oz, Robert Hadden, and its willingness to fake its U.S. News numbers. And Harvard is tarnished by its endorsement of various well-publicized bits of pseudoscience.

Reputation goes both ways. By publishing Epstein’s uniformed commentary, Stanford’s leadership undermined public and scientific confidence, and then Bhattacharya had to pay some of the price for this undermined confidence. That’s too bad, also too bad that he ended up on the advisory board of an organization that gave the following advice: “Currently, there is no one for whom the benefit would outweigh the risk of these [covid] vaccines–even the most vulnerable, elderly nursing home patients.”

The challenge is that legitimate arguments about policy responses under uncertainty get tangled with ridiculous claims such as that covid would only kill 500 Americans, or ridiculous policies such as removing the basketball hoops in the local park. On one side, you had people spreading what can only be called denial (for example, the claim that the pandemic was over in the summer of 2020); on the other side were public health authorities playing the fear card and keeping everyone inside.

So I can see how Bhattacharya was frustrated by the response to his study. But he’s missing the mark when he misrepresents what I wrote, and when, elsewhere in his post, he disparages Stephanie Lee for doing reporting on his study. We’re doing our jobs–I’m assessing the strength of statistical evidence, Lee is tracking down leads in the story–just like Bhattacharya is doing his job by making policy recommendations.

“Accounting for Nonresponse in Election Polls: Total Margin of Error”

Jeff Dominitz and Chuck Manski write:

The potential impact of nonresponse on election polls is well known and frequently acknowledged. Yet measurement and reporting of polling error has focused solely on sampling error, represented by the margin of error of a poll. Survey statisticians have long recommended measurement of the total survey error of a sample estimate by its mean square error (MSE), which jointly measures sampling and non-sampling errors. Extending the conventional language of polling, we think it reasonable to use the square root of maximum MSE to measure the total margin of error. This paper demonstrates how to measure the potential impact of nonresponse using the concept of the total margin of error, which we argue should be a standard feature in the reporting of election poll results. We first show how to jointly measure statistical imprecision and response bias when a pollster lacks any knowledge of the candidate preferences of non-responders. We then extend the analysis to settings where the pollster has partial knowledge that bounds the preferences of non-responders.

Good stuff. This relates to two of my papers on nonsampling errors and differential nonresponse:

[2018] Disentangling bias and variance in election polls. Journal of the American Statistical Association 113, 607-614. (Houshmand Shirani-Mehr, David Rothschild, Sharad Goel, and Andrew Gelman)

[1998] Modeling differential nonresponse in sample surveys. Sankhya 60, 101-126. (Thomas C. Little and Andrew Gelman)

It’s funny about that last paper because Tom Little and I did that project nearly 30 years ago and I forgot about it even while doing applied research on differential nonresponse. I think there’s more to be done here, integrating the different perspectives here. I like the idea of modeling the nonsampling error rather than just treating it as another error term.

On the general point of reporting surveys and margins of errors, I recommend Ansolabehere and Belin’s paper from 1993.

Applications of (Bayesian) variational inference?

I’m curious about whether anyone’s using variational inference, and more specifically, using variational approximations to estimate posterior expectations for applied work. And if so, what kinds of reactions have you gotten from readers or reviewers?

I see a lot of talk in papers about how variational inference (VI) scales better than MCMC at the cost of only approximating the posterior. MCMC, which is often characterized as “approximate”, is technically asymptotically exact. MCMC’s approximation is not very many decimal places of accuracy rather than bias, at least in cases where MCMC can sample the posterior.

But I don’t recall ever seeing anyone use VI for inference in applied statistics. In particular, I’m curious if there are any Bayesian applications of VI, by which I mean applications where the variational approximation is used to estimate Bayesian posterior expectations in the usual way for an applied statistics problem of interest. That is, I’m wondering if anyone uses a variational approximation q(theta | phi), where phi is fixed as usual, to approximate a Bayesian posterior p(theta | y) and use it to estimate expectations as follows.

E[f(theta) | y] = INTEGRAL f(theta) q(theta | phi) d.theta.

This could be computed with Monte Carlo when it is possible to sample from q(theta | phi).

I’m using our Pathfinder variational inference system (now in Stan) to initialize MCMC, but I wouldn’t trust inference based on Pathfinder because of the very restrictive variational family (i.e., multivariate normal with low rank plus diagonal covariance). Similarly, most of the theoretical results I’ve been seeing around VI are for normal approximating families, particularly of the mean field (diagonal covariance) variety. Mean field approximations are easy to manipulate theoretically and computationally, but seem to make poor candidates for predictive inference, where there is often substantial posterior correlation and non-Gaussianity.

I know that there are ML applications to autoencoding that use variational inference, but I’m specifically asking about applied statistics applications that would be published in an applied journal, not a stats methodology or ML journal. I’ve seen some applications of point estimates from VI to “fit” latent Dirichlet allocation (LDA) models, but the ones I’ve seen don’t compute any expectations other than point estimates of parameters from a local mode among combinatorially many modes.

I’m curious about applications using ML techniques like normalizing flows as the variational family. I would expect those to be of more practical interest to applied statisticians than all the VI that has come before. I’ve seen cases where VI outperforms NUTS from Abhinav Agrawal and Justin Domke using a 10-layer deep, 20-ish neuron wide, real non-volume preserving (realNVP) flow touched up with importance sampling—their summary paper’s still under review and Abhinav’s thesis is being revised. But it requires a lot of compute, which isn’t cheap these days. The cases where realNVP outperforms include funnels, multimodal targets, bananas and other varying curvature models (like from an IRT 2PL posterior). I suspect the costs and accessibility of the equivalent of an NVIDIA H100 GPU will drop to a point where everyone will be able to use these methods in 10 years. It’s what I’m spending at least half my time on these days—JAX is fun and ChatGPT can (help) translate Stan programs pretty much line for line into JAX.

Iterative imputation and incoherent Gibbs sampling

Seeing this post by Tim Morris on the difference between iterative imputation and Gibbs sampling reminded me of some articles that my colleagues and I have written on the topic:

[2001] Using conditional distributions for missing-data imputation. Discussion of “Conditionally specified distributions” by B. Arnold et al. Statistical Science 3, 268-269. (Andrew Gelman and T. E. Raghunathan).

[2014] Multiple imputation for continuous and categorical data: Comparing joint and conditional approaches. Political Analysis 22, 497-519. (Jonathan Kropko, Ben Goodrich, Andrew Gelman, and Jennifer Hill).

[2014] On the stationary distribution of iterative imputations. Biometrika 1, 155-173. (Jingchen Liu, Andrew Gelman, Jennifer Hill, Yu-Sung Su, and Jonathan Kropko).

It’s something that’s been discussed in the Bayesian statistics community for a long time—I remember talking about it with Jun Liu around 1992 or so.

For a very simple example, consider these two incoherent conditional specifications:
x|y ~ normal(0, 1)
y|x ~ normal(x, 1).

These are obviously incoherent: in the specification for x|y, x and y are independent; in the specification for y|x, they are dependent.

What happens if you do Gibbs updating (following Morris, I won’t call it “Gibbs sampling,” as there is no underlying joint distribution)?
Iteration 1: Draw x from normal(0, 1)
Iteration 2: Draw y from normal(x, 1)
Iteration 3: Draw x from normal(0, 1)
Iteration 4: Draw y from normal(x, 1)
Etc.

What is the joint distribution of (x,y)? It depends on when you stop. If you stop at an odd iteration, the joint distribution is (x,y) ~ MVN((0,0), ((1,0),(0,2))). If you stop at an even iteration, the joint distribution is (x,y) ~ MVN((0,0), ((1,1),(1,2))). If you want a single joint distribution, you could stop at a random iteration, in which case the joint distribution will depend on your random process, but for any reasonable choice of random stopping point will be approximately (x,y) ~ 0.5*MVN((0,0), ((1,0),(0,2))) + 0.5*MVN((0,0), ((1,1),(1,2))).

Things get more complicated when you have more than two variables, because then there are other possible updating schedules. You could update x,y,z,x,y,z,…, or x,y,x,z,y,z,…, or do a random update—and these could lead to different joint distributions, even after averaging over a random stopping point. There have to be some conditions under which you could bound the variation around these possible distributions—at the extreme case where the conditional specifications are coherent and the process is ergodic, there’s no variation at all—but I don’t know what they are.

Iterative imputation for missing data is kinda like the above, except the conditional distributions are themselves estimated from available data. In some ways this makes the problem harder to analyze (which is why the math in the third article linked above gets so complicated), but in other ways it makes things better that the distributions are estimated from a common dataset, as this should enforce some approximate coherence among the conditional specifications.

Lots to chew about, from practical, theoretical, and computational perspectives.

Answering two questions, one about Bayesian post-selection inference and one about prior and posterior predictive checks

Richard Artner writes:

I have two questions. The first one is, of course, related to Bayesian post-selection inference (which I first asked you about five years ago. Back then, you admitted that Bayesians are not immune to overfitting when using the same data to modify/extend a model and to predict with it. Recently, you even mentioned this issue in your new article on Bayesian workflow. However, I have not yet seen research from you (or others) that investigates the effect of a data- (and thought-driven) workflow that replaces models with better ones (better likelihoods and better priors) on overfit and poor out-of-sample predictions. After all, invalid p-values, confidence/credible intervals and biased point estimates are just the other side of the same coin (with overfit and poor predictions representing the other side of the coin). I am currently considering starting a research line that focuses on post-selection inference in Bayesian workflows and any guidance (literature ?) would be greatly appreciated.

The other question concerns the prior predictive distribution whose meaning I struggle with a lot. Henceforth, I provide a quick summary of my issues/thoughts.

A classical, parametric statistical model for measurements is a family of probability distributions P that depend on a finite number of unknown parameters (e.g., the Gaussian family for a single measurement which is determined by a location and a scale parameter). In the M-closed case, the probability distribution that corresponds to those parameter values is the true data-generating process (TDGP). A Bayesian model consists of the likelihood in a conjunction with a prior distribution. Because of this prior distribution, we can make predictions about the measurements via the prior predictive distribution, but when does it make sense to do so? The prior predictive is the marginal distribution of the measurements (i.e., the result of integrating out all model parameters in the joint distribution). Given its definition, the meaning of the prior predictive depends on the choice of prior distribution for the parameter vector. Here are three distinct interpretations of the prior distribution which result in different interpretations of the prior predictive distribution:

View 1: The prior distribution is chosen such that the prior predictive distribution corresponds to (our best guess of) the TDGP.
View 2 (subjective Bayesianism): The prior describes our beliefs about the relative plausibilities of the probability distributions in the model family P.
View 3 (your preferred view?): The prior works as a regularization device that helps mitigating the impact of sampling error on model fit. Similar to View 1, the prior serves a pragmatic goal under this view and the prior predictive can be very different from the TDGP.

Under View 1, the Bayesian model is akin to a fitted classical model. This view requires very strong priors as well as hierarchies (see toy-example below). Furthermore, it can be sensible to compare Bayesian models via Bayes factors under View 1. Under Views 2 and 3, Bayes-factor tests don’t make any sense as they solely depend on the prior predictive distribution of two models (evaluated at the observed data, respectively).

Personally, I have drawn the following conclusions:
The role of the prior predictive (and the posterior predictive) distribution is unclear under Views 2 and 3.
The prior predictive can only be close to the TDGP if there is a hierarchy (e.g., a random effects model where the random effects (the parameters) are first drawn from a certain probability distribution. Afterwards, data are generated using the drawn random effect. Rinse and repeat.) We can illustrate this with a toy example: 3 six-sided dice of equal size are in a bag and one of them is drawn to generate data. One die is fair, one has a 1 on two of its sides and the third one has a 1 on three of its sides.
Situation 1: We draw a die from the bag, roll it, and report whether it showed a 1. Afterwards, we return it to the bag and repeat the process.
Situation 2: We draw one die from the bag and repeatedly roll it (and always report if it showed a 1).
For both situations, an obvious choice for a Bayesian model is a binomial likelihood with a parameter prior of P(theta=1/6)=P(theta=1/3)=P(theta=1/2)=1/3 since we know that each die has the same chance of being drawn. In Situation 1, the prior predictive is the TDGP. In situation 2, it is not. In situation 2, the prior predictive is very different from the TDGP regardless of the die drawn. Nevertheless, our Bayesian model is arguably the best possible model for this toy example (at least from View 2).

Given the above observations/conclusions, I find it difficult to agree with conclusions such as Bayes factors measure prior predictive performance and Bayes factors evaluate priors, cross validations evaluate posteriors. To me, the conclusions drawn by Bob Carpenter in those two blog posts only hold under View 1 (which is really only possible in special cases that involve hierarchies) but not under Views 2 and 3.

How do you interpret the prior predictive? What is its relation to the TDGP? When are Views 1, 2, and 3 reasonable? Are there other relevant views on the prior predictive that are not considered here?

Given that I am unsure about the meaning of the prior predictive, I also struggle with the meaning of the posterior predictive. In particular, it is unclear to me why one would want to make predictions via the posterior predictive under views 2 and 3. If there is little available data, it will be similar to the prior predictive and very different to the TDGP. If there is plenty of data, we could use the posterior predictive and we would be fine (due to the Bernstein-van Mises theorem), but we could also just work with posterior quantiles in that case. Isn’t it always better to predict using posterior quantiles (e.g., the median for our best guess and the .1 and .9 quantile to express uncertainty) instead of the posterior predictive under View 3?

My reply:

In answer to the first question about workflow and post-selection inference, I’d recommend that you look at model-building workflow (building models, checking models, expanding models) as part of a larger scientific workflow, which also includes substantive theory, measurement, and data collection. In statistics and machine learning—theory and application alike—we focus so much on the analysis of some particular dataset in the context of some existing theory. But real science and engineering almost always involves designing new experiments, incorporating new data into our analyses, and trying out new substantive models. From that perspective, looking at post-selection inference is fine—it represents a sort of minimal adjustment of an analysis, in the same way that the sampling standard error from a survey is a minimal statement of uncertainty, representing uncertainty under ideal conditions, not real conditions.

In my own applied work, the sorts of adjustments that would be needed to adjust for model selection would be pretty minor, so it’s not where I focus my theoretical or methodological research efforts.

For bad analyses, it’s another story. For example, if Daryl Bem of those notorious ESP papers were to have used Bayesian methods, he still could’ve selected the hell out of things and found big fat posterior probabilities. But then I think the appropriate way to address this problem would not be to take these (hypothetical) apparently strong results and adjust them for model selection. Rather, I would recommend modeling all the data together using some sort of hierarchical model, which would have the effect of partially pooling estimates toward zero.

To me, doing poorly-motivated model selection and then trying to clean it up statistically is kinda like making a big mess and then trying to clean it up, or blowing something up and then trying to put it back together. I’d rather try to do something reasonable in the first place. And then, yes, there are still selection issues—there’s not one single reasonable hierarchical model, or only one single reasonable regularized machine learning algorithm, or whatever—but the selection becomes a much smaller part of the problem, which in practice gets subsumed by multiple starting points, cross validation, new data and theories, external validation, etc.

In answer to the second question about prior and predictive distributions, let me start by correcting this statement of yours: “A Bayesian model consists of the likelihood in a conjunction with a prior distribution.” The more accurate way to put this is: A Bayesian model consists of a data model in a conjunction with a prior distribution. The data model is the family of probability distributions p(y|theta). The likelihood is p(y|theta), considered as a function of theta for the observed data, y. As discussed in chapters 6, 7, and 8 of BDA3, the data model and the likelihood are not the same thing. There can be many data models that correspond to the same likelihood function. For Bayesian inference conditional on the data and model, you don’t need the data model + prior, you only need the likelihood + prior. But for model checking—prior predictive checking, posterior predictive checking, and everything in between (really, all of this can be considered as different forms of posterior predictive checking, conditioning on different things)—the likelihood isn’t enough; you need the data model too. Again, we have some examples of this in BDA, the simplest of which is a switch from a binomial sampling model to a negative-binomial sampling model with the same likelihood but different predictive distributions.

You ask, “when does it make sense” to make prior or posterior predictions. My answer is that you can interpret such predictions directly, in a literal sense as predictions of hypothetical future or alternative data based on the model. Suppose you have a hierarchical model with modeled data y, local parameters alpha, hyperparameters phi, and unmodeled data x, and your posterior distribution is p(alpha,phi|x,y) proportional to p(phi|x)p(alpha|phi,x)p(y|alpha,phi,x). Just to fix ideas, think of alpha as corresponding to an “urn” from which the data y are drawn, and think of phi as a “room” that is full of urns, each of which corresponds to a different value of alpha. Finally, think of the prior distribution of phi as a “building” full of rooms. The building is your model. – So the generative model is: Go to the building, sample a room at random from that building, then sample an urn at random from the urns in that room, then sample data y from your urn.

Suppose you fit the model and obtain posterior simulations (alpha,phi)^s, s=1,…,S.
– You can simulate new data y* from the prior predictive distribution, which would correspond to picking a new room, a new urn, and new data. For each simulation s, you can do this by drawing phi* from p(phi|x), then drawing alpha* from p(alpha|phi*,x), then drawing y* from p(y|alpha*,phi*,x).
– Or you can simulate new data y* from the posterior predictive distribution for new data from new groups, which would correspond to staying in the same room but then drawing a new urn and new data from that urn. For each simulation s, you can do this by keeping phi^s, then drawing alpha* from p(alpha|phi^s,x), then drawing y* from p(y|alpha*,phi^s,x).
– Or you can simulate new data y* from the posterior predictive distribution for new data from existing groups, which would correspond to staying in the same room and keeping the same urn and then drawing new data from that urn. For each simulation s, you can do this by keeping phi^s, keeping alpha^s, then drawing y* from p(y|alpha^s,phi^s,x).
These three different distributions correspond to different scenarios. They can all be interpreted directly. This has nothing to do with “relative plausibility” or whatever; they’re just predictive distributions of what you might see in alternative rooms and urns, if the model were true.

It’s not necessary to describe this all using multilevel models—you can distinguish between prior and posterior predictive checks with a simple model with just y and theta—but I find the multilevel modeling framework to be helpful in that it allows me to better visualize the different sorts of replications being considered. You could also consider other distributions in a non-nested model, for example predictive data on new patients, new time periods, new hospitals, etc.

Regarding Bob’s posts about Bayes factors and prior predictive checks: I take his point to be not philosophical but mathematical. His point is that the Bayes factor is mathematically an integration over the prior predictive distribution. This is obvious—you can just look at the integral—but it seems that people get confused about the Bayes factor because they look at it in terms of what is supposed to do (give the posterior probability that a model is true, something that I think is typically the wrong question to ask, for reasons discussed in chapter 7 of BDA3 and also this article with Shalizi) rather than what it does. In that sense, Bob’s post fits into a long tradition of statistical research and exposition including Neyman, Tukey, and others who work to understand methods in terms of what they actually do. This does not address your questions about prior and posterior predictive distributions—for that I refer you to the above paragraph about rooms and urns, which is largely drawn from my 1996 paper with Meng and Stern—; you just have to read Bob’s posts literally, not as Bayesian positions but as agnostic “machine learning” descriptions of what prior and posterior predictive checks do.

Understanding p-values: Different interpretations can be thought of not as different “philosophies” but as different forms of averaging.

The usual way we think about p-values is that they’re part of null-hypothesis testing inference, and if that’s not your bag, you don’t have use for p-values.

That summary is pretty much the case. I once did write a paper for the journal Epidemiology called “P-values and statistical practice,” and in that paper I gave an example of a p-value that worked (further background is here), but at this point my main interest in p-value is that other people use p-values so it behooves me to understand what they’re doing.

Theoretical statistics is the theory of applied statistics, and part of applied statistics is what other people do.

What this means is that, just as non-Bayesians should understand enough about Bayesian methods to be able to assess the frequency properties of said methods, so should I, as a Bayesian, understand the properties of p-values. Bayesians are frequentists.

The point is, a p-value is a data summary, and it should be interpretable under various assumptions. As we like to say, it’s all about the averaging.

Below are two different ways of understanding p-values. You could think of these as the classical interpretation or the Bayesian interpretation, but I prefer to think of them as conditioning-on-the-null-hypothesis or averaging-over-an-assumed-population-distribution.

So here goes:

1. One interpretation of the p-value is as the probability of seeing a test statistic as extreme as, or more extreme than, the data, conditional on a null hypothesis of zero effects. This is the classical interpretation.

2. Another interpretation of the p-value is conditional on some empirically estimated distribution of effect sizes. This is we did in our recent article by Zwet et al., “A new look at p-values for randomized clinical trials,” using the Cochrane database of medical trials.

Both interpretations 1 and 2 are valid! No need to think of interpretation 2 as a threat to interpretation 1, or vice versa. It’s the same p-value, we’re just understanding it by averaging over different predictive distributions.

What to do with all this theory and empiricism is another question, and there is a legitimate case to be made that following procedures derived from interpretation 2 could lead to worse scientific outcomes, just as of course there is a strong case to be made that procedures derived from interpretation 1 have already led to bad scientific outcomes.

Following that logic, one could argue that interpretation 1, or interpretation 2, or both, are themselves pernicious in leading, inexorably or with high probability, toward these bad outcomes. One can continue with the statement that interpretation 1, or interpretation 2, or both, have intellectual or institutional support that prop them up and allow the relating bad procedures to continue; various people benefit from these theories, procedures, and outcomes.

To the extent there are, or should be, disputes about p-values, I think such disputes should focus on the bad outcomes for which there is concern, not on the p-values themselves or on interpretations 1 and 2, both of which are mathematically valid and empirically supported within their zones of interpretation.

Keith O’Rourke’s final published paper: “Statistics as a social activity: Attitudes toward amalgamating evidence”

Keith O’Rourke passed away two years ago. Here’s his obituary, which was sent to me by Bart Harvey:

Lover of earth, wood, water and fire, Keith left us after a brief illness on November 27, 2022. Born to Evelyn and Frank O’Rourke, he was the second of four sons. He met Marlene and they shared their life together for 39 years.

Keith worked in landscaping throughout his undergraduate years at the University of Toronto, then at Moore Business Forms, and in the western and northern provinces and territories in the field of compressed air before returning to UofT to complete and MBA and undertake an MSc. For many years he worked as a biostatistician at the Toronto General Hospital and the Ottawa Hospital on numerous studies in the fields of cancer, diabetes, SARS and infectious diseases research before completing a PhD at Oxford University in 2004. Having worked at Duke University, McGill and Queen’s, he joined Health Canada in the late 2000’s as a biostatistician, initially in health care and then in pesticide management. A conscientious intellectual and deep-thinker, Keith endeavoured to make our world a better and safer place.

Known to enjoy the occasional three fingers of Glenfiddich or an IPA, Keith pondered the mysteries of health and safety while taking longs walks in nature, nurturing many maple, birch and oak saplings, cutting up downed trees at our cottage in the Lanark Highlands, building bonfires and stoking every wood stove he had access to. A black belt in Kung Fu, and an assistant coach in boxing at Oxford he was known to jump in the ring and spar with up-and-coming kick-boxers into his mid-sixties, until Covid ended access to the ring.

We had an unpublished project together which we had not touched since 2016. I recently revised the paper, and it will be published. The article was originally titled, “Attitudes toward amalgamating evidence in statistics”; the final version is called, “Statistics as a social activity: Attitudes toward amalgamating evidence,” and here’s the abstract:

Amalgamation of evidence in statistics is done in several ways. Within a study, multiple observations are combined by averaging or as factors in a likelihood or prediction algorithm. In multilevel modeling or Bayesian analysis, population or prior information are combined with data using the weighted averaging derived from probability modeling. In a scientific research project, inferences from data analysis are interpreted in light of mechanistic models and substantive theories. Within a scholarly or applied research community, data and conclusions from separate laboratories are amalgamated through a series of steps including peer review, meta-analysis, review articles, and replication studies.

These issues have been discussed for many years in the philosophy of science and statistics, gaining attention in recent decades first with the renewed popularity of Bayesian inference and then with concerns about the replication crisis in science. In this article, we review amalgamation of statistical evidence from different perspectives, connecting the foundations of statistics to the social processes of validation, criticism, and consensus building.

I’m pretty sure that this will be Keith’s final published article. I very much regretted not having him around when revising the paper; we would’ve had lots to talk about.

“Why do medical tests always have error rates?”

John Cook writes:

Someone recently asked me why medical tests always have an error rate. It’s a good question.

A test is necessarily a proxy, a substitute for something else. You don’t run a test to determine whether someone has a gunshot wound: you just look.

Good point! I hadn’t thought about it that way before, but, yeah, “Let’s test for condition X” implies an indirectness of measurement, in a way that “Let’s look for X” does not.

Also, yeah, sure, sometimes looking isn’t enough to tell if someone has a gunshot wound; there have to be some edge cases. But we get Cook’s general point here.

What should Yuling include in his course on statistical computing?

Yuling writes:

I have been creating a new course for a PhD level core class on statistical computing in my department. Due to my personal bias, this course will be largely computing methods for Bayesian inference (or probabilistic ML).

I have created a tentative schedule for this course.

Any suggestions on the topics or reading materials that I try to cover?

Please leave your suggestions for topics and reading materials, or any other thoughts you have on statistical computing, in the comments.

My suggestion is that Yuling have them read my article with Aki, What are the most important statistical ideas of the past 50 years?, which is not just about computing but could be helpful to give students a sense of the motivation behind many of the developments in computational statistics.

Calibration for everyone and every decision problem, maybe

This is Jessica. Continuing on the theme of calibration for decision-making I’ve been writing about, recall that a calibrated prediction model or algorithm is one for which, for all predictions that are b, the event realizes at a rate of b, and this is true for all values of b. In practice we settle for approximate calibration and do some binning.

As I mentioned in my last post in calibration for decision-making, an outcome indistinguishability definition of perfect calibration says that the true distribution of individual-outcome pairs (x, y*) is indistinguishable from the distribution of pairs (x, ytilde) generated by simulating a y using the true (empirical) probability, ytilde ∼ Ber(ptilde(x)). This makes clear that calibration can be subject to multiplicity, i.e., there may be many models that are calibrated over X overall but which make different predictions on specific instances.

This is bad for decision-making in a couple ways. First, being able to show that predictions are calibrated in aggregate doesn’t prevent disparities in error rates across subpopulations defined on X. One perspective on why this is bad is socially motivated. A fairness critique of the standard ML paradigm where we minimize empirical risk is that we may be trading off errors in ways that disadvantage some groups more than others. Calibration and certain notions of fairness, like equalized odds, can’t be achieved simultaneously. Going a step further, ideally we would like calibrated model predictions across subgroups without having to rely on the standard, very coarse notions of groups defined by a few variables like sex and race. Intersectionality theory, for example, is concerned with the limitations of the usual coarse demographic buckets for capturing people’s identities. Even if we put social concerns aside, from a technical standpoint, it seems clear that the finer the covariate patterns on which we can guarantee calibration, the better our decisions will be.

The second problem is that a predictor being calibrated doesn’t tell us how informative it is for a particular decision task. Imagine predicting a two-dimensional state. We have two predictors that are equally well calibrated. The first predicts perfectly on the first dimension but is uninformed on the second. The second predicts perfectly on the second dimension but is uninformed on the first. Which predictor we prefer depends on the specifics of our decision problem, like what the loss function is. But once we commit to a loss function in training a model, it is hard to go back and get a different loss-minimizing predictor from the one we got. And so, a mismatch between the loss function we train the model with and the loss function that we care about when making decisions down the road can mean leaving money on the table in the sense of producing a predictor that is not as informative as it could have been for our downstream problem. 

Theoretical computer science has produced a few intriguing concepts to address these problems (at least in theory). They turn out to be closely related to one another. 

Predictors that are calibrated over many possibly intersecting groups

Multicalibration, introduced in 2018 in this paper, guarantees that for any possibly intersecting set of groups that are supported by your data and can be defined by some function class (e.g., decision trees up to some max depth), your predictions are calibrated. Specifically, if C is a collection of subsets of X and alpha takes a value in [0, 1], a predictor f is (C, α)- multicalibrated if for all S in C, f is alpha-calibrated with respect to S. Here alpha-calibrated means the expected difference between the “true” probability and the predicted probability is less than alpha.

This tackles the first problem above, that we may want calibrated predictions without sacrificing some subpopulations. It’s natural to think about calibration from a hierarchical perspective, and multicalibration is a way of interpolating between what is sometimes called “strong calibration”, i.e., calibration on every covariate pattern (which is often impossible) and calibration only in aggregate over a distribution. 

Predictors that simultaneously minimize loss for many downstream decision tasks 

Omniprediction, introduced here, starts with the slightly different goal of identifying predictors that can be used to optimize any loss in a family of loss functions relevant to downstream decision problems. This is useful because sometimes we often don’t know at the time when we are developing the predictive model what sort of utility functions the downstream decision-makers will be facing. 

An omnipredictor is defined with respect to a class of loss functions and (again) a class of functions or hypotheses (e.g., decision trees, neural nets). But this time the hypothesis class fixes a space of possible functions we can learn that will benchmark the performance of the (omni)predictor we get. Specifically, for a family L of loss functions, and a family C of hypotheses c : X → R, an (L,C)-omnipredictor is a predictor f with the property that for every loss function l in L, there is a post-processing function k_l such that the expected loss of the composition of k_l with f measured using l is almost as small as that of the best hypothesis c in C.

Essentially an omnipredictor is extracting all the predictive power from C, so that we don’t need to worry about the specific loss function that characterizes some downstream decision problem. Once we have such a predictor, we post-process it by applying a simple transformation function to its predictions, chosen based on the loss function we care about, and we’re guaranteed to do well compared with any alternative predictor that can be defined within the class of functions we’ve specified.  

Omnipredictors and multicalibration are related

Although the goal of omniprediction is slightly different than the fairness motivation for multicalibration, it turns out the two are closely related. Multicalibration can provide provable guarantees on the transferability of a model’s predictions to different loss functions. Multicalibrated predictors are convex omnipredictors. 

One way to understand how they are related is through covariance. Say we partition X into different disjoint subsets whose union is X. Multicalibration implies that for an average state in the partition of X, for any hypothesis c in C, conditioning on the label does not change the expectation of c(x). No c has extra predictive power on y given once we’re in one of these groups. We can think of multicalibration as predicting a model of nature that fools correlation tests from C. 

Finding such predictors (in theory) 

Naturally there is interest in how hard it is to find predictors that satisfy these definitions. For omnipredictors, a main result is that for any hypothesis class C, a weak agnostic learner for C is sufficient to efficiently compute an (L, C)-omnipredictor where L consists of all decomposable convex loss functions obeying mild Lipshcitz conditions. Weak agnostic learning here means that we can expect to be able to efficiently find a hypothesis in C that is considerably better than random guessing. Both multicalibration and omniprediction are related to boosting – if our algorithm returns a predictor that is only slightly better than chance, we can train another model to predict the error, and repeat this until our “weak learners” together give us a strong learner, one that is well correlated with the true function.

Somewhat surprisingly, the complexity of finding the omnipredictor turns out to be comparable to that of finding the best predictor for some single loss function. 

But can we achieve them in practice?  

If it wasn’t already obvious, work on these topics is heavily theoretical. Compared to other research related to quantifying prediction uncertainty like conformal prediction, there are far fewer empirical results. Part of the reason I took an interest in the first place was because I was hearing multiple theorists respond to discussions on how to communicate prediction uncertainty by proposing that this was easy because you could just use a multicalibrated model. Hmmm. 

It seems obvious that we need a lot of data to achieve multicalibration. How should researchers working on problems where the label spaces are very large (e.g., some medical diagnosis problems) or where the feature space is very high dimensional and hard to slice up think about these solutions? 

Here it’s interesting to contrast how more practically-oriented papers talk about calibration. For example, I came across some papers related to calibration for risk prediction in medicine where the authors seem to be arguing that multicalibration is mostly hopeless to expect in practice. E.g., this paper implies that “moderate calibration” (the definition of calibration at the start of this post) is all that is needed, and “strong calibration,’’ which requires that the event rate equals the predicted risk for every covariate pattern, is utopic and not worth wasting time on. The authors argue that moderate calibration guarantees non-harmful decision-making, sounding kind of like the claims about calibration being sufficient for good decisions that I was questioning in my last postAnother implies that the data demands for approaches like multicalibration are too extreme, but that it’s still useful to test for strong calibration to see whether there are specific groups that are poorly calibrated.

I have more to say on the practical aspects, since as usual it’s the gap left between the theory and practice that intrigues me. But this post is already quite long so I’ll follow up on this next week.

Average predictive comparisons

A political science colleague asked me what I thought of this recent article by Carlisle Rainey, which begins:

Some work in political methodology recommends that applied researchers obtain point estimates of quantities of interest by simulating model coefficients, transforming these simulated coefficients into simulated quantities of interest, and then averaging the simulated quantities of interest . . . But other work advises applied researchers to directly transform coefficient estimates to estimate quantities of interest. I point out that these two approaches are not interchangeable and examine their properties. I show that the simulation approach compounds the transformation-induced bias . . .

By quantities of interest, he’s referring to summaries such as “predicted probabilities, expected counts, marginal effects, and first differences.”

Statistical theory is focused on the estimation of parameters in a model, not so much on inference for derived quantities, and it turns out that these questions are subtle. There are no easy answers, as Iain Pardoe and I realized years ago when trying to come up with general rules for estimating average predictive comparisons.

Here’s how we put it in Section 14.4 of Regression and Other Stories:

It’s a fun problem, typical of many math problems in that there’s no way to talk about the correct answer until you decide what’s the question! The predictive comparison—sometimes we call it the average predictive effect of u on y, other times we avoid the term “effect” because of its implication of causality—depends on the values chosen for u^lo and u^hi, it depends on the values of the other predictors v, and it depends on the parameter vector theta. For a linear model with no interactions, all these dependences go away, but in general the average predictive comparison depends on all these things.

So, along with evaluating different estimators of the average predictive comparison (or other nonlinear summaries such as predicted probabilities and expected counts), we need to decide what we want to be estimating. And that’s no easy question!

For more on the topic, I recommend my 2007 paper with Pardoe, Average predictive comparisons for models with nonlinearity, interactions, and variance components.

The topic has confused people before; see for example here.

I think the problem arises because of a mistaken intuition that there should be a single correct answer to this question. The issues that Rainey discusses in his article are real, and they arise within a larger context of trying to provide methods for researchers who are producing inferential summaries without always having a clear sense of what they are trying to estimate. I’m not slamming these researchers—it’s not always clear to me what summaries I’m interested in.

Again, I recommend our 2007 paper. Many open questions remain!