Understanding exchangeability in statistical modeling: a Thanksgiving-themed post

Several years ago, on the day after Thanksgiving, I was on the phone with my sister telling her how I’d used turkey leftovers to make a delicious chili. The process started by pulling out the pieces that could be eaten on their own—the drumsticks and the big chunks of meat. The rest went into the chili. The next step was to strip off the small bits of meat, pull off the skin and connective tissue and cut them into little pieces (no matter how long you cook the skin, it won’t soften and separate on its own), throw in several onions, a bunch of red bell peppers, a couple tomatoes, a variety of different sorts of hot pepper, a couple squares of unsweetened chocolate, and a mushed banana, and then cook it all in beer for several hours over low heat. Add more hot peppers to taste, and more red peppers to make it sweeter.

After this long description, there was a pause on the other end of the phone, and then my sister said, “I was with you until the bit about the connective tissue.”

Similarly, many classically-trained statisticians are impressed by Bayesian methods and see their appeal, but they can’t find their way to accept the assumption of exchangeability, which stands behind the hierarchical models that are so central to modern Bayesian practice.

In the simplest hierarchical model, the variation among the underlying effects (relative to the uncertainties in their estimation) determines the amount to which the estimated effects are pooled toward the grand mean. This is commonplace to us, at least since Rubin’s 8-schools paper from 1981, but if you go back to the pre-Bayesian literature from decades ago, you will find discussions of the appropriateness of data pooling, which was considered to be a big step requiring a strong substantive motivation.

As late as 1972, in the discussion of the landmark paper by Lindley and Smith, the respected statistician Oscar Kempthorne wrote, “Is it ‘practically realistic’ to use an exchangeable prior? Information is available in the records to show that schools differ widely, students of different social and ethnic backgrounds perform differently on tests, and so on.” Kempthorne was thinking of exchangeability as a substantive concern, to be addressed by subject-matter knowledge. And subject-matter knowledge is indeed relevant, most notably in setting up the model for the varying parameters. But he was completely wrong to think that “exchangeable” meant “identical.”

An attractive feature of hierarchical Bayesian inference is that it translates questions of statistical method to questions about substantive modeling. For example, the amount of partial pooling in a hierarchical modeling is connected to the substantive parameter that represents group-level variation.

In contrast, Stein (1959) and James and Stein (1961) proved that shrinkage gives lower mean squared error, compared to unpooled estimates—even when estimating parameters with no relation to each other. This has mystified researchers to the extent that it has been called a paradox (Efron and Morris, 1977).

Our resolution to “Stein’s paradox” is that if the parameters θj are being estimated are truly unrelated, then they will likely be very different from each other, in which case the estimated group-level variance will probably be large, and very little actual shrinkage will go on. Conversely, if we have a set of unrelated parameters that happen to be close to each other—perhaps because of measurement protocols under which each is expected to have a value near 1.0, say—then, yes, this is information that can improve inferences in each specific case.

Beyond this, nonexchangeability—in the sense of additional information distinguishing the groups in a study—can be included in a Bayesian hierarchical model, typically by including the extra information as group-level predictors. In Kempthorne’s example above, one can use previous school records as a covariate in a regression model. Or, in a more difficult problem involving several parameters, some of which are believed to be similar to each other and others much different, the researcher could fit a mixture model. The point is that aspects of “nonexchangeability” might be better viewed as structure that can be modeled.

13 thoughts on “Understanding exchangeability in statistical modeling: a Thanksgiving-themed post

  1. Nice post. I find it strange that classical statisticians would have a problem with exchangeability when they seemingly have no problem with (conditional) iid assumptions of observables.

  2. I’ve taken all the randomized trials in the Cochrane Database of Systematic Reviews (CDSR) to be exchangeable here: default informative priors for effect sizes where do they come from.

    Now, I claim that taking all the trials in the CDSR to be exchangeable is not an assumption but a choice. It is a matter of choosing which information to use. If I present someone with a trial from the CDSR and tell them nothing about it, then as far as they’re concerned it’s exchangeable with all the trials in the CDSR. If I tell them that it’s actually a phase 2 trial, then for them becomes exchangeable with the phase 2 trials in the CDSR.

    So, if I take a trial to be exchangeable with the entire CDSR, it means that I’m using the information that it’s a trial from the CDSR and nothing more (which seems nicely objective to me). I can understand how someone would want to include more information (like the fact that it’s a phase 2 trial), but I don’t think it makes sense to use less, i.e. ignore that it’s a trial that meets the criteria of the CDSR.

      • I don’t think I understand the distinction between mathematical and substantive assumptions. If I assume some counts are i.i.d. Poisson, is that a mathematical or substantive assumption? Or a little bit of both?

        I would say that exchangeability is the absence of distinguishing features. This can be “achieved” by ignoring them. In fact, it can only be achieved by ignoring them because there are always distinguishing features if you look close enough.

  3. Model assumptions generally are formal mathematical constructs and reality is generally not like that, so assumptions won’t be fulfilled in reality, and an assumption can be very helpful and the method based on it useful even if the assumption is in fact violated. Then there are also assumption violations that are dangerous, and the hard bit is to tell the dangerous ones apart from the rather harmless ones. This applies to frequentist as well as Bayesian statistics with epistemic probabilities. Even if one insists that probabilities are epistemic and refer to a belief/state of knowledge rather than “reality out there”, implications of probability models including prior choices are complex, and it is hardly checked whether they match a person’s beliefs precisely; normally they won’t, but that doesn’t necessarily affect the use of the resulting analysis.

    That said, one should at least acknowledge that exchangeability (as assumption for epistemic probability) is unrealistic and unreasonable to believe in (even though it may be helpful for analyses, as is the i.i.d. assumption for frequentists). Exchangeability implies that the order of observations *will not be taken into account* when analysing. This means that when observing binary data, you commit yourself in advance to make no difference between a sequence of, say, 001000110110111001010110, and 000111111111111000000000 when it comes to predicting the next outcome. Particularly, if you start with exchangeability, you can never get out of it whatever happens – unless you are prepared to drop your model and violate coherence (which I believe Andrew is indeed prepared to do).

    In the posting Andrew states that “aspects of “nonexchangeability” might be better viewed as structure that can be modeled”, however exchangeability doesn’t allow for “structure” to exist that you are not aware of before looking at the data (and even when looking at the data you may not necessarily know how to explain what you see).

    Of course if the analyst is not dogmatic and prepared to check and if necessary reject their model based on the data, this may not be that much of a problem. Such an analyst can tentatively start with an exchangeability model if there is no prior information that suggests specific deviations from it, and drop it when data rebel. (I should acknowledge here that Andrew has repeatedly stated that he doesn’t think of the probabilities he uses in Bayesian analyses as epistemic but rather that Bayesian statistic can well be used with probabilities that refer to data generating processes in the world. In this way modelled probabilities can actually be contradicted by the data, which purely epistemic probabilities can’t.) A frequentist i.i.d. assumption is probably neither better nor worse in this respect. But note that this violates Bayesian principles, and Bayesians should then not use these same principles to argue their superiority to frequentists.

    A thing that I find important to acknowledge is that ultimately something like exchangeability or i.i.d. assumptions are required to allow for learning about later observations from earlier observations. Such assumptions imply that in some sense the later observations behave like the earlier ones (“identity”), and that what happens later is not determined by the precise individual behaviour of what happened earlier (“independence”, with exchangeability being basically i.i.d. conditionally on something unobserved). We can model structures that cause deviations from exchangeability, and this is fine, but a useful model still needs to imply that what happens later is like what happens earlier and not determined by idiosyncrasies on a higher level, even if not at the level of the plain observations. For example we could postulate that over time the regression relation, time series structure, the way within-individual effects play out etc. is the same, independently of what happened earlier. We can’t really get around such assumptions, not because they’d be true (be it in reality or epistemically), but because they formally enable the statistical methods to learn from the past.

    • Christian:

      One way to think about exchangeability is to suppose you have a level predictor and you’re including it in a regression model. Then exchangeability implies that the coefficient of that predictor is exactly zero, an assumption which will be false in just about any real-world situation. On the other hand, if you don’t have any predictor, than exchangeability is required of the model.

  4. I like the interpretability of Stein’s shrinkage effect, a topic that needs expansion in this age of big data. For more on Stein’ impact see this new pub https://projecteuclid.org/journals/statistical-science/advance-publication/Steins-Method-Meets-Computational-Statistics–A-Review-of-Some/10.1214/22-STS863.short

    On a personal note, when Stein is mentioned, I keep seeing him play Go at the Stanford cafeteria. He used to spend hours there…

  5. Efron refers to anxiety about shrinkage as a concern for “relevance”, but Robbins (1951) wasn’t worried:

    “No relation whatever is assumed to hold among the unknown parameters  . To emphasize this point, X1 could be an observation on a butterfly in Ecuador, X2 on an oyster in Maryland, X3 the temperature of a star, and so on, all observations being taken at different times.”

    My favorite Stein image is him sitting on the floor at the Stanford “old union” with several policemen standing over him with the newspaper caption, ” Stein deviates from the statistical norm,” in the process of being arrested in an anti-apartheid demonstration.

Leave a Reply

Your email address will not be published. Required fields are marked *