Survey Statistics: toy example for energy balancing weights

Posted on July 7, 2026 8:08 PM by shira

Last week we talked about The Big Changes Coming to the Times/Siena Poll:

New weighting variable: support score = E(2024 vote | other X variables).
New weighting method: energy balancing (Huling & Mak, 2024)

Ben Schneider helpfully blogged about energy balancing as well:

Raking and similar calibration methods are based on balancing means or totals for specific variables…The energy balancing method does something different: it calibrates based on an entire multivariate distribution, as measured by an empirical cumulative distribution function (ECDF).

Jared Huling (of Huling & Mak, 2024) helpfully answered questions in the comments. I’m still puzzling over how energy balancing handles empty cells (unsampled regions of the joint covariate space). I need a toy example.

Consider 2 binary variables, so 4 population cells, with known population shares:

       k=0    k=1    total
j=0    .4     .2     .6
j=1    .2     .2     .4
total  .6     .4

Say the sample is missing folks in cell 11:

       k=0    k=1    total
j=0    .5     .3     .8
j=1    .2     0      .2
total  .7     .3

Consider 4 methods:

1. Classical Poststratification: not defined because of division by 0.

2. Raking: match only the margins. Correct when Y | X1, X2 is additive.

       k=0    k=1    total
j=0    .2     .4     .6
j=1    .4     0      .4
total  .6     .4

3. Energy balancing: minimize the Energy-Distance(F_w, F_pop) between the weighted sample distribution of X1, X2 and the population distribution. Correct when Y | X1, X2 is such that nearby cells have similar means.

Say X1 = young/old, X2 = man/woman, Y = percent Democrats, and no old women are sampled.

Raking is correct when additivity holds: old women = young women + (old men − young men)

Energy balancing is correct approximately when: old women = (old men + young women)/2 ?

library(WeightIt)

pop  <- data.frame(X1 = rep(c(0, 0, 1, 1), c(40, 20, 20, 20)),
                   X2 = rep(c(0, 1, 0, 1), c(40, 20, 20, 20)))

samp <- data.frame(X1 = rep(c(0, 0, 1), c(50, 30, 20)),
                   X2 = rep(c(0, 1, 0), c(50, 30, 20)))

dat <- rbind(cbind(pop,  A = 1),
             cbind(samp, A = 0))

W <- weightit(A ~ X1 + X2, data = dat, method = "energy",
              estimand = "ATT", focal = "1",
              dist.mat = as.matrix(dist(dat[, c("X1", "X2")])))

w <- W$weights[dat$A == 0]
tapply(w, interaction(samp$X1, samp$X2), sum) / sum(w)

       k=0    k=1    total
j=0    .381   .309   .69
j=1    .309   0      .309
total  .69    .309

4. MRP: fit a model for Y | X1, X2. The interaction term’s posterior equals its prior, propagating uncertainty around additivity.

Am I understanding this correctly ?

Survey Statistics: Big Changes in the Times/Siena Poll

Posted on June 30, 2026 4:01 PM by shira

Yesterday Nate Cohn wrote about The Big Changes Coming to the Times/Siena Poll, with
more details in their poll of Maine.

Say we want to estimate average Platner support in Maine’s likely electorate, E(Y). But we only have survey respondents, R = 1.

The NYT uses survey weights to weight respondents, E(YW | R = 1). In contrast, some pollsters use MRP, fitting a Multilevel Regression model for Platner support, then applying it to the population, E(E_model(Y | X, R = 1)).

Nate discusses 2 Big Changes to how they construct the weights W.

(The polar bear has not yet hiked in ME, but he is training for it. This above is in TN.)

Big Change 1: Support score

A few weeks ago we saw the NYT started weighting on “synthetic 2024 vote”, which is recalled 2024 vote that is validated with the voter file and imputed if needed.

Now they’re also weighting on support score = E(2024 vote | other X variables). Nate explains the motivation:

While a poll can’t weight on dozens of variables, the support score lets us pile a lot of information into a single measure.

This reminded me of the causal inference context, where D’Amour and Franks (2021) “see especially strong performance for propensity weights computed with respect to the prognostic score”, where the prognostic score is E(Y | X, control). In our survey context, this would be a model for Platner support Y. Instead, the NYT use 2024 vote, perhaps for applicability across multiple outcomes Y ?

Big Change 2: Energy balancing

Beyond adding new weighting variables, they’re also changing how they calculate the weights. Nate notes the challenge of weighting on many variables and interactions with typical sample sizes. So they are turning to the R package WeightIt, which implements the energy balancing method from Huling & Mak (2024):

This article introduces a new weighting method, called energy balancing, which instead aims to balance weighted covariate distributions. By directly targeting distributional imbalance, the proposed weighting strategy can be flexibly utilized in a wide variety of causal analyses without the need for careful model or moment specification.

The energy balancing weights do not use outcome Y, but the paper notes that estimates can be improved with a model for Y.

How do energy balancing weights handle the challenge of jointly weighting on many variables with typical sample sizes “without the need for model specification” ?

Structural equation modeling (SEM) and positive definiteness

Posted on June 25, 2026 3:00 PM by Bob Carpenter

This post is from Bob.

Mitzi and I were swotting up on structural equation models (SEM) for our class this past Monday at the Modern Modeling and Methods (M3) conference at Fordham University. It was a lot of fun and now I think I understand SEM notation. I really like these applied conferences and this was a group of psychometrician, econometricians, and sociometricians. Many if not most of them thought about models in terms of SEM, so we thought we should figure it out. But I was left with a concern you may be able to help me sort out.

The example

The first worked example in Ken Bollen’s seminal 1979 textbook on SEM is a study of how industrialization relates to democracy. It comes from his paper,

Bollen, Kenneth A. (1979). “Political Democracy and the Timing of Development.” American Sociological Review, 44(4).

and was reprised in his book

Bollen, Kenneth A. (1989). Structural Equations with Latent Variables. Wiley.

I had the pleasure of sitting across from Ken at the invited speakers dinner at the conference, so I’m glad I looked into SEM before that. Good news for the SEM devotees—he released a completely revised guide to SEM a few months ago.

Bollen, Kenneth A. 2026. Elements of Structural Equation Models. Cambridge University Press.

The data and parameters

The data consists of eleven covariates (called “indicators” in SEM) for each of 75 countries. Four of the covariates are related to democracy in 1960 (y1, y2, y3, y4), the same four measurements were taken again again in 1965 (y5, y6, y7, y8) , and there were three measurements of industrialization in 1960 (x1, x2, x3).

The SEM model the original researcher came up with here assumes three latent scalars per country, industrialization in 1960 (IND60), level of democracy in 1960 (DEM60), and level of democracy in 1965 (DEM65). These latent parameters are related in the following way: democracy in 1960 is a regression on industrialization in 1960, and democracy in 1965 is a regression on both democracy in 1960 and industrialization in 1960.

The covariates are then modeled like a seemingly unrelated regression in econometrics. The four democracy 1965 parameters are treated as regressions on the latent level of democracy in 1965, and similarly for the democracy in 1960, and industrialization in 1960.

Rather than independent errors, a SEM model explicitly indicates with arrows which pairs of observations are allowed to have non-zero correlation in the covariance matrix for the observations. The three industrialization observations are assumed to have zero correlation—there are no arrows between any of the three measurements in the SEM diagram. Each of the four measurements in 1960 is assumed to covary with the same measurement taken in 1965. In addition, the second and fourth measurement in each year are assumed to be correlated with each other, which leads to a box-like structure.

The SEM diagram

Here are the arrows in the diagram, where I’m not using their standard LISREL notation, but writing them in R expression syntax to indicate what is regressed on what. In their graphical notation, just replace ~ with <-. All three latent variables and all eleven measurements are indexed by country.

IND60
DEM60 ~ IND60
DEM65 ~ DEM60, IND60

x1, x2, x3 ~ IND60
y1, y2, y3, y4 ~ DEM60
y5, y6, y7, y8 ~ DEM65

The covariance structure is indicated by stating which pairs of measurements are modeled with non-zero correlation. The first four just pair the measurements of the same thing across 1960 and 1965.

y1 <-> y5
y2 <-> y6
y3 <-> y7
y4 <-> y8

The last pair of correlations are within 1960 and within 1965.

y2 <-> y4
y6 <-> y8

Together, these induce an odd box structure, where y2 is correlated with y6 and y4, both of which are correlated with y8, but y2 and y8 are assumed to have zero correlation.

y2 <-> y6
^      ^
|      |
v      v
y4 <-> y8

Stan implementation

We didn’t get this far in my half of the class, so I will share here the Stan Playground example where I fit Bollen’s example (you can get the data and the Stan model through the Playground link:

Stan implementation of Bollen’s SEM example.

It gets the right answer compared to lavaan/blavaan, which is nice. In the Stan code, xi is IND60 and eta1, eta2 are DEM60, DEM65. The relation among the latent parameters are modeled directly as regressions. The correlations among the observations are modeled using soft zeroing, where I just put a tight prior around zero on the structural zero elements, because Stan doesn’t give you a good way of setting up structural zeroes in a covariance matrix (Sean Pinkney or Ben Goodrich might know how to do this?).

This makes me curious how the lavaan package in R manages this. There’s a Bayesian version of lavaan built on top of Stan, blavaan. The first example right at the top of the home pages for both the lavaan and blavaan is Bollen’s democracy model. I guess it’s like the Scottish lip cancer data set for spatial modeling or Fisher’s iris data for regressions.

My questions

Consider a simple diagram among measurements like the following.

x <-> y
y <-> z

This says there can be non-zero correlation between A/B and also between B/C, but the correlation between A/C is zero. It’s a simplified case of the box we saw in the actual example. These arrows implies the correlation matrix looks as follows.

|        1  rho[x,y]         0 |
| rho[x,y]         1  rho[y,z] | = Omega
|        0  rho[y,z]         1 |

Given that the correlation matrix Omega must be positive definite, this limits the range of rho[x,y] and rho[y,z]. For example, we can’t have rho[x,y] = rho[y,z] = 0.9, or rho[x,z] would have to be greater than zero to maintain positive definiteness.

Q1: Why doesn’t SEM instead say that the correlation rho[x,z] is just the minimum value it can be given rho[x,y] and rho[y,z]? I’m suggesting that we instead treat the above diagram as implying no additional correlation between x and z other than that implied by the correlation between x and y and the correlation between y and z? That is, why try to shrink rho[x,z] all the way to zero? From the text, it feels like the motivation is to enforce zero correlation in the model. But all this is doing is simplifying regressions—it won’t actually enforce zero correlation among the measurements that are modeled with zero correlation. I wished I’d asked Ken this question at dinner, but I’ll ping him about this blog post and hopefully get a response.

Of course, in the pragmatic Bayesian workflow, we’d use posterior predictive checks to evaluate whether there’s unmodeled correlation between x and z.

Q2: I’m also curious what Andrew and others think about enforcing structural zeroes in correlation between measurements as opposed to just estimating a dense covariance matrix and inspecting where the correlations fall.

Ph.D. student opening in Sweden on Earth Observation, Data Science, and AI for poverty estimation

Posted on June 15, 2026 5:37 PM by Andrew

Adel Daoud writes:

I’m writing to ask for your help circulating a PhD opening in my group at Chalmers, the AI and Global Development Lab (www.aidevlab.org). The position is in Earth Observation, Data Science, and AI for poverty estimation, the Data Science and AI division (Department of Computer Science and Engineering). We are looking for candidates with a strong grounding in data science, computer science, deep learning, statistics, or similar— remote sensing experience and causal inference are welcome bonus.

Ad and application portal: https://www.chalmers.se/en/about-chalmers/work-with-us/vacancies/?rmpage=job&rmjob=14818&rmlang=UK
Deadline: 20 June 2026.

Here’s the description of their center:

The AI & Global Development Lab fuses AI with Earth Observation to illuminate the causes and consequences of human development across time and space.

Our interdisciplinary team, comprising data scientists, computer scientists, and social scientists, develops methods to better understand the multi-scale dynamics of pressing global issues, including poverty, conflict, sustainability, and the effectiveness of policy interventions.

By analyzing satellite imagery from 1984 to the present, AI search agent swarms for large-scale knowledge discovery, and other planetary-scale sources, we are reconstructing historical and geographical development trajectories at a level of detail never before possible, working to offer new insights into the changing face of development worldwide.

We also invite you to visit PlanetaryCausalInference.org for more information about the causal arm of our project.

They call it “Planetary causal inference,” which seems to fit the themes of this blog.

To what extent is it true that “All intelligence, human or artificial, must extract structure from correlational data”?

Posted on June 13, 2026 9:49 AM by Andrew

Someone pointed me to this article, “Does AI already have human-level intelligence?” You can click through to read the whole thing; spoiler alert: their answer is Yes.

I don’t have much to say about the main argument of the article–it’s a topic we’ve gone over all too much in past comment threads–also, as non-user of chatbots, I’m really the worst person to ask for an opinion on the topic. Indeed, the other day I was contacted by a reporter for a story about “vibe analytics” where people use chatbots to write code to perform data analysis. I shared my thoughts for a few minutes but then referred the reporter to Bob and Jessica, as they both have thought a lot more about this than I have. I continue to (a) think that it can make sense to consider chatbots and ping-pong playing robots as having human-level intelligence, and (b) agree with Gary Smith that it remains a big problem when people think chatbots have a level of understanding that they don’t actually have. But, again, my thoughts on this shouldn’t count for much.

But there is one thing in this new article that I did want to comment on. It was just an aside, not the main point by any means, but interesting:

“All intelligence, human or artificial, must extract structure from correlational data.”

Is this true? I don’t know about that, for two reasons. First, I can’t think of many cases where I (that is, my human intelligence) have extracted structure from correlational data. Setting aside my professional life as a statistician and social scientist, when have I done this? I’m not sure. Yes, I’ve estimated parameters from correlational data–for example, if I’m playing sports I make inferences about the abilities of other players based on what they’ve done on the field in the past. But that’s not structure, exactly. There is structure in the world, like the difference between cats and dogs. You can dress a dog up like a cat but it’s still a dog. Essentialism and natural kinds and all that. But that’s not anything I extracted from correlational data: I know it because people told me.

One way that I’ve extracted structure from correlational structure is that as a kid I heard lots of talking and read lots of books and I extracted lots of structure of the language from that. But that’s just one example–an important example, sure, but I don’t know that it’s a characteristic of “all intelligence.”

Another way to look at this is that, as a community, we’ve extracted a lot of structure in the world–it’s called doing science–and some of this is from correlational data (Kepler figuring out planetary orbits, Galton and his table of heights, etc.) but lots of the structure we’ve extracted comes either from logical reasoning (Newtonian mechanics, relativity theory) or from experimentation–they say Galileo did a bit of that.

This doesn’t invalidate the argument made in the linked article–after all, there’s no reason a computer program can’t do pure theory or conduct experiments–; I just thought it was interesting. Speaking in some fundamental sense, it seems to me that experimentation, not just observation, is a crucial part of how we often extract structure. We experiment a lot when speaking. On the other hand, sometimes, as with Kepler or with someone learning a language from reading books, the information is all, or almost, correlational.

It’s an interesting thing to think about. We could throw this at a chatbot and see what it would say–or, more precisely, we could see what it could extract from what humans have said about related topics. But humans have said a lot; it’s a mark of intelligence to be able to read a million books and then extract their key points.

P.S. After reading a bunch of comments, I realize that I kind of missed the point of the passage I was quoting.

My argument above is that intelligence doesn’t learn about structure only by extracting structure from correlational data. Intelligence also learns about structure from logical reasoning and experiment.

But my argument doesn’t refute the quoted line, “All intelligence, human or artificial, must extract structure from correlational data.” That quote doesn’t posit that intelligence only learns from correlations. It just says that learning from correlation is part of the mix, and I agree with that.

So, as long as that passage is interpreted as saying that “extracts structure from correlational data” is necessary for “intelligence,” I’m ok with it. My problem was my interpretation (or misreading) that correlational analysis was sufficient.

“The Data Analyst’s Guide to Cause and Effect”

Posted on June 10, 2026 8:38 PM by Andrew

Theiss Bendixen and Benjamin Grant Purzycki wrote this book. He writes:

The website holds:

– All data and code used in the book
– Free sample chapters
– Bonus material

These aren’t quite the same methods for causal inference that I’m inclined to use (for my own approach, see chapters 18-21 of Regression and Other Stories), but their presentation is clear and has code, and it’s always good to see another perspective.

What is the relation between interactions in a regression model and correlations among the predictors?

Posted on June 6, 2026 9:31 AM by Andrew

I’ve often seen confusion between interactions in a regression model and correlations among the predictors. To keep it simple, consider the model y = b0 + b1*x1 + b2*x2 + b3*x1*x2 + error, and assume the predictors have been signed so that both b1 and b2 are positive. Then b3 represents the interaction. This has nothing to do with the joint distribution of x1 and x2 in the data, or in the population. (For simplicity, assume the data to which the model are being fit is a random sample from the population of interest.)

The interaction depends on the model of y given x1 and x2, while the correlation depends on the model for x1 and x2. These are two completely different parts of the model. And yet, they often seem connected.

I have the general impression that I’d be more likely to expect a positive interaction of x1 and x2 when predicting y, if x1 and x2 are positively correlated in the population.

For example, when predicting income from height and sex, being taller and being male both predict higher income, also they interact–the coefficient for height is higher for men than for women–and of course the two predictors, height and male, are positively correlated in the population.

I’m not sure how to think about this connection or even whether it’s a real pattern! But there might be something there so I wanted to share it with you.

The issue of interactions comes up in the context of the concept of intersectionality, which is a form of interaction that comes up in sociology. It started for me with this email from Elin Waring:

I’ve been working on data on intersectionality and retention of students in STEM majors. My little group is specifically looking at data from Lehman College and trying to model graduation with a STEM degree. There are a lot of details, but basically we have come to the conclusion that the right way to describe this is with a discrete time competing risk model (the competing risks being graduation with a STEM degree and graduation with a non-STEM degree). I won’t go into all the details. We have data for between 1 and 20 semesters enrolled for students starting as freshman. For us, intersectional identity is defined by 5 variables that yield 32 distinct combinations or strata as used in the next articles.

In trying to think about how to account for intersectional identities we came across the “MAIHDA Method.” I was wondering if you had seen this discussion before or have any thoughts about it.

Evans, Clare R., George Leckie, and Juan Merlo. 2020. “Multilevel versus Single-Level Regression for the Analysis of Multilevel Information: The Case of Quantitative Intersectional Analysis.” Social Science & Medicine (1982) 245:112499. doi:10.1016/j.socscimed.2019.112499.

They essentially argue for treating the strata as random effects in a multilevel model where with the individual components of the combinations introduced as fixed effects describing the combinations.

The next article criticizes that approach and argues for fixed effects all around.

Wilkes, Rima, and Aryan Karimi. 2024. “What Does the MAIHDA Method Explain?” Social Science & Medicine 345:116495. doi:10.1016/j.socscimed.2023.116495.

Responded to here:

Evans, Clare R., Luisa N. Borrell, Andrew Bell, Daniel Holman, S. V. Subramanian, and George Leckie. 2024. “Clarifications on the Intersectional MAIHDA Approach: A Conceptual Guide and Response to Wilkes and Karimi (2024).” Social Science & Medicine 350:116898. doi:10.1016/j.socscimed.2024.116898.

I was wondering if you have any thoughts about this? For me, intersectionality as a theoretical approach does mean that it makes sense to look at the strata rather than thinking of the strata as just the most complex level of creating statistical models of the intersection of the variables. But then it seems as though treating this a random effect more or less undermines its centrality to the theory. And is treating both the strata and the individual characteristics as variables at the same level basically a way to decompose?

In the end, I feel like the pro-MAIHDA people retreat to “we are just descriptive” in a way that isn’t very helpful. That said, they are right that this seems to have some traction in the world of health disparity research.

I replied that I’d never heard of any of this method before. I couldn’t actually muster the energy to read the above articles, as all this debate seems to be missing the key issues. I don’t really care if something is called a fixed effect or a random effect (see here); my current preferred way of thinking of these problems is by framing as a generative model.

Regarding intersectionality, the natural way I would see it is that this would show up as an interaction term, the idea that the interaction is more than the sum of its parts? For a simple example, if there are 5 binary variables and each has the same effect on its own (which they wouldn’t, this is just a simple hypothetical example), then you could create a variable which is the total number of identities, thus a number from 0 to 5, and “intersectionality” would show up as a super-linear or convex relation between the outcome and this total predictor?

Waring responded:

Sure, but the idea you suggested about intersectionality itself isn’t right. You can’t just sum the number of identities, everyone has identities and the idea is that it is not just about concentrated disadvantage of having all or some specific identities. If we have 5 dichtomous identity/group variables everyone has 5 dimensions of identity. Intersectionality is about the idea that something like “white, native born. woman, high income” shapes what happens because of how those come together to shape (in the case of my analysis) whether, as an undergraduate, you persist in STEM fields.

I replied as follows:

Yes, I was actually thinking this when I wrote that! I was imagining that each of the 5 factors has an “off” and “on” setting, and intersectionality kicks in when there are multiple “on” settings, where “on” represents the group that faces more difficulty (nonwhite, non-native born, female, low income, gender nonconformist, etc.). Once you allow arbitrary possibilities for intersectionality, then my simple superadditive model wouldn’t fit. On the other hand, if you were to allow all 32 possibilities to take on any value, then realistically you would not be able to estimate anything much at all: this is the usual problem in sociology of approximating a complex social structure by a simple model that explains most of the variance. For predicting persistence in STEM (or any academic field), one possible factor that could enter in a complicated way is conservative political ideology, in that for many attitudes and behavior its predictive effect goes in the opposite of the “on” categories listed above, but grad students, in STEM and other fields are predominantly politically on the left. I could well imagine that conservative political ideology, like the other “on” categories, is predictive of not persisting in STEM but that this could interact in unexpected ways with those other categories.

From a statistical perspective, my main message is to choose such a model based on its explanatory power and recognizing that it’s an approximation, rather than using methods such as statistical significance or Bayes factors which in different ways are driven by sample size, as we discussed in this 1995 paper.

Another interesting statistical feature of this and similar discussions is that it’s natural for the discussion to go back and forth between the correlation between two predictors in the data (or the population) and the interaction between their predictive effects, as discussed at the top of this post.

I’m not sure if this interaction thing is a general pattern that has some statistical explanation, or just a faulty intuition of mine based on just a couple of special cases. But I have noticed a general confusion that when people talk about interactions, often they seem to be talking about correlation between the predictors.

Epidemiologist Donna Spiegelman sez: SUTVA is “mostly not necessary for valid causal estimation and inference most of the time”

Posted on June 3, 2026 9:12 AM by Andrew

Donna Spiegelman shares this presentation she gave at the recent American Causal Inference Conference. I like what she has to say.

Here are the two parts of the stable treatment value assumption:

1. No interference between units. As Spiegelman says, nowadays it’s not hard to model spillovers. As I say, untangling spillovers is an ill-posed inverse problem that can be solved using Bayesian inference with reasonable priors. Serious practical work has moved past the demonstrate-that-spillover-doesn’t-matter stage to the just-model-the-spillover-directly stage.

2. Deterministic potential outcomes. As Spiegelman says, in the real world, outcomes are stochastic. Jonas and I talk about this in our Russian roulette paper.

The part that I’m less sure about is Spiegelman’s claim that adjustments for pre-treatment variables usually don’t matter. I’m persuaded that they usually don’t matter in the epidemiology and biostatistics applications she’s worked on, but I think that in social science, such adjustments can be important. Especially if there are big treatment interactions and your population is a lot different from your sample.

In any case, I recommend you look through Spiegelman’s slides, as she offers a refreshing perspective compared to our usual obsessive focus on the details of causal identification:

Survey Statistics: GREG

Posted on May 19, 2026 5:33 PM by shira

I just got to chat with Andrew and some of the authors of the MrPlew paper: Ryan Giordano, Erin Hartman, and Avi Feller. Lots more I have to digest here ! The paper came out while the polar bear and I were crossing from TN into VA.

We talked about using a model for response R, a model for outcome Y, or both. So GREG came up, and Andrew asked “what’s GREG ?” Good question.

GREG is Generalized REGression estimator. Särndal, Swensson, Wretman (1992) has a nice section that writes it in a few alternative ways:

1. Adjust an estimate based on the model with a Horvitz-Thompson estimate of the error:

2. Or on the flip side, you can see it as adjusting the Horvitz-Thompson estimate with the model:

It’s called GREG for Generalized REGression estimator, what is being generalized ?

Lumley 2010 made me think we were generalizing to continuous X variables:

Preview

Sharon Lohr’s book made me think we were generalizing beyond simple random samples:

Sampling Design and Analysis: Third Edition — Sharon Lohr

Särndal, Swensson, Wretman (1992) made me think we were generalizing to multiple X variables:

Amazon.com: Model Assisted Survey Sampling (Springer Series in Statistics): 9780387406206: Särndal, Carl-Erik, Swensson, Bengt, Wretman, Jan: Books

Regardless of the exact origin of the name, GREG has connections to the Doubly Robust literature in causal inference (as Coston et al. (2020) note in a footnote). Any favorite references making these connections ?

Recent discoveries on the acquisition of the highest levels of statistical fallacies

Posted on May 13, 2026 9:09 AM by Andrew

Mark Goldstein points us to this post by Alex Dimakis, who writes:

A paper was recently published in Science on highest level of human performance across athletics, science, math and music. I think the paper makes some classical statistics mistakes that still fool many smart people. The paper “Recent discoveries on the acquisition of the highest levels of human performance” by Gullich et al. claims: “In summary, when comparing performers across the highest levels of achievement, the evidence suggests that eventual peak performance is negatively associated with early performance.”

The paper makes two mistakes. Base-rate fallacy and . . . Berkson’s paradox . . .

The study says simply that the very top at young age are not identical with the very top adults. (As one would expect, since there are *many many more non-elite young candidates*). Still, elite young performers are 40 times more likely to be in the top adults compare to general population. This is acknowledged in the paper but in page 6-7, a bit buried in the technical analysis and not sufficiently discussed in abstract or conclusions. . . .

The paper claims “Across the highest adult performance levels, peak performance is negatively correlated with early performance.” This is a classic example of Berkson’s paradox. Here is a simplified example to understand this: Assume that to be a successful actor you have to be either extremely good looking or extremely talented. Assume also that talent and looks are independent in the population. However, among sucessful actors you will observe a negative correlation between looks and talent. This doesn’t meant anything beyond the selection process and should not be extrapolated. My favorite example-joke of this is that basketball points scored is negatively associated with height among NBA players. (because to be an NBA player you have to be very tall OR be very good at scoring). From this, I extrapolated that since I’m 5’7, I will be scoring 80+ points per NBA game. . . .

Here’s paper in question, “Recent discoveries on the acquisition of the highest levels of human performance.”

Yeah, this sort of thing comes up all the time! For example, some celebrity academics a couple years ago wrote a book that included the false statement, “while correlation does not imply causation, causation does imply correlation.” Even more amusingly, they prefaced this by “We must, however, remember that”. I guess we must remember a lot of false things! Economist Rachael Meager gave a quick example showing why they were wrong; See details here.

This new example also looks a lot like the well-known regression-to-the-mean fallacy (for more on that, I recommend Section 6.5 of our book, Regression and Other Stories, which includes some simulation code to demonstrate the problem). Of course, just because lots of people know about a fallacy, that doesn’t stop people from making the error in new settings. That’s why it’s a fallacy!

P.S. An anonymous commenter points out that Dimakis (and, by extension, Goldstein and me) are being unfair to this paper. The descriptive results are what they are. I remain skeptical of the paper’s claim that “similar developmental pattern across different domains suggests widespread, and possibly universal, principles underlying the acquisition of the highest levels of achievement,” as I do suspect that much of what they have seen arises from the usual statistical selection artifacts. So maybe it’s ok to caution about the interpretation of these numbers. But now I’m thinking it wasn’t fair of us to slam the paper for presenting some interesting data findings.

The Application Matters: Medical Ethics and Counterfactual Utilities

Posted on May 12, 2026 2:00 PM by Jonas Mikhaeil

I believe, as applied statisticians, we need to get our hands dirty and immerse ourselves in the applications we try to address. This post is mostly about medical ethics and the famous “first, do no harm” principle. It is also an attempt to understand how statistics can serve medical practice. The motivation for this comes from a recent debate in the statistics literature about counterfactual losses, which often invokes this “first, do no harm’’ principle as a motivation. Much has been written about the theory of these counterfactual losses — and I’m sure they will find a fruitful application — but do they actually speak to the challenge of medical decision-making that the “first, do no harm’’ principle seeks to address?

I will argue that they cannot, because this principle is concerned with medicine at its most human: medical practice centered on the relationship between an individual patient and an individual physician. But what can statistics help with? Modern medical obligations acknowledge that medicine is embedded in society; they highlight medical practitioners’ concern with justice and with reducing health disparities. These are concerns statistics can help to address.

But let me start at the beginning. There’s a recent literature that considers decision making under counterfactual loss — what if the utility of your decisions not only depends on the realized outcome but also on what could have been, on a counterfactual? A paradigmatic example is the following “first, do no harm’’ utility: Suppose you’re administering a drug and there are only two extreme outcomes. The patient may live, or they will die. The literature (e.g., Bordley, 2009, Ben-Michae et al., 2023, Christy and Kowalski, 2026) has interpreted the medical aphorism “first, do no harm” as requiring a utility function that assigns asymmetric weights to saving a life and causing a patient’s death. The disutility from killing a patient who, counterfactually, would have survived outweighs the positive utility of saving a patient who otherwise would have died. Although this may initially seem attractive, several authors have pointed out complications that arise when decisions are based on such counterfactual losses (e.g., Dawid and Senn, 2023, Sarvet and Stensrud, 2023).

Andrew and I contributed to this literature with a small example that seemingly produces a counterintuitive recommendation, which I discuss below.

In response, Koch and co-authors write:

[T]his seemingly nonsensical result can be reasonable in a different setting. […] It may be reasonable for a physician to prefer standard care, prioritizing the avoidance of adverse counterfactual outcomes over improvements in expected benefits. Indeed, such a decision reflects the Hippocratic principle of “do no harm”. […] This example underscores the fact that a utility function represents the preferences of the decision-maker and is therefore inherently subjective and context-dependent.

This uncovers a problem with our argument based on intuition — see, this decision doesn’t make sense, does it? Intuition, of course, can be misleading. One way our example might be misleading, as Koch et al. point out, is that it may describes a setting in which we simply do not hold these counterfactual utilities. If we were to transplant the same recommendation into an appropriate setting, it might no longer appear nonsensical and might instead conform to how we think we should behave.

This has me very excited. I believe statistics is at its best when it takes its applications seriously. So, in this blog post, I want to do just that.

I will briefly give the example Andrew and I came up with to show that a “do no harm’’ utility can lead to counterintuitive decision recommendations. We do so through an example involving Russian roulette. It is a useful example, but by no means an accurate representation of what we would consider plausible in real medical settings. What it does show, however, is that we need to be really careful with these “do no harm’’ utilities: if we don’t really hold them, they may lead to nonsensical decisions.

Taking the application seriously, we will dive into medical ethics to ask whether the proposed counterfactual “do no harm” utilities help with medical decisions. We do so by briefly examining the origin and history of the “first, do no harm” principle. We will see that “do no harm” is perhaps best understood in the context of a professional ethic that commits physicians to the rules of their craft and to respect for each individual patient. Statistics cannot truly speak to this individual-level patient-physician relationship. Since the Hippocratic Oath, however, medicine has changed substantially. With the advent of scientific methods in clinical medicine, doctors face new moral obligations not captured by the “do no harm’’ principle. Some of these new obligations arise from the relationship among medicine and society; others arise from the use of scientific methods themselves. We will look at modern medical oaths to get a glimpse of these new obligations — and how statistics can help fulfill them.

Russian Roulette

As a starting point, let me present our simple and somewhat morbid example in which counterfactual utilities give a counterintuitive decision recommendation: Imagine we are choosing between two games of Russian roulette. In the first game, the status quo, we play with a six-chamber gun, one chamber of which is loaded. That is, we face a one-in-six chance of death. We are then offered the option to switch to a seven-chamber gun, the new alternative “treatment.” If we switch, we face better odds: only a one-in-seven chance of dying. By switching games, we lower our probability of death, which to me seems preferable.

What would the counterfactual “do no harm’’ utility function recommend? To figure this out, we treat the outcomes under either game of Russian roulette as (independent) potential outcomes and divide the population of players into four principal strata based on survival status. Only two of the principal strata are relevant for our decision, those in which a player would survive one game but die playing the other. It’s easy to work out that with probability 6/42 switching to the new gun saves you: you would die under the status quo but survive under the treatment. But with probability 5/42, you would have survived under the status quo, but switching to the new gun, you will die. Suppose we interpret “first, do no harm’’ as mandating that the negative repercussions of our treatment choice, the death of a player, outweigh the benefits of saving a life. For example, suppose saving a life has utility +1, while the death of a player has utility −2. Then the 6/42 chance that the treatment saves you is outweighed by the 5/42 chance that the treatment kills you in cases where, counterfactually, you would have lived.

Under this counterfactual utility, we ought not to switch. It recommends we stick to the status quo, under which we face a higher chance of death. This strikes me as a counterintuitive decision recommendation.

The “First, do no harm” Principle

There is, however, a limit to the force of this argument based on intuition. One might argue that the recommendation in the Russian roulette example is not evidence against counterfactual utilities in general, but rather an indication that, when playing Russian roulette, we do not hold utilities of this kind. When transplanted to a setting where we have such asymmetric counterfactual utilities, the same recommendation might be sensible. The counterfactual-utility literature often motivates asymmetric counterfactual utilities by appealing to the “first, do no harm’’ principle in medicine.

For the rest of this post, I will discuss whether counterfactual utilities are useful in this paradigmatic application: medical decision-making.

In a paper frequently cited by advocates of counterfactual utilities, Cedric Smith (2005) discusses the origin and limitations of the “first, do no harm” principle. It is actually not part of the Hippocratic Oath, or the wider Hippocratic corpus, as is often implied, but has somewhat nebulous roots. Smith traces its origin to the seventeenth-century English physician Thomas Sydenham. While undoubtedly catchy, this principle is not embedded in a larger ethical framework that would give guidance on its interpretation or justifications for its use.

The is a problem because taken literally, this “first, do no harm’’ principle is a poor guide to medical decision-making. Let me cite Louis Lasagna, an American physician of the last century who was very involved in rethinking the Hippocratic Oath:

“To observe this advice [first, do no harm] literally is to deny important therapy to everyone, since only inert nostrums [quack medicine without active pharmaceutical ingredients] can be guaranteed to do no harm. It is more reasonable to ask doctors to balance the potential gains against the possible harm; would that we could only quantify these probabilities more precisely!” (Lasagna cited in Smith, 2005)

A call to action for us statisticians if I ever saw one. Of course, the counterfactual-utility literature that cites this principle is not advocating what Lasagna warns against: doing absolutely no harm. Its proponents are well aware that benefits and risks must be carefully weighed against each other. If the principle is not meant to be taken literally, then its obscure origin becomes a problem: it gives us little insight into what actually matters to medical practitioners, because it is disconnected from any wider tradition that would help us interpret it.

Luckily, we can find a similar, more nuanced statement in the Hippocratic corpus (Epidemics I):

“Declare the past, recognize the present, foretell the future: attend to these things. As to diseases, make a habit of two things—to help, or at least to do no harm. The art has three factors, the disease, the patient, the physician. The physician is the servant of the art.”

The Greek word here is technē (orig. τέχνη) which we might also want to translate as “craft”. Medicine is a craft because the decisions a physician has to face cannot be made by rote application of knowledge. As a craftsperson, the physician as an individual becomes relevant. That is why the Hippocratic Oath commits the physician, as an individual, to be benevolent in each patient interaction. Medical ethics based on the Hippocratic Oath is not focused on outcomes, let alone utility, but concerned with the character of the physician and their obligations toward their patient (Pellegrino, 2006). It centers the patient-physician relationship.

With this background in mind, we can understand why the “benevolence” implied in the imperative to help is qualified with the phrase ‘’or at least do no harm’’ — if I’m already committed to help, it may seem that I’m already committed to do no harm. Lynn Jansen (2022) argues that this is where the professional aspect of medicine enters: As a professional, the physician needs to restrict their actions to those that align with their profession. That is, while they strive for benevolence in the sense of furthering the patient’s overall well-being, they reject all courses of action that would harm the patient’s medical well-being. This second aspect is often called non-maleficence.

Statistics and Medicine

In modern medicine, this tension is heightened. Taking the patient’s moral agency seriously, a physician must be careful not to “confuse technical with moral authority” (Pellegrino, 2006) or override patients’ values. This is worth keeping in mind. The patient must be involved in weighing benefits and risks. Thus, the medical professional does not have sole discretion to choose an optimal treatment. “Help, or at least do no harm” is a professional mantra that guides a physician in their interactions with patients. It is not a constraint on optimal decision-making; it is a moral commitment to respect each patient.

This conception of medicine is in stark contrast to the world seen through the lens of statistics. Compare this focus on the individuality of both patient and physician with the following quotation from an 1835 report to the Academy of Sciences, written by a committee of four mathematicians, including Poisson, on operations for gallstones:

“In statistical affairs … the first care before all else is to lose sight of the man taken in isolation in order to consider him only as a fraction of the species. It is necessary to strip him of his individuality to arrive at the elimination of all accidental effects that individuality can introduce into the question.” (taken from Hacking, 1990)

Statistics’ power lies in constructing aggregates, making disparate things hold together (Desrosières, 1998). Historically, these aggregates were useful for the emerging nation-state and were quickly adopted to address large-scale social problems, such as public health. Many professions, including medicine, strongly resisted losing sight of the particular – in our case, the individual patient — in favor of aggregates. Even randomized experiments, which we nowadays all too easily accept as the gold standard of evidence, had a hard time entering clinical medicine (Porter, 2020).

Due to this tension, modern medicine has a dual nature. On the one hand, doctors are still committed to treating their patients as individuals — medicine is the art of healing. Yet with advances of scientific methods within medicine, and with the recognition that health must be understood in the context of society, doctors face new moral obligations (Pellegrino, 2006).

Modern Medical Oaths

To get a glimpse of these new obligations and the self-understanding of doctors in the twenty-first-century, we can look to modern versions of medical oaths. While many doctors still take the ancient Hippocratic Oath, many medical schools revise the original text or students take an additional self-formulated oath. In 2005, for example, students at Weill Cornell Medical College began taking a revised Hippocratic Oath. Let me highlight a brief excerpt:

I vow […]

That above all else I will serve the highest interests of my patients through the practice of my science and my art; That I will be an advocate for patients in need and strive for justice in the care of the sick.

Notice the emphasis on justice; it’s not idiosyncratic to this oath. Two further examples show similar themes. The University of Pittsburgh School of Medicine’s class of 2024 took an oath that highlighted the social determinants of health and advocated for a more equitable health care system. Harvard Medical School’s class of 2019 vowed to combat structural oppression and promote social justice. In this admittedly selective set of examples, much emphasis is placed on how medicine relates to society. Core commitments are justice and the building of an equitable health care system.

So, how can we statisticians help modern medical practice? Modern medical ethics places great emphasis on patients’ autonomy and their freedom to choose based on their own values. For a patient’s decision to be well informed, deliberation about benefits and risks is central — but the decision ultimately depends on a personal tradeoff shaped by the patient’s values. For this reason, our goal should perhaps not be to optimize treatment decisions. We do need to help estimate the benefits and risks of treatments more accurately, but treatment decisions remain part of the individual patient-physician relationship. Instead, we should put more emphasis on identifying and reducing disparities in the health care system, focusing on medicine as embedded in society. The most important task may not be deciding which drug to administer, but reducing inequalities in access to treatment in the first place. I believe statistics has an important role to play in making health care systems more equitable and more just.

“An Axiomatic Foundation for Decisions with Counterfactual Utility”

Posted on May 8, 2026 6:19 PM by Andrew

Benedikt Koch, Kosuke Imai, and Tomasz Strzalecki write:

Counterfactual utilities evaluate decisions not only by the realized outcome under a given decision, but also by the counterfactual outcomes that would arise under alternative decisions. By generalizing standard utility frameworks, they allow decision-makers to encode asymmetric criteria, such as avoiding harm and anticipating regret. Recent work, however, has raised fundamental concerns about the coherence and transitivity of counterfactual utilities. We address these concerns by extending the von Neumann-Morgenstern (vNM) framework to preferences defined on the extended space of all potential outcomes rather than realized outcomes alone. We show that expected counterfactual utility satisfies the vNM axioms on this extended domain, thereby admitting a coherent preference representation. We further examine how counterfactual preferences map onto the realized outcome space through menu-dependent and context-dependent projections. This axiomatic framework reconciles apparent inconsistencies highlighted by the Russian roulette example in the statistics literature and resolves the well-known Allais paradox from behavioral economics. We also derive an additional axiom required to reduce counterfactual utilities to standard utilities on the same potential outcome space, and establish an axiomatic foundation for additive counterfactual utilities, which satisfy a necessary and sufficient condition for point identification. Finally, we show that our results hold regardless of whether individual potential outcomes are deterministic or stochastic.

I have to admit that I don’t see the appeal of utility functions based on counterfactuals. For example, I’ve never thought that the decision-theoretic concept of “regret” makes sense. That said, I know that a lot of people are interested in the topic, so I hope the above paper is useful to people in clearing up these issues, and I’m glad that they were able to use our Russian roulette example.

An economist writes: “the fulminations over the #1 pick seem overheated to me.”

Posted on May 3, 2026 9:21 AM by Andrew

Jonathan Falk writes:

I [Falk] am always amazed at the amount of (digital) ink spilled on the perverse incentives involved in taking to get the #1 draft pick. The current local woes of the Giants and Jets obviously contribute a lot to these discussions, but they happen all the time. As an economist, it’s clear to me that the value of a draft pick is the incremental value, not the absolute value. I’m completely aware that the upper tails of distributions have much more dispersion than the center, or even the 80th-90th percentile does, but the fulminations over the #1 pick still seem overheated to me.

First, of course, is the fact that assessment is made with error, and there are plenty of #1 busts in every sport. #2s can be busts as well, of course, but that merely lowers the expected difference between #1 and #2 as the true value of both is attenuated towards 0 — #1 loses more.

Second, there is the issue of team fit. Greatness is a vector, not a number, and if the teams ahead of you in draft order need something else, you still stand a chance of getting the player optimized for your needs. Going the other way, of course, is that higher draft picks absolutely lower the number of teams that can steal your guy.

Third, teams are… teams. One person can only contribute so much. So the relevant assessment is now how much better A is than B, but how much the addition of A versus the addition of B will change the prospects of your team — which I think is pretty obviously a lower difference, though I guess your rationale for voting runs in the other direction — you ought to judge a small incremental addition by the gigantic difference between winning a championship or not.

Fourth, more narrowly economic, every incrementally pick costs more. I don’t think that effect is huge in the context of overall payrolls, but isn’t that then another anomaly? If #1 picks are so dramatically better than, say, #5 picks, why aren’t they paid multiples more?

I don’t really have anything to say here, because I have no sense of how much teams are paying for #1 or #2 picks. I do remember a couple years ago that everyone was talking bout Wemby, but basketball’s different than football because there are only 5 players on the court, so one player can make more of a difference.

The case of Wemby makes me think that one way this could be studied would be to compare different years. In some years there is a clear consensus #1 pick, other years not.

John Carlin says, “‘Identifying variables that independently predict…’ is not a well-defined research task”

Posted on May 2, 2026 9:29 AM by Andrew

John “Bayesian Data Analysis” Carlin writes:

Recent developments in the methodology of epidemiological research have emphasized the importance of achieving clarity of purpose by classifying research questions into one of three types: descriptive, predictive, and causal. . . .

I [Carlin] do not believe that studies aiming to “identify” independent predictors or “prognostic factors” are addressing well-defined research questions. Indeed, beyond the issues already raised, there is a broader question of the extent to which it is ever sensible to frame a research question as if it could be answered dichotomously, as in “is this an (independent) prognostic factor?” Prediction questions, which include prognosis, are those that involve the development of a model or algorithm to provide predictions of outcomes using available variables that are potential predictors.

This all makes sense. I kinda think that descriptive, predictive, and causal are all the same thing–or, more precisely, that “descriptive” and “causal” are special cases of “predictive,” under different conditions. But if you want to divide them into three tasks, sure, go for it. Personally, I’d rather divide statistics into the goals of exploration, estimation, and discrimination, but I think that’s because I’m thinking in a more general “data science” perspective, whereas John is focusing more on the more traditional problem of inference.

But, yes, I agree with him 100% on avoiding dichotomization, a topic that Sander Greenland, I, and others have been screaming about for a long time–indeed, John and I contributed to the anti-dichotomization theme in our book Bayesian Data Analysis, in that we focused on model building and inference within a model, rather than on the then-fashionable problem of choosing among or comparing models using Bayes factors. So, yes on that.

John continues:

Some variables may have greater predictive value than others, but this should be assessed by comparing the predictive value of the model or algorithm with and without the use of that variable, not by examining its “independent effect” in a multivariable regression model.

I’m confused on this point. I mean, sure, I agree that you shouldn’t label a regression coefficient as an “independent effect”; indeed, I always use the terms “predictors” and “outcome” rather than “independent and dependent variables.” Beyond this, I’m not quite sure what John is suggesting. Suppose you have a predictor of interest, x3, and you’ve fit the model y ~ x1 + x2 + x3 (for convenience using standard R notation). I guess John is saying, don’t just look at the coefficient for x3 in that model; also compare it to the model y ~ x1 + x2. Maybe this is a good idea–it’s not something I’ve thought about for a while. Is this the same as what used to be called “partial regression coefficients”? I remember from the statistical literature in the 1960s and 1970s that there was a lot of work on methods for understanding what happens in linear regression when you add one variable at a time. Perhaps it would be good to revisit some of those ideas, and maybe it’s a mistake that we don’t cover them in Regression and Other Stories.

I also want to plug my paper with Guido Imbens (also included as Section 21.5 in Regression and Other Stories), Why ask why? Forward causal inference and reverse causal questions. Our point there is that it can be a good idea to search for prognostic factors in observational data, not with the idea this will identify causal effects but rather as a way of understanding what’s missing from our existing models.

Finally, John writes:

More broadly, debates on whether to “adjust” or not for certain variables in a regression model can only be answered by situating the analysis within a sharply defined research question and a sharply defined rationale for specifying a regression model in the first place.

I don’t get this at all. First I don’t get why “adjust” is in scare quotes; second, ummm, yeah, it’s always good to have a sharply defined research question, but in the meantime people are always making comparisons, and so let’s do what adjusting as we can. For example, in an epidemiology study it should pretty much always be a good idea to adjust for age and smoking history. Or maybe John would say that the rationale for adjusting for age and smoking history is sharply defined, in which case maybe we’re in agreement.

To put it another way, it’s often a good idea to have a sharply defined research question–but that applies in general, not just for statistical adjustments. I think it’s also true that it’s better to have a sharply defined research question when performing a randomized clinical trial. A randomized clinical trial gives identification for the sample average treatment effect in any case–but without a sharply defined research question, it’s not clear what can be done with such an estimate.

So I’m wary of John singling out adjustment in his criticisms, as I fear his article will be taken as implying that, if you don’t try to adjust, that everything will be ok.

Two Health Economists Walk into a Bar: What bothered me in that conversation of Jay Bhattacharya and Emily Oster

Posted on April 28, 2026 9:49 AM by Andrew

Last week I was at a conference on enhancing scientific integrity (as I reported here), and one of the sessions was an interview of Jay Bhattacharya, the current director of the National Institutes of Health, and Emily Oster, a professor of economics and Brown University.

I referred to that session in a post the other day regarding the recent case of a report from the Centers for Disease Control and Prevention that was pulled by Bhattacharya, in his additional capacity as acting director of the CDC. I’ll get back to that story in a bit, but here I wanted to talk about some larger things that bothered me in the interview.

Before getting to my disagreements, let me give my positive take, which is that both the people in the interview had an air of moral seriousness.

This is important. So much of the discourse in politics and social science these days is polluted with cynicism, whether it be from history professor Niall Ferguson decrying the “wokeness” on college campuses when he’s not encouraging college students to do “oppo research” on each other, or Lawrence Summers sleazing around with a sex trafficker and then trying to enlist his rich friends to intimidate student journalists, or Cass Sunstein writing an entire book on a topic he knows nothing about, or Sunstein’s friend Adrian Vermeule promoting election denial, or Mehmet Oz and Andrew Huberman trading off their medical and scientific credentials to hawk dietary supplements, or Steven Levitt promoting dubious claims on mind-body healing and global warming denialism (presumably because they’re cool and transgressive, respectively), or Matthew Walker torturing the data, etc etc. I’m talking about researchers who see science as a path to glory, not to understanding, and politically-minded academics who will happily promote stupid ideas that push their agenda. Beyond that there are straight-up politicians who lie, cheat, and steal, and that’s bad too–but here I’m talking about that nexus between government, policy, and the human sciences.

Anyway, Bhattacharya and Oster weren’t like that. They recognize that we’re talking about serious issues here. When asked about disruptions to NIH funding, Bhattacharya emphasized the larger goal of improving public health, making the point that they want to fund a portfolio of projects to address health challenges. I have no sense of how things are run internally within NIH, so I’m not saying I agree or disagree with his particular administrative directions, but I appreciated that he kept his eye on the ball by emphasizing ultimate goals. For her part, Oster questioned Bhattacharya on a number of issues. She too gave the sense that this is a serious topic, not just a political game.

How to do better is another question! Last month Oster wrote positively about some silly dietary guidelines recently released by the FDA, and if you read her op-ed carefully she doesn’t actually seem to agree with most of those guidelines (the best thing she could say about them was that they were “not crazy”), so I take it that in writing that piece she was making a sort of persuasion calculation that the best way to be effective is to mix the criticism with a gallon of sugar. That’s not my style. So, Oster uses a different approach than I do, and I’m sure we’d have our differences in how to interpret statistical evidence. But, again, I think she’s engaging with moral seriousness.

And it’s possible to be morally serious while still having fun. Consider Nate Silver. Nate’s an entertaining writer–I try to be too!–and I’ve had my disagreements with him regarding statistics and communication, but I think he’s coming from a place of intellectual and moral seriousness that shows respect for the challenges of political analytics and the stakes involved. Indeed, sometimes when he’s disagreed with me, it’s on the implicit grounds that he’s making progress in understanding the real world, doing some analytical engineering that is outpacing the statistical theory. I still think there’s a benefit to interrogating the edge cases where our methods break down . . . anyway, my point is that I’m not just using the term “moral seriousness” to refer to things that I agree with. I’m talking about an attitude that I see in Bhattacharya, Oster, and Silver that I don’t see in, say, Niall Ferguson or Andrew Huberman.

Now, to return to our main thread, these are the parts of last week’s interview that bothered me:

1. When asked about some news reports regarding the NIH and CDC, Bhattacharya dismissed them as “fake news.” This annoyed me for two reasons. First, he offered no evidence that the reports were untrue. Second, he was appointed by a man who spews out false statements at an amazing rate, including on the topic of public health. Who are we supposed to trust here? News reports or a political appointee? Also, Bhattacharya himself has a record of being sloppy with the facts, as I happen to know because it happened to me.

Now, don’t get me wrong, I’m not saying that Bhattacharya was lying or misinformed regarding recent NIH and CDC policies. It could well be that the news items were erroneous or misleading–and, if so, I can see how Bhattacharya would be legitimately annoyed. And he should feel free to express his annoyance! But just dismissing the reports as “fake news” . . . that’s not a serious response.

As I wrote above, I appreciate that Bhattacharya treats the nation’s public health spending with the seriousness it deserves. As a statistician, I think information needs to be treated with respect as well. Which means he should be addressing serious news reports and, for that matter, respecting the institution of journalism. Which he wasn’t doing here.

2. When the topic of vaccines came up, Bhattacharya came out strongly in favor of vaccination, and he expressed the view that it is better for vaccination to be voluntary rather than mandatory. This could be. I guess it depends on the context. For almost all my life, childhood vaccines were mandatory, just about everybody got vaccinated, and just about nobody complained about it. So mandatory vaccination can work just fine–we have decades of experience on this one. The bad news is that in the past few years, vaccination has become politicized and anti-vax attitudes have become embedded in right-wing politics. So it could be that Bhattacharya is right and the mandates will have to go, we’ll just have to accept more sick and dead kids and adults, just the price to pay for this aspect of political dysfunction. I don’t know, but it could be, so I’m not going to criticize Bhattacharya for his hot take on this issue.

What bothered me was . . . if you are going to go with a voluntary vaccination strategy, I think you’d want a strong strategy of encouraging people to choose vaccination for themselves and their kids. So I think his response would’ve been stronger if he’d also said something about how to vigorously promote vaccine usage. That’s part of public health policy too. Also, Bhattacharya doesn’t have a great track record on this issue: just a few years ago he was part of an anti-vax organization. See here for the ugly story. OK, fine, everybody makes mistakes and has lapses in judgment. But then at least he should address that, in the past, he’s been part of the problem. To just say that you want vaccines to be optional but without addressing that history, that’s not right.

3. The un-publishing of that CDC report. Bhattacharya said he stopped the CDC from publishing the report because it was using an approach called a test-negative design, which he thinks is a bad statistical method. When he said this, Oster jumped in and said that she too thought it was a bad method. It was only a brief exchange and there was no time for either of them to give a reference or to explain why they think the method is bad. In the meantime, it seems that the report has been leaked; see here. One of the authors of the report said, “I’m strongly opposed to this kind of censorship . . . It should be out in the world at large for the scientific community to judge it for what it is.”

I think the best next step would be for the CDC to release the report officially, along with a critical response from a statistician explaining how the method is flawed. Bhattacharya said it was common knowledge that the method was terrible; on the other hand, it seems that this “test-negative design” is a standard approach for studying the effect of vaccines in the population after they have been released; see also here. So at the very least it would be a valuable educational opportunity to see this article that was on the verge of publication, and to understand its purported problems. Publishing the report along with a companion article discussing its problems, that could make sense. Canceling the report without explaining why (and, no, just saying you don’t like this method isn’t enough of an explanation) . . . that’s not serious science. Scientific integrity is not being advanced by this sort of behavior.

I was also upset that Oster just jumped into the discussion to say that she, too, hates the test-negative design. Neither Bhattacharya nor Oster are statisticians. They’re health economists. It’s fine for a health economist to have an opinion on a statistical method, but, to be so sure about it, that doesn’t seem right to me. To the extent that Bhattacharya and Oster have legitimate concerns about the statistical method, they can work with a statistician to express these concerns openly and scientifically.

I’m not saying that statisticians or epidemiologists are always right or that other professionals should defer to them. Statisticians can be wrong, really wrong, and the errors can be compounded by a presumption that they know what they’re doing. So question these reports all you want. But then is the time to bring in an expert of your own, not to wing it.

Above I talked about moral seriousness regarding outcomes. There’s also moral seriousness regarding methods, and neither of the two people in that interview were displaying it. Also important is moral seriousness about communication, which has not been displayed by Bhattacharya, who has yet to come to grips with the fact that he was on the board of an anti-vax organization.

P.S. Dorothy Bishop provides a detailed discussion of this event.

Did Taylor Swift kill a bunch of people?

Posted on April 21, 2026 9:09 AM by Andrew

In a post entitled “FARCE: FARS Album Release Coincidence Examination,” Gaurav Sood writes:

Replication and extended analysis of Patel, Worsham, Liu & Jena (2026), “Smartphones, Online Music Streaming, and Traffic Fatalities,” NBER Working Paper 34866.

Key Findings

1. The Statistical Effect Is Real

Traffic fatalities are elevated on major album release days:

Estimator Effect (Tier 1) SE t-stat

Local (±10 day)* +23.0 deaths 5.1 4.5

Donut-global +16.2 deaths 5.1 3.2

Forecast +22.8 deaths 4.9 4.6

. . .

2. But The Causal Story Doesn’t Hold Up

No dose-response relationship:

Album Streams Effect

Tortured Poets (2024) 313M -2 deaths

Her Loss (2022) 97M +63 deaths

Midnights (2022) 185M +5 deaths

. . .

Out-of-sample replication fails (2023-2024):

The paper analyzed 2017-2022 releases. We tested 7 major 2023-2024 albums as a true out-of-sample test:

Album Streams Effect

Tortured Poets 313M -2.1

UTOPIA 128M +10.5

For All The Dogs 109M -12.8

Cowboy Carter 76M -0.4

Hit Me Hard and Soft 73M +7.0

SOS 68M +9.4

One Thing at a Time 52M -1.5

Average effect: +1.4 deaths (vs. +22.8 for original sample). The biggest streaming day in Spotify history (Tortured Poets, 313M) shows a negative effect. The pattern found in 2017-2022 does not replicate forward.

Single outlier dominates: Her Loss accounts for 34% of the total Tier 1 effect.

3. Methodology Concerns

The ±10 day estimator uses post-treatment days as controls. The paper compares release-day fatalities to the average of the surrounding ±10 days—but this includes days after the release. Standard event studies use only pre-treatment periods. If the effect persists beyond day 0, the control mean is biased upward.

What The Paper Claims

Patel et al. (2026) find:

139.1 deaths on release days vs 120.9 on control days (+18.2 deaths, +15%)

123.3M streams on release days vs 86.1M control (+43%)

Proposed mechanism: smartphone distraction from streaming while driving

What We Did

Analysis Description

Extended data FARS 2007-2024 (vs. 2017-2022)

Forecast estimator Train model on non-release days, predict counterfactual

Dose-response Test if more streams → more deaths

Extended sample Added 2023-2024 albums (27 total vs. original 10)

Placebo tests Pre-trends, year permutation, window sensitivity

Results Summary

Finding Result Interpretation

In-sample effect +22.8 deaths/release Statistically significant (2017-2022)

Out-of-sample +1.4 deaths/release Effect vanishes in 2023-2024

Dose-response r = -0.18 Wrong sign for causal story

Her Loss outlier 34% of total effect Results driven by one album

Tier 2 ratio 0.80 (expected 0.50) Effect doesn’t scale with streams

Estimator	Effect (Tier 1)	SE	t-stat
Local (±10 day)*	+23.0 deaths	5.1	4.5
Donut-global	+16.2 deaths	5.1	3.2
Forecast	+22.8 deaths	4.9	4.6

Album	Streams	Effect
Tortured Poets (2024)	313M	-2 deaths
Her Loss (2022)	97M	+63 deaths
Midnights (2022)	185M	+5 deaths

Album	Streams	Effect
Tortured Poets	313M	-2.1
UTOPIA	128M	+10.5
For All The Dogs	109M	-12.8
Cowboy Carter	76M	-0.4
Hit Me Hard and Soft	73M	+7.0
SOS	68M	+9.4
One Thing at a Time	52M	-1.5

Analysis	Description
Extended data	FARS 2007-2024 (vs. 2017-2022)
Forecast estimator	Train model on non-release days, predict counterfactual
Dose-response	Test if more streams → more deaths
Extended sample	Added 2023-2024 albums (27 total vs. original 10)
Placebo tests	Pre-trends, year permutation, window sensitivity

Finding	Result	Interpretation
In-sample effect	+22.8 deaths/release	Statistically significant (2017-2022)
Out-of-sample	+1.4 deaths/release	Effect vanishes in 2023-2024
Dose-response	r = -0.18	Wrong sign for causal story
Her Loss outlier	34% of total effect	Results driven by one album
Tier 2 ratio	0.80 (expected 0.50)	Effect doesn’t scale with streams

My talk at Stanford later this month: “What to do when your estimate is 1 standard error away from 0?”

Posted on April 7, 2026 9:21 AM by Andrew

Tuesday 28 Apr 2026, 4pm in CoDa E160:

What to do when your estimate is 1 standard error away from 0?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We provide a new answer to this simple yet very important question. Thinking clearly about this problem leads us to bring in many ideas in statistical analysis and computing, including causal identification, meta-analysis, Mister P, expectation propagation, decision analysis, experimental design, and the fundamental unity of Bayesian and frequentist statistics. We demonstrate our approach in examples from many applications, including medicine, social science, business, sports, and public policy.

This work is joint with Witold Więcek and Erik van Zwet.

In addition to all the above, I’ll probably drift into some related general topics such as the role of experimentation in science and engineering and the limitations of thinking about policy analysis in terms of causal inference.

2026 American Causal Inference Conference

Posted on March 28, 2026 5:18 PM by Andrew

This one looks great.

Jonas Mikhaeil will be speaking at a session, “Controversies about Counterfactual Utilities,” which is related to our paper, Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals. The other speakers in that panel are Kosuke Imai, Mats Stensrud, Benedikt Koch, and Amanda Kowalski.

The sessions at the conference cover some of my favorite topics, including varying treatment effects, Bayesian inference, generalizability, latent variables, experimental design, network analysis, and causal discovery.

So, yeah, good stuff all around. This conference has been going for awhile now–it was originally the Atlantic Causal Inference Conference and we held the first one here at Columbia in 2005, not long after we launched this blog which includes Causal Inference in its name. So it’s good to see this going stronger than ever, and keeping its balanced focus on theory, methods, and applications.

This year’s meeting is in Salt Lake City on 11-14 May 2026 so not too late for you to register!

Survey Statistics: sampling-weighted loss

Posted on March 3, 2026 4:00 PM by shira

We’ve mostly focused on a population mean E(Y) as our quantity of interest. We saw how methods extend to estimating a subgroup mean E(Y | V=1), e.g. voters.

What about estimating a general conditional mean E(Y | X) ? We talked a lot (4 posts) about calibrating this to a known population mean E(Y), e.g. via the “logit shift”. But first we start with an estimate of E(Y | X) from survey data.

Lumley 2010 Section 5.2 says:

The polar bear has been going thru the pile of papers he was sitting on last week and found this:

Replace R (whether you respond to a survey) with T (whether you are treated) and you can see that my drawing is heavily inspired by Johansson et al. (2022) Figure 3:

We’ve talked about connections between survey random sampling and randomized experiments. There are also connections between nonprobability surveys and observational studies. We will explore more analogies between survey statistics and causal inference. Favorite references ?

My online talk Tues 24 Feb, 9am NY time at the Behind-the-Scenes seminar series: Russian Roulette and stochastic potential outcomes

Posted on February 16, 2026 6:05 PM by Andrew

I’m speaking at this online seminar Tues 24 Feb, 9am NY time:

The Behind-the-Scenes Seminar Series is designed to learn about the production process of research papers, offering an opportunity for students and researchers in all fields and at all career stages to engage with the challenges encountered during project development and how they were overcome.

Unlike most research seminars that focus on the research findings, this series will be dedicated to discussing the research process. Not only this, the seminars will also feature a live survey to gauge the audience’s expectations regarding the journey of the paper and compare them with the speaker’s actual experience.

What happened is that a few months ago the seminar organizers (three economists: Vatsal Khandelwal, Séverine Toussaert, and Jasmin Baier) wrote to me:

Speakers not only present their findings but also share the story behind their research, from the initial idea and design choices to data or modelling challenges and unexpected results.

Our aim is to foster openness, reflection, and engagement in the research community by highlighting the often-invisible processes that shape scientific work.

Would you be willing to suggest a paper you could cover? Ideally, it would be something that has already been accepted for publication, so that we can discuss the full journey, including the submission and review process.

I replied:

Here’s a list of our published research from last year.

If you go to that link and scroll down to “The stories behind the papers,” you’ll see where each paper came from.

So, if you want, you can pick one or more papers from that list that have good origin stories.

They responded that, as economists, they were most interested in the Russian roulette project.

It should be fun, to speak not just on the research itself but on where it came from and how it came to be published. It’s a joint paper with Jonas Mikhaeil, and we came up with the idea after hearing from Amanda Kowalski about her recent paper with Neil Christy, which got us thinking about what you can get from stochastic models for potential outcomes.

Here’s our published paper, “Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals,” and here’s the abstract:

It has been proposed in medical decision analysis to express the “first do no harm” principle as an asymmetric utility function in which the loss from killing a patient would count more than the gain from saving a life. Such a utility depends on unrealized potential outcomes, and we show how this yields a paradoxical decision recommendation in a simple hypothetical example involving games of Russian roulette. The problem is resolved if we abandon the stable unit treatment value assumption and allow the potential outcomes to be random variables. This leads us to conclude that, if you are interested in this sort of asymmetric utility function, you need to move to the stochastic potential outcome framework. We discuss the implications of the choice of parameterization in this setting.

We learned a lot from writing this paper and we’re continuing to think about the topic.

So, if you want to hear more, you can go to the Behind the Scenes website and sign up to get the zoom link. And here’s our blog discussion of the paper from last year.

Statistical Modeling, Causal Inference, and Social Science

Category Archives: Causal Inference