What is the relation between interactions in a regression model and correlations among the predictors?

I’ve often seen confusion between interactions in a regression model and correlations among the predictors. To keep it simple, consider the model y = b0 + b1*x1 + b2*x2 + b3*x1*x2 + error, and assume the predictors have been signed so that both b1 and b2 are positive. Then b3 represents the interaction. This has nothing to do with the joint distribution of x1 and x2 in the data, or in the population. (For simplicity, assume the data to which the model are being fit is a random sample from the population of interest.)

The interaction depends on the model of y given x1 and x2, while the correlation depends on the model for x1 and x2. These are two completely different parts of the model. And yet, they often seem connected.

I have the general impression that I’d be more likely to expect a positive interaction of x1 and x2 when predicting y, if x1 and x2 are positively correlated in the population.

For example, when predicting income from height and sex, being taller and being male both predict higher income, also they interact–the coefficient for height is higher for men than for women–and of course the two predictors, height and male, are positively correlated in the population.

I’m not sure how to think about this connection or even whether it’s a real pattern! But there might be something there so I wanted to share it with you.

The issue of interactions comes up in the context of the concept of intersectionality, which is a form of interaction that comes up in sociology. It started for me with this email from Elin Waring:

I’ve been working on data on intersectionality and retention of students in STEM majors. My little group is specifically looking at data from Lehman College and trying to model graduation with a STEM degree. There are a lot of details, but basically we have come to the conclusion that the right way to describe this is with a discrete time competing risk model (the competing risks being graduation with a STEM degree and graduation with a non-STEM degree). I won’t go into all the details. We have data for between 1 and 20 semesters enrolled for students starting as freshman. For us, intersectional identity is defined by 5 variables that yield 32 distinct combinations or strata as used in the next articles.

In trying to think about how to account for intersectional identities we came across the “MAIHDA Method.” I was wondering if you had seen this discussion before or have any thoughts about it.

Evans, Clare R., George Leckie, and Juan Merlo. 2020. “Multilevel versus Single-Level Regression for the Analysis of Multilevel Information: The Case of Quantitative Intersectional Analysis.” Social Science & Medicine (1982) 245:112499. doi:10.1016/j.socscimed.2019.112499.

They essentially argue for treating the strata as random effects in a multilevel model where with the individual components of the combinations introduced as fixed effects describing the combinations.

The next article criticizes that approach and argues for fixed effects all around.

Wilkes, Rima, and Aryan Karimi. 2024. “What Does the MAIHDA Method Explain?” Social Science & Medicine 345:116495. doi:10.1016/j.socscimed.2023.116495.

Responded to here:

Evans, Clare R., Luisa N. Borrell, Andrew Bell, Daniel Holman, S. V. Subramanian, and George Leckie. 2024. “Clarifications on the Intersectional MAIHDA Approach: A Conceptual Guide and Response to Wilkes and Karimi (2024).” Social Science & Medicine 350:116898. doi:10.1016/j.socscimed.2024.116898.

I was wondering if you have any thoughts about this? For me, intersectionality as a theoretical approach does mean that it makes sense to look at the strata rather than thinking of the strata as just the most complex level of creating statistical models of the intersection of the variables. But then it seems as though treating this a random effect more or less undermines its centrality to the theory. And is treating both the strata and the individual characteristics as variables at the same level basically a way to decompose?

In the end, I feel like the pro-MAIHDA people retreat to “we are just descriptive” in a way that isn’t very helpful. That said, they are right that this seems to have some traction in the world of health disparity research.

I replied that I’d never heard of any of this method before. I couldn’t actually muster the energy to read the above articles, as all this debate seems to be missing the key issues. I don’t really care if something is called a fixed effect or a random effect (see here); my current preferred way of thinking of these problems is by framing as a generative model.

Regarding intersectionality, the natural way I would see it is that this would show up as an interaction term, the idea that the interaction is more than the sum of its parts? For a simple example, if there are 5 binary variables and each has the same effect on its own (which they wouldn’t, this is just a simple hypothetical example), then you could create a variable which is the total number of identities, thus a number from 0 to 5, and “intersectionality” would show up as a super-linear or convex relation between the outcome and this total predictor?

Waring responded:

Sure, but the idea you suggested about intersectionality itself isn’t right. You can’t just sum the number of identities, everyone has identities and the idea is that it is not just about concentrated disadvantage of having all or some specific identities. If we have 5 dichtomous identity/group variables everyone has 5 dimensions of identity. Intersectionality is about the idea that something like “white, native born. woman, high income” shapes what happens because of how those come together to shape (in the case of my analysis) whether, as an undergraduate, you persist in STEM fields.

I replied as follows:

Yes, I was actually thinking this when I wrote that! I was imagining that each of the 5 factors has an “off” and “on” setting, and intersectionality kicks in when there are multiple “on” settings, where “on” represents the group that faces more difficulty (nonwhite, non-native born, female, low income, gender nonconformist, etc.). Once you allow arbitrary possibilities for intersectionality, then my simple superadditive model wouldn’t fit. On the other hand, if you were to allow all 32 possibilities to take on any value, then realistically you would not be able to estimate anything much at all: this is the usual problem in sociology of approximating a complex social structure by a simple model that explains most of the variance. For predicting persistence in STEM (or any academic field), one possible factor that could enter in a complicated way is conservative political ideology, in that for many attitudes and behavior its predictive effect goes in the opposite of the “on” categories listed above, but grad students, in STEM and other fields are predominantly politically on the left. I could well imagine that conservative political ideology, like the other “on” categories, is predictive of not persisting in STEM but that this could interact in unexpected ways with those other categories.

From a statistical perspective, my main message is to choose such a model based on its explanatory power and recognizing that it’s an approximation, rather than using methods such as statistical significance or Bayes factors which in different ways are driven by sample size, as we discussed in this 1995 paper.

Another interesting statistical feature of this and similar discussions is that it’s natural for the discussion to go back and forth between the correlation between two predictors in the data (or the population) and the interaction between their predictive effects, as discussed at the top of this post.

I’m not sure if this interaction thing is a general pattern that has some statistical explanation, or just a faulty intuition of mine based on just a couple of special cases. But I have noticed a general confusion that when people talk about interactions, often they seem to be talking about correlation between the predictors.

Epidemiologist Donna Spiegelman sez: SUTVA is “mostly not necessary for valid causal estimation and inference most of the time”

Donna Spiegelman shares this presentation she gave at the recent American Causal Inference Conference. I like what she has to say.

Here are the two parts of the stable treatment value assumption:

1. No interference between units. As Spiegelman says, nowadays it’s not hard to model spillovers. As I say, untangling spillovers is an ill-posed inverse problem that can be solved using Bayesian inference with reasonable priors. Serious practical work has moved past the demonstrate-that-spillover-doesn’t-matter stage to the just-model-the-spillover-directly stage.

2. Deterministic potential outcomes. As Spiegelman says, in the real world, outcomes are stochastic. Jonas and I talk about this in our Russian roulette paper.

The part that I’m less sure about is Spiegelman’s claim that adjustments for pre-treatment variables usually don’t matter. I’m persuaded that they usually don’t matter in the epidemiology and biostatistics applications she’s worked on, but I think that in social science, such adjustments can be important. Especially if there are big treatment interactions and your population is a lot different from your sample.

In any case, I recommend you look through Spiegelman’s slides, as she offers a refreshing perspective compared to our usual obsessive focus on the details of causal identification:

Survey Statistics: GREG

I just got to chat with Andrew and some of the authors of the MrPlew paper: Ryan Giordano, Erin Hartman, and Avi Feller. Lots more I have to digest here ! The paper came out while the polar bear and I were crossing from TN into VA.

We talked about using a model for response R, a model for outcome Y, or both. So GREG came up, and Andrew asked “what’s GREG ?” Good question.

GREG is Generalized REGression estimator. Särndal, Swensson, Wretman (1992) has a nice section that writes it in a few alternative ways:

1. Adjust an estimate based on the model with a Horvitz-Thompson estimate of the error:

2. Or on the flip side, you can see it as adjusting the Horvitz-Thompson estimate with the model:

It’s called GREG for Generalized REGression estimator, what is being generalized ?

Lumley 2010 made me think we were generalizing to continuous X variables:

Preview

Sharon Lohr’s book made me think we were generalizing beyond simple random samples:

Sampling Design and Analysis: Third Edition — Sharon Lohr

Särndal, Swensson, Wretman (1992) made me think we were generalizing to multiple X  variables:

Amazon.com: Model Assisted Survey Sampling (Springer Series in Statistics): 9780387406206: Särndal, Carl-Erik, Swensson, Bengt, Wretman, Jan: Books

Regardless of the exact origin of the name, GREG has connections to the Doubly Robust literature in causal inference (as Coston et al. (2020) note in a footnote). Any favorite references making these connections ?

Recent discoveries on the acquisition of the highest levels of statistical fallacies

Mark Goldstein points us to this post by Alex Dimakis, who writes:

A paper was recently published in Science on highest level of human performance across athletics, science, math and music. I think the paper makes some classical statistics mistakes that still fool many smart people. The paper “Recent discoveries on the acquisition of the highest levels of human performance” by Gullich et al. claims: “In summary, when comparing performers across the highest levels of achievement, the evidence suggests that eventual peak performance is negatively associated with early performance.”

The paper makes two mistakes. Base-rate fallacy and . . . Berkson’s paradox . . .

The study says simply that the very top at young age are not identical with the very top adults. (As one would expect, since there are *many many more non-elite young candidates*). Still, elite young performers are 40 times more likely to be in the top adults compare to general population. This is acknowledged in the paper but in page 6-7, a bit buried in the technical analysis and not sufficiently discussed in abstract or conclusions. . . .

The paper claims “Across the highest adult performance levels, peak performance is negatively correlated with early performance.” This is a classic example of Berkson’s paradox. Here is a simplified example to understand this: Assume that to be a successful actor you have to be either extremely good looking or extremely talented. Assume also that talent and looks are independent in the population. However, among sucessful actors you will observe a negative correlation between looks and talent. This doesn’t meant anything beyond the selection process and should not be extrapolated. My favorite example-joke of this is that basketball points scored is negatively associated with height among NBA players. (because to be an NBA player you have to be very tall OR be very good at scoring). From this, I extrapolated that since I’m 5’7, I will be scoring 80+ points per NBA game. . . .

Here’s paper in question, “Recent discoveries on the acquisition of the highest levels of human performance.”

Yeah, this sort of thing comes up all the time! For example, some celebrity academics a couple years ago wrote a book that included the false statement, “while correlation does not imply causation, causation does imply correlation.” Even more amusingly, they prefaced this by “We must, however, remember that”. I guess we must remember a lot of false things! Economist Rachael Meager gave a quick example showing why they were wrong; See details here.

This new example also looks a lot like the well-known regression-to-the-mean fallacy (for more on that, I recommend Section 6.5 of our book, Regression and Other Stories, which includes some simulation code to demonstrate the problem). Of course, just because lots of people know about a fallacy, that doesn’t stop people from making the error in new settings. That’s why it’s a fallacy!

P.S. An anonymous commenter points out that Dimakis (and, by extension, Goldstein and me) are being unfair to this paper. The descriptive results are what they are. I remain skeptical of the paper’s claim that “similar developmental pattern across different domains suggests widespread, and possibly universal, principles underlying the acquisition of the highest levels of achievement,” as I do suspect that much of what they have seen arises from the usual statistical selection artifacts. So maybe it’s ok to caution about the interpretation of these numbers. But now I’m thinking it wasn’t fair of us to slam the paper for presenting some interesting data findings.

The Application Matters: Medical Ethics and Counterfactual Utilities

I believe, as applied statisticians, we need to get our hands dirty and immerse ourselves in the applications we try to address. This post is mostly about medical ethics and the famous “first, do no harm” principle. It is also an attempt to understand how statistics can serve medical practice. The motivation for this comes from a recent debate in the statistics literature about counterfactual losses, which often invokes this “first, do no harm’’ principle as a motivation. Much has been written about the theory of these counterfactual losses — and I’m sure they will find a fruitful application — but do they actually speak to the challenge of medical decision-making that the “first, do no harm’’ principle seeks to address?

I will argue that they cannot, because this principle is concerned with medicine at its most human: medical practice centered on the relationship between an individual patient and an individual physician. But what can statistics help with? Modern medical obligations acknowledge that medicine is embedded in society; they highlight medical practitioners’ concern with justice and with reducing health disparities. These are concerns statistics can help to address.

But let me start at the beginning. There’s a recent literature that considers decision making under counterfactual loss — what if the utility of your decisions not only depends on the realized outcome but also on what could have been, on a counterfactual? A paradigmatic example is the following “first, do no harm’’ utility: Suppose you’re administering a drug and there are only two extreme outcomes. The patient may live, or they will die. The literature (e.g., Bordley, 2009,  Ben-Michae et al., 2023, Christy and Kowalski, 2026) has interpreted the medical aphorism “first, do no harm” as requiring a utility function that assigns asymmetric weights to saving a life and causing a patient’s death. The disutility from killing a patient who, counterfactually, would have survived outweighs the positive utility of saving a patient who otherwise would have died.  Although this may initially seem attractive, several authors have pointed out complications that arise when decisions are based on such counterfactual losses (e.g., Dawid and  Senn, 2023, Sarvet and Stensrud, 2023).

Andrew and I contributed to this literature with a small example that seemingly produces a counterintuitive recommendation, which I discuss below.

In response, Koch and co-authors write:

[T]his seemingly nonsensical result can be reasonable in a different setting. […] It may be reasonable for a  physician to prefer standard care, prioritizing the avoidance of adverse counterfactual outcomes over  improvements in expected benefits. Indeed, such a decision reflects the Hippocratic principle of “do  no harm”. […] This example underscores the fact that a utility function represents the preferences of the  decision-maker and is therefore inherently subjective and context-dependent.

This uncovers a problem with our argument based on intuition — see, this decision doesn’t make sense, does it? Intuition, of course, can be misleading. One way our example might be misleading, as Koch et al. point out,  is that it may describes a setting in which we simply do not hold these counterfactual utilities.  If we were to transplant the same recommendation into an appropriate setting, it might no longer appear nonsensical and might instead conform to how we think we should behave.

This has me very excited. I believe statistics is at its best when it takes its applications seriously. So, in this blog post, I want to do just that.

I will briefly give the example Andrew and I came up with to show that a “do no harm’’ utility can lead to counterintuitive decision recommendations. We do so through an example involving Russian roulette. It is a useful example, but by no means an accurate representation of what we would consider plausible in real medical settings. What it does show, however, is that we need to be really careful with these “do no harm’’ utilities: if we don’t really hold them, they may lead to nonsensical decisions.

Taking the application seriously, we will dive into medical ethics to ask whether the proposed counterfactual “do no harm” utilities help with medical decisions. We do so by briefly examining the origin and history of the “first, do no harm” principle.  We will see that “do no harm” is perhaps best understood in the context of a professional ethic that commits physicians to the rules of their craft and to respect for each individual patient. Statistics cannot truly speak to this individual-level patient-physician relationship. Since the Hippocratic Oath, however, medicine has changed substantially. With the advent of scientific methods in clinical medicine, doctors face new moral obligations not captured by the “do no harm’’ principle. Some of these new obligations arise from the relationship among medicine and society; others arise from the use of scientific methods themselves. We will look at modern medical oaths to get a glimpse of these new obligations — and how statistics can help fulfill them.

Russian Roulette 

As a starting point, let me present our simple and somewhat morbid example in which counterfactual utilities give a counterintuitive decision recommendation: Imagine we are choosing between two games of Russian roulette. In the first game, the status quo, we play with a six-chamber gun, one chamber of which is loaded. That is, we face a one-in-six chance of death. We are then offered the option to switch to a seven-chamber gun, the new alternative “treatment.” If we switch, we face better odds: only a one-in-seven chance of dying. By switching games, we lower our probability of death, which to me seems preferable. 

What would the counterfactual “do no harm’’ utility function recommend? To figure this out, we treat the outcomes under either game of Russian roulette as (independent) potential outcomes and divide the population of players into four principal strata based on survival status. Only two of the principal strata are relevant for our decision, those in which a player would survive one game but die playing the other. It’s easy to work out that with probability 6/42 switching to the new gun saves you: you would die under the status quo but survive under the treatment. But with probability 5/42, you would have survived under the status quo, but switching to the new gun, you will die. Suppose we interpret “first, do no harm’’ as mandating that the negative repercussions of our treatment choice, the death of a player, outweigh the benefits of saving a life. For example, suppose saving a life has utility +1, while the death of a player has utility −2. Then the 6/42 chance that the treatment saves you is outweighed by the 5/42 chance that the treatment kills you in cases where, counterfactually, you would have lived.

Under this counterfactual utility, we ought not to switch. It recommends we stick to the status quo, under which we face a higher chance of death. This strikes me as a counterintuitive decision recommendation.

The “First, do no harm” Principle

There is, however, a limit to the force of this argument based on intuition. One might argue that the recommendation in the Russian roulette example is not evidence against counterfactual utilities in general, but rather an indication that, when playing Russian roulette, we do not hold utilities of this kind. When transplanted to a setting where we have such asymmetric counterfactual utilities, the same recommendation might be sensible. The counterfactual-utility literature often motivates asymmetric counterfactual utilities by appealing to the “first, do no harm’’ principle in medicine.

For the rest of this post, I will discuss whether counterfactual utilities are useful in this paradigmatic application: medical decision-making.

In a paper frequently cited by advocates of counterfactual utilities, Cedric Smith (2005) discusses the origin and limitations of the “first, do no harm” principle. It is actually not part of the Hippocratic Oath, or the wider Hippocratic corpus, as is often implied, but has somewhat nebulous roots. Smith traces its origin to the seventeenth-century English physician Thomas Sydenham. While undoubtedly catchy, this principle is not embedded in a larger ethical framework that would give guidance on its interpretation or justifications for its use.

The is a problem because taken literally, this “first, do no harm’’ principle is a poor guide to medical decision-making. Let me cite Louis Lasagna, an American physician of the last century who was very involved in rethinking the Hippocratic Oath:

“To observe this advice [first, do no harm] literally is to deny important therapy to everyone, since only inert nostrums [quack medicine without active pharmaceutical ingredients] can be guaranteed to do no harm. It is more reasonable to ask doctors to balance the potential gains against the possible harm; would that we could only quantify these probabilities more precisely!” (Lasagna cited in  Smith, 2005)

A call to action for us statisticians if I ever saw one. Of course, the counterfactual-utility literature that cites this principle is not advocating what Lasagna warns against: doing absolutely no harm. Its proponents are well aware that benefits and risks must be carefully weighed against each other. If the principle is not meant to be taken literally, then its obscure origin becomes a problem: it gives us little insight into what actually matters to medical practitioners, because it is disconnected from any wider tradition that would help us interpret it. 

Luckily, we can find a similar, more nuanced statement in the Hippocratic corpus (Epidemics I):

“Declare the past, recognize the present, foretell the future: attend to these things. As to diseases, make a habit of two things—to help, or at least to do no harm. The art has three factors, the disease, the patient, the physician. The physician is the servant of the art.”

The Greek word here is technē (orig. τέχνη) which we might also want to translate as “craft”.  Medicine is a craft because the decisions a physician has to face cannot be made by rote application of knowledge. As a craftsperson, the physician as an individual becomes relevant. That is why the Hippocratic Oath commits the physician, as an individual, to be benevolent in each patient interaction. Medical ethics based on the Hippocratic Oath is not focused on outcomes, let alone utility, but concerned with the character of the physician and their obligations toward their patient (Pellegrino, 2006). It centers the patient-physician relationship. 

With this background in mind, we can understand why the “benevolence” implied in the imperative to help is qualified with the phrase ‘’or at least do no harm’’ — if I’m already committed to help, it may seem that I’m already committed to do no harm. Lynn Jansen (2022) argues that this is where the professional aspect of medicine enters: As a professional, the physician needs to restrict their actions to those that align with their profession. That is, while they strive for benevolence in the sense of furthering the patient’s overall well-being, they reject all courses of action that would harm the patient’s medical well-being. This second aspect is often called non-maleficence. 

Statistics and Medicine 

In modern medicine, this tension is heightened. Taking the patient’s moral agency seriously, a physician must be careful not to “confuse technical with moral authority” (Pellegrino, 2006) or override patients’ values. This is worth keeping in mind. The patient must be involved in weighing benefits and risks. Thus, the medical professional does not have sole discretion to choose an optimal treatment. “Help, or at least do no harm” is a professional mantra that guides a physician in their interactions with patients. It is not a constraint on optimal decision-making; it is a moral commitment to respect each patient.

This conception of medicine is in stark contrast to the world seen through the lens of statistics. Compare this focus on the individuality of both patient and physician with the following quotation from an 1835 report to the Academy of Sciences, written by a committee of four mathematicians, including Poisson, on operations for gallstones: 

“In statistical affairs … the first care before all else is to lose sight of the man taken in isolation in order to consider him only as a fraction of the species. It is necessary to strip him of his individuality to arrive at the elimination of all accidental effects that individuality can introduce into the question.(taken from Hacking, 1990)

Statistics’ power lies in constructing aggregates, making disparate things hold together (Desrosières, 1998). Historically, these aggregates were useful for the emerging nation-state and were quickly adopted to address large-scale social problems, such as public health. Many professions, including medicine, strongly resisted losing sight of the particular – in our case, the individual patient — in favor of aggregates. Even randomized experiments, which we nowadays all too easily accept as the gold standard of evidence, had a hard time entering clinical medicine (Porter, 2020). 

Due to this tension, modern medicine has a dual nature.  On the one hand, doctors are still committed to treating their patients as individuals — medicine is the art of healing. Yet with advances of scientific methods within medicine, and with the recognition that health must be understood in the context of society, doctors face new moral obligations (Pellegrino, 2006).

Modern Medical Oaths

To get a glimpse of these new obligations and the self-understanding of doctors in the twenty-first-century, we can look to modern versions of medical oaths. While many doctors still take the ancient Hippocratic Oath, many medical schools revise the original text or students take an additional self-formulated oath. In 2005, for example, students at Weill Cornell Medical College began taking a revised Hippocratic Oath. Let me highlight a brief excerpt:

I vow […]

That above all else I will serve the highest interests of my patients through the practice of my science and my art; That I will be an advocate for patients in need and strive for justice in the care of the sick.

Notice the emphasis on justice; it’s not idiosyncratic to this oath. Two further examples show similar themes. The University of Pittsburgh School of Medicine’s class of 2024 took an oath that highlighted the social determinants of health and advocated for a more equitable health care system. Harvard Medical School’s class of 2019 vowed to combat structural oppression and promote social justice. In this admittedly selective set of examples, much emphasis is placed on how medicine relates to society. Core commitments are justice and the building of an equitable health care system.

So, how can we statisticians help modern medical practice? Modern medical ethics places great emphasis on patients’ autonomy and their freedom to choose based on their own values. For a patient’s decision to be well informed, deliberation about benefits and risks is central — but the decision ultimately depends on a personal tradeoff shaped by the patient’s values. For this reason, our goal should perhaps not be to optimize treatment decisions. We do need to help estimate the benefits and risks of treatments more accurately, but treatment decisions remain part of the individual patient-physician relationship. Instead, we should put more emphasis on identifying and reducing disparities in the health care system, focusing on medicine as embedded in society. The most important task may not be deciding which drug to administer, but reducing inequalities in access to treatment in the first place. I believe statistics has an important role to play in making health care systems more equitable and more just. 

“An Axiomatic Foundation for Decisions with Counterfactual Utility”

Benedikt Koch, Kosuke Imai, and Tomasz Strzalecki write:

Counterfactual utilities evaluate decisions not only by the realized outcome under a given decision, but also by the counterfactual outcomes that would arise under alternative decisions. By generalizing standard utility frameworks, they allow decision-makers to encode asymmetric criteria, such as avoiding harm and anticipating regret. Recent work, however, has raised fundamental concerns about the coherence and transitivity of counterfactual utilities. We address these concerns by extending the von Neumann-Morgenstern (vNM) framework to preferences defined on the extended space of all potential outcomes rather than realized outcomes alone. We show that expected counterfactual utility satisfies the vNM axioms on this extended domain, thereby admitting a coherent preference representation. We further examine how counterfactual preferences map onto the realized outcome space through menu-dependent and context-dependent projections. This axiomatic framework reconciles apparent inconsistencies highlighted by the Russian roulette example in the statistics literature and resolves the well-known Allais paradox from behavioral economics. We also derive an additional axiom required to reduce counterfactual utilities to standard utilities on the same potential outcome space, and establish an axiomatic foundation for additive counterfactual utilities, which satisfy a necessary and sufficient condition for point identification. Finally, we show that our results hold regardless of whether individual potential outcomes are deterministic or stochastic.

I have to admit that I don’t see the appeal of utility functions based on counterfactuals. For example, I’ve never thought that the decision-theoretic concept of “regret” makes sense. That said, I know that a lot of people are interested in the topic, so I hope the above paper is useful to people in clearing up these issues, and I’m glad that they were able to use our Russian roulette example.

An economist writes: “the fulminations over the #1 pick seem overheated to me.”

Jonathan Falk writes:

I [Falk] am always amazed at the amount of (digital) ink spilled on the perverse incentives involved in taking to get the #1 draft pick. The current local woes of the Giants and Jets obviously contribute a lot to these discussions, but they happen all the time. As an economist, it’s clear to me that the value of a draft pick is the incremental value, not the absolute value. I’m completely aware that the upper tails of distributions have much more dispersion than the center, or even the 80th-90th percentile does, but the fulminations over the #1 pick still seem overheated to me.

First, of course, is the fact that assessment is made with error, and there are plenty of #1 busts in every sport. #2s can be busts as well, of course, but that merely lowers the expected difference between #1 and #2 as the true value of both is attenuated towards 0 — #1 loses more.

Second, there is the issue of team fit. Greatness is a vector, not a number, and if the teams ahead of you in draft order need something else, you still stand a chance of getting the player optimized for your needs. Going the other way, of course, is that higher draft picks absolutely lower the number of teams that can steal your guy.

Third, teams are… teams. One person can only contribute so much. So the relevant assessment is now how much better A is than B, but how much the addition of A versus the addition of B will change the prospects of your team — which I think is pretty obviously a lower difference, though I guess your rationale for voting runs in the other direction — you ought to judge a small incremental addition by the gigantic difference between winning a championship or not.

Fourth, more narrowly economic, every incrementally pick costs more. I don’t think that effect is huge in the context of overall payrolls, but isn’t that then another anomaly? If #1 picks are so dramatically better than, say, #5 picks, why aren’t they paid multiples more?

I don’t really have anything to say here, because I have no sense of how much teams are paying for #1 or #2 picks. I do remember a couple years ago that everyone was talking bout Wemby, but basketball’s different than football because there are only 5 players on the court, so one player can make more of a difference.

The case of Wemby makes me think that one way this could be studied would be to compare different years. In some years there is a clear consensus #1 pick, other years not.

John Carlin says, “‘Identifying variables that independently predict…’ is not a well-defined research task”

John “Bayesian Data Analysis” Carlin writes:

Recent developments in the methodology of epidemiological research have emphasized the importance of achieving clarity of purpose by classifying research questions into one of three types: descriptive, predictive, and causal. . . .

I [Carlin] do not believe that studies aiming to “identify” independent predictors or “prognostic factors” are addressing well-defined research questions. Indeed, beyond the issues already raised, there is a broader question of the extent to which it is ever sensible to frame a research question as if it could be answered dichotomously, as in “is this an (independent) prognostic factor?” Prediction questions, which include prognosis, are those that involve the development of a model or algorithm to provide predictions of outcomes using available variables that are potential predictors.

This all makes sense. I kinda think that descriptive, predictive, and causal are all the same thing–or, more precisely, that “descriptive” and “causal” are special cases of “predictive,” under different conditions. But if you want to divide them into three tasks, sure, go for it. Personally, I’d rather divide statistics into the goals of exploration, estimation, and discrimination, but I think that’s because I’m thinking in a more general “data science” perspective, whereas John is focusing more on the more traditional problem of inference.

But, yes, I agree with him 100% on avoiding dichotomization, a topic that Sander Greenland, I, and others have been screaming about for a long time–indeed, John and I contributed to the anti-dichotomization theme in our book Bayesian Data Analysis, in that we focused on model building and inference within a model, rather than on the then-fashionable problem of choosing among or comparing models using Bayes factors. So, yes on that.

John continues:

Some variables may have greater predictive value than others, but this should be assessed by comparing the predictive value of the model or algorithm with and without the use of that variable, not by examining its “independent effect” in a multivariable regression model.

I’m confused on this point. I mean, sure, I agree that you shouldn’t label a regression coefficient as an “independent effect”; indeed, I always use the terms “predictors” and “outcome” rather than “independent and dependent variables.” Beyond this, I’m not quite sure what John is suggesting. Suppose you have a predictor of interest, x3, and you’ve fit the model y ~ x1 + x2 + x3 (for convenience using standard R notation). I guess John is saying, don’t just look at the coefficient for x3 in that model; also compare it to the model y ~ x1 + x2. Maybe this is a good idea–it’s not something I’ve thought about for a while. Is this the same as what used to be called “partial regression coefficients”? I remember from the statistical literature in the 1960s and 1970s that there was a lot of work on methods for understanding what happens in linear regression when you add one variable at a time. Perhaps it would be good to revisit some of those ideas, and maybe it’s a mistake that we don’t cover them in Regression and Other Stories.

I also want to plug my paper with Guido Imbens (also included as Section 21.5 in Regression and Other Stories), Why ask why? Forward causal inference and reverse causal questions. Our point there is that it can be a good idea to search for prognostic factors in observational data, not with the idea this will identify causal effects but rather as a way of understanding what’s missing from our existing models.

Finally, John writes:

More broadly, debates on whether to “adjust” or not for certain variables in a regression model can only be answered by situating the analysis within a sharply defined research question and a sharply defined rationale for specifying a regression model in the first place.

I don’t get this at all. First I don’t get why “adjust” is in scare quotes; second, ummm, yeah, it’s always good to have a sharply defined research question, but in the meantime people are always making comparisons, and so let’s do what adjusting as we can. For example, in an epidemiology study it should pretty much always be a good idea to adjust for age and smoking history. Or maybe John would say that the rationale for adjusting for age and smoking history is sharply defined, in which case maybe we’re in agreement.

To put it another way, it’s often a good idea to have a sharply defined research question–but that applies in general, not just for statistical adjustments. I think it’s also true that it’s better to have a sharply defined research question when performing a randomized clinical trial. A randomized clinical trial gives identification for the sample average treatment effect in any case–but without a sharply defined research question, it’s not clear what can be done with such an estimate.

So I’m wary of John singling out adjustment in his criticisms, as I fear his article will be taken as implying that, if you don’t try to adjust, that everything will be ok.

Two Health Economists Walk into a Bar: What bothered me in that conversation of Jay Bhattacharya and Emily Oster

Last week I was at a conference on enhancing scientific integrity (as I reported here), and one of the sessions was an interview of Jay Bhattacharya, the current director of the National Institutes of Health, and Emily Oster, a professor of economics and Brown University.

I referred to that session in a post the other day regarding the recent case of a report from the Centers for Disease Control and Prevention that was pulled by Bhattacharya, in his additional capacity as acting director of the CDC. I’ll get back to that story in a bit, but here I wanted to talk about some larger things that bothered me in the interview.

Before getting to my disagreements, let me give my positive take, which is that both the people in the interview had an air of moral seriousness.

This is important. So much of the discourse in politics and social science these days is polluted with cynicism, whether it be from history professor Niall Ferguson decrying the “wokeness” on college campuses when he’s not encouraging college students to do “oppo research” on each other, or Lawrence Summers sleazing around with a sex trafficker and then trying to enlist his rich friends to intimidate student journalists, or Cass Sunstein writing an entire book on a topic he knows nothing about, or Sunstein’s friend Adrian Vermeule promoting election denial, or Mehmet Oz and Andrew Huberman trading off their medical and scientific credentials to hawk dietary supplements, or Steven Levitt promoting dubious claims on mind-body healing and global warming denialism (presumably because they’re cool and transgressive, respectively), or Matthew Walker torturing the data, etc etc. I’m talking about researchers who see science as a path to glory, not to understanding, and politically-minded academics who will happily promote stupid ideas that push their agenda. Beyond that there are straight-up politicians who lie, cheat, and steal, and that’s bad too–but here I’m talking about that nexus between government, policy, and the human sciences.

Anyway, Bhattacharya and Oster weren’t like that. They recognize that we’re talking about serious issues here. When asked about disruptions to NIH funding, Bhattacharya emphasized the larger goal of improving public health, making the point that they want to fund a portfolio of projects to address health challenges. I have no sense of how things are run internally within NIH, so I’m not saying I agree or disagree with his particular administrative directions, but I appreciated that he kept his eye on the ball by emphasizing ultimate goals. For her part, Oster questioned Bhattacharya on a number of issues. She too gave the sense that this is a serious topic, not just a political game.

How to do better is another question! Last month Oster wrote positively about some silly dietary guidelines recently released by the FDA, and if you read her op-ed carefully she doesn’t actually seem to agree with most of those guidelines (the best thing she could say about them was that they were “not crazy”), so I take it that in writing that piece she was making a sort of persuasion calculation that the best way to be effective is to mix the criticism with a gallon of sugar. That’s not my style. So, Oster uses a different approach than I do, and I’m sure we’d have our differences in how to interpret statistical evidence. But, again, I think she’s engaging with moral seriousness.

And it’s possible to be morally serious while still having fun. Consider Nate Silver. Nate’s an entertaining writer–I try to be too!–and I’ve had my disagreements with him regarding statistics and communication, but I think he’s coming from a place of intellectual and moral seriousness that shows respect for the challenges of political analytics and the stakes involved. Indeed, sometimes when he’s disagreed with me, it’s on the implicit grounds that he’s making progress in understanding the real world, doing some analytical engineering that is outpacing the statistical theory. I still think there’s a benefit to interrogating the edge cases where our methods break down . . . anyway, my point is that I’m not just using the term “moral seriousness” to refer to things that I agree with. I’m talking about an attitude that I see in Bhattacharya, Oster, and Silver that I don’t see in, say, Niall Ferguson or Andrew Huberman.

Now, to return to our main thread, these are the parts of last week’s interview that bothered me:

1. When asked about some news reports regarding the NIH and CDC, Bhattacharya dismissed them as “fake news.” This annoyed me for two reasons. First, he offered no evidence that the reports were untrue. Second, he was appointed by a man who spews out false statements at an amazing rate, including on the topic of public health. Who are we supposed to trust here? News reports or a political appointee? Also, Bhattacharya himself has a record of being sloppy with the facts, as I happen to know because it happened to me.

Now, don’t get me wrong, I’m not saying that Bhattacharya was lying or misinformed regarding recent NIH and CDC policies. It could well be that the news items were erroneous or misleading–and, if so, I can see how Bhattacharya would be legitimately annoyed. And he should feel free to express his annoyance! But just dismissing the reports as “fake news” . . . that’s not a serious response.

As I wrote above, I appreciate that Bhattacharya treats the nation’s public health spending with the seriousness it deserves. As a statistician, I think information needs to be treated with respect as well. Which means he should be addressing serious news reports and, for that matter, respecting the institution of journalism. Which he wasn’t doing here.

2. When the topic of vaccines came up, Bhattacharya came out strongly in favor of vaccination, and he expressed the view that it is better for vaccination to be voluntary rather than mandatory. This could be. I guess it depends on the context. For almost all my life, childhood vaccines were mandatory, just about everybody got vaccinated, and just about nobody complained about it. So mandatory vaccination can work just fine–we have decades of experience on this one. The bad news is that in the past few years, vaccination has become politicized and anti-vax attitudes have become embedded in right-wing politics. So it could be that Bhattacharya is right and the mandates will have to go, we’ll just have to accept more sick and dead kids and adults, just the price to pay for this aspect of political dysfunction. I don’t know, but it could be, so I’m not going to criticize Bhattacharya for his hot take on this issue.

What bothered me was . . . if you are going to go with a voluntary vaccination strategy, I think you’d want a strong strategy of encouraging people to choose vaccination for themselves and their kids. So I think his response would’ve been stronger if he’d also said something about how to vigorously promote vaccine usage. That’s part of public health policy too. Also, Bhattacharya doesn’t have a great track record on this issue: just a few years ago he was part of an anti-vax organization. See here for the ugly story. OK, fine, everybody makes mistakes and has lapses in judgment. But then at least he should address that, in the past, he’s been part of the problem. To just say that you want vaccines to be optional but without addressing that history, that’s not right.

3. The un-publishing of that CDC report. Bhattacharya said he stopped the CDC from publishing the report because it was using an approach called a test-negative design, which he thinks is a bad statistical method. When he said this, Oster jumped in and said that she too thought it was a bad method. It was only a brief exchange and there was no time for either of them to give a reference or to explain why they think the method is bad. In the meantime, it seems that the report has been leaked; see here. One of the authors of the report said, “I’m strongly opposed to this kind of censorship . . . It should be out in the world at large for the scientific community to judge it for what it is.”

I think the best next step would be for the CDC to release the report officially, along with a critical response from a statistician explaining how the method is flawed. Bhattacharya said it was common knowledge that the method was terrible; on the other hand, it seems that this “test-negative design” is a standard approach for studying the effect of vaccines in the population after they have been released; see also here. So at the very least it would be a valuable educational opportunity to see this article that was on the verge of publication, and to understand its purported problems. Publishing the report along with a companion article discussing its problems, that could make sense. Canceling the report without explaining why (and, no, just saying you don’t like this method isn’t enough of an explanation) . . . that’s not serious science. Scientific integrity is not being advanced by this sort of behavior.

I was also upset that Oster just jumped into the discussion to say that she, too, hates the test-negative design. Neither Bhattacharya nor Oster are statisticians. They’re health economists. It’s fine for a health economist to have an opinion on a statistical method, but, to be so sure about it, that doesn’t seem right to me. To the extent that Bhattacharya and Oster have legitimate concerns about the statistical method, they can work with a statistician to express these concerns openly and scientifically.

I’m not saying that statisticians or epidemiologists are always right or that other professionals should defer to them. Statisticians can be wrong, really wrong, and the errors can be compounded by a presumption that they know what they’re doing. So question these reports all you want. But then is the time to bring in an expert of your own, not to wing it.

Above I talked about moral seriousness regarding outcomes. There’s also moral seriousness regarding methods, and neither of the two people in that interview were displaying it. Also important is moral seriousness about communication, which has not been displayed by Bhattacharya, who has yet to come to grips with the fact that he was on the board of an anti-vax organization.

P.S. Dorothy Bishop provides a detailed discussion of this event.

Did Taylor Swift kill a bunch of people?

In a post entitled “FARCE: FARS Album Release Coincidence Examination,” Gaurav Sood writes:

Replication and extended analysis of Patel, Worsham, Liu & Jena (2026), “Smartphones, Online Music Streaming, and Traffic Fatalities,” NBER Working Paper 34866.

Key Findings

1. The Statistical Effect Is Real

Traffic fatalities are elevated on major album release days:

Estimator Effect (Tier 1) SE t-stat
Local (±10 day)* +23.0 deaths 5.1 4.5
Donut-global +16.2 deaths 5.1 3.2
Forecast +22.8 deaths 4.9 4.6

. . .

2. But The Causal Story Doesn’t Hold Up

No dose-response relationship:

Album Streams Effect
Tortured Poets (2024) 313M -2 deaths
Her Loss (2022) 97M +63 deaths
Midnights (2022) 185M +5 deaths

. . .

Out-of-sample replication fails (2023-2024):

The paper analyzed 2017-2022 releases. We tested 7 major 2023-2024 albums as a true out-of-sample test:

Album Streams Effect
Tortured Poets 313M -2.1
UTOPIA 128M +10.5
For All The Dogs 109M -12.8
Cowboy Carter 76M -0.4
Hit Me Hard and Soft 73M +7.0
SOS 68M +9.4
One Thing at a Time 52M -1.5

Average effect: +1.4 deaths (vs. +22.8 for original sample). The biggest streaming day in Spotify history (Tortured Poets, 313M) shows a negative effect. The pattern found in 2017-2022 does not replicate forward.

Single outlier dominates: Her Loss accounts for 34% of the total Tier 1 effect.

3. Methodology Concerns

The ±10 day estimator uses post-treatment days as controls. The paper compares release-day fatalities to the average of the surrounding ±10 days—but this includes days after the release. Standard event studies use only pre-treatment periods. If the effect persists beyond day 0, the control mean is biased upward.

What The Paper Claims

Patel et al. (2026) find:

  • 139.1 deaths on release days vs 120.9 on control days (+18.2 deaths, +15%)
  • 123.3M streams on release days vs 86.1M control (+43%)
  • Proposed mechanism: smartphone distraction from streaming while driving

What We Did

Analysis Description
Extended data FARS 2007-2024 (vs. 2017-2022)
Forecast estimator Train model on non-release days, predict counterfactual
Dose-response Test if more streams → more deaths
Extended sample Added 2023-2024 albums (27 total vs. original 10)
Placebo tests Pre-trends, year permutation, window sensitivity

Results Summary

Finding Result Interpretation
In-sample effect +22.8 deaths/release Statistically significant (2017-2022)
Out-of-sample +1.4 deaths/release Effect vanishes in 2023-2024
Dose-response r = -0.18 Wrong sign for causal story
Her Loss outlier 34% of total effect Results driven by one album
Tier 2 ratio 0.80 (expected 0.50) Effect doesn’t scale with streams
More details at the link.

 

And Matt Thachet writes in with further thoughts:

I was wondering if you saw this paper. I first saw it written up in the New York Times, but it generated a fair number of articles in other outlets, too. The main claim is that the 10 biggest album releases (by Spotify streams) were associated with a 15% increase in fatal car crashes in the US.

I see the logic: higher streaming activity indicates more distracted driving, which causes more car crashes, but something feels flimsy to me. For one thing, it’s not clear to me that streaming music would actually be that distracting. If I wanted to listen to a new album I would put it on and then drive. There’s not much more to it, but maybe I’m underestimating the amount of other smartphone use that comes from this, like posting my reaction on social media.

The other part that sounds challenging is controlling for the day,.Most albums are released on Fridays which will have higher car crashes than other weekdays, but they control for this by comparing the 10 day periods before and after release date, which will include the same day of the week before and after the release date.

They include this list of albums and 5 of them were released within 10 days of another album in the list, which presumably makes the 10 day before and after control trickier. The other thing I wondered about, but they don’t seem to mention is whether the albums in the bottom half of the list have half the fatalities associated with the ones at the top, having half the streams. The average number of traffic fatalities per day is about 100, so maybe this would be too hard to tell.

image.png

Anyway I’m curious if you have time to hear your reaction to it. Like I said, the causal mechanism makes sense to me, but 15% is a huge increase and it just seems like controlling for day, season, holidays, etc. would make this almost impossible to be sure about.

I don’t have the energy to look into this myself.  Gaurav and Matt seem to have the right general approach, which is to look at the effect in the context of specific cases and to study variation.  In contrast, the common approach to quantitative research in published social science is to find some statistically significant relationship and hold onto it for dear life.

Or maybe I’m just saying this because I don’t want to believe that musicians are killing people.  I have a soft spot for pop stars, as compared to the culture heroes of today.

My talk at Stanford later this month: “What to do when your estimate is 1 standard error away from 0?”

Tuesday 28 Apr 2026, 4pm in CoDa E160:

What to do when your estimate is 1 standard error away from 0?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We provide a new answer to this simple yet very important question. Thinking clearly about this problem leads us to bring in many ideas in statistical analysis and computing, including causal identification, meta-analysis, Mister P, expectation propagation, decision analysis, experimental design, and the fundamental unity of Bayesian and frequentist statistics. We demonstrate our approach in examples from many applications, including medicine, social science, business, sports, and public policy.

This work is joint with Witold Więcek and Erik van Zwet.

In addition to all the above, I’ll probably drift into some related general topics such as the role of experimentation in science and engineering and the limitations of thinking about policy analysis in terms of causal inference.

2026 American Causal Inference Conference

This one looks great.

Jonas Mikhaeil will be speaking at a session, “Controversies about Counterfactual Utilities,” which is related to our paper, Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals. The other speakers in that panel are Kosuke Imai, Mats Stensrud, Benedikt Koch, and Amanda Kowalski.

The sessions at the conference cover some of my favorite topics, including varying treatment effects, Bayesian inference, generalizability, latent variables, experimental design, network analysis, and causal discovery.

So, yeah, good stuff all around. This conference has been going for awhile now–it was originally the Atlantic Causal Inference Conference and we held the first one here at Columbia in 2005, not long after we launched this blog which includes Causal Inference in its name. So it’s good to see this going stronger than ever, and keeping its balanced focus on theory, methods, and applications.

This year’s meeting is in Salt Lake City on 11-14 May 2026 so not too late for you to register!

Survey Statistics: sampling-weighted loss

We’ve mostly focused on a population mean E(Y) as our quantity of interest. We saw how methods extend to estimatingsubgroup mean E(Y | V=1), e.g. voters.

What about estimating a general conditional mean E(Y | X) ? We talked a lot (4 posts) about calibrating this to a known population mean E(Y), e.g. via the “logit shift”. But first we start with an estimate of E(Y | X) from survey data.

Lumley 2010 Section 5.2 says:

The polar bear has been going thru the pile of papers he was sitting on last week and found this:

Replace R (whether you respond to a survey) with T (whether you are treated) and you can see that my drawing is heavily inspired by Johansson et al. (2022) Figure 3:

We’ve talked about connections between survey random sampling and randomized experiments. There are also connections between nonprobability surveys and observational studies. We will explore more analogies between survey statistics and causal inference. Favorite references ?

My online talk Tues 24 Feb, 9am NY time at the Behind-the-Scenes seminar series: Russian Roulette and stochastic potential outcomes

I’m speaking at this online seminar Tues 24 Feb, 9am NY time:

The Behind-the-Scenes Seminar Series is designed to learn about the production process of research papers, offering an opportunity for students and researchers in all fields and at all career stages to engage with the challenges encountered during project development and how they were overcome.

Unlike most research seminars that focus on the research findings, this series will be dedicated to discussing the research process. Not only this, the seminars will also feature a live survey to gauge the audience’s expectations regarding the journey of the paper and compare them with the speaker’s actual experience.

What happened is that a few months ago the seminar organizers (three economists: Vatsal Khandelwal, Séverine Toussaert, and Jasmin Baier) wrote to me:

Speakers not only present their findings but also share the story behind their research, from the initial idea and design choices to data or modelling challenges and unexpected results.

Our aim is to foster openness, reflection, and engagement in the research community by highlighting the often-invisible processes that shape scientific work.

Would you be willing to suggest a paper you could cover? Ideally, it would be something that has already been accepted for publication, so that we can discuss the full journey, including the submission and review process.

I replied:

Here’s a list of our published research from last year.

If you go to that link and scroll down to “The stories behind the papers,” you’ll see where each paper came from.

So, if you want, you can pick one or more papers from that list that have good origin stories.

They responded that, as economists, they were most interested in the Russian roulette project.

It should be fun, to speak not just on the research itself but on where it came from and how it came to be published. It’s a joint paper with Jonas Mikhaeil, and we came up with the idea after hearing from Amanda Kowalski about her recent paper with Neil Christy, which got us thinking about what you can get from stochastic models for potential outcomes.

Here’s our published paper, “Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals,” and here’s the abstract:

It has been proposed in medical decision analysis to express the “first do no harm” principle as an asymmetric utility function in which the loss from killing a patient would count more than the gain from saving a life. Such a utility depends on unrealized potential outcomes, and we show how this yields a paradoxical decision recommendation in a simple hypothetical example involving games of Russian roulette. The problem is resolved if we abandon the stable unit treatment value assumption and allow the potential outcomes to be random variables. This leads us to conclude that, if you are interested in this sort of asymmetric utility function, you need to move to the stochastic potential outcome framework. We discuss the implications of the choice of parameterization in this setting.

We learned a lot from writing this paper and we’re continuing to think about the topic.

So, if you want to hear more, you can go to the Behind the Scenes website and sign up to get the zoom link. And here’s our blog discussion of the paper from last year.

Hanging out: An observational study

Paolo Parigi writes:

Bruno Abrahao and I are dealing with a complex data structure and are unsure of the best way to analyze it.

We have data from approximately 9,000 Airbnb users who participated in an online investment game designed to measure trust. Bruno and I published a paper on this back in 2017, in collaboration with Alok Gupta and Karen Cook. This was phase 1 of the study.

We then invited all phase 1 participants to return six weeks later and play the game again (phase 2). About 5,000 people came back. In the interim between the two phases, Airbnb tracked platform usage, and roughly 3,500 users traveled.

Among those who traveled, some reported having an experience they described as similar to a “hangout” in a follow-up survey. We have treated having a hangout as a quasi-random experience, given that it requires two people to occur.

We are unsure of the best way to analyze this data. One approach we’ve explored is to focus only on those who traveled between phases 1 and 2, treating the “hangout” experience as the treatment, and excluding non-travelers from the sample. Using a difference-in-differences model to examine trust differences between the two phases, we found that those who experienced the hangout treatment became more sensitive to reviews and reacted more strongly to homophily compared to those who travelled but did not have a hangout experience (the control group). We used IPW [inverse estimated probability weighting] to adjust for attrition, specifically for those who traveled but did not return for phase 2.

However, this approach excludes many users who did not travel but still participated in phase 2. Dropping so many observations raises the question of whether we could retain all users who returned for phase 2. Yet, we are uncertain how to model the data, especially given the nested hierarchical treatment (e.g., travel at level 1 and travel + hangout at level 2), as well as the multiple sources of attrition (self-selection into participating in the experiment, into traveling, and into having a hangout).

Moreover, including all phase 2 returnees in the analysis–especially those who did not travel–may complicate the interpretation of the treatment, as only those who traveled could have had exposure to the treatment, i.e., a hangout experience. At the same time, those who did not travel give us data on subjects who were not exposed to the treatment.

I told Parigi that there were three things about the above description that I didn’t understand:
– What is a “hangout”?
– What do you mean by “more sensitive to reviews”?
– What do you mean by “reacted more strongly to homophily”?

He replied:

Based on the responses to the Airbnb followup survey, we grouped participants into four categories:
– Did not travel
– Traveled (impersonal hotel-like experience)
– Traveled (social/friendly experience)
– Traveled and “hung out” with hosts

We’re now analyzing how these experiences affected trust behavior between the two phases, focusing on shifts in reliance on homophily and reputation. To isolate effects, we compared “Hangout” vs. “Did not travel” (excluding the intermediate groups), and “Hangout” vs. the other travel categories (excluding “Did not travel”).

We recognize this means excluding some data, but our goal is to sharpen the comparison between clearly distinct treatment groups.

Our difference-in-differences analysis shows two key effects:

– Hangout experiences reinforced reaction to homophily: Subjects showed a stronger preference for demographically similar profiles in phase 2.

– Hangout experiences reinforced sensitivity to reputation: Hangout participants became more trusting of profiles with higher reputation in phase 2.

OK, given all that, here’s my reply:

1. You ask about “the best way” to analyze these data. There is no best way! I say this not just in the trivial sense that we never know what to do, we can always learn more, etc., but also in the more specific sense that you can often learn multiple things from the same dataset. You might want to use these data to perform an observational study estimating the effect of “hanging out” and also to perform various descriptive analyses comparing different groups of customers. So even in an ideal setting there’d be no “best way”; rather, there are many things you can do.

2. Before getting to the causal analysis, let me just make a pitch for descriptive comparisons. You can look at different outcomes, comparing average responses of people with different demographics, different past travel behavior, etc. It’ll just be good to get a look at the data and see what’s going on!

3. To estimate the effect of hanging out, yes, it makes sense to consider this as a natural experiment of sorts. The important thing here will be to adjust for pre-treatment variables: again, demographics, also geography (where the respondents are from and also where they traveled to), their past frequency of traveling, etc. I don’t see any benefit from thinking of this as a difference-in-differences comparison. To me, it’s an observational study.

4. Regarding the question of whether to include the non-travelers in your study: this depends on what is the causal effect you’re trying to estimate. If you want to estimate the effect of “hanging out,” you need to decide what you are comparing it to.

5. Some of your specific questions can be framed as treatment interactions. Here my suggestion is to look at all interactions of potential interest, and you can display these estimates in some sort of graph. This will be better than just pulling out one or two interactions regarding homophily or trust or whatever.

“Statistics is widely understood to provide a body of techniques for ‘modeling data.'”

John “Bayesian Data Analysis” Carlin writes

Recent developments in the methodology of epidemiological research have emphasized the importance of achieving clarity of purpose by classifying research questions into one of three types: descriptive, predictive, and causal. . . .

I [Carlin] do not believe that studies aiming to “identify” independent predictors or “prognostic factors” are addressing well-defined research questions. Indeed, beyond the issues already raised, there is a broader question of the extent to which it is ever sensible to frame a research question as if it could be answered dichotomously, as in “is this an (independent) prognostic factor?” Prediction questions, which include prognosis, are those that involve the development of a model or algorithm to provide predictions of outcomes using available variables that are potential predictors.

This all makes sense. I kinda think that descriptive, predictive, and causal are all the same thing–or, more precisely, that “descriptive” and “causal” are special cases of “predictive,” under different conditions. But if you want to divide them into three tasks, sure, go for it. Personally, I’d rather divide statistics into the goals of exploration, estimation, and discrimination, but I think that’s because I’m thinking in a more general “data science” perspective, whereas John is focusing more on the more traditional problem of inference.

But, yes, I agree with him 100% on avoiding dichotomization, a topic that Sander Greenland, I, and others have been screaming about for a long time–indeed, John and I contributed to the anti-dichotomization theme in our book Bayesian Data Analysis, in that we focused on model building and inference within a model, rather than on the then-fashionable problem of choosing among or comparing models using Bayes factors. So, yes on that.

John continues:

Some variables may have greater predictive value than others, but this should be assessed by comparing the predictive value of the model or algorithm with and without the use of that variable, not by examining its “independent effect” in a multivariable regression model.

I’m confused on this point. I mean, sure, I agree that you shouldn’t label a regression coefficient as an “independent effect”; indeed, I always use the terms “predictors” and “outcome” rather than “independent and dependent variables.” Beyond this, I’m not quite sure what John is suggesting. Suppose you have a predictor of interest, x3, and you’ve fit the model y ~ x1 + x2 + x3 (for convenience using standard R notation). I guess John is saying, don’t just look at the coefficient for x3 in that model; also compare it to the model y ~ x1 + x2. Maybe this is a good idea–it’s not something I’ve thought about for a while. Is this the same as what used to be called “partial regression coefficients”? I remember from the statistical literature in the 1960s and 1970s that there was a lot of work on methods for understanding what happens in linear regression when you add one variable at a time. Perhaps it would be good to revisit some of those ideas, and maybe it’s a mistake that we don’t cover them in Regression and Other Stories.

I also want to plug my paper with Guido Imbens (also included as Section 21.5 in Regression and Other Stories), Why ask why? Forward causal inference and reverse causal questions. Our point there is that it can be a good idea to search for prognostic factors in observational data, not with the idea this will identify causal effects but rather as a way of understanding what’s missing from our existing models.

Finally, John writes:

More broadly, debates on whether to “adjust” or not for certain variables in a regression model can only be answered by situating the analysis within a sharply defined research question and a sharply defined rationale for specifying a regression model in the first place.

I don’t get this at all. First I don’t get why “adjust” is in scare quotes; second, ummm, yeah, it’s always good to have a sharply defined research question, but in the meantime people are always making comparisons, and so let’s do what adjusting as we can. For example, in an epidemiology study it should pretty much always be a good idea to adjust for age and smoking history. Or maybe John would say that the rationale for adjusting for age and smoking history is sharply defined, in which case maybe we’re in agreement.

To put it another way, it’s often a good idea to have a sharply defined research question–but that applies in general, not just for statistical adjustments. I think it’s also true that it’s better to have a sharply defined research question when performing a randomized clinical trial. A randomized clinical trial gives identification for the sample average treatment effect in any case–but without a sharply defined research question, it’s not clear what can be done with such an estimate.

So I’m wary of John singling out adjustment in his criticisms, as I fear his article will be taken as implying that, if you don’t try to adjust, that everything will be ok.

I sent the above comments to John, and he replied:

– I [John] don’t think it helps to say everything is actually prediction — it kinda goes against the idea of sharp specification of purpose! I’m not sure if my interest is “the traditional problem of inference” (not sure what you mean by that). I see it being about drawing conclusions that will be relevant to policies, decisions and practices in medicine and health. In this area, the question is usually causal (people want to say something about the effect of an intervention or exposure) but could be descriptive (“burden of disease” etc) or prediction (we want an algorithm for advising patients etc). The desired answer to the question (if causal or descriptive), and therefore the embodiment of the “sharp” question, should be a well-defined population parameter — as Don Rubin used to say, what would the number be if you had the whole population of interest (and could measure potential outcomes under both conditions for a causal question)…

– You write: “Suppose you have a predictor of interest, x3, and you’ve fit the model y ~ x1 + x2 + x3…” I can’t make sense of this unless you tell me why you’ve specified this model and what you hope to learn (re question of interest) from it. See all the stuff about “true model myth” in our papers.

– Re “it should pretty much always be a good idea to adjust for age and smoking history”, no!!! This is the whole point of a lot of discussion in modern epidemiology/ causal inference, that you should consider very carefully what you “adjust for” and why. The “scare” quotes are actually real quotes, reflecting that people use this terminology (of adjustment) with a whole range of ill-defined intentions, so the term “adjust for” is not well defined. There are some brief examples in the short paper that you’re commenting on that try to highlight why “adjusting” is often poorly understood and not serving a well-defined purpose. There’s a lot of writing in the epi causal inference literature about the dangers of over- or under-adjusting, and methods (usually using DAGs and always requiring substantive non-statistical assumptions) to figure out what should be adjusted for (in causal questions).

In response to John’s response, let me just say:

– The problems that he refers to, of causal inference, decision making, and description, can all be characterized as predictions conditional on some inputs. I agree 100% with John that it’s important to consider your applied goals. My statement, “‘descriptive’ and ‘causal’ are special cases of ‘predictive,’ under different conditions,” is not at all meant to deny the relevance of applied goals. The key question is what to predict, and “prediction” does not just mean prediction for new data that look just like your observed data, or for some specified test set, or whatever.

– I can change my example with x1, x2, and x3 to a real-world example from political science, for example a model of elections of legislative elections from our 1990 paper where y = Republican share of the two party vote in a district, x1 = Republican vote share in the previous election, x2 = incumbency status in the current election, x3 = incumbent party indicator. I don’t actually think that linear regression is the best model here (see our update from 2008). Anyway, in this model our goal was to estimate the incumbency advantage, defined as the expected difference in vote share for a party if an incumbent was running, compared to if it were an open seat. Our reason for including multiple predictors in our regression is not because we think any of our models are approximating a true data generating process or because we think that adjustment is sufficient to control for all confounding bias. We’re just doing the best we can here, keeping our applied goals in mind.

– Regarding age and smoking history: Sure, it depends on the goals. I was imagining a study of some medical intervention with treatment and control groups, where the patients have a mix of ages, and the outcome of interest is survival. For that sort of problem you’ll want to adjust for age and pre-treatment smoking history: even in a randomized trial there will typically be some imbalance on these important predictors, and in an observational study the imbalance can be large. And, relevant to John’s point, how you do this adjustment can be important. For example, simply including age as a linear predictor or an indicator for age > 65 might not do a good job. It also can be important for the model to the treatment effect to vary by age and smoking history, and I wouldn’t want to imply that any particular adjustment will solve all problems here. What is a reasonable adjustment procedure will depend on context, and problem-specific knowledge will help.

John also reminded me that a couple years ago we posted on John and Margarita Moreno-Betancur’s preprint, “On the Uses and Abuses of Regression Models: A Call for Reform of Statistical Practice and Teaching.” In the discussion thread, John wrote:

The idea that the method or technique is central is the fundamental problem that we identify, and we propose that this can only be addressed by turning the initial focus of teaching to the purpose or research question. This needs to be done very seriously, i.e. there should be no “general theory” of regression models, or of how to fit them “well” . . .

There has been a longstanding tendency to identify the role of the biostatistician with models and estimation techniques, undervaluing the essential role of statistical thinking in framing well-specified research questions. In contrast, our key assertion is that the three types of question need to be taken seriously, at the beginning of every analytic investigation. Identifying the type of question requires considerable reflection and discussion between statistician and collaborator. Once identified though, very briefly, if you have a prediction question, you might then think about tools A and B for developing prediction algorithms (multivariable regression being a great one if done well)… If you have a descriptive question, the first thoughts for analysis might be simple descriptive statistics, but there might be complicating issues to attend to (re sampling bias, perhaps, for which some regression technology might help). If you have a causal question, then you need to get serious about defining the question precisely, in terms of a target parameter or estimand, before thinking about the potential role for models to assist in estimating that quantity. Rather rarely, we suggest, will this target parameter be naturally defined by a coefficient in a regression model, although under carefully considered assumptions a regression coefficient might be relevant.

I think I agree with everything John is saying here–as long as you’ll accept my interpretation! When John writes, “there should be no ‘general theory’ of regression models, or of how to fit them ‘well'” . . . ummmm, my colleagues and I wrote this book, Regression and Other Stories, so obviously we do think there’s general advice we can give for regression models, and of course there’s mathematical theory (least squares, etc.). Is there a “general theory”? I guess it depends what is meant by that. In our book we recommend starting simple and then building up from there as needed based on the applied goals. This is some sort of principle, but it’s not a “general theory,” if by that John means a particular algorithm such as stepwise regression or lasso or horseshoe or whatever.

Then again, for regression there can never be a general theory, in that the most important part of a regression is typically the choice of what predictors to include. Indeed, I’ve been loudly (and, I think, correctly) critical of regression discontinuity analyses that don’t include enough predictors to account for pre-treatment differences between treatment and control. (An extreme example of this was a regression predicting remaining years of life which did not include current age as a predictor.)

The year after that discussion, John and Margarita’s paper was published;
here’s the final version. The journal also published some discussions and a rejoinder.

I read these two articles in full, and I pretty much agree with everything that John and Margarita are saying. I agree, for example, with their statement that “it is not logically possible for a single multivariable regression model,” and I think the points they make in their article are consistent with the approaches recommended in our books, Bayesian Data Analysis, Data Analysis Using Regression and Multilevel/Hierarchical Models, Regression and Other Stories, and the forthcoming Bayesian Workflow. Indeed, their recommendations also jibe with the statistical analyses performed in our book, Red State, Blue State, Rich State, Poor State.

Given all this, perhaps the question is, what is the role of default statistical procedures and generic statistical advice? We wrote Regression and Other Stories based on the knowledge that researchers will be making comparisons with various predictive and conditionally predictive goals, including causal inference, and they will be applying linear and logistic regressions to this task. With that in mind, I think it’s valuable to teach people how to understand the meaning of the expression, y = a + bx + error, along with more complicated expressions, how to interpret standard errors, how to graph data and fitted regression lines, how to avoid common pitfalls in understanding, and so forth.

It’s kinda like . . . math is used for all sorts of purposes. You can’t “do calculus” in an applied sense without having some sense of your real-world goals. Even so, general-purpose math textbooks can be useful. There are general techniques for differentiation and integration, general ways of understanding concepts such as the slope of a line and the area under a curve, etc.

To get back to statistics and regression modeling, here’s the beginning of Section 1.1 of Regression and Other Stories:

I guess we take it as a given that the only reason people are doing this is that they have some applied goals. We continue right away in that chapter with several examples.

That said, I get it when John and Margarita lament that “statistics is widely understood to provide a body of techniques for ‘modeling data.'” This is a concern. In our books we always try to start with the applied question before getting to the data, but sometimes we take the application for granted and don’t spell out the goals. Also, just as mathematical methods exist in an abstract sense without any application (2*2=4, hey!), various statistical methods and principles exist independently of examples, and indeed these are the kind of things that are worth teaching in a statistics course. So it’s complicated. I don’t think there are any easy answers. Maybe there should be a short new book explaining the ideas in John and Margarita’s article, but I don’t think this would take the place of books like Regression and Other Stories and Bayesian Workflow, which are focused on techniques for building models, fitting them, and understanding them once they’ve been fit, nor will it take the place of more mathematical books that derive the methods that we use.

The stories behind our published research from last year

It’s January so time to look back on what we’ve done in the past year.  I thought this time I’d give a little story of background on each of our published papers.

First, here’s the list of recently published papers:

Also we completed some new work that’s not yet been published:

We have a lot on deck for 2026, including two new books (Bayesian Workflow and the second edition of the edited Handbook of Monte Carlo) and a bunch of research articles on different topics in statistical modeling, causal inference, and social science.

And you can expect another 600 or so blog posts.

The stories behind the papers

It’s hard for me to pick my favorites among all the recently published papers, so let me just say something about each of them, in the same order they were listed above (roughly inverse chronological order of publication):

  • Adaptive sequential Monte Carlo for structured cross validation in Bayesian hierarchical models:  GH took a couple of my classes and had ideas for a couple of papers, including this one.  This is his idea that I just helped on a small amount.
  • Reanalysis of “Competition and innovation: An inverted-U relationship”:  This was originally a blog post.  The editor of the Journal of Robustness Reports asked me to submit it to them.  It took a couple rounds–the reviewers made some good points!–and fun thing about this journal is you can go to the link and see the entire review process.
  • The ladder of abstraction in statistical graphics:  I absolutely love this paper.  It originated in a talk I gave to Ron Yurko’s statistical graphics class at CMU.  I sent it to the journal and they had some good suggestions for improvement that my friend and colleague Kaiser Fung was able to do.
  • Statistical workflow:  As many of youall know, we’ve been writing a book on Bayesian Workflow–it will appear very soon!  I felt that the workflow concept would be useful in non-Bayesian statistics too, so my colleagues and I organized a special issue of a journal, where we solicited a bunch of articles from theoretical and applied researchers, mostly not Bayesian, to get different perspectives on workflow.  The journal issue is looking good–I guess it will be out soon–and we wrote this short article to lead off that issue.  It’s a short paper and I recommend you take a look!
  • Adjusting for underreporting of child protective services involvement in the Future of Families and Child Wellbeing Study and assessing its empirical implications through illustrative analyses of young adult disconnection:  OK, I don’t have much to say about this one.  It’s by my colleagues at the school of social work at Columbia; I was involved in the survey weighting for the study.
  • A multilevel Bayesian approach to climate-fueled migration and conflict:  Hey, I don’t remember much about this at all!  But, yeah, multilevel modeling, I guess I did something useful here!
  • Artificial intelligence and aesthetic judgment:  This one’s mostly by Jessica and Ari, but I made some contributions throughout, which might be recognized from earlier appearances of some of these ideas on the blog.  It’s published in Sankhya because I think they asked me to submit something for a special issue, and we had this cool paper that we couldn’t figure out what to do with.
  • Discussion of “Statistical exploration of the manifold hypothesis”:  This journal sometimes runs papers with discussions (they did a couple of mine in the past decade), and sometimes I contributed something.  Here I saw a good opportunity to remind people of my thoughts on Tibshirani’s “bet on sparsity” principle and where it can go wrong.
  • Meta-analysis with a single study:  What can I say?  This paper has an awesome title.  Erik, Witold, and I have been meeting weekly and will be coming out with more articles soon on science and meta-science.
  • Normative scientific conflict is unavoidable and should be welcomed:  I can’t remember how, but I came across an announcement of a special issue of the journal Theory and Society on the topic of normative scientific conflict.  I had some things to say on the topic, and this seemed like a good outlet.  I like this paper!  You should read it.
  • Russian roulette:  The need for stochastic potential outcomes when utilities depend on counterfactuals:  This paper has a funny story behind it.  I was contacted by economist Amanda Kowalski about a paper she and her colleagues had written about causal inference.  That paper got me thinking about stochastic potential outcomes and asymmetric utility functions, and I had this idea of demonstrating these ideas in a simple example of Russian roulette.  Jonas joined as a collaborator and clarified a bunch of issues that I’d been sloppy with.  We asked Amanda if she wanted to join in, but she was too busy on her own stuff.  Anyway, the final paper is cool–it’s really clean, and it’s timely because lots of people are interested in going beyond the stable unit treatment value assumption.
  • Multilevel regression and poststratification using margins of poststratifiers:  Improving inference for HIV health outcomes during the COVID-19 pandemic:  Qixuan has been taking the lead on a bunch of papers we’ve been doing, generalizing MRP in various ways.  I think we’re gradually moving toward a bright future of generalizing from sample to population.
  • Statistical graphics and comics:  Parallel histories of visual storytelling:  This is an idea that I’ve had for a while.  I mentioned it in class offhandedly one day, and one of the students told me she was interested in the topic too, so we wrote this article.  It was a true collaboration.  It’s kind of a specialized topic, but I think it should have a potentially wide audience, because lots of people love comics and lots of people love statistical graphics.  We focus on the fascinating question of how it is that these two modes of communication have developed only in the past few centuries, even though they could have been invented much earlier.  This is a sister paper to the “ladder of abstraction” paper mentioned above.
  • Letter to the editor:  Long story here.   Back in 2017, a bigshot professor lied about me in a published article in the journal, Perspectives on Psychological Science.  It was the kind of crap article that should never have been accepted, but at the time that journal was run by a corrupt cabal and they were publishing their friends’ articles essentially without peer review.  At the time I complained to the journal but only got rude responses from the cabal.  But things change.  The journal is now run by civilized people and they published my letter.  Better 8 years late than never at all.  And, no, the people who wrote and published the lies never apologized.  Of course not!  Apologies are for losers, not for members of the prestigious National Academy of Sciences.
  • Rethinking approaches to analysis of global randomised controlled trials:  Epidemiologist Jay Brophy wrote this one.  I had some minor contribution, I can’t remember what.
  • Simulation-based calibration checking for Bayesian computation:  The choice of test quantities shapes sensitivity:  This is the latest version of a long series of papers on SBC, starting with Samantha Cook’s Ph.D. thesis, which we turned into a paper that was published twenty years earlier.  I continue to be interested in the idea of accompanying inferences with simulations that check the computations.
  • Visualizing distributions of covariance matrices:  This paper is nearly 20 years old!  At the time we had difficulty getting it published and we moved on to other things.  Then a couple years ago a journal asked me for an article and I sent them this one.  Unfortunately it was a so-called predatory journal, and one of my coauthors didn’t want our article appearing there.  Fair enough!  But then we thought we might as well get it published, so we sent it off.  I like the paper, and I also like that it’s on the relatively understudied topic of visualizing models (as opposed to visualizing data).
  • Interrogating the “cargo cult science” metaphor:  This topic had been bugging me for a while, and Megan and I wrote this paper which got rejected by a couple of places.  Neither of us really knows how to communicate with researchers in the field of science studies, so it was a hard paper to place, even though it makes a clean point.  Then I happened to hear about the journal Theory and Society, which seemed like the perfect place.  I don’t know if anyone read our article, but I’d like to think that, in the future, people will think twice before talking about cargo cult science.
  • A calibrated BISG for inferring race from surname and geolocation:  This is Philip’s project.  I did help out a bit, but I remain frustrated in that we haven’t been able to frame this in a fully Bayesian or generative way.  We’re continuing to work on the problem, and we have a new method, supercaliBISG, which does even better than caliiBISG, which is an improvement on BISG, which itself has the word “improved” in its title (and also calls itself Bayesian, but it’s not fully so).
  • Hierarchical Bayesian models to mitigate systematic disparities in prediction with proxy outcomes:  I can’t remember exactly where his paper came from, but it was somehow associated with some conversations we had with Sharad Goel and others on statistical measures of disparity.  As is often the case, I think much is gained by framing the problem within a generative model.
  • The piranha problem:  Large effects swimming in a small pond:  This one’s important!  The basic idea–there are probabilistic or statistical constraints regarding patterns of dependence in high dimensions, and this has implications for our understanding of patterns in complex structures–was mine, but the coauthors did most of the rest, to collect some relevant mathematical results.  As I like to say, I think there’s more to be said in this area, maybe some connections to random matrix theory.  Also, the paper has an unusual publication story.  What happened was that a student from the statistics club at San Diego State University asked me to do a remote meeting with them.  I did so–it was a fun conversation–and it turned out that their faculty adviser, Richard Levine, was editor of the Notices of the American Mathematical Society, and was looking for general-interest math papers with applied or statistical relevance.  So I sent him the piranha paper.  Articles in this journal have a strict limit of no more than 10 pages and no more than 20 references.  It was hard for me to keep the references under 20 while demonstrating the applied relevance of the topic, so I cheated and wrote a blog post entitled, “Here are just some of the factors that have been published in the social priming and related literatures as having large effects on behavior,” so that just counted as 1 reference in our paper.  Kind of like if the genie gives you 3 wishes and you spend one of them on more wishes.
  • For how many iterations should we run Markov chain Monte Carlo?:  This is an update of my paper with Kenny Shirley for the new edition of the MCMC handbook.  Charles took the lead on this chapter.

Last post on the estimated effects of Mississippi school reforms

For background:

How much of “Mississippi’s education miracle” is an artifact of selection bias?

When the numbers don’t look right, check them! (Mississippi education update)

More on school reform, this time New Orleans

And now one more, from Noah Spencer, who writes:

I did have a good back-and-forth with Wainer et al., but remain unconvinced by their main critique.

– I [Spencer] address the authors’ main critique – that truncation due to retention mechanically explains the observed effects – in Section 7.2 of my paper. Basically, students who are retained in grade 3 do not just stay there forever. The typical student is retained for one year and then proceeds to grade 4, where they can write the NAEP. Based on the timing of the policy, it just would not have been the case that any NAEP-taking cohort would be artificially missing a mass of weaker students.
“One hypothesis is that the NAEP test score gains are a mechanical consequence of weaker 3rd-grade students not making it to fourth grade to write the NAEP test. Given the timing of the retention policy however, this purely mechanical explanation does not make sense. The first cohort eligible for retention under the LBPA was the 2014-2015 grade 3 cohort. Thus, the 2014-2015 grade 4 NAEP test-takers were not exposed to the new retention policy. It is true that the 2016-2017 grade 4 NAEP test-taking cohort would not have included students who were retained in grade 3 after the 2015-2016 school year (who would have been in grade 4 in 2016-2017 absent the LBPA). However, the 2016-2017 test-taking cohort would have included students who were retained in grade 3 after the 2014-2015 school year (assuming they were not retained again in 2015-2016).

    Thus, the mass of weaker students taking the NAEP would not be eliminated due to the LBPA, but rather replaced by a mass of previously under-achieving students who had been retained and had now passed the necessary grade 3 reading assessment.

Similar logic follows for the 2018-2019 test-taking cohort.”
– Minor note: Being retained multiple times in grade 3 is rare in Mississippi.

– I also test in my paper whether the LBPA changed the composition of NAEP-takers beyond the above truncation concern (see Table B3). I do not find statistically significant effects on the percent of NAEP takers who: are White, are male, are English language learners, have a disability, or have a computer at home.

– The question of whether retention was the key mechanism through which the LBPA’s effects manifest is a good one. Are the average test score gains across Mississippi driven by the scores of retained students? The 2014-2015 treatment effect cannot be due to LBPA-induced retention as Mississippi’s 2014-2015 grade 4 cohort was not exposed to the retention aspect of the policy (which started in 2014-2015). The 2018-2019 treatment effect is unlikely to be substantially influenced by LBPA-induced retention given that the 2016-2017 third-grade retention rate (3.8%) was so similar to the pre-LBPA retention rate (3.3% in 2013-2014). You would have to assume incredible gains in test scores due to retention for such a small segment of students to influence a state’s average so greatly. The 2016-2017 treatment effect is the most likely to be affected by retention given that 8.1% of third-graders in 2014-2015 were retained. In Appendix C, I conduct a decomposition exercise and estimate that only about 22% of the 2016-2017 treatment effect is due to retention aspect of the LBPA – though I should note that this decomposition exercise does require some strong assumptions.

– With respect to longer-term effects, I show in Appendix B.1 of my paper that effects persist until at least grade 7 on higher-stakes, state-level tests. There is some fadeout, but this is not unusual among educational interventions. I did not analyze effects on grade 8 NAEP reading scores in my paper partially because there was only one pre-COVID grade 8 cohort who was exposed to the LBPA and partially because I wanted to use grade 8 test scores as covariates. For what it’s worth, though, I have run the analysis quickly and find positive effects for grade 8 NAEP reading test-takers (including the 2022 and 2024 cohorts), though I would be hesitant to take much from post-COVID results because there was so much else changing at the time.

– Carefully evaluating effects on longer-term outcomes like high school completion rates, ACT scores, and post-secondary entrance rates is an important topic for future research. Mississippi’s gains on grade 4-8 assessments certainly do not guarantee longer-term effects and, again, it would not be unusual for short/medium-term effects to fade out.

– The claim that “The 2024 NAEP fourth grade mathematics scores rank the state at a tie at 50th!” is incorrect: Mississippi ranked 16th. They are also ranked 35th in 8th grade math, not 50th. I believe the authors have corrected this in an updated version of their article.

– “He improvised by using some prior years’ data as the control group, and instead of random assignment he used various bits of covariate information to equate this year’s students with the previous years…” – This was not what I did (nor what the synthetic difference-in-differences method does). I generated a control group based on a weighted average of states with similarly evolving test scores pre-treatment.

– Mississippi’s results are not entirely unique. Westall and Cummings (2023) assess early literacy policies across the country and find 0.14 SD effects for kids exposed from K-3 in the average “comprehensive policy” state. My 0.23 SD estimated effect for Mississippi is not wholly inconsistent with their national results.

How the covid vaccine almost killed me

So, I was talking on the phone with a friend the other day and she said she just got covid, and I realized that I knew a few other people who’d had covid recently, and this season’s version of the vaccine had come out. I scheduled an appointment at the doctor’s office the next day for covid and flu shots. But when I got there, all they had was the flu shot—the covid shots hadn’t come in yet. The nurse recommended I try doing it through a pharmacy. I kinda forgot about it but then a couple days later I remembered. I went on the CVS website and it was really easy to schedule . . . actually they had an appointment in 20 minutes on West 57 St in midtown. (Amusingly enough, when I typed in my location, it gave the closest locations as some places in New Jersey—I guess they were measuring as-the-crow-flies distance rather than travel time.) 20 minutes doesn’t leave much margin of error so I threw on my shoes, grabbed my bike, zipped over to the subway, went down to 59 St, and biked over to the corner where the CVS was . . . I wasn’t sure which way to go and I couldn’t see any street numbers so I took a guess and turned left, then I saw the street numbers were too low . . . I was in a real hurry now, I didn’t want to get there too late and have them retract my appointment, also I had to return home in about 40 minutes, so I decided to turn around right there in the middle of the block. As I was making that U-turn I slowed down to find a break in the traffic going the other direction and I saw a city bus barreling right at me! Fortunately there was some space in the cars so I could get into the traffic and I didn’t get run over.

Everything else went well. I got the shot and I got home in time for my 4pm meeting. But I almost got run over by a bus (entirely my fault, not the bus driver’s at all). So that’s my story: the vaccine almost killed me.

I’m reminded of the principle that the most dangerous part of a flight is the ride to the airport.

Combining a high-quality probability sample with data from larger online panels

Yajuan Si, James Wagner, and Ron Kessler write:

The traditional use of high-quality probability samples to carry out psychiatric epidemiological surveys of the household population is facing increasing financial and operational challenges. Surveys from nonprobability and probability-based online panels have emerged as cost-effective alternatives with the additional advantage of rapid turnaround time, albeit with biases that can in some cases be substantial.

We recommend a middle ground of integrating surveys from online panels with small parallel high-quality probability samples . . . The key features of such “hybrid designs” are as follows: use of a high-quality probability sample as a population surrogate to provide information about the distributions of otherwise unavailable variables that differentiate participants in online panels from the larger household population, inclusion in both surveys of measures that are both strongly associated with the outcomes of interest and strongly predictive of membership in the online panel, and use of best-practice statistical methods that blend results across the 2 samples.

Such a hybrid design should be the minimally acceptable design for psychiatric epidemiological surveys of the household population given the biases known to exist in online panels. However, we also comment on several other designs that might be used for more rapid and less expensive exploratory analyses.

This is interesting, to think of multi-frame, multi-mode sampling as best practice in itself rather than as an awkward problem to be dealt with only if absolutely necessary.

Yajuan offers some background on the project:

This is my first time writing a paper without any equations or data modeling but having to rely on solid statistical knowledge, understanding the extensive literature, and gathering lots of data. And Ron Kesser is a phenomenal collaborator. I learned a lot from working with him.

Anyway, here is the idea of the paper: We propose a hybrid data collection of large-scale nonprobability samples and small parallel high-quality probability samples as common practice for population-based research. For MRP applications, we often struggle with the availability of population information of X. We propose to estimate the population distribution of X in a small probability sample, after we identify the list of highly predictive covariates X for the outcome Y. We can also collect Y in the probability sample. We propose the sequential weighting adjustment by first weighting the probability sample to the census data (this should be based a small list of adjustment factors, say only demographics, assuming the probability sample design is well controlled and nonresponse bias is small) and then weighting the nonprobability sample to the initially weighted probability sample (the list of adjustment factors could be large, even including the outcome). After the sequential weighting, the combined samples can give us enough power for small area estimates. I use weighting adjustment here for simplicity, but we can also use MRP for the adjustment if we have an outcome of interest.

Basically, I’m trying to push the MRP adjustment from post-collection inference to inform study design and modify data collection adaptively, releasing the burden or strong assumptions on analysis by improving the study design from the starting point.

This is interesting and potentially important for several reasons:

1. Data quality of survey responses is becoming more and more of an issue, and it makes sense to try to reach potential respondents in more comfortable places than the traditional survey interview.

2. We should be thinking more systematically about how to integrate data from multiple sources.

3. MRP can be adapted to more general data structures.

4. As Yajuan says, we should be aware of all these data collection and analysis issues in the design stage.