Hanging out: An observational study

Posted on February 16, 2026 9:21 AM by Andrew

Paolo Parigi writes:

Bruno Abrahao and I are dealing with a complex data structure and are unsure of the best way to analyze it.

We have data from approximately 9,000 Airbnb users who participated in an online investment game designed to measure trust. Bruno and I published a paper on this back in 2017, in collaboration with Alok Gupta and Karen Cook. This was phase 1 of the study.

We then invited all phase 1 participants to return six weeks later and play the game again (phase 2). About 5,000 people came back. In the interim between the two phases, Airbnb tracked platform usage, and roughly 3,500 users traveled.

Among those who traveled, some reported having an experience they described as similar to a “hangout” in a follow-up survey. We have treated having a hangout as a quasi-random experience, given that it requires two people to occur.

We are unsure of the best way to analyze this data. One approach we’ve explored is to focus only on those who traveled between phases 1 and 2, treating the “hangout” experience as the treatment, and excluding non-travelers from the sample. Using a difference-in-differences model to examine trust differences between the two phases, we found that those who experienced the hangout treatment became more sensitive to reviews and reacted more strongly to homophily compared to those who travelled but did not have a hangout experience (the control group). We used IPW [inverse estimated probability weighting] to adjust for attrition, specifically for those who traveled but did not return for phase 2.

However, this approach excludes many users who did not travel but still participated in phase 2. Dropping so many observations raises the question of whether we could retain all users who returned for phase 2. Yet, we are uncertain how to model the data, especially given the nested hierarchical treatment (e.g., travel at level 1 and travel + hangout at level 2), as well as the multiple sources of attrition (self-selection into participating in the experiment, into traveling, and into having a hangout).

Moreover, including all phase 2 returnees in the analysis–especially those who did not travel–may complicate the interpretation of the treatment, as only those who traveled could have had exposure to the treatment, i.e., a hangout experience. At the same time, those who did not travel give us data on subjects who were not exposed to the treatment.

I told Parigi that there were three things about the above description that I didn’t understand:
– What is a “hangout”?
– What do you mean by “more sensitive to reviews”?
– What do you mean by “reacted more strongly to homophily”?

He replied:

Based on the responses to the Airbnb followup survey, we grouped participants into four categories:
– Did not travel
– Traveled (impersonal hotel-like experience)
– Traveled (social/friendly experience)
– Traveled and “hung out” with hosts

We’re now analyzing how these experiences affected trust behavior between the two phases, focusing on shifts in reliance on homophily and reputation. To isolate effects, we compared “Hangout” vs. “Did not travel” (excluding the intermediate groups), and “Hangout” vs. the other travel categories (excluding “Did not travel”).

We recognize this means excluding some data, but our goal is to sharpen the comparison between clearly distinct treatment groups.

Our difference-in-differences analysis shows two key effects:

– Hangout experiences reinforced reaction to homophily: Subjects showed a stronger preference for demographically similar profiles in phase 2.

– Hangout experiences reinforced sensitivity to reputation: Hangout participants became more trusting of profiles with higher reputation in phase 2.

OK, given all that, here’s my reply:

1. You ask about “the best way” to analyze these data. There is no best way! I say this not just in the trivial sense that we never know what to do, we can always learn more, etc., but also in the more specific sense that you can often learn multiple things from the same dataset. You might want to use these data to perform an observational study estimating the effect of “hanging out” and also to perform various descriptive analyses comparing different groups of customers. So even in an ideal setting there’d be no “best way”; rather, there are many things you can do.

2. Before getting to the causal analysis, let me just make a pitch for descriptive comparisons. You can look at different outcomes, comparing average responses of people with different demographics, different past travel behavior, etc. It’ll just be good to get a look at the data and see what’s going on!

3. To estimate the effect of hanging out, yes, it makes sense to consider this as a natural experiment of sorts. The important thing here will be to adjust for pre-treatment variables: again, demographics, also geography (where the respondents are from and also where they traveled to), their past frequency of traveling, etc. I don’t see any benefit from thinking of this as a difference-in-differences comparison. To me, it’s an observational study.

4. Regarding the question of whether to include the non-travelers in your study: this depends on what is the causal effect you’re trying to estimate. If you want to estimate the effect of “hanging out,” you need to decide what you are comparing it to.

5. Some of your specific questions can be framed as treatment interactions. Here my suggestion is to look at all interactions of potential interest, and you can display these estimates in some sort of graph. This will be better than just pulling out one or two interactions regarding homophily or trust or whatever.

“Statistics is widely understood to provide a body of techniques for ‘modeling data.'”

Posted on February 4, 2026 9:17 AM by Andrew

John “Bayesian Data Analysis” Carlin writes

Recent developments in the methodology of epidemiological research have emphasized the importance of achieving clarity of purpose by classifying research questions into one of three types: descriptive, predictive, and causal. . . .

I [Carlin] do not believe that studies aiming to “identify” independent predictors or “prognostic factors” are addressing well-defined research questions. Indeed, beyond the issues already raised, there is a broader question of the extent to which it is ever sensible to frame a research question as if it could be answered dichotomously, as in “is this an (independent) prognostic factor?” Prediction questions, which include prognosis, are those that involve the development of a model or algorithm to provide predictions of outcomes using available variables that are potential predictors.

This all makes sense. I kinda think that descriptive, predictive, and causal are all the same thing–or, more precisely, that “descriptive” and “causal” are special cases of “predictive,” under different conditions. But if you want to divide them into three tasks, sure, go for it. Personally, I’d rather divide statistics into the goals of exploration, estimation, and discrimination, but I think that’s because I’m thinking in a more general “data science” perspective, whereas John is focusing more on the more traditional problem of inference.

But, yes, I agree with him 100% on avoiding dichotomization, a topic that Sander Greenland, I, and others have been screaming about for a long time–indeed, John and I contributed to the anti-dichotomization theme in our book Bayesian Data Analysis, in that we focused on model building and inference within a model, rather than on the then-fashionable problem of choosing among or comparing models using Bayes factors. So, yes on that.

John continues:

Some variables may have greater predictive value than others, but this should be assessed by comparing the predictive value of the model or algorithm with and without the use of that variable, not by examining its “independent effect” in a multivariable regression model.

I’m confused on this point. I mean, sure, I agree that you shouldn’t label a regression coefficient as an “independent effect”; indeed, I always use the terms “predictors” and “outcome” rather than “independent and dependent variables.” Beyond this, I’m not quite sure what John is suggesting. Suppose you have a predictor of interest, x3, and you’ve fit the model y ~ x1 + x2 + x3 (for convenience using standard R notation). I guess John is saying, don’t just look at the coefficient for x3 in that model; also compare it to the model y ~ x1 + x2. Maybe this is a good idea–it’s not something I’ve thought about for a while. Is this the same as what used to be called “partial regression coefficients”? I remember from the statistical literature in the 1960s and 1970s that there was a lot of work on methods for understanding what happens in linear regression when you add one variable at a time. Perhaps it would be good to revisit some of those ideas, and maybe it’s a mistake that we don’t cover them in Regression and Other Stories.

I also want to plug my paper with Guido Imbens (also included as Section 21.5 in Regression and Other Stories), Why ask why? Forward causal inference and reverse causal questions. Our point there is that it can be a good idea to search for prognostic factors in observational data, not with the idea this will identify causal effects but rather as a way of understanding what’s missing from our existing models.

Finally, John writes:

More broadly, debates on whether to “adjust” or not for certain variables in a regression model can only be answered by situating the analysis within a sharply defined research question and a sharply defined rationale for specifying a regression model in the first place.

I don’t get this at all. First I don’t get why “adjust” is in scare quotes; second, ummm, yeah, it’s always good to have a sharply defined research question, but in the meantime people are always making comparisons, and so let’s do what adjusting as we can. For example, in an epidemiology study it should pretty much always be a good idea to adjust for age and smoking history. Or maybe John would say that the rationale for adjusting for age and smoking history is sharply defined, in which case maybe we’re in agreement.

To put it another way, it’s often a good idea to have a sharply defined research question–but that applies in general, not just for statistical adjustments. I think it’s also true that it’s better to have a sharply defined research question when performing a randomized clinical trial. A randomized clinical trial gives identification for the sample average treatment effect in any case–but without a sharply defined research question, it’s not clear what can be done with such an estimate.

So I’m wary of John singling out adjustment in his criticisms, as I fear his article will be taken as implying that, if you don’t try to adjust, that everything will be ok.

I sent the above comments to John, and he replied:

– I [John] don’t think it helps to say everything is actually prediction — it kinda goes against the idea of sharp specification of purpose! I’m not sure if my interest is “the traditional problem of inference” (not sure what you mean by that). I see it being about drawing conclusions that will be relevant to policies, decisions and practices in medicine and health. In this area, the question is usually causal (people want to say something about the effect of an intervention or exposure) but could be descriptive (“burden of disease” etc) or prediction (we want an algorithm for advising patients etc). The desired answer to the question (if causal or descriptive), and therefore the embodiment of the “sharp” question, should be a well-defined population parameter — as Don Rubin used to say, what would the number be if you had the whole population of interest (and could measure potential outcomes under both conditions for a causal question)…

– You write: “Suppose you have a predictor of interest, x3, and you’ve fit the model y ~ x1 + x2 + x3…” I can’t make sense of this unless you tell me why you’ve specified this model and what you hope to learn (re question of interest) from it. See all the stuff about “true model myth” in our papers.

– Re “it should pretty much always be a good idea to adjust for age and smoking history”, no!!! This is the whole point of a lot of discussion in modern epidemiology/ causal inference, that you should consider very carefully what you “adjust for” and why. The “scare” quotes are actually real quotes, reflecting that people use this terminology (of adjustment) with a whole range of ill-defined intentions, so the term “adjust for” is not well defined. There are some brief examples in the short paper that you’re commenting on that try to highlight why “adjusting” is often poorly understood and not serving a well-defined purpose. There’s a lot of writing in the epi causal inference literature about the dangers of over- or under-adjusting, and methods (usually using DAGs and always requiring substantive non-statistical assumptions) to figure out what should be adjusted for (in causal questions).

In response to John’s response, let me just say:

– The problems that he refers to, of causal inference, decision making, and description, can all be characterized as predictions conditional on some inputs. I agree 100% with John that it’s important to consider your applied goals. My statement, “‘descriptive’ and ‘causal’ are special cases of ‘predictive,’ under different conditions,” is not at all meant to deny the relevance of applied goals. The key question is what to predict, and “prediction” does not just mean prediction for new data that look just like your observed data, or for some specified test set, or whatever.

– I can change my example with x1, x2, and x3 to a real-world example from political science, for example a model of elections of legislative elections from our 1990 paper where y = Republican share of the two party vote in a district, x1 = Republican vote share in the previous election, x2 = incumbency status in the current election, x3 = incumbent party indicator. I don’t actually think that linear regression is the best model here (see our update from 2008). Anyway, in this model our goal was to estimate the incumbency advantage, defined as the expected difference in vote share for a party if an incumbent was running, compared to if it were an open seat. Our reason for including multiple predictors in our regression is not because we think any of our models are approximating a true data generating process or because we think that adjustment is sufficient to control for all confounding bias. We’re just doing the best we can here, keeping our applied goals in mind.

– Regarding age and smoking history: Sure, it depends on the goals. I was imagining a study of some medical intervention with treatment and control groups, where the patients have a mix of ages, and the outcome of interest is survival. For that sort of problem you’ll want to adjust for age and pre-treatment smoking history: even in a randomized trial there will typically be some imbalance on these important predictors, and in an observational study the imbalance can be large. And, relevant to John’s point, how you do this adjustment can be important. For example, simply including age as a linear predictor or an indicator for age > 65 might not do a good job. It also can be important for the model to the treatment effect to vary by age and smoking history, and I wouldn’t want to imply that any particular adjustment will solve all problems here. What is a reasonable adjustment procedure will depend on context, and problem-specific knowledge will help.

John also reminded me that a couple years ago we posted on John and Margarita Moreno-Betancur’s preprint, “On the Uses and Abuses of Regression Models: A Call for Reform of Statistical Practice and Teaching.” In the discussion thread, John wrote:

The idea that the method or technique is central is the fundamental problem that we identify, and we propose that this can only be addressed by turning the initial focus of teaching to the purpose or research question. This needs to be done very seriously, i.e. there should be no “general theory” of regression models, or of how to fit them “well” . . .

There has been a longstanding tendency to identify the role of the biostatistician with models and estimation techniques, undervaluing the essential role of statistical thinking in framing well-specified research questions. In contrast, our key assertion is that the three types of question need to be taken seriously, at the beginning of every analytic investigation. Identifying the type of question requires considerable reflection and discussion between statistician and collaborator. Once identified though, very briefly, if you have a prediction question, you might then think about tools A and B for developing prediction algorithms (multivariable regression being a great one if done well)… If you have a descriptive question, the first thoughts for analysis might be simple descriptive statistics, but there might be complicating issues to attend to (re sampling bias, perhaps, for which some regression technology might help). If you have a causal question, then you need to get serious about defining the question precisely, in terms of a target parameter or estimand, before thinking about the potential role for models to assist in estimating that quantity. Rather rarely, we suggest, will this target parameter be naturally defined by a coefficient in a regression model, although under carefully considered assumptions a regression coefficient might be relevant.

I think I agree with everything John is saying here–as long as you’ll accept my interpretation! When John writes, “there should be no ‘general theory’ of regression models, or of how to fit them ‘well'” . . . ummmm, my colleagues and I wrote this book, Regression and Other Stories, so obviously we do think there’s general advice we can give for regression models, and of course there’s mathematical theory (least squares, etc.). Is there a “general theory”? I guess it depends what is meant by that. In our book we recommend starting simple and then building up from there as needed based on the applied goals. This is some sort of principle, but it’s not a “general theory,” if by that John means a particular algorithm such as stepwise regression or lasso or horseshoe or whatever.

Then again, for regression there can never be a general theory, in that the most important part of a regression is typically the choice of what predictors to include. Indeed, I’ve been loudly (and, I think, correctly) critical of regression discontinuity analyses that don’t include enough predictors to account for pre-treatment differences between treatment and control. (An extreme example of this was a regression predicting remaining years of life which did not include current age as a predictor.)

The year after that discussion, John and Margarita’s paper was published;
here’s the final version. The journal also published some discussions and a rejoinder.

I read these two articles in full, and I pretty much agree with everything that John and Margarita are saying. I agree, for example, with their statement that “it is not logically possible for a single multivariable regression model,” and I think the points they make in their article are consistent with the approaches recommended in our books, Bayesian Data Analysis, Data Analysis Using Regression and Multilevel/Hierarchical Models, Regression and Other Stories, and the forthcoming Bayesian Workflow. Indeed, their recommendations also jibe with the statistical analyses performed in our book, Red State, Blue State, Rich State, Poor State.

Given all this, perhaps the question is, what is the role of default statistical procedures and generic statistical advice? We wrote Regression and Other Stories based on the knowledge that researchers will be making comparisons with various predictive and conditionally predictive goals, including causal inference, and they will be applying linear and logistic regressions to this task. With that in mind, I think it’s valuable to teach people how to understand the meaning of the expression, y = a + bx + error, along with more complicated expressions, how to interpret standard errors, how to graph data and fitted regression lines, how to avoid common pitfalls in understanding, and so forth.

It’s kinda like . . . math is used for all sorts of purposes. You can’t “do calculus” in an applied sense without having some sense of your real-world goals. Even so, general-purpose math textbooks can be useful. There are general techniques for differentiation and integration, general ways of understanding concepts such as the slope of a line and the area under a curve, etc.

To get back to statistics and regression modeling, here’s the beginning of Section 1.1 of Regression and Other Stories:

I guess we take it as a given that the only reason people are doing this is that they have some applied goals. We continue right away in that chapter with several examples.

That said, I get it when John and Margarita lament that “statistics is widely understood to provide a body of techniques for ‘modeling data.'” This is a concern. In our books we always try to start with the applied question before getting to the data, but sometimes we take the application for granted and don’t spell out the goals. Also, just as mathematical methods exist in an abstract sense without any application (2*2=4, hey!), various statistical methods and principles exist independently of examples, and indeed these are the kind of things that are worth teaching in a statistics course. So it’s complicated. I don’t think there are any easy answers. Maybe there should be a short new book explaining the ideas in John and Margarita’s article, but I don’t think this would take the place of books like Regression and Other Stories and Bayesian Workflow, which are focused on techniques for building models, fitting them, and understanding them once they’ve been fit, nor will it take the place of more mathematical books that derive the methods that we use.

The stories behind our published research from last year

Posted on January 6, 2026 11:36 AM by Andrew

It’s January so time to look back on what we’ve done in the past year. I thought this time I’d give a little story of background on each of our published papers.

First, here’s the list of recently published papers:

[2026] Adaptive sequential Monte Carlo for structured cross validation in Bayesian hierarchical models. {\em Journal of Computational and Graphical Statistics}. (Geonhee Han and Andrew Gelman)
[2026] Reanalysis of “Competition and innovation: An inverted-U relationship.” {\em Journal of Robustness Reports}. (Andrew Gelman)
[2026] The ladder of abstraction in statistical graphics. {\em American Statistician}. (Andrew Gelman and Kaiser Fung)
[2026] Statistical workflow. {\em Philosophical Transactions of the Royal Society A}. (Andrew Gelman, Aki Vehtari, and Richard McElreath)
[2026] Adjusting for underreporting of child protective services involvement in the Future of Families and Child Wellbeing Study and assessing its empirical implications through illustrative analyses of young adult disconnection. {\em Social Service Review}. (Lawrence M. Berger, Tia Dickerson, Andrew Gelman, Hye-Min Jung, Seonghun Lee, Margaret Thomas, and Jane Waldfogel)
[2025] A multilevel Bayesian approach to climate-fueled migration and conflict. {\em Scientific Reports}. (Claire Palandri, Paulina Concha Larrauri, Andrew Gelman, Michael J. Puma, and Upmanu Lall)
[2025] Artificial intelligence and aesthetic judgment. {\em Sankhya}. (Jessica Hullman, Ari Holtzman, and Andrew Gelman)
[2025] Discussion of “Statistical exploration of the manifold hypothesis,” by N. Whiteley, A. Gray, and P. Rubin-Delanchy. {\em Journal of the Royal Statistical Society B}. (Andrew Gelman)
[2025] Meta-analysis with a single study. {\em Statistical Methods in Medical Research}. (Erik van Zwet, Witold Wiecek, and Andrew Gelman)
[2025] Normative scientific conflict is unavoidable and should be welcomed. {\em Theory and Society}. (Andrew Gelman)
[2025] Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals. {\em Biometrika}. (Andrew Gelman and Jonas Mikhaeil)
[2025] Multilevel regression and poststratification using margins of poststratifiers: Improving inference for HIV health outcomes during the COVID-19 pandemic. {\em Statistics in Medicine}.(Amy J. Pitts, Maiko Yomogida, Angela Aidala, Andrew Gelman, and Qixuan Chen)
[2025] Statistical graphics and comics: Parallel histories of visual storytelling. {\em Nightingale}. (Andrew Gelman and Susan Kruglinski)
[2025] Letter to the editor. {\em Perspectives on Psychological Science}. (Andrew Gelman)
[2025] Rethinking approaches to analysis of global randomised controlled trials. {\em British Medical Journal} {\bf 389}, r1273. (James M. Brophy and Andrew Gelman)
[2025] Simulation-based calibration checking for Bayesian computation: The choice of test quantities shapes sensitivity. {\em Bayesian Analysis} {\bf 20}, 461–488. (Martin Modrák, Angie H. Moon, Shinyoung Kim, Paul Bürkner, Niko Huurre, Kateřina Faltejsková, Andrew Gelman, and Aki Vehtari)
[2025] Visualizing distributions of covariance matrices. {\em Journal of Data Science, Statistics, and Visualisation} {\bf 5}, 7. (Tomoki Tokuda, Ben Goodrich, Iven Van Mechelen, Andrew Gelman, and Francis Tuerlinckx)
[2025] Interrogating the “cargo cult science” metaphor. {\em Theory and Society} {\bf 54}, 197–207.. (Andrew Gelman and Megan Higgs)
[2025] A calibrated BISG for inferring race from surname and geolocation. {\em Journal of the Royal Statistical Society A}. (Philip Greengard and Andrew Gelman)
[2025] Hierarchical Bayesian models to mitigate systematic disparities in prediction with proxy outcomes. {\em Journal of the Royal Statistical Society A}. (Jonas Mikhaeil, Andrew Gelman, and Philip Greengard)
[2025] The piranha problem: Large effects swimming in a small pond. {\em Notices of the American Mathematical Society} {\bf 72}, 15–25. (Christopher Tosh, Philip Greengard, Ben Goodrich, Andrew Gelman, and Daniel Hsu)
[2025] For how many iterations should we run Markov chain Monte Carlo? In {\em Handbook of Markov Chain Monte Carlo}, second edition. (Charles C. Margossian and Andrew Gelman)
[2026] Reanalysis of “Competition and innovation: An inverted-U relationship.” {\em Journal of Robustness Reports}. (Andrew Gelman)
[2026] The ladder of abstraction in statistical graphics. {\em American Statistician}. (Andrew Gelman and Kaiser Fung)
[2026] Statistical workflow. {\em Philosophical Transactions of the Royal Society A}. (Andrew Gelman, Aki Vehtari, and Richard McElreath)
[2026] Adjusting for underreporting of child protective services involvement in the Future of Families and Child Wellbeing Study and assessing its empirical implications through illustrative analyses of young adult disconnection. {\em Social Service Review}. (Lawrence M. Berger, Tia Dickerson, Andrew Gelman, Hye-Min Jung, Seonghun Lee, Margaret Thomas, and Jane Waldfogel)
[2025] A multilevel Bayesian approach to climate-fueled migration and conflict. {\em Scientific Reports}. (Claire Palandri, Paulina Concha Larrauri, Andrew Gelman, Michael J. Puma, and Upmanu Lall)
[2025] Artificial intelligence and aesthetic judgment. {\em Sankhya}. (Jessica Hullman, Ari Holtzman, and Andrew Gelman)
[2025] Discussion of “Statistical exploration of the manifold hypothesis,” by N. Whiteley, A. Gray, and P. Rubin-Delanchy. {\em Journal of the Royal Statistical Society B}. (Andrew Gelman)
[2025] Meta-analysis with a single study. {\em Statistical Methods in Medical Research}. (Erik van Zwet, Witold Wiecek, and Andrew Gelman)
[2025] Normative scientific conflict is unavoidable and should be welcomed. {\em Theory and Society}. (Andrew Gelman)
[2025] Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals. {\em Biometrika}. (Andrew Gelman and Jonas Mikhaeil)
[2025] Multilevel regression and poststratification using margins of poststratifiers: Improving inference for HIV health outcomes during the COVID-19 pandemic. {\em Statistics in Medicine}.(Amy J. Pitts, Maiko Yomogida, Angela Aidala, Andrew Gelman, and Qixuan Chen)
[2025] Statistical graphics and comics: Parallel histories of visual storytelling. {\em Nightingale}. (Andrew Gelman and Susan Kruglinski)
[2025] Letter to the editor. {\em Perspectives on Psychological Science}. (Andrew Gelman)
[2025] Rethinking approaches to analysis of global randomised controlled trials. {\em British Medical Journal} {\bf 389}, r1273. (James M. Brophy and Andrew Gelman)
[2025] Simulation-based calibration checking for Bayesian computation: The choice of test quantities shapes sensitivity. {\em Bayesian Analysis} {\bf 20}, 461–488. (Martin Modrák, Angie H. Moon, Shinyoung Kim, Paul Bürkner, Niko Huurre, Kateřina Faltejsková, Andrew Gelman, and Aki Vehtari)
[2025] Visualizing distributions of covariance matrices. {\em Journal of Data Science, Statistics, and Visualisation} {\bf 5}, 7. (Tomoki Tokuda, Ben Goodrich, Iven Van Mechelen, Andrew Gelman, and Francis Tuerlinckx)
[2025] Interrogating the “cargo cult science” metaphor. {\em Theory and Society} {\bf 54}, 197–207.. (Andrew Gelman and Megan Higgs)
[2025] A calibrated BISG for inferring race from surname and geolocation. {\em Journal of the Royal Statistical Society A}. (Philip Greengard and Andrew Gelman)
[2025] Hierarchical Bayesian models to mitigate systematic disparities in prediction with proxy outcomes. {\em Journal of the Royal Statistical Society A}. (Jonas Mikhaeil, Andrew Gelman, and Philip Greengard)
[2025] The piranha problem: Large effects swimming in a small pond. {\em Notices of the American Mathematical Society} {\bf 72}, 15–25. (Christopher Tosh, Philip Greengard, Ben Goodrich, Andrew Gelman, and Daniel Hsu)
[2025] For how many iterations should we run Markov chain Monte Carlo? In {\em Handbook of Markov Chain Monte Carlo}, second edition. (Charles C. Margossian and Andrew Gelman)

Also we completed some new work that’s not yet been published:

Power analysis is essential: High-powered tests suggest minimal to no effect of rounded shapes on click-through rates. (Ron Kohavi, Jakub Linowski, Lukas Vermeer, Fabrice Boisseranc, Joachim Furuseth, Andrew Gelman, Guido Imbens, and Ravikiran Rajagopal)
Efficient scenario analysis in real-time Bayesian election forecasting via sequential meta-posterior sampling. (Geonhee Han, Andrew Gelman, and Aki Vehtari)
Continuous adaptive path sampling for efficient multimodal sampling and marginalization. (Yuling Yao, Collin Cademartori, Aki Vehtari, and Andrew Gelman)
Conformal prediction and human decision making. (Jessica Hullman, Yifan Wu, Dawei Xie, Ziyang Guo, and Andrew Gelman)
When fiction is presented as real: The case of the burly boatmen. (Andrew Gelman)

We have a lot on deck for 2026, including two new books (Bayesian Workflow and the second edition of the edited Handbook of Monte Carlo) and a bunch of research articles on different topics in statistical modeling, causal inference, and social science.

And you can expect another 600 or so blog posts.

The stories behind the papers

It’s hard for me to pick my favorites among all the recently published papers, so let me just say something about each of them, in the same order they were listed above (roughly inverse chronological order of publication):

Adaptive sequential Monte Carlo for structured cross validation in Bayesian hierarchical models: GH took a couple of my classes and had ideas for a couple of papers, including this one. This is his idea that I just helped on a small amount.
Reanalysis of “Competition and innovation: An inverted-U relationship”: This was originally a blog post. The editor of the Journal of Robustness Reports asked me to submit it to them. It took a couple rounds–the reviewers made some good points!–and fun thing about this journal is you can go to the link and see the entire review process.
The ladder of abstraction in statistical graphics: I absolutely love this paper. It originated in a talk I gave to Ron Yurko’s statistical graphics class at CMU. I sent it to the journal and they had some good suggestions for improvement that my friend and colleague Kaiser Fung was able to do.
Statistical workflow: As many of youall know, we’ve been writing a book on Bayesian Workflow–it will appear very soon! I felt that the workflow concept would be useful in non-Bayesian statistics too, so my colleagues and I organized a special issue of a journal, where we solicited a bunch of articles from theoretical and applied researchers, mostly not Bayesian, to get different perspectives on workflow. The journal issue is looking good–I guess it will be out soon–and we wrote this short article to lead off that issue. It’s a short paper and I recommend you take a look!
Adjusting for underreporting of child protective services involvement in the Future of Families and Child Wellbeing Study and assessing its empirical implications through illustrative analyses of young adult disconnection: OK, I don’t have much to say about this one. It’s by my colleagues at the school of social work at Columbia; I was involved in the survey weighting for the study.
A multilevel Bayesian approach to climate-fueled migration and conflict: Hey, I don’t remember much about this at all! But, yeah, multilevel modeling, I guess I did something useful here!
Artificial intelligence and aesthetic judgment: This one’s mostly by Jessica and Ari, but I made some contributions throughout, which might be recognized from earlier appearances of some of these ideas on the blog. It’s published in Sankhya because I think they asked me to submit something for a special issue, and we had this cool paper that we couldn’t figure out what to do with.
Discussion of “Statistical exploration of the manifold hypothesis”: This journal sometimes runs papers with discussions (they did a couple of mine in the past decade), and sometimes I contributed something. Here I saw a good opportunity to remind people of my thoughts on Tibshirani’s “bet on sparsity” principle and where it can go wrong.
Meta-analysis with a single study: What can I say? This paper has an awesome title. Erik, Witold, and I have been meeting weekly and will be coming out with more articles soon on science and meta-science.
Normative scientific conflict is unavoidable and should be welcomed: I can’t remember how, but I came across an announcement of a special issue of the journal Theory and Society on the topic of normative scientific conflict. I had some things to say on the topic, and this seemed like a good outlet. I like this paper! You should read it.
Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals: This paper has a funny story behind it. I was contacted by economist Amanda Kowalski about a paper she and her colleagues had written about causal inference. That paper got me thinking about stochastic potential outcomes and asymmetric utility functions, and I had this idea of demonstrating these ideas in a simple example of Russian roulette. Jonas joined as a collaborator and clarified a bunch of issues that I’d been sloppy with. We asked Amanda if she wanted to join in, but she was too busy on her own stuff. Anyway, the final paper is cool–it’s really clean, and it’s timely because lots of people are interested in going beyond the stable unit treatment value assumption.
Multilevel regression and poststratification using margins of poststratifiers: Improving inference for HIV health outcomes during the COVID-19 pandemic: Qixuan has been taking the lead on a bunch of papers we’ve been doing, generalizing MRP in various ways. I think we’re gradually moving toward a bright future of generalizing from sample to population.
Statistical graphics and comics: Parallel histories of visual storytelling: This is an idea that I’ve had for a while. I mentioned it in class offhandedly one day, and one of the students told me she was interested in the topic too, so we wrote this article. It was a true collaboration. It’s kind of a specialized topic, but I think it should have a potentially wide audience, because lots of people love comics and lots of people love statistical graphics. We focus on the fascinating question of how it is that these two modes of communication have developed only in the past few centuries, even though they could have been invented much earlier. This is a sister paper to the “ladder of abstraction” paper mentioned above.
Letter to the editor: Long story here. Back in 2017, a bigshot professor lied about me in a published article in the journal, Perspectives on Psychological Science. It was the kind of crap article that should never have been accepted, but at the time that journal was run by a corrupt cabal and they were publishing their friends’ articles essentially without peer review. At the time I complained to the journal but only got rude responses from the cabal. But things change. The journal is now run by civilized people and they published my letter. Better 8 years late than never at all. And, no, the people who wrote and published the lies never apologized. Of course not! Apologies are for losers, not for members of the prestigious National Academy of Sciences.
Rethinking approaches to analysis of global randomised controlled trials: Epidemiologist Jay Brophy wrote this one. I had some minor contribution, I can’t remember what.
Simulation-based calibration checking for Bayesian computation: The choice of test quantities shapes sensitivity: This is the latest version of a long series of papers on SBC, starting with Samantha Cook’s Ph.D. thesis, which we turned into a paper that was published twenty years earlier. I continue to be interested in the idea of accompanying inferences with simulations that check the computations.
Visualizing distributions of covariance matrices: This paper is nearly 20 years old! At the time we had difficulty getting it published and we moved on to other things. Then a couple years ago a journal asked me for an article and I sent them this one. Unfortunately it was a so-called predatory journal, and one of my coauthors didn’t want our article appearing there. Fair enough! But then we thought we might as well get it published, so we sent it off. I like the paper, and I also like that it’s on the relatively understudied topic of visualizing models (as opposed to visualizing data).
Interrogating the “cargo cult science” metaphor: This topic had been bugging me for a while, and Megan and I wrote this paper which got rejected by a couple of places. Neither of us really knows how to communicate with researchers in the field of science studies, so it was a hard paper to place, even though it makes a clean point. Then I happened to hear about the journal Theory and Society, which seemed like the perfect place. I don’t know if anyone read our article, but I’d like to think that, in the future, people will think twice before talking about cargo cult science.
A calibrated BISG for inferring race from surname and geolocation: This is Philip’s project. I did help out a bit, but I remain frustrated in that we haven’t been able to frame this in a fully Bayesian or generative way. We’re continuing to work on the problem, and we have a new method, supercaliBISG, which does even better than caliiBISG, which is an improvement on BISG, which itself has the word “improved” in its title (and also calls itself Bayesian, but it’s not fully so).
Hierarchical Bayesian models to mitigate systematic disparities in prediction with proxy outcomes: I can’t remember exactly where his paper came from, but it was somehow associated with some conversations we had with Sharad Goel and others on statistical measures of disparity. As is often the case, I think much is gained by framing the problem within a generative model.
The piranha problem: Large effects swimming in a small pond: This one’s important! The basic idea–there are probabilistic or statistical constraints regarding patterns of dependence in high dimensions, and this has implications for our understanding of patterns in complex structures–was mine, but the coauthors did most of the rest, to collect some relevant mathematical results. As I like to say, I think there’s more to be said in this area, maybe some connections to random matrix theory. Also, the paper has an unusual publication story. What happened was that a student from the statistics club at San Diego State University asked me to do a remote meeting with them. I did so–it was a fun conversation–and it turned out that their faculty adviser, Richard Levine, was editor of the Notices of the American Mathematical Society, and was looking for general-interest math papers with applied or statistical relevance. So I sent him the piranha paper. Articles in this journal have a strict limit of no more than 10 pages and no more than 20 references. It was hard for me to keep the references under 20 while demonstrating the applied relevance of the topic, so I cheated and wrote a blog post entitled, “Here are just some of the factors that have been published in the social priming and related literatures as having large effects on behavior,” so that just counted as 1 reference in our paper. Kind of like if the genie gives you 3 wishes and you spend one of them on more wishes.
For how many iterations should we run Markov chain Monte Carlo?: This is an update of my paper with Kenny Shirley for the new edition of the MCMC handbook. Charles took the lead on this chapter.

Last post on the estimated effects of Mississippi school reforms

Posted on January 5, 2026 9:49 AM by Andrew

For background:

– How much of “Mississippi’s education miracle” is an artifact of selection bias?

– When the numbers don’t look right, check them! (Mississippi education update)

– More on school reform, this time New Orleans

And now one more, from Noah Spencer, who writes:

I did have a good back-and-forth with Wainer et al., but remain unconvinced by their main critique.

– I [Spencer] address the authors’ main critique – that truncation due to retention mechanically explains the observed effects – in Section 7.2 of my paper. Basically, students who are retained in grade 3 do not just stay there forever. The typical student is retained for one year and then proceeds to grade 4, where they can write the NAEP. Based on the timing of the policy, it just would not have been the case that any NAEP-taking cohort would be artificially missing a mass of weaker students.
– “One hypothesis is that the NAEP test score gains are a mechanical consequence of weaker 3rd-grade students not making it to fourth grade to write the NAEP test. Given the timing of the retention policy however, this purely mechanical explanation does not make sense. The first cohort eligible for retention under the LBPA was the 2014-2015 grade 3 cohort. Thus, the 2014-2015 grade 4 NAEP test-takers were not exposed to the new retention policy. It is true that the 2016-2017 grade 4 NAEP test-taking cohort would not have included students who were retained in grade 3 after the 2015-2016 school year (who would have been in grade 4 in 2016-2017 absent the LBPA). However, the 2016-2017 test-taking cohort would have included students who were retained in grade 3 after the 2014-2015 school year (assuming they were not retained again in 2015-2016).

Thus, the mass of weaker students taking the NAEP would not be eliminated due to the LBPA, but rather replaced by a mass of previously under-achieving students who had been retained and had now passed the necessary grade 3 reading assessment.

Similar logic follows for the 2018-2019 test-taking cohort.”
– Minor note: Being retained multiple times in grade 3 is rare in Mississippi.

– I also test in my paper whether the LBPA changed the composition of NAEP-takers beyond the above truncation concern (see Table B3). I do not find statistically significant effects on the percent of NAEP takers who: are White, are male, are English language learners, have a disability, or have a computer at home.

– The question of whether retention was the key mechanism through which the LBPA’s effects manifest is a good one. Are the average test score gains across Mississippi driven by the scores of retained students? The 2014-2015 treatment effect cannot be due to LBPA-induced retention as Mississippi’s 2014-2015 grade 4 cohort was not exposed to the retention aspect of the policy (which started in 2014-2015). The 2018-2019 treatment effect is unlikely to be substantially influenced by LBPA-induced retention given that the 2016-2017 third-grade retention rate (3.8%) was so similar to the pre-LBPA retention rate (3.3% in 2013-2014). You would have to assume incredible gains in test scores due to retention for such a small segment of students to influence a state’s average so greatly. The 2016-2017 treatment effect is the most likely to be affected by retention given that 8.1% of third-graders in 2014-2015 were retained. In Appendix C, I conduct a decomposition exercise and estimate that only about 22% of the 2016-2017 treatment effect is due to retention aspect of the LBPA – though I should note that this decomposition exercise does require some strong assumptions.

– With respect to longer-term effects, I show in Appendix B.1 of my paper that effects persist until at least grade 7 on higher-stakes, state-level tests. There is some fadeout, but this is not unusual among educational interventions. I did not analyze effects on grade 8 NAEP reading scores in my paper partially because there was only one pre-COVID grade 8 cohort who was exposed to the LBPA and partially because I wanted to use grade 8 test scores as covariates. For what it’s worth, though, I have run the analysis quickly and find positive effects for grade 8 NAEP reading test-takers (including the 2022 and 2024 cohorts), though I would be hesitant to take much from post-COVID results because there was so much else changing at the time.

– Carefully evaluating effects on longer-term outcomes like high school completion rates, ACT scores, and post-secondary entrance rates is an important topic for future research. Mississippi’s gains on grade 4-8 assessments certainly do not guarantee longer-term effects and, again, it would not be unusual for short/medium-term effects to fade out.

– The claim that “The 2024 NAEP fourth grade mathematics scores rank the state at a tie at 50th!” is incorrect: Mississippi ranked 16th. They are also ranked 35th in 8th grade math, not 50th. I believe the authors have corrected this in an updated version of their article.

– “He improvised by using some prior years’ data as the control group, and instead of random assignment he used various bits of covariate information to equate this year’s students with the previous years…” – This was not what I did (nor what the synthetic difference-in-differences method does). I generated a control group based on a weighted average of states with similarly evolving test scores pre-treatment.

– Mississippi’s results are not entirely unique. Westall and Cummings (2023) assess early literacy policies across the country and find 0.14 SD effects for kids exposed from K-3 in the average “comprehensive policy” state. My 0.23 SD estimated effect for Mississippi is not wholly inconsistent with their national results.

How the covid vaccine almost killed me

Posted on December 29, 2025 9:23 AM by Andrew

So, I was talking on the phone with a friend the other day and she said she just got covid, and I realized that I knew a few other people who’d had covid recently, and this season’s version of the vaccine had come out. I scheduled an appointment at the doctor’s office the next day for covid and flu shots. But when I got there, all they had was the flu shot—the covid shots hadn’t come in yet. The nurse recommended I try doing it through a pharmacy. I kinda forgot about it but then a couple days later I remembered. I went on the CVS website and it was really easy to schedule . . . actually they had an appointment in 20 minutes on West 57 St in midtown. (Amusingly enough, when I typed in my location, it gave the closest locations as some places in New Jersey—I guess they were measuring as-the-crow-flies distance rather than travel time.) 20 minutes doesn’t leave much margin of error so I threw on my shoes, grabbed my bike, zipped over to the subway, went down to 59 St, and biked over to the corner where the CVS was . . . I wasn’t sure which way to go and I couldn’t see any street numbers so I took a guess and turned left, then I saw the street numbers were too low . . . I was in a real hurry now, I didn’t want to get there too late and have them retract my appointment, also I had to return home in about 40 minutes, so I decided to turn around right there in the middle of the block. As I was making that U-turn I slowed down to find a break in the traffic going the other direction and I saw a city bus barreling right at me! Fortunately there was some space in the cars so I could get into the traffic and I didn’t get run over.

Everything else went well. I got the shot and I got home in time for my 4pm meeting. But I almost got run over by a bus (entirely my fault, not the bus driver’s at all). So that’s my story: the vaccine almost killed me.

I’m reminded of the principle that the most dangerous part of a flight is the ride to the airport.

Combining a high-quality probability sample with data from larger online panels

Posted on December 14, 2025 9:33 AM by Andrew

Yajuan Si, James Wagner, and Ron Kessler write:

The traditional use of high-quality probability samples to carry out psychiatric epidemiological surveys of the household population is facing increasing financial and operational challenges. Surveys from nonprobability and probability-based online panels have emerged as cost-effective alternatives with the additional advantage of rapid turnaround time, albeit with biases that can in some cases be substantial.

We recommend a middle ground of integrating surveys from online panels with small parallel high-quality probability samples . . . The key features of such “hybrid designs” are as follows: use of a high-quality probability sample as a population surrogate to provide information about the distributions of otherwise unavailable variables that differentiate participants in online panels from the larger household population, inclusion in both surveys of measures that are both strongly associated with the outcomes of interest and strongly predictive of membership in the online panel, and use of best-practice statistical methods that blend results across the 2 samples.

Such a hybrid design should be the minimally acceptable design for psychiatric epidemiological surveys of the household population given the biases known to exist in online panels. However, we also comment on several other designs that might be used for more rapid and less expensive exploratory analyses.

This is interesting, to think of multi-frame, multi-mode sampling as best practice in itself rather than as an awkward problem to be dealt with only if absolutely necessary.

Yajuan offers some background on the project:

This is my first time writing a paper without any equations or data modeling but having to rely on solid statistical knowledge, understanding the extensive literature, and gathering lots of data. And Ron Kesser is a phenomenal collaborator. I learned a lot from working with him.

Anyway, here is the idea of the paper: We propose a hybrid data collection of large-scale nonprobability samples and small parallel high-quality probability samples as common practice for population-based research. For MRP applications, we often struggle with the availability of population information of X. We propose to estimate the population distribution of X in a small probability sample, after we identify the list of highly predictive covariates X for the outcome Y. We can also collect Y in the probability sample. We propose the sequential weighting adjustment by first weighting the probability sample to the census data (this should be based a small list of adjustment factors, say only demographics, assuming the probability sample design is well controlled and nonresponse bias is small) and then weighting the nonprobability sample to the initially weighted probability sample (the list of adjustment factors could be large, even including the outcome). After the sequential weighting, the combined samples can give us enough power for small area estimates. I use weighting adjustment here for simplicity, but we can also use MRP for the adjustment if we have an outcome of interest.

Basically, I’m trying to push the MRP adjustment from post-collection inference to inform study design and modify data collection adaptively, releasing the burden or strong assumptions on analysis by improving the study design from the starting point.

This is interesting and potentially important for several reasons:

1. Data quality of survey responses is becoming more and more of an issue, and it makes sense to try to reach potential respondents in more comfortable places than the traditional survey interview.

2. We should be thinking more systematically about how to integrate data from multiple sources.

3. MRP can be adapted to more general data structures.

4. As Yajuan says, we should be aware of all these data collection and analysis issues in the design stage.

“We conclude that apparent effects of growth mindset interventions on academic achievement are likely attributable to inadequate study design, reporting flaws, and bias.”

Posted on December 11, 2025 9:18 AM by Andrew

Joshua Brooks points us to this research article by Brooke Macnamara and Alexander Burgoyne, “Do growth mindset interventions impact students’ academic achievement? A systematic review and meta-analysis with recommendations for best practices,” which states:

According to mindset theory, students who believe their personal characteristics can change–that is, those who hold a growth mindset–will achieve more than students who believe their characteristics are fixed. Proponents of the theory have developed interventions to influence students’ mindsets, claiming that these interventions lead to large gains in academic achievement. Despite their popularity, the evidence for growth mindset intervention benefits has not been systematically evaluated considering both the quantity and quality of the evidence. Here, we provide such a review by (a) evaluating empirical studies’ adherence to a set of best practices essential for drawing causal conclusions and (b) conducting three meta-analyses. When examining all studies (63 studies, N = 97,672), we found major shortcomings in study design, analysis, and reporting, and suggestions of researcher and publication bias: Authors with a financial incentive to report positive findings published significantly larger effects than authors without this incentive. Across all studies, we observed a small overall effect . . . which was nonsignificant after correcting for potential publication bias. No theoretically meaningful moderators were significant. When examining only studies demonstrating the intervention influenced students’ mindsets as intended . . . the effect was nonsignificant . . . When examining the highest-quality evidence . . . the effect was nonsignificant . . . We conclude that apparent effects of growth mindset interventions on academic achievement are likely attributable to inadequate study design, reporting flaws, and bias.

I haven’t read the paper, let alone the 63 cited studies, but I thought I’d do my part by getting this into the discussion.

We talked about earlier critical work by Mcnamara on growth mindset back in 2018, where I discussed how to think about effect sizes for such interventions.

My main message was that, if mindset interventions work, we’d still expect small average effects, because they won’t work for all students. As I wrote, “it’s a small effect in the context of any student, and of course it’s a small effect. It’s hard to get good grades, and there’s no magic way to get there!”

In one sense, my conclusion is negative on mindset interventions in that I’m saying we shouldn’t expect to see large effects, and any large effects that do show up are likely to be huge overestimates.

In another sense, my conclusion is positive on mindset interventions in that, given that any average effects will be small, the lack of statistically significant average effects in small or even moderately-large studies does not have to imply that mindset interventions don’t work; it just says that they only work in some settings, and individual effects will mostly be small.

Also relevant is this discussion we had a few years ago on mindset interventions with contributions from Russell Warne and David Yeager. Lots to chew on here, also this example helped form my thinking on varying treatment effects, leading to our causal quartets paper and some future lines of research.

25,000 lives saved per ship sunk, $100,000 per citation, a probability of 10^-90 of a decisive vote . . . Is there a through line from B.S. numbers in junk science to B.S. numbers coming from the government?

Posted on December 10, 2025 9:01 AM by Andrew

Did a blithe disregard for innumeracy in pop social science pave the way to a blithe disregard for innumeracy in government?

I came across this:

At first I thought this a parody but maybe it’s real? Here’s the home page, which again looks like a joke but I think it really is coming from the U.S. government:

Again, I’m not sure but for the purposes of this post, let’s assume that this is actually an official government statement:

EVERY TIME WE HIT A NARCO-TRAFFICKING VESSEL, WE SAVE TWENTY-FIVE THOUSAND LIVES.

This is innumerate crazy talk.

Paul Campos discusses this in the context of the president’s cognitive degeneration, but I think there’s more to it than that.

Standard-issue innumeracy

Bill James once wrote that his innovation as a sports analyst was to think of baseball statistics as numbers that could be added, subtracted, multiplied, and divided, in contrast to the usual attitude in which statistics are treated like words (so-and-so hit .300 or led the league in stolen bases or whatever).

We often see this meaningless-numbers-as-words attitude coming from credentialed academic social scientists. Some examples we’ve discussed over the years include:
– The claim that beautiful parents are 36% more likely to have girl babies (thanks, Freakonomics!),
– The claims that single women were 20% more likely to support Barack Obama and three times more likely to wear red or pink clothing during certain times of the month (thanks, Psychological Science!),
– The claim that every execution prevents 18 murders (thanks, Harvard!).

These are examples of what one might call standard-issue innumeracy, which is how we might characterize claims that could in theory be correct but whose plausibility disintegrates after any serious engagement with reality. These are numbers that don’t make a lot of sense but they kinda sound good. A moment’s reflection would cause immediate skepticism, but who has time for a moment’s reflection? Not Steven Levitt, Cass Sunstein, or various authors, reviewers, and editors for Psychological Science. The numbers don’t mean anything, they’re just a way to tell a story.

Standard-issue innumeracy can come by fishing in small samples of noisy data, yielding what can be massive overestimates of effect size.

Hard-core innumeracy

But then there are what we might call hardcore innumeracy, those quantitative statements that don’t even require a moment’s reflection to recognize as absolutely ridiculous. For example:
– The claim that the probability of a decisive vote is 10^-90 (thanks, British Journal of Politics and International Relations!),
– The claim that scientific citations are worth $100,000 each (thanks, Ted talks!).

As Campos might say, these are the equivalent of saying that a baseball player is hitting 3.000 or that somebody is on track to hit 272 home runs in April.

But credentialed social scientists write these things! What’s the point? 10^-90 is a really tiny number and $100,000 is a really big number, that’s the point.

When the President of the United States says, “every time we knock out a boat, we save 25,000 American lives,” is kinda like when the Robert Gray Dodge Professor of Network Science and University Distinguished Professor writes, “We can, in other words calculate exactly how much a single citation is worth. . . . in the United States each citation is worth a whopping $100,000.”

Two things are going on here:
(a) They’re being hard-core innumerate, providing numbers that are orders of magnitude away from anything reasonable.
(b) They are exercising political or social authority, the power to say things that don’t make sense without getting called on it. Just like in the story of the Emperor’s New Clothes: the more power you have, the more outlandish things you can get away with.

Also as with the emperor in the story, I suspect that the President and the Distinguished Professor believe the numbers they’re stating. I don’t think they’re bullshitting, exactly; it’s more that they treat numbers as words, not as things that can be added, subtracted, multiplied, and divided.

And their interlocutors don’t care either, maybe because they too think of numbers as words or maybe because it’s better for their careers to agree with the emperors, maybe question them on some specifics while showing a careful deference to their core ability to make outrageous claims and not be questioned.

A through line?

We live in a world in which certain quarters of academia and the prestige news media give strong support to outrageously innumerate claims. So I guess no surprise to see it coming from the government too–especially given the government’s recent proclivity to cite nonexistent or fraudulent research. Scary times all around.

Again, the problem is not just the innumeracy, it’s the blithe disregard for it, the idea that being off by multiple orders of magnitude–in one case, literally dozens of orders of magnitude–just doesn’t matter.

More on school reform, this time New Orleans

Posted on December 9, 2025 9:22 AM by Andrew

Recently we discussed a debate about how much of the improvement in test scores of students in Mississippi can be attributed to a policy of holding back more students–in particular, having kids repeat third grade will be expected to improve average for fourth graders. Education researchers Howard Wainer, Irina Grabovsky, and Daniel Robinson expressed skepticism about claimed dramatic benefits from the Mississippi plan, but then there were good arguments on the other side. One thing is that a lot of the discussion was about what happened right after the new plan was implemented in the mid-2010s, but there have been longer-term trends in Mississippi and other states. Changes in averages are always hard to interpret because of possible changes in compositional effects, including decisions of the age at which children start first grade, classification of students as disabled, and who’s taking the test in any given year. Also, all these comparisons are observational: as Wainer puts it, there’s no control group. On the other other hand, decisions need to be made in the absence of ironclad evidence. So I was left in a state of uncertainty.

A couple days later we learned that Wainer et al. had garbled some statistics, entirely misreporting Mississippi’s fourth and eighth grade math scores. Wainer et al. were making a general point about testing and selection, something they’d seen in various forms many times in their careers, but they were evidently not close to the data from Mississippi, even to the aggregate data that are easily available. As I discussed, I should’ve earlier been more suspicious of their claims about the math scores, given that in my earlier post I’d noticed a discrepancy between those and others’ claims. After all this, I remain unsure what to think about Mississippi. It’s an observational comparison, there’s selection, there’s variation between states in how much they teach to the test, and at the individual level there are the spillover effects on the kids who are not held back . . . all sorts of things. On the other hand there are these long-term trends. Selection has to be explaining some of what is happening in Mississippi–if you hold kids back and give them the test later or manage to exclude them from the tested population entirely, the average scores of the remaining students should rise–but it’s hard to say how much, and at some point you have to go with the data in front of you. As is often the case, we’re not just arguing about causal effects; we’re also trying to pin down what exactly is happening.

In the meantime, I received an email from another education researcher, Doug Harris, who writes:

Wainer et al. also got it wrong on the other cities like New Orleans. To quote them: “We have seen several previous K–12 education ‘miracles’ that turned out to be hoaxes. Five of them were in Houston, Atlanta, the District of Columbia, El Paso, and New Orleans . . . The New Orleans miracle was caused by a natural disaster. Hurricane Katrina tragically relocated about a third of the students who came from the poorest areas. Removing thousands of low scorers immediately raised the average test scores of the students who remained.”

Several people pointed this out to me [Harris], especially because I have been studying the New Orleans school reforms for more than 10 years. My center, the Education Research Alliance for New Orleans, has published more than 50 articles about it. Our Advisory Board includes both supporters and critics of the reforms.

When I first came to New Orleans the sharp upward trend in outcomes gave me and others good reason to think this fit the first rule. The school reforms were sparked by Hurricane Katrina, which changed the city in many ways. Many families never returned, at least not to their original homes and neighborhoods. The whole city was hit hard, but low-income neighborhoods were hit a bit harder. Given the correlation between demographics and education outcomes, it was reasonable to be concerned that changes in the population, not the school reforms, drove the change in outcomes. Recognizing the problem, I spent years trying to disentangle this.

In the end, to my own surprise, it became clear that the reforms really did drive substantial improvement in a wide range of education outcomes—elementary/middle school test scores, high school graduation, college entrance exams, college attendance, and college graduation. They reduced many achievement gaps and may have reduced crime in the city (this last point is more difficult to determine with confidence). These results can be found here (ungated) and in economics journal, Journal of Human Resources (gated) and in my book Charter School City (University of Chicago Press, 2020). New Orleans went from being next to last in the state on almost every measure to being about average within ten years, improvement that has been largely maintained.

How did we isolate the Katrina effects from the school reforms? You can read our much longer articles, but here is a short take:

1. We tracked the trajectories of the individual students before and after the reforms and found that those who returned to New Orleans saw improved trajectories. Since these are the exact same students (the data were anonymized, of course), demographic change cannot explain that.

2. We tracked all pre-Katrina students with test scores living in New Orleans before the reforms and compared those baseline scores for returning and non-returning students. They were nearly identical, and we controlled for the remaining differences.

3. We commissioned the U.S. Census to calculate the change in demographics of households with school-age children before and after Katrina and the reforms. Again, they were nearly identical.

4. We tracked students who switched into and out of New Orleans. Those who switched into New Orleans learned at slower rates before Katrina and learned at faster rates afterward.

Demographic change was not the only potential problem. Given the system’s strict accountability, we wondered whether data manipulation was a driving force. We found no evidence of this. Tests for strange patterns in test responses and miscoded high school graduation rates turned up no or slight differences between New Orleans and the rest of the state. We also know how the improvement occurred, which provides even more confidence.

So, why do Wainer et al. call it a “hoax”? Because they apparently never looked for any evidence to back up their claim. Any basic internet search would have turned up our work. Our findings made national news. There is a “hoax” here but it’s not the one they claim.

OK, the New Orleans test scores are another story I know nothing about! What you see above is one take on them. At this point my main role is to convey these different arguments and advertise my uncertainty.

When the numbers don’t look right, check them! (Mississippi education update)

Posted on December 4, 2025 3:53 PM by Andrew

Part 1: Reading what different sources say

The other day, as part of a long discussion about the estimated effects of Mississippi’s education plan, I quoted some education researchers, Wainer et al., who wrote:

The 2024 NAEP fourth-grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.

I also quoted a different critic of the Mississippi claims, Ravitch, who wrote:

In math, [Mississippi’s test scores] zoomed from fiftieth to twenty-third. Adjusted for demographics, Mississippi now ranks near the top in fourth grade reading and math according to the Urban Institute’s America’s Gradebook report.

And I found this from the wikipedia page on the Mississippi Miracle:

After adjusting for demographics, in 2024, Mississippi was the nation’s #1 state in Reading as well as in Mathematics.

I wrote, “But Wainer et al. say that Mississippi is tied for 50th in math. Can they really be worst in the nation, but best after demographic adjustment? I guess it’s possible.”

Part 2: Anomalies!

Wainer et al. said Mississippi’s 4th and 8th grade math scores were the nation’s worst in 2024.

Ravitch said their 4th-grade math scores have increased to 23rd in the nation and that they’re near the top when adjusted for demographics.

Wikipedia said that Mississippi’s math scores were best after adjusting for demographics.

So, Wainer et al. and Ravitch flat-out disagree on Mississippi’s absolute ranking in 4th-grade math; Ravitch and Wikipedia disagree slightly on the result after demographic adjustment (“near the top” or “the nation’s #1 state”); and I can’t be sure, but it also seems doubtful that a state could be #50 unadjusted and #1 after adjustment. As I wrote, it’s theoretically possible but it seems like a stretch.

Part 3: I do nothing.

One of my sayings is that an important characteristic of a good scientist is the capacity to be upset, to recognize anomalies for what they are, and to track them down and figure out what in our understanding is lacking.

In this case, though, I just let the anomaly sit there like a rotting fish. I went around it and I kept writing.

Why did I not explore this 4th-grade math test thing more closely? Partly because I didn’t have the data and hand. It turned out that a quick google was all that was needed, but I didn’t take that step. Another thing is that, in any investigation, many anomalies will come up (one of these was the average age of the students being tested; more on that below), and we can’t look into everything at once. In that way, it’s a like an Agatha Christie-style mystery, where various inconsistencies and anomalies arise and are noted in turn, but then the story moves on, with the explanation happening later. The other day we saw the new Knives Out movie–it was really great! If the original Knives Out was a 10 and the sequel was a 3, this third installment was a solid 9–and it did that thing were anomalies would pop up and get discussed but then set aside. If you stopped the train at every anomaly, you’d never get to the destination.

And the math scores were not a key part of the story, so I just let my bafflement sit there and I did not follow up.

Part 4: Let’s look at the numbers.

In the discussion of our post, two commenters said that Wainer et al. were wrong on the math scores. Steve wrote:

You can look the data up on the 2024 NAEP report:

https://nces.ed.gov/nationsreportcard/

I have no idea how these researchers came up with these claims: “The 2024 NAEP fourth-grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.”

My reading of the report is that Mississippi’s 8th grade math scores had trailed the national average by 18 points in 2000 but by only 3 points by 2024.

And SD wrote:

“The 2024 NAEP fourth-grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.”

This is just literally made up

So I looked it up, and . . . yeah, Wainer et al. had it wrong! Here’s what it says on the NAEP page:

4th grade math: National avg 237, MS avg 239, above average!
8th grade math: National avg 272, MS avg 269, but rank is approx 35th, not 50th.

Also I went to the Urban Institute page to see their demographically adjusted numbers (“The demographics we use for the adjustment include gender, age, race or ethnicity, receipt of free and reduced-price lunch, special education status, and English language learner status”) for 2024:

4th grade math: MS 248.6, they are indeed #1!
8th grade math: MS 281.3, also #1!

You can make of this adjustment what you will. But, in any case, no way were they ranked #50. I contacted Wainer et al., and Dan Robinson, one of the authors on the paper, confirmed that this was a mistake and that they would remove those two sentences from their paper.

Part 5: Where are we now?

As I discussed a couple days ago, I’m coming at this from two directions.

On one side, Wainer, Grabovsky, and Robinson are experienced education researchers, and they are not impressed by the claimed large effects of Mississippi’s policies.

On the other side, Wainer et al. are making their arguments in general terms, and the specific numbers from Mississippi seem impressive. This “on the other side” point is even stronger when we consider that Wainer et al. based part of their argument on math scores on garbled numbers.

There’s also a political angle, which I did not discuss in my original post but which came up in the comments, and it’s interesting because both side’s arguments have a politically conservative flavor. It’s a conservative vs. conservative battle. The proponents of the Mississippi plan offer the conservative argument that back-to-basics education work, also the conservative (in the U.S. context) argument that Mississippians are as good as anyone else. The skeptics of the Mississippi plan offer the conservative argument that there are no miracle cures, that schooling can’t do much to alter the natural order of things, and that government statistics can’t be trusted. I’m exaggerating the political slant in both directions here, but I do think that the arguments are taking place on a conservative turf, which is interesting, and I guess reflects the discrediting in recent years of education practices associate with the left.

Before ending this discussion, though, I wanted to go back to the statistics. Not the details but more of a view from 30,000 feet.

– An intervention was done in Mississippi in the mid-2010s, and people studied state-level aggregate test scores before and after. Mississippi’s test scores improved a lot relative to the nation during this period. This was part of a longer-term improving trend.

– The estimates of the program’s effects are observational. There was no control group. The implicit control is to imagine that previous trends in the state would have continued, or that the trends in Mississippi would be like trends in other states afterward.

– We don’t have easily accessible data on individual students. Robinson asks, “For example, what students benefited most from the intervention? What happened to the scores of the retained students once they took the NAEP reading test again?”

– The critics were coming into this from a generally skeptical position based on their view of previous hype in the education field, also the clear statistical issue that if you delay the kids who are performing poorly on the test, that averages will go up, also the lack of a control group. They did not do the work to quantify these concerns in this particular case, in part because relevant data were not easily accessible, but their distance from the details was a problem, as we could see with the gross error regarding the math tests.

– Mississippi’s average test scores have been going up. How much is this due to selection of who takes the test and when they take it, how much is due to changes in accommodations for disabilities (as discussed by Kelsey Piper in comments), and how much is due to targeted test preparation, I don’t know. It is a luxury of blogging that I can openly admit my uncertainty here.

– Stepping back, it’s clear to me why Wainer et al. remain skeptical, while Piper and other reporters have a more positive take on the Mississippi program.

– Finally, it’s not all about average test scores and it’s not all about the students being held back. I’m still thinking that a key outcome is reading and math ability at the time of school leaving. The idea of the program seems to be that if you hold some kids back a year, that will help them learn by keeping them in classes that are closer to the right level for them, and that this will also allow a higher level of education for the kids who are not held back. Some commenters also argued that the threat of being held back would motivate kids to learn more in third grade. I don’t know about that, but the point is that the problem is complicated enough that I can see the virtue of a “reduced-form” approach that just looks at effects on average test scores–but then you have to be concerned about the lack of control group and about compositional effects, which is where we started!

Part 6: Summary

– I should’ve looked into those math-score claims more carefully! Once I noticed the discrepancy between different reports, that was the time to track down what was happening. I’ve criticized statisticians for just accepting unreasonable numbers without checking, so bad on me for sloppiness here.

– As before, I don’t have a strong take on what’s happening in Mississippi. I see good arguments on both sides and no easy way to resolve them. My Bayesian inclination is to split the difference and say there’s some evidence that these policies are working but not to the extent that is advertised, but I don’t really know. Indeed, I can think of this Bayesian splitting of the difference as a kind of frequentist procedure in the sense that, on average, I think we will do well by splitting the difference in this sort of dispute. In any given problem, I’ll often come down stronger on one side or another (as here, for example), but in this case, nah, I don’t really have more for you.

P.S. I get that many readers of this post and my earlier post on the topic are frustrated because I don’t come to a strong conclusion for or against the Mississippi program. But that’s because I can’t: it’s an observational study with a lot of uncertainty about key aspects of the data. We can criticize particular aspects of various reports on the topic, but that’s not the same as coming to a strong conclusion about the effects of the program. Meanwhile, though, policymakers need to make decisions. And this sort of decision can’t wait on definitive evidence; they’ll need to rely on some mix of theory, judgment, and an assessment of political possibilities.

P.P.S. In part 5 of the above post, I remark that the Mississippi discussion has turned into a conservative vs. conservative debate with not much from the liberal direction. Jonathan Chait discusses this too: at a liberal journalist who supports Mississippi’s school policies, he’s surprised that liberal pundits are taking the conservative line that the policies don’t work.

P.P.P.S. I received the following email from Jean Gordon Cook of the Office of Communication and Government Relations of the Mississippi Department of Education:

The Mississippi Department of Education was made aware of an upcoming article that appears to be set for publication in January in Significance magazine. The article casts doubt on the accuracy of Mississippi’s gains on the National Assessment of Educational Progress

We have noted several errors/issues with the article that we sent to the editor of Significance yesterday. We are sharing these items with you because you reference this article in a blog post.

• Incorrect information is in the second-to-last paragraph on p. 33 when it states that the “2024 NAEP fourth grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.” 2024 NAEP state profiles show Mississippi’s fourth-grade mathematics scores rank the state No. 16 in the nation, and eighth-grade scores rank No. 35.

• Regarding the discussion of retention, the article does not address the fact that students can be retained for reasons other than the Literacy-Based Promotion Act. If you look at the 2018-19 LBPA Annual Report, you will see that 5,049 (14.4%) of third graders did not pass the third-grade reading test on the initial or two retests. Of those students, 4,131 were promoted to fourth grade with a good cause exemption. That means only 918 of the 3,379 third graders who were retained that year were held back because they failed the third-grade reading test.

• The article suggests that students who are held back in third grade may never advance to fourth grade and possibly be in the sample of students who take NAEP. It also doesn’t discuss the fact that students who are retained and students who are promoted to fourth grade with a good cause exemption are required to receive intensive remediation. This is a key part of the Literacy Based Promotion Act (LBPA) and Mississippi’s work to ensure students become strong readers.

The first point is covered in my post above, but I thought it was simplest to share the whole message.

In my reply to Cook, I apologized for not checking the numbers myself the first time. The funny thing is that, as I explain in the above post, those numbers did look odd to me, but then I didn’t follow through and try to look them up.

How much of “Mississippi’s education miracle” is an artifact of selection bias?

Posted on December 1, 2025 9:27 AM by Andrew

Howard Wainer, Irina Grabovsky, and Daniel Robinson write:

We were sceptical when we read Noah Spencer’s 2024 article about “Mississippi’s education miracle” which education economics expert Harry Anthony Patrinos called a “model for global literacy reform. The results Spencer reported from his econometric model do seem to be miraculous . . . Based on the National Assessment of Educational Progress (NAEP) fourth-grade literacy test scores, the state moved from a 49th place ranking in 2013 to the top 20 in 2023. The latest 2024 scores revealed that Mississippi is now tied for 8th place among 53 US states and territories!”

Such a dramatic turnaround clearly marks a sharp deviation from what we expect given the laws of nature/education generated by a century of empirical experience. If the turnaround is indeed legitimate, then the “intervention” that is claimed to be the cause of the improvement, the Literacy-Based Promotion Act (LBPA), which started in 2013, should be seriously considered for implementation in other states.

But now comes the bad news:

The improvement in the average performance of Mississippi’s fourth-graders on NAEP was preceded by two key changes in their schooling in third grade. One was the a priori sensible idea of trying to improve classroom instruction by improved teacher training, instituting preschool, and a variety of other helpful actions. This was to be accomplished through the promise of an additional annual state expenditure . . . about $111.63 of extra funding annually for each pupil. Comparing this amount to what are annual contemporary per pupil expenditures nationally, we have to agree that if such small expenditures can make a visible difference in student performance it truly is a miracle – a Mississippi version of St. John’s loaves and fishes.

But it was the second component of the Mississippi Miracle, a new retention policy, that is likely to be the key to their success.

Third-graders who fail to meet reading standards are forced to repeat the third grade. Prior to 2013, a higher percentage of third-graders moved on to the fourth grade and took the NAEP fourth-grade reading test. After 2013, only those students who did well enough in reading moved on to the fourth grade and took the test.

Wainer et al. share the figure at the top of this post to show how this works quantitatively, and then they continue:

As previously mentioned, the latest NAEP data for 2024 show even more impressive, “miraculous” results on the fourth-grade literacy test scores – a tie for 8th place. Strangely though, for the eighth-grade literacy test, the state’s rank dropped to a tie for 42nd place! This should clear up any miracle illusions that may remain. Need more proof that Mississippi public education is without miracles? The 2024 NAEP fourth-grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.

OK, I wouldn’t use the word “proof” here, but I take their point.

There’s still some interesting stuff going on in the data, though. Wainer et al. share this figure:

and write:

Scores gradually increased for both from 1990 to 2015. But then the scores began to decline nationally whereas they continued to increase for Mississippi. Why? It is hard to credit Mississippi’s 2013 LBPA as the cause since there does not appear to be any change in Mississippi’s continued improvement. Yet viewed in the context of the national decline, perhaps LBPA deserves some credit. But if we are to credit LBPA for the continued growth, how do we apportion that credit to the Act’s two parts? Is it due to the changes in what was taking place in fourth-grade classrooms? Or who was allowed into those classrooms?

Good question! Here’s their answer:

The most credible way to do this is with a formal experiment in which we form four groups by crossing the two factors – extra per-pupil expenditures, and promotion based on reading performance – yielding four experimental groups . . . Unfortunately, in education, such experiments are rare . . . In the current situation, the best we can do is to use the model given in Figure 1 to predict the gain in mean score from the retention rates and see how much of Mississippi’s gains shown in Figure 2 are left unaccounted for. Were we to do this we would find that most of Mississippi’s gains are due to the retention rate.

Hey, don’t just say “most”! Give a percentage! And an uncertainty estimate!

It’s happened before

In their article, Wainer et al. also provide some historical context:

We have seen several previous K–12 education “miracles” that turned out to be hoaxes. Five of them were in Houston, Atlanta, the District of Columbia, El Paso, and New Orleans.

In the first four, investigators found fraud. The people in charge (e.g., superintendents) cheated to give the impression of increased test scores. In Houston, the numbers of students who were categorised as “special education” were increased so their low test scores would not be included in the school’s overall test scores. In Atlanta, records were falsified. In the District of Columbia, high-school students graduated who should not have. And in El Paso, to inflate scores, Mexican transfer students, who typically scored lower, were prevented from taking the state-mandated tenth-grade achievement tests.

The New Orleans miracle was caused by a natural disaster. Hurricane Katrina tragically relocated about a third of the students who came from the poorest areas. Removing thousands of low scorers immediately raised the average test scores of the students who remained . . .

OK, this seems pretty clear. So how did everybody get fooled?

Wainer et al. write:

There are three possible reasons. First, to his credit, Patrinos cited the 2024 study by Spencer whose analysis concluded that the LBPA was the cause of the increase in fourth-grade reading and maths scores. The gold standard for measuring the effects of causes is to . . . randomly assign students to either treatment . . . or control . . .

Spencer did not have the data required for such an analysis. So, instead, he improvised by using some prior years’ data as the control group, and instead of random assignment he used various bits of covariate information to equate this year’s students with the previous years.

This sort of approach often works–we generally recommend adjusting for pre-treatment variables in observational studies–but not here! Why not? Because, as Wainer et al. point out, “in the current year the bottom of the class was truncated and so were very much unlike the prior years’ scores – and no covariate adjustment was going to make them equal.”

They continue:

Besides weak empirical data, educational reformers like Patrinos should have given greater weight to the extant literature on the Mississippi Miracle. The miracle had already been convincingly debunked.

This last link is to a Los Angeles Times article entitled, “How Mississippi gamed its national reading test scores to produce ‘miracle’ gains,” written by Michael Hiltzik. Hey–I recognize that name! Palko sometimes links to him.

Wainer et al. conclude:

Third, Patrinos, and others who have praised the Mississippi miracle, should know that extreme educational reform success stories are non-existent. History has shown us that a little bit of digging has, in the past, always revealed such claims of miracles to be false. This does not mean that we should give up hope. Small successes are common in education. But dramatic huge successes should always alert us to scepticism.

Other views

I have not looked at these data myself, and I hold out the possibility that Wainer et al. are mistaken, that there are aspects of the situation they’ve not fully thought through, and that a more careful analysis would legitimately show something different. To get a sense of other perspectives, I googled *Mississippi miracle*.

The first item was a wikipedia page that went into the details of the Mississippi plan and report it as an unqualified success: “After adjusting for demographics, in 2024, Mississippi was the nation’s #1 state in Reading as well as in Mathematics.” But Wainer et al. say that Mississippi is tied for 50th in math. Can they really be worst in the nation, but best after demographic adjustment? I guess it’s possible.

Some of the wikipedia article seems to inadvertently support the skeptical position, as here:

Oklahoma, for instance, passed a bill in 2012 that mirrored the LBPA, only to pass a new law two years later that defanged the law. This was done to avoid actually holding back the students that could not read at grade level. Oklahoma’s scores have since plummeted and the state ranks near the very bottom of the NAEP’s list.

This seems consistent with the take that holding back the lowest-performing kids was the cause of the increase in average test scores.

I guess I could add a link to the Hiltzik article to the wikipedia page.

The next link is to an article in the New York Post entitled, “Mississippi’s reading triumph is no miracle — it’s the future of education,” again, 100% buying the effect with no comment on selection bias.

This is not to say that the policies being done in Mississippi are a bad idea! As Wainer et al. emphasize, small improvements matter too. It’s just not right to go around claiming large composite effects that are really due to selection on who takes the test, not improvements for individual kids.

The third link in my Google search is from a website called ExcelinEd entitled, “Four Reasons Why Mississippi’s Reading Gains Are Neither Myth Nor Miracle,” and beginning, “It’s time to debunk criticism of the Magnolia State’s literacy outcomes.” They address the selection issue head on:

This progress would not have been possible without ending social promotion and implementing the so-called “third-grade gate.” . . . Critics tend to take aim at retention for two reasons: First, it can be an emotional issue for families to find out their child needs to repeat a grade . . . Second, critics want to believe gains can be made without retention, and they strategically parse the data to prove their point.

Critics have alleged that Mississippi’s outcomes are a “statistical illusion,” because of the percentage of students retained by the third-grade gate. Retained students’ test scores aren’t part of the overall results, so they argue the picture is rosier than it should be.

But they think the improvement is real:

Except that’s not true at all. Researchers at Mississippi First took a deep dive into the data and what actually happened.

Here’s the short version: The largest NAEP gains in Mississippi were from 2013-2015 when no third graders were retained—because the state had not yet implemented that part of the law. The outcomes that led to the “Mississippi Miracle” designation in 2019 were made by the 2018 cohort of third graders, less than 5% of whom were retained.

There was a one-time jump in retention in 2019, because the state raised the standard for a student to pass the third-grade gate. But the retention rate has declined every year since then, even after the pandemic.

According to Wainer et al., the retention rate was 9.6% in 2018-19 and declined to 7.2% in 2022-23, and they argue that even a 7% retention rate would cause an improvement of 0.15 standard deviations from the selection effect alone. I don’t know how this maps onto test scores or state rankings, also the cumulative retention rate could be higher if some kids are also held back in earlier grades.

I’d never heard of ExcelinEd, so I clicked through to find out who they are. The board of directors is mostly a mix of Republican politicians and political appointees who served under Republican administrations. That’s fine–no reason that a partisan group shouldn’t care about education–, it just gives some sense of where they’re coming from.

In any case, similar arguments are made by nonpartisan sources. The ExcelinEd article points to this post on Chalkbeat, which is another site I’d never heard of before, but it doesn’t seem to have a partisan agenda. They seem to be doing their best to weigh the evidence:

Some have called it the “Mississippi miracle.” Others say not so fast.

In the last decade, Mississippi students have rapidly closed the test score gap with the nation as a whole, particularly in fourth grade. State officials, education wonks, and national journalists have attributed these improvements to the state’s 2013 early reading law, which included emphasizing phonics and holding back third graders who struggle to read. . . . “Mississippi has shown that it is possible to raise standards even in a state ranked dead last in the country in child poverty and hunger,” New York Times columnist Nick Kristof wrote in May. . . .

But a few commentators have pushed back on this rosy narrative. Los Angeles Times columnist Michael Hiltzik recently claimed scores had been “gamed.”

Chalkbeat says:

Hiltzik, the Los Angeles Times columnist, advanced two major critiques of the state’s test score gains in a recent column.

First, he argued that by holding back struggling third-graders, the state had inflated its test scores by removing those students from the pool of fourth grade test-takers.

In reality, this could help explain test scores jumps for a short period of time, but it doesn’t make much sense for longer-term gains. Eventually, students who are retained in early grades will move up to the next grade — they are not held back forever. Because Mississippi has seen sustained improvements, retention gaming appears to be an unlikely explanation. . . .

Andrew Ho, a testing expert at Harvard University and previously a member of the board that oversees NAEP, said his instinct is to question big test score gains. But in the case of Mississippi, he said, “I don’t see any smoking guns or red flags that make me say that they’re gaming NAEP.”

They bring up another issue:

One sometimes overlooked change in Mississippi education policy in the last decade involved not curriculum or instruction, but its testing regimen. In 2015, Mississippi overhauled its state test, including by aligning it more closely with NAEP. . . . Testing experts say that focusing on the content of a particular exam might improve scores because educators teach to that specific test. . . . “To the extent you prioritize NAEP, you risk inflating NAEP scores,” said Ho. However, the state testing shift began in 2015, while NAEP gains began in 2013. Additional scrutiny might shed more light on this issue.

So, lots to chew on.

The next Google link comes from PBS, an entirely uncritical report (sample quote, “The institute’s CEO, Kelly Butler, said she tells them there’s no secret to the strategy. ‘We know how to teach reading,’ she said. ‘We just have to do it everywhere.'”).

Next, a straight-up press release from the state of Mississippi. Sample sentence: “The results speak for themselves.” No mention of selection bias.

Next is from a website called The74 (the name refers to “the education of America’s 74 million children”), “There Really Was a ‘Mississippi Miracle’ in Reading. States Should Learn From It.” They address the statistical concerns:

A research paper last fall from Noah Spencer from the University of Toronto found that the law helped drive the state’s gains.

Spencer estimated that the third-grade retention policy alone could be responsible for about one-quarter of the gains, and it was surely the most controversial element. Some people have even tried to cast doubt on Mississippi’s NAEP gains by arguing they’re merely a function of testing older kids. But this has been debunked: Mississippi does hold back more kids than other states, but it always has, and the average age of Mississippi’s NAEP test-takers has barely budged over time.

Research on third-grade retention policies has found that students who are retained tend to have better long-term outcomes than those who are not . . .

Lots of links there! Let’s look at some of them.

First, a negative take, from education analyst Diane Ravitch:

A long-time cellar dweller in the NAEP rankings, Mississippi students have risen faster than anyone since 2013, particularly for fourth graders. In fourth grade reading results, Mississippi boosted its ranking from forty-ninth in 2013 to twenty-ninth in 2019; in math, they zoomed from fiftieth to twenty-third. Adjusted for demographics, Mississippi now ranks near the top in fourth grade reading and math according to the Urban Institute’s America’s Gradebook report.

So how have they done it? Education commentators have pointed to several possible causes: roll-out of early literacy programs and professional development (Cowen & Forte), faithful implementation of Common Core standards (Petrilli), and focus on the “science of reading” (State Superintendent Carey Wright).

But one key part of Mississippi’s formula has gotten less coverage: holding back low-performing students. . . . a “third grade gate,” making success on the reading exit exam a requirement for fourth grade promotion. This isn’t a new idea . . . But Mississippi has taken the concept further than others, with a retention rate higher than any other state. In 2018–19, according to state department of education reports, 8 percent of all Mississippi K–3 students were held back (up from 6.6 percent the prior year). This implies that over the four grades, as many as 32 percent of all Mississippi students are held back; a more reasonable estimate is closer to 20 to 25 percent, allowing for some to be held back twice.

I don’t quite follow this: are they just holding back 8% of third graders, or are they holding back 8% each year?

Next, a positive take, from Todd Collins, writing at the site of the charter school organization, the Fordham Insitute:

Mississippi didn’t cheat. Its reading gains are real. . . . the data show that it’s done so for at least twenty years, and at the same rates as under the current literacy law.

Retention by itself did nothing for them, mechanically or otherwise. Before 2013, Mississippi ranked forty-eighth for fourth grade reading, despite having one of the country’s highest retention rates. And after the reading retention law went into effect, the year-to-year rate changes had no discernible effect on NAEP results. . . .

Moreover, the pandemic provided a clear natural experiment: What happens when retention stops? In 2021, Mississippi suspended its third grade retention requirement. When those students took the fourth grade NAEP in 2022, the “statistical illusion” should have worked in reverse, sending Mississippi scores tumbling, relative to other states. Instead, although scores did fall, as they did in forty-four other states, Mississippi’s drop was less than the national average.

There’s also the Patrinos article mentioned near the top of this post, but it’s just terrible, as it doesn’t even acknowledge the selection issue at all, it’s just straight-up hype. Similarly, economics journalist Noah Smith writes, “Mississippi has had a big breakthrough in teaching poor kids to read! The core of the approach is an old technique called “phonics” that’s coming back into vogue. But it’s also about identifying students who are struggling and giving them extra resources, while also not simply giving them a rubber stamp and letting them pass to a higher grade.” Sure, but if you don’t let them pass to a higher grade, you’re gonna see higher average scores among the students who do take the test. This is something that an economics journalist should realize!

Finally, here’s a report from Mississippi First, “a leading voice for high-quality early education, high-quality public charter schools.” They’re an interested party here, but they’re also close to the data and have a motivation to get things right, so we should look at what they have to say:

On the 4th grade reading test, Mississippi gained 20 scale points between 1992, when the first state NAEP data were released, and 2019, when we first reached the national average. . . . Mississippi had two periods of big gains: 2005-2009 and 2013-2019 . . .

On the 4th grade math assessment, we gained 18 points from 2003 to 2019. Mississippi’s math gains were very steady in this 16-year period, with the state improving little by little while the nation stood still, until we finally saw Mississippi’s gap-closing jump between 2017 and 2019.

Mississippi’s Literacy-Based Promotion Act (LBPA) did not pass until spring 2013 and the “gate” (i.e., the requirement that students score a minimum level of proficiency, which originally was a level 2 of 5) did not go into effect for 3rd graders until 2015. This means the first year that Mississippi kids who experienced the “gate” were in the NAEP sample was 2017, when 15 points of our 20-point gain had already happened. . . . the LBPA can only explain 4th grade gains beginning in 2017.

What is the “Bottom 10%” Argument, and Why Is It Unpersuasive? . . .

The LA Times column claims that Mississippi’s NAEP success–specifically our reaching the national average in 4th grade reading–is a sham based on the analysis of a blogger armed with a graph of NAEP 4th grade reading data between 2013 and 2022 and the claim that Mississippi had a “nearly 10%” retention rate in 3rd grade following the LBPA’s passage. . . . but the percentage of 3rd graders held back as a result of the LBPA in whole or in part has never been that high. Certainly, the LBPA caused an increase of between 5.74-5.91 percentage points in the retention rate over a base trend of around 3-3.4% in the years immediately prior to the 2014-2015 implementation, but after that first year, the retention rate began to drop back down to pre-LBPA levels. By 2017-2018, the retention rate as a result of the LBPA was no higher than 1.58%, and the overall rate was less than 5%. . . . Because the LBPA caused so few retentions by 2018, Mississippi actually raised the bar in 2018-2019 so that “passing” the gate meant a higher level of proficiency in reading (now a 3 of 5, instead of a 2 of 5). After that, the overall retention rate did reach a high of 9.6% for the 2019 3rd grade cohort (2020’s 4th grade cohort), but those kids weren’t in the 2019 or 2022 4th grade NAEP. . . .

I [Rachel Canter from Mississippi First] object to the whole construct of the bottom 10% methodology because retained students don’t just disappear such that one needs to “add them back in.” They actually eventually get promoted, which means they do show up in 4th grade data, including 4th grade NAEP, just after some remediation (hopefully!). Having better scores after being held back . . . is that not the point of grade retention?

So where are we, then?

I’m not sure what to think. On one hand, Wainer, Grabovsky, and Robinson are experts on educational measurement, their argument about selection effects is persuasive, and their meta-argument about skepticism given the history of education hype also makes sense to me. Also, Howard’s a friend, and he’s a reasonable person, so I’m inclined to agree with him.

On the other hand, it all depends on the numbers: how many kids of each grade are held back each year, how they do in later years, etc. And it seems likely that some of these numbers will never be available.

Another question is, what are the causal inferences we’re looking for? How would we summarize things if we had all the data, including all potential outcomes? We’d like to know the changes due to the program among kids who would not be held back in any case, kids who would be held back in any case, and kids who would be held back under the treatment but not under the control. Among those in that third group, there’s the question of whether you’re comparing later outcomes in the same year (i.e., the same age of the kid) or at the same grade (so that you’re comparing the test scores of held-back kids to the scores they would have receive a year earlier had they been promoted). There’s also the frustrating way in which the discussions jump back and forth between absolute test scores and demographically-adjusted comparisons between states.

Another challenge in sorting this all out is that the Mississippi program had a lot of features, and Mississippi’s test scores had been improving for awhile. Some of the “Mississippi miracle” discussions focus on what’s happened since 2013, but the article from Mississippi First seems to be arguing that state policies have been helping since 1992. So it’s kind of a moving target. There’s also the association with phonics-based language instruction and a kind of general take that Mississippi’s success comes from them holding kids to a more rigorous standard. Which could be, but there I lean toward the skepticism of Wainer et al., in that states always seem to be talking about getting back to basics in education.

So, lots of moving parts. On statistical grounds, it would seem undeniable that some large chunk of the improved test scores in Mississippi come from the selection effect of delaying the students who were going to perform the worst, but it seems hard to put a number on this. In any case, it’s just gonna be hard to make causal attributions and estimate causal effects in a context where the national outcomes are changing so much, as can be seen in the second graph near the top of this post.

Wainer’s reactions

I sent the above to Howard Wainer, the first author of the above-linked paper that questions the claims of Mississippi’s success, and here’s how he responded:

OK — let’s take it from the top.

The basic idea is that they (Mississippi) picked out an outcome variable to measure success (NAEP score). Then they instituted a compound treatment (funding, class size, etc + retention ON THE BASIS OF THE OUTCOME VARIABLE) and the goal was the measure the causal effect of each of the components of the treatment (e.g. how much is due to class size and how much is due to focused retention). This is tough going under any circumstance, but especially without a control group. Hence my earlier comment to you about Hugo Muench’s “laws” of clinical studies, which essentially says that nothing improves the performance of an innovation more than lack of controls.

Anyway, this means that trying to figure out what is the causal effect of each part is tough and so our guess that it was mostly the newly focused retention policy we thought was a good bet. Which is why we included the plot of mean gains as a function of truncation percent to indicate that it accounted for (order of magnitude) most (all?) of the gains claimed. Yes, they had a high retention rate previously, but who was retained was based on a mixture (unknown, at least to us) of variables/causes. The new policy retained specifically on the basis of the outcome variable.

Thus we would posit that the retention rate is unlikely to have much of an effect on the height of 4th graders, but it would if only short kids were retained.

But, there is a lot of dark here, and we tried to offer the most plausible explanation all things considered. We were not inclined to give Mississippi the benefit of the doubt (based on the chicanery that has manifested itself with essentially ALL prior education miracles.

Fair enough. I still wonder what happens with those kids who are held back and are then tested a year later. I guess they improve on average a lot on their own, no matter what is done, during that year.

Also, I’m not sure what’s the ultimate policy goal: maybe to improve reading and math ability (as measured by test scores) when kids leave the K-12 system? I think this would be one of the traditional reasons to hold students back a grade, so that they have a longer time to learn the material, which could be helpful even if the treatment is having no effect on their learning trajectories.

Another way of putting this is that I don’t think it’s always clear what people are estimated here. The causal effect of the treatment would apply to individual students, but the outcomes are being compared in the aggregate, which is a challenge given that the treatment affects who’s being aggregated.

P.S. More here, here, here, and here.

The three funniest items on the Kroger recall list

Posted on November 26, 2025 6:47 PM by Andrew

Palko points us to this news item, “Kroger Recall Update: Full List of Product Warnings Across 18 States.” My favorite:

Yummi Sushi, recalled October 28, 2025: Nashville, Knoxville, Georgia, and South Carolina stores. Possible contamination with metal fragments.

If you’d asked me why they were recalling sushi, “contamination with metal fragments” would not have been in my first hundred guesses. I guess the metal makes it yummier.

And my second favorite:

Face Rock Curds Vampire Slayer Garlic, recalled June 25, 2025: Affects Fred Meyer and QFC stores. Potential Listeria contamination.

I don’t see the problem. Listeria would slay a vampire too, no?

And this:

High Noon vodka Beach Variety, recalled July 28, 2025: Affects Kroger-owned stores located in Wisconsin, South Carolina and Virginia. Kroger says: “Specific lot codes of the product are being recalled due to variety packs may have cans labeled Celsius Energy Drink filled with seltzer alcohol.”

High noon, indeed.

The acupuncture paradox and its resolution

Posted on November 10, 2025 9:23 AM by Andrew

This one’s important.

The other day we had a post on when it’s ok to judge people by their worst.

Dmitri wrote in:

You know who I judge by their worst belief? Healers: doctors and such. If a doctor has one crazy health belief, I am out of there.

My mother-in-law has an acupuncturist who is into all sorts of weird Chinese traditional medicine. I could use an acupuncturist for some tendinitis but I have ruled her out because I know that some of her beliefs are totally nuts. Maybe she knows where to stick for the tendinitis but I am not taking any chances.

I told him I disagreed regarding the nutty acupuncture beliefs (more on this below), and Dmitri elaborated on his reasoning:

When I choose a health-care practitioner I am choosing on the basis of the quality of their health-care reasoning. I want them to figure out what’s wrong with me and find a way to fix it. If I have any evidence that their health-care reasoning is faulty, I want to stay away.

My comment about acupuncture was written from the standpoint of the belief that acupuncture might really work for certain kinds of ailments. I had a friend who got good results with tendinitis. But even if it’s all placebo I suspect I’ll get the best placebo effect if I have faith in the person administering the treatment.

This all makes sense, also it pleases my sense of nostalgia to see a mother-in-law joke. OK, it wasn’t really a joke on Dmitri’s part; still, it brought back memories of wacky mothers-in-law in old sitcoms.

The acupuncture paradox

But back to the acupuncture. I’ve been thinking about this for a long time. Here’s a post from 2011:

The scientific consensus appears to be that, to the extent that acupuncture makes people feel better, it is through relaxing the patient, also the acupuncturist might help in other ways, encouraging the patient to focus on his or her lifestyle.

A friend recommended an acupuncturist to me awhile ago and I told her the above line, to which she replied: No, I don’t feel at all relaxed when I go to the acupuncturist. Those needles really hurt!

I don’t know anything about this, but one thing I do know is that when I discuss the topic with any of my Chinese friends, they assure me that acupuncture is real. Real real. Not “yeah, it works by calming people” real or “patients respond to a doctor who actually cares about them” real. Real real. The needles, the special places to put the needles, the whole thing. I haven’t had a long discussion on this, but my impression is that Chinese people think of acupuncture as working in the same way that we think of TV’s or cars or refrigerators: even if we don’t know the details, we trust the basic idea.

Anyway, I don’t know what to make of this. The scientific studies finding no effect of acupuncture needles are plausible to me—but if they’re so plausible, how come none of my Chinese friends seem to be convinced?

My question here is not whether acupuncture could work (possibly through some backdoor mechanism like the needles stimulating your body in some useful way, or whatever) but on the evidence of how much it does work. As noted, I think the overwhelming impression among my Chinese friends–statisticians included–is that it does work, and not merely through some vague calming effect. But this would seem to contradict the research, so I don’t know what to think.

This does seem to be a paradox, as evidenced by some of the discussion in the 56 comments on the above-linked post.

We had another good comment thread when I brought up the topic again in 2016.

What are acupuncturists doing?

So, yeah, this paradox was bugging me for years, and then at some point I came up with a satisfying resolution.

My resolution of the acupuncture paradox might not be scientifically correct–indeed, it would be wonderful to design some experiments to study the topic and see what, if any, of my ideas in this domain hold up–but it has the virtue of being a possible solution to the problem. Which is more than I had before.

I’ve talked with a bunch of people about this idea, and I’ve mentioned it in some public lectures, but this might be the first time I’ve written it up.

My resolution of the paradox starts with the idea that the success of acupuncture, as with physical therapy, coaching, teaching, and many other things, comes from a fruitful interaction between the patient and the therapist. Good acupuncturists, like good physical therapists, coaches, and teaching do not just push buttons and follow a template; they work closely with each patient and figure out what is needed. In addition, I assume that acupuncture is like these other endeavors in that an key function of the therapist is to motivate patients to keep up with the work.

Thinking of a treatment effect as a vector with direction and magnitude

I’ve written that you can conceptualize an education intervention as a vector, where the direction of the vector is the material being learned and the length of the vector is the amount that students are motivated to work. You want the material learned to be useful–you’d like the vector to have a positive “dot product” with the vector of skills, knowledge, and understanding that will be useful going forward–but, conditional on those two vectors being roughly aligned, the real gain is in the magnitude. And this magnitude will be an interaction between the teacher and student: there’s no button to push to create motivation, and if there were a button it would already have been pushed.

What I’m saying is that, when thinking about acupuncture, or physical therapy, or coaching, or teaching, we have to go beyond what I’ve called the penicillin model of science, the idea that innovations come from nowhere and that the job of statistics is to design and analyze experiments to reject the null hypothesis of no effect, and in which the treatment in such experiments is considered as a black box, with the goal being to estimate an average treatment effect.

I don’t think the penicillin model usually applies. Most of the time in health, education, and just about any field, improvements are incremental, and the goal is to improve the process while gaining understanding. Clinical trials and offline experiment both play a role, and you’re not going to learn much by studying a treatment as if it’s a black box.

This is not to say that there cannot be new developments in any of these fields, nor is it to deny that such developments can sometimes arise serendipitously. I just think that, in any case, you have to go beyond the average treatment effect and think about the mechanism of action.

Resolution of the acupuncture paradox

OK, now on to my answer.

Let’s suppose that acupuncture really works, not just as a placebo or as relaxation or whatever, but as a set of physical manipulations that help you heal better. Let’s also suppose that the mechanism of acupuncture is not the position of the needles or qi or whatever, but the result of the acupuncturist observing you, listening to you, watching and feeling you as you move, then getting a sense of where your problems are and doing movements and giving you advice that will improve your healing. There’s no need for either or both of these statements to be true, but they could be, and suppose they are.

In that case, we should see two things:

1. The usual controlled studies of acupuncture should show no effect. If you do an experiment where the treatment group gets acupuncture with the needles in the “correct” places and the control group gets acupuncture with needles in the “wrong” places, there will be no difference. If useful acupuncture is being done, with good interaction between the therapist and patient and the therapist giving informed, patient-specific therapy, then it will work in both treatment and control groups. If push-button acupuncture is being done, without that focus on the patient, it won’t work in either group. In either of these scenarios, the treatment effect in the experiment will be zero, or nearly zo.

2. Real-world acupuncture would work, and not just because of relaxation/placebo/etc.

In this setting, the usual experimental research method won’t work, because the experiment with the random needle placements is removing the very mechanism by which acupuncture works (or is assumed to work, in my scenario).

That’s a paradox for you: We can have an effect that is real but which will not show up under the usual controlled-trial design.

This is not to say that the effect could not be detected. You’d just need a different design, for example acupuncture (done however the therapist wants to do it) versus nothing, or versus some default therapy. In such a setting you’d want to gather lots of intermediate data to find out what the acupuncturists (and also the control-group therapists) and the patients are actually doing, how their bodies are moving as the weeks go on. Get rich data, thick description that can then be analyzed using multilevel models as necessary.

What about the theory?

To return to Dmitri’s original statement: What about the theory of acupuncture, the placement of the needles, the lines on the body, the qi, etc.? I don’t know. I admit I’m skeptical, and I think that acupuncture could work (and not just as relaxation etc.) even with all these theories being bogus. I’d think of the theories as a sort of checklist, a framework that gives acupuncturists some focus and gives them the space to observe the patient and figure out what to do and what to recommend.

Just as in chess it is said that planning is important, even your plan does not work out, it could also be true in acupuncture (and also in physical therapy, coaching, teaching, etc.) that even a misguided or empty theory can provide useful structure.

But we don’t always have a good language for talking about this when we talk about science. We can talk about specific mechanisms (this gene codes for this protein which unlocks this other protein which catalyzes that reaction, or whatever) or we can talk completely abstractly (this treatment works, as has been demonstrated in a bulletproof clinical trial), but we’re not so good about the steps of trying to work through a mechanism by gathering intermediate measurements and modeling them.

So, give your mother-in-law’s acupuncturist a break. Her beliefs and medical theories may be “totally nuts,” but they may be no more than a framework that she can use to do the useful things that she does.

P.S. Andrew Vickers points in comments to a meta-analysis from 2018 finding positive effects of acupuncture, beyond any placebo effect, in clinical trials.

Conflicting statistical evidence on the long-term effects of children on being whacked by their parents

Posted on November 7, 2025 9:33 AM by Andrew

A few years ago we had a post on the lack of clear evidence on the long-term effects of children on being whacked by their parents. This is sometimes called “corporal punishment” but I think that term is too mild, because from the kid’s perspective what’s relevant is not the “punishment” part (to a kid, the adult world is full of ever-changing rules, so you can be punished for pretty much anything if the adult in power decides to do so) but the bit about being hit by someone who is taking care of you and is possibly supposed to love you. It’s also sometimes called “physical abuse,” which to me seems like an accurate term but which I will avoid because the term “abuse” when applied to children brings to mind sexual abuse which is not what I’m talking about here. So I’ll stick with “whacking” which I think conveys the pain of being hit, if not the feeling of betrayal.

Back in that earlier post, I questioned some journalists who reported certain pro-whacking research (“The research, by Calvin College psychology professor Marjorie Gunnoe, found that kids smacked before age 6 grew up to be more successful . . . ‘The claims that are made for not spanking children fail to hold up. I think of spanking as a dangerous tool, but then there are times when there is a job big enough for a dangerous tool. You don’t use it for all your jobs.'”). Nothing wrong with citing this work, but then it would make sense to also cite the research pointing in the other direction.

And this goes both ways. John “not Towering Inferno” Williams writes:

My step-son is visiting, along with his wife and two boys, aged 4 and 6, and I’ve retreated upstairs to my study. My step-son and wife are following what seems to be the current norm for parenting, which involves trying to reason with or distract misbehaving children, rather than setting limits and teaching them that breaching the limits has consequences. The boys, being bright, realize that they can get away with all kinds of misbehavior, and take full advantage of this.

Now, I was raised by pretty permissive parents: my father spanked me only once, and my mother never did. However, behaving badly got me or my sibs put “out of the living room” for five minutes for the first offense, longer for subsequent ones, and this was enforced. I don’t recall that we ever resisted this, probably because on an unspoken threat of corporal punishment if we did, but in any case, some semblance of order was maintained.

Anyway, the noise from downstairs got me wondering where the current fashion for child rearing came from, so I started poking around on the web, and came across sites such as that of the Center for Parenting Education, where I found “The Case Against Corporal Punishment,” which among other things says that:

Children who are hit as toddlers have a lower IQ than children who are not spanked.

According to Murray Straus, a professor at the University of New Hampshire and Director of the Family Research Lab there, children who were spanked or slapped averaged a five-point drop in IQ.

The strongest link between corporal punishment and IQ occurs when parents continue to hit their children into their teen years. Yet, “even small amounts of spanking make a difference,” according to Straus. (Glenn).

This seems last implausible, so I looked up “Murray Straus corporal punishment IQ” on Google Scholar and found: Corporal Punishment by Mothers and Development of Children’s Cognitive Ability: A Longitudinal Study of Two Nationally Representative Age Cohorts 2009. Murray A. Straus & Mallie J. Paschall; here is the abstract:

This study tested the hypothesis that the use of corporal punishment (CP), such as slapping a child’s hand or “spanking,” is associated with restricted development of cognitive ability. Cognitive ability was measured at the start of the study and 4 years later for 806 children age 2–4 and 704 children age 5–9 in the National Longitudinal Study of Youth. The analyses controlled for 10 parenting and demographic variables. Children of mothers in both cohorts who used little or no CP at Time 1 gained cognitive ability faster than children who were not spanked. The more CP experienced, the more they fell behind children who were not spanked.

So, it is an exploratory study is looking at existing data; with lots of forking paths and measurement issues (just what is a spanking?); the main result is shown in Figure 1, with no error bars, and an unexplained benefit from very frequent spankings for older kids.

Figure 1. The more spanking, the lower the child’s cognitive ability score four years later.

From the text, we learn that the data are really for the previous week, but, not to worry.

CP was measured during two 1-week assessment periods in order to identify children who experienced as close to no-CP as possible with this data. The fact that a score of zero identifies children who were not spanked in either of the 2 sample weeks over a 2-year time span makes it plausible to consider the zero group as children for whom CP was extremely rare or in some cases nonexistent. Nevertheless, in the light of the extremely high intervention rates needed to properly supervise toddlers (once every 6–10 minutes; Lee & Bates, 1985; Minton, Kagan, & Levine, 1971; Power & Chapieski, 1986), there were innumerable opportunities for the mothers to use CP as one of the disciplinary tactics and, as another national survey found, 94% of parents use CP with toddlers (Straus & Stewart, 1999). Thus, the CP scale used for this study does not eliminate the possibility that the children in the zero category experienced CP on rare occasions.

Google Scholar also lists a 2009 talk, explaining that the decrease in corporal punishment explains the recent increase in IQ scores: DIFFERENCES IN CORPORAL PUNISHMENT BY PARENTS IN 32 NATIONS AND ITS RELATION TO NATIONAL DIFFERENCES IN IQ*; here is the abstract:

A previous study found that spanking by parents of two nationally representative age cohorts of children found that the more spanking at the start of the study, the more the child fell behind in development of cognitive ability when tested again four years later. There is also evidence of a world-wide decrease in use of corporal punishment (CP) by parents and of a world-wide increase in IQ. The combination of these three sets of research results suggested the hypothesis that the decrease in use of CP is part of the explanation for increase IQ in many nations. A preliminary test of this hypothesis was tested using data on CP experienced by 17,404 university students in 32 nations and data on national average IQ scores. The results show that the higher the percent of parents who used CP, the lower the national average IQ. These results provide additional evidence on the harmful side-effects of CP. Because the historic decrease in use of CP is accelerating, these results also suggest future gains in national IQ.

Murry seems to be a big gun in the anti-corporal punishment world, so I looked up some more of his stuff, such as

Murray Straus 2010 PREVALENCE, SOCIETAL CAUSES, AND TRENDS IN CORPORAL PUNISHMENT BY PARENTS IN WORLD PERSPECTIVE

This starts with:

This article looks at corporal punishment by parents from several angles–from its links to familial behavior patterns to global variations in its use. First, it describes the prevalence of spanking and other legal forms of corporal punishment (CP) around the world. Second, it presents and illustrates a theoretical model arguing that an important part of the causes of CP are to be found in the nature of society. Third, it presents some of the evidence that a world-wide reduction in the use of CP is taking place. Fourth, it suggests changes in society that may be producing the decrease. The bulk of the research leads to the conclusion that CP has harmful side effects, and that conclusion is an underlying assumption of this article.

At least he is forthright about assuming his conclusion. Anyway, farther down:

How often parents use CP is critically important because many of the adverse effects on children are in the form of a “dose response”–that is, the more frequent the CP, the greater the probability of the adverse side effect. This is illustrated by studies of the relation of CP to depression,8 antisocial and cognitive ability.10 The dose-response pattern is also the basis for the erroneous claim that, when rarely used, spanking is harmless.11

Going to 11 in the references gets to a couple of comments on other articles.

11. See generally Diana Baumrind, Robert E. Larzelere & Philip A. Cowan, Ordinary Physical Punishment: Is It Harmful? Comment on Gershoff (2002), 128 PSYCHOL. BULL. 580 (2002); Robert E. Larzelere, Response to Oosterhuis: Empirically Justified Uses of Spanking: Toward a Discriminating View of Corporal Punishment, 21 J. PSYCHOL. & THEOLOGY 142, 146 (1993).

The main thing I learn from all this is that Straus thinks ever hitting kids is bad, but I don’t see a lot of evidence that “even small amounts of spanking make a difference.” Certainly corporal punishment can be overdone–when I was young, many decades ago, I had friends whose father whipped them badly with his belt, and that was indeed bad. I don’t doubt that even much less than that is bad–I hit my own kid only once, when he didn’t come away from the waves on a dangerous beach when I told him to–but I don’t see evidence that an occasional swat on the bum is going to hurt a kid. I’ve spent only a couple of hours looking into this, so maybe there really is good evidence on that hitting kids is always bad, but it smells a lot more like ideology than science.

I think it’s fair to say that the research results on the effects of parents whacking children are not so clear, and that’s kind of inevitable given the observational nature of the data, the difficulty of recall, reporting errors, etc.

Parenting can be tough. Whether you think it’s ok to handle your frustrations with the job by occasionally whacking your kids until they cry, that’s your call. Unfortunately, your kids don’t have much of a say in it, at least not until they’re big enough to fight back.

I don’t think any of the studies under discussion considered the immediate effects of whacking, balancing out the parent’s stress release and feeling of satisfaction in exercising power with the child’s pain and feelings of betrayal.

It’s kind of funny (in the interesting, not the ha-ha sense) to consider only the long-term effects, and none of the intermediate effects, of an immediately violent action that’s pretty much being taken for the purpose of giving immediate satisfaction to the perpetrator.

The Office of Risk Assessment at the Netherlands Food and Consumer Product Authority is looking for an applied statistician with expertise in Bayesian statistics or causal inference

Posted on November 3, 2025 4:28 PM by Andrew

Joost Meekes writes:

At the Netherlands Food and Consumer Product Authority (NVWA), Office of Risk Assessment, we have a vacancy for an applied statistician (or a data scientist with expertise in statistics). We are particularly interested in candidates with knowledge of and experience with Bayesian statistics or causal inference. If you know anyone who might be interested in this position, or if you would publish the vacancy on your blog, we would be most grateful.

The position requires proficiency in Dutch.

The Netherlands Food and Consumer Product Authority is a government agency which oversees a wide variety of domains, working to guarantee public interests including food and product safety, plant health, and animal health and welfare. The position offers the opportunity to work on a wide variety of applied statistical and machine learning problems with societal impact and comes with excellent benefits.

This sounds really cool, also a great opportunity to improve jouw nederlands.

Survey Statistics: Blue Rose Research is hiring !

Posted on October 28, 2025 4:00 PM by shira

As readers may know, I’m a survey statistician at Blue Rose Research. We survey the public to forecast elections and test political messages, used to advise Democrats. We’ve announced hiring here a few times (e.g. in April 2025). We’ve discussed our 2024 election retrospective. And now we hiring again ! Looking for experts at the intersection of AI and statistical modeling.

We estimate causal effects of in-survey political messages using scaled-up versions of MRP. To get more insight, we connect this pipeline to new AI tools for generation and summarization. We are looking for a teammate with deep expertise in both LLM tools and statistical modeling to build tooling that scales our analyses with care and thoughtfulness. We want a teammate who clearly communicates assumptions, results, and uncertainty. We are a mission-driven team that values kindness and collaboration.

For more details, see the job posting.

Salary: $140,000 – $190,000 annually, commensurate with experience.
Benefits: Competitive health, dental, and vision coverage; generous leave; and a supportive, mission-driven culture.
Work setup: Fully remote team with an NYC office and regular in-person meetups (NYC & DC). Most of our work happens on East Coast time.

Please circulate and apply !

Reanalysis of that Nobel prizewinning study of patents and innovation (with R and Stan code)

Posted on October 21, 2025 9:57 AM by Andrew

A few days ago we discussed a paper from 2005, Competition and Innovation: An Inverted-U Relationship, two of whose authors recently won the Nobel prize in economics.

I had some concerns about the analysis, which I can express with reference to the above figure from that paper:

1. The paper’s all about an inverted U relationship, but this is driven by fitting a quadratic curve rather than, say, a curve with diminishing returns.

2. The line does not seem to go through the data points. In particular, the curve seems to be too low at the rightmost part of the graph–an artifact of fitting a quadratic, perhaps?

3. The y-axis is some weighted count of patents but it’s being used as a measure of the more abstract concept of “innovation.”

4. The x-axis is an average of profit margins but it’s being used as labeled a measure of the more abstract concept of “competition.”

5. The model predicts patents from profit margin in the same year, but to the extent the model is appropriate I think you’d expect a lag.

6. They use Poisson regression, but the data are not counts, also if you don’t somehow correct for overdispersion your standard errors will be too low.

The plan

Here are the ways I’m gonna adjust for the above issues in my reanalysis:

1. I’ll fit a quadratic curve to replicate what they did in the paper, and I’ll also fit another family of curves (the “hinge”) that allows for nonlinearity but without enforcing non-monotonicity.

2. One problem with the above graph is that it excludes 20% of the data (see the figure caption). I’ll plot the fitted curves showing all the data. Another possible reason for the problematic fit is that in the paper they say they adjust for industry and year effects, and so I’ll make a plot showing the fitted curve and the data broken down by industry. I’ll do the adjustments in two ways: “fixed effects” as in the published paper (using various R packages), and multilevel modeling (using Stan).

3, 4. I’ll relabel the axes of the graph to more accurately capture the authors’ measures of competitiveness and innovation, but I’ll swallow the concern that the analysis only uses a subset of the data that were available at the time (from the paper, “Our sample includes all firms with names beginning “A” to “L” plus all large R&D firms. After removing firms involved in large mergers or acquisitions and those with missing data . . .”).

5. This lag thing is an issue, but I have enough concerns with items 3 and 4 that it’s hard for me to take the paper’s theoretical and causal arguments seriously. But, sure, you could replicate my analyses at different lags.

6. I’ll compare four sorts of models: Poisson, quasipoisson, negative binomial, and normal regression on log(y+1). Quasipoisson and negative binomial are two different ways to correct Poisson regression for overdispersion. Modeling on log(y+1) is a completely different way to model data that are mostly positive but have some zeroes, and I think it actually makes the most sense for this example, as the data are not actually counts. When fitting Poisson and negative binomial regressions, I’ll round the data to the nearest integer, which turns out to essentially not affect the results.

The above is not a preregistration plan; I wrote it in the middle of the project after doing some of the analyses already.

The data

Bradford in comments links to a post from 2014 by Leif Nelson and Uri Simonsohn that includes a file linking to the website of Nicholas Bloom, one of the authors of the paper. Navigating the site takes me to this page with a link to a zip file with the data, which I then downloaded.

Good job by Bloom to post the data from a paper published in 2005. I’m not so good with data availability myself.

The data file has 354 records, including data from 1973-1994 and from 17 industries, and for each of these records it has the weighted patent count (“patcw”), the measure of profit margin (“Lc”), and a bunch of variables that I didn’t try to figure out, because these are all I need to replicate the main analysis. (The file does not seem to include a code book, but I’m not complaining, as they’re already way ahead of me by making the data available at all.)

Initial data analysis: quadratic regression using Poisson, quasipoisson, and negative binomial models

I get going by plotting the data and fitting some quadratic models:

Compared to the graph at the top of this post, the above plot shows more data (I’m not trimming the upper and lower 10% of observations), and the quadratic curves are much flatter than were shown in that paper:

(1) green curve: poisson fit to rounded data (using glm() in R)
(2) red curve: quasipoisson fit to rounded data (using glm())
(3) blue curve: quasipoisson fit to raw data (using glm())
(4) pink curve: negative binomial fit to rounded data (using glm.nb())
(5) orange curve: negative binomial fit to rounded data (using stan_glm())

The first four curves are so similar that the lines overlap and you can’t see the green and red curves: They’re there, just overwritten. For each model I display the point estimate of the curve (best fit for models 1-4, posterior median for model 5). In the models above, quasipoisson uses the Poisson fit and then adjusts standard errors to account for overdispersion; negative binomial is a probability distribution that accounts for overdispersion in a different way.

We went from 1 to 2 to see if switching to quasipoisson made a difference (it did; the standard error was much bigger after we allowed for overdispersion); we went from 2 to 3 to see if rounding made a difference (it did’t for these data); we went from 3 to 4 to see if switching to negative binomial made a difference (it didn’t do much, but the standard error increased slightly); and we went from 4 to 5 to see if switching to full Bayes made a difference (it looked a lot different, actually).

I’ll put the code at the end of this post so as not to distract from the story. Here are the relevant pieces of console output:

(1) poisson fit to rounded data
            coef.est coef.se
(Intercept) -74.82    25.85 
Lc          164.76    54.90 
I(Lc^2)     -88.39    29.14

(2) quasipoisson fit to rounded data
            coef.est coef.se
(Intercept) -74.82    84.64 
Lc          164.76   179.78 
I(Lc^2)     -88.39    95.44

(3) quasipoisson fit to raw data
            coef.est coef.se
(Intercept) -75.01    84.01 
Lc          165.13   178.43 
I(Lc^2)     -88.55    94.72 

(4) negative binomial fit to rounded data
            Estimate Std. Error
(Intercept)   -80.80      89.49
Lc            177.20     190.10
I(Lc^2)       -94.85     100.92

(5) negative binomial fit to rounded data (Bayesian)
            Median MAD_SD
(Intercept)  -7.1   33.2 
Lc           21.1   69.5 
I(Lc^2)     -12.2   37.3

In a paper I’d prefer to display these uncertainties graphically; here I’m giving the console output to give a sense of how things might go in our usual workflow of fitting models and looking at them. Here you can see that the quadratic terms all have large standard errors–except for the very first model, the Poisson, but its standard error is too low as it does not account for overdispersion.

The only thing that really puzzles me here is what’s going on with model 5. Could it be that the Bayesian model uses the posterior median rather than the optimum? I’ll check by re-fitting it, running stan_glm on “optimizing” setting:

(5_opt) negative binomial fit to rounded data (Bayesian posterior mode)
            Median MAD_SD
(Intercept)  -5.3   36.8 
Lc           17.1   76.7 
I(Lc^2)     -10.9   39.4

That’s not much different from the posterior median. So I guess the difference between the negative binomial regressions fit by glm.nb() and stan_glm() arise from differences in the fitting algorithms, or maybe something I missed in the coding. If I wanted to proceed further down this track I would have to investigate this a bit further, reading up on what glm.nb is actually doing in R, and in Stan programming the negative binomial model myself from scratch and also comparing to brms.

For now I’ll just move on. My guess is that the difference in fits has something to do with how seriously the model takes the small number of influential data points near the top of the graph, but I’ll set that aside for now because we’ll be fitting some more models.

Quadratic regression adjusting for industry and year effects

The next step is to adjust for industry and year effects, which I’ll do in a few ways. I’ll aid in the interpretation of these by plotting the data and fitted curve separately for each industry. In the data the industry codes took on 17 different values ranging from 22 to 49, which according to the paper correspond to “two-digit SIC codes.” So I googled “two-digit SIC codes,” which are listed in various places online. I could not find an official link, but various unofficial sources seemed to agree; here’s one such list.

For each plot, black dots show the data for that industry, with a large dot showing the data from the first year of the data and thin black lines showing the time sequence. The colored curves are the fitted quadratics for each industry in an average year, as explained below.

It’s kind of weird that Furniture has so many patents, and that the number of patents for machinery and computers started out high and then dropped, and that electric and gas services had no patents for the first few years and then suddenly had a lot . . . this is just what’s in the data. I checked a few things to make sure I didn’t garble the categories but it’s possible that I’m missing something.

Here are the models I fit:

(3a) blue curve: quasipoisson fit to rounded data, including factors for industry and year (using glm())
(4a) pink curve: negative binomial fit to rounded data, including factors for industry and year (using glm.nb())
(5a) orange curve: negative binomial fit to rounded data, including factors for industry and year (using stan_glm())
(6a) purple curve: multilevel negative binomial fit to rounded data, with varying intercepts for industry and year (using stan_glmer())

Models (3a), (4a), and (5a) include unmodeled coefficients for industry and year (“fixed effects,” in economics jargon); model (6a) considers these coefficients as latent variables and estimates their distributions (this is what economists call “random effects”).

From the above graph you can see that, after adjusting for industry and year effects, the quadratic curves are much stronger. Most of this comes from the industry effects; there’s not much evidence for unexplained variation at the year level. Here’s the relevant console output from model (6a), which conveniently estimates the scale of each batch of varying intercepts:

stan_glmer
 family:       neg_binomial_2 [log]
 formula:      patcw_rounded ~ Lc + I(Lc^2) + (1 | sic2) + (1 | year)
 observations: 354
------
            Median MAD_SD
(Intercept) -68.3   27.1 
Lc          146.6   57.5 
I(Lc^2)     -77.9   30.9 

Auxiliary parameter(s):
                      Median MAD_SD
reciprocal_dispersion 5.6    1.0   

Error terms:
 Groups Name        Std.Dev.
 year   (Intercept) 0.11    
 sic2   (Intercept) 2.11    
Num. levels: year 22, sic2 17

In this fitted model (6a), and also in (4a), and (5a), not shown here, the estimated coefficient for the quadratic term is a bit more than 2 standard errors away from zero, that is, statistically significant at the conventional level. The estimated quadratic term in (3a), the quasipoisson regression, is a bit more than 4 standard errors from zero; I guess this difference is attributable to the different ways that the quasipoisson and negative binomial effectively weight the extreme values in the data.

Quadratic regressions on log(y+1)

When modeling count data we usually start with the negative binomial model with log link. And here I wanted to connect to whatever version of Poisson regression was used in that published paper.

But, as noted above, these data aren’t counts–they’re not even integers, and I think it makes sense to just directly model them on the log scale. Some of the observations have zero values, though, and so I’ll follow the standard practice of modeling nonnegative data y by fitting regressions to log(y+1). My general recommendation along these lines is to model log(y+A), where A is some constant that corresponds to a baseline level of the data. In this case, the 354 data points include 46 zeros, then another 60 values between 0 and 1, then various values (none of them are integers!) ranging as high as 44.7. For the purpose of studying patent counts as measures of innovation, it seems reasonable to add 1 to these data, which blurs the lowest values (there is no big distinction between 0 and the lowest nonzero observation, 0.037) while preserving the distinction between the higher levels. This seems about right: to the extent these data will be supplying a signal on innovation, we’ll want to mostly be learning from data on the high end.

Here’s what happens when we fit the quadratic regression, not adjusting for industry and year effects, to the log(y+1) transformed data. Now we can just use normal errors and we don’t have to worry about Poisson or negative binomials, so there’s just one curve:

You can see the zeros at the very bottom of the graph. The model of normally-distributed errors:

(7) green curve: normal regression fit to log(y+1) (using lm())

is not perfect, but I think it’s close enough, and we no longer have to worry about exactly how we’re modeling extremely high values. On the log scale we still see a fitted quadratic curve. The fit is noisy–the coefficient for the quadratic term has a standard error that is much higher than the estimate itself–but, again, let’s move on to the regressions that adjusts for industry and year effects.

As before, we show the data and fit (this time, with both on the log(y+1) scale) broken out by industry:

The curves come from two fitted models:

(8) orange curve: normal regression fit to log(y+1), including factors for industry and year (using stan_glm())
(9) purple curve: multilevel regression fit to log(y+1), with varying intercepts for industry and year (using stan_glmer())

They’re pretty similar; I assume that the small differences between the two fits arise from the fact that the least-squares model (8) does more adjustment for industries. The estimated coefficients for the quadratic terms are 3 or 4 standard errors from zero. Here’s the output from model (9):

 family:       gaussian [identity]
 formula:      log_patcw ~ Lc + I(Lc^2) + (1 | sic2) + (1 | year)
 observations: 354
------
            Median MAD_SD
(Intercept) -86.8   24.6 
Lc          186.6   52.5 
I(Lc^2)     -98.6   27.9 

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.5    0.0   

Error terms:
 Groups   Name        Std.Dev.
 year     (Intercept) 0.09    
 sic2     (Intercept) 1.04    
 Residual             0.47    
Num. levels: year 22, sic2 17

Hinge regression on log(y+1)

As we have discussed already, the big problem with quadratic regression is that it enforces non-monotonicity. I’d like to fit a model that allows nonlinearity but without a declining slope automatically turning into a negative slope.

My first thought was to fit a model with an asymptote, something like this: b_0 + b_1*(1 – exp(x/b_3)), as an alternative. But this curve has the opposite problem: it is restricted to be monotonic. What I really want is a model that can have diminishing returns or can asymptote or can have that inverted U shape, with the question answered by the data.

We have such a model–the hinge function:

This is a curve that smoothly connects two straight lines which can have arbitrary slopes. The parameters of the curve are:
x0: the value of x where the two lines would intersect
a: the value of y where the two lines would intersect
b0: the slope of the line on the left side of the hinge
b1: the slope of the line on the right side of the hinge
delta: the scale of the continuous curve connecting the two lines

For our purposes, we are most interested in b1: is there evidence that this slope is negative?

We program the hinge model in Stan–the code’s right there at the linked post–also including a multilevel model with varying intercepts for industry and year, as above. Here’s the whole program:

functions {
  vector logistic_hinge(vector x, real x0, real a, real b0, real b1, real delta) { 
    vector[size(x)] xdiff = x - x0;
    return a + b0 * xdiff + (b1 - b0) * delta * log1p_exp(xdiff / delta);
  }
}
data {
  real<lower=0> delta;
  int N;
  vector[N] x, y;
  int J1, J2;
  array[N] int<lower=1,upper=J1> group1;
  array[N] int<lower=1,upper=J2> group2;
}
parameters {
  real x0;
  real a, b0, b1;
  real<lower=0> sigma, sigma1, sigma2;
  vector<offset=0, multiplier=sigma1>[J1] a1;
  vector<offset=0, multiplier=sigma2>[J2] a2;
}
model {
  x0 ~ normal(1, 1);
  a ~ normal(0, 100);
  b0 ~ normal(0, 100);
  b1 ~ normal(0, 100);
  a1 ~ normal(0, sigma1);
  a2 ~ normal(0, sigma2);
  y ~ normal(logistic_hinge(x, x0, a, b0, b1, delta) + a1[group1] + a2[group2], sigma);
}

When fitting the model, we specify the scale parameter delta, setting it to 0.05, which allows for a gentle curve within the range of the data (as you can see from the plots above, x ranges from about 0.85 to 1.0). There’s some arbitrariness here, but it’s just too hard to fit this parameter directly from the data. In effect this curvature is hard-coded into the quadratic regressions fit earlier.

We assign weak priors to the other parameters in the data, effectively excluding extreme values for the slopes of the curves. The only one of these priors that might be confusing is the prior for x0, the x-position of the hinge. We’re soft-bounding it at the high and low ends just so that the fit won’t get lost in extreme values: once x0 is far outside the range of the data, the curve effectively becomes a straight line and the location of the hinge becomes non-identified.

So, yeah, the hinge is a bit more work to fit compared to the quadratic, but that’s the price we have to pay to fit a more flexible model to this small dataset. And in this case I think the flexible model is absolutely necessary given the goal of seeing whether the data indicate that inverted U shape.

Here’s the result:

In each plot, the blue curve represents the posterior median of the parameters and the red curves correspond to 20 draws from the posterior distribution.

Here’s the summary of inferences:

 variable   mean median    sd   mad      q5    q95 rhat ess_bulk ess_tail
   lp__    71.69  71.85  6.65  6.87   60.81  82.19 1.00      838     1428
   x0       0.94   0.93  0.07  0.09    0.82   1.05 1.04      132     1204
   a        4.03   4.03  0.72  0.69    2.84   5.23 1.01     1178     1549
   b0      52.71  36.93 42.67 24.45   15.29 146.35 1.03      251     1288
   b1     -50.66 -31.22 46.50 22.85 -149.89 -11.44 1.03      146      883
   sigma    0.47   0.47  0.02  0.02    0.44   0.50 1.00     4279     2726
   sigma1   1.11   1.08  0.21  0.20    0.81   1.50 1.01      880     1217
   sigma2   0.08   0.08  0.04  0.04    0.02   0.16 1.00      999     1301
   a1[1]   -0.25  -0.25  0.30  0.30   -0.74   0.26 1.00      486      736
   a1[2]    0.23   0.22  0.30  0.30   -0.26   0.75 1.00      491     1036
...

The key parameter is b1, the slope on the right side of the curve. The posterior mean is -50.66 with standard error 46.50–so, apparently, not statistically significant–but if you look at the posterior quantiles, you’ll see that the distribution is very skewed. Indeed, it turns out that all 4000 of the posterior simulation draws of b1 are negative here.

So, after all this modeling, it seems that the data do clearly indicate an inverted U pattern!

That said, I haven’t explored the hinge model too carefully. For example, when I re-fit it with some other choices of the hinge scale parameter delta, it sometimes doesn’t mix well. I think the above graph with delta=0.5 fits these points about as well as is possible, so I’m ok with it as a data summary, but I’m not saying that my fitting procedure is fully computationally robust. If I were to be working more on the computation for this particular problem, I’d start by removing the year effects, as they can be adding stress to the computation without affecting the fit in any meaningful way.

Before getting to the interpretation of these results when it comes to competition and innovation, I want to tie up a couple of loose ends.

Quadratic regression on log(y+1), programmed in Stan

In fitting the above multilevel hinge regression, I moved from the pre-programmed Stan code of stan_glmer() to a custom Stan program. Just to check that nothing funny is going on here, I’ll go back and program the multilevel quadratic regression directly in Stan. It’s easy enough to write the program; I just replace the hinge by a quadratic function and get rid of the priors, and, indeed, we get essentially the same result as when fitting using stan_glmer(). I won’t bother showing the new graph here. No surprise, it’s just good to check.

The other thing we can do with the Bayesian fit is see whether the peak of the curve falls within the range of the data. The curve b0 + b1*x + b2*x^2 has its peak at x = -b1/(2*b2) (just take the derivative, set to 0, and solve for x), so we can compute the posterior distribution of this value from our simulations. It turns out that only 3 of the 4000 simulated curves has a peak outside the range of the data, so fair enough that, according to the fitted quadratic model (which I don’t like), the inverse U pattern is occurring within those bounds.

Adjusting for the group-level mean of the predictor

In general when fitting a multilevel model, it’s a good idea to adjust for the group-level mean of the predictor–in this case, the average value of x within each industry. Otherwise you have to worry about correlations between x and the varying intercepts. In this case, adding this group-level predictor doesn’t change much. I won’t share the result here just because I’ve already done a lot of work to write this up and there are enough other concerns with the analysis, but, yeah, it’s a good idea to include this predictor too.

Summary

OK, so what have we learned?

• The graph from the original paper (reproduced at the top of the above post) did not show a good fit because it didn’t display the adjustments for industry. We can see the pattern a lot clearer using separate plots for each industry.

• I think it makes more sense all around to model these data on the log(y+1) scale rather than using Poisson regression or any of its variants.

• The patterns within and between industries aren’t so clear to me. I don’t get why furniture has so many patents, why the number of patents for machinery and computers dropped, why the number of patents for electric and gas services shot up, and so forth. I’m not saying these numbers are wrong, and I might have made some mistake in coding; I’m just saying I’m baffled.

• I think the hinge model is much more appropriate than the quadratic for the propose of seeing the extent to which the data support an inverted U pattern. It wasn’t hard at all to program, debug, and fit the hinge model, and in this case it does support the inverted U. So in that sense the published paper was correct in its statistical conclusion, even if I think they got kinda lucky in finding the pattern with their quadratic curves.

• I still don’t buy the claim in the paper that they “find strong evidence of an inverted U relationship” between “product market competition and innovation.” Again, my main problem here is their measure of innovation using average weighted patent counts, an issue that further bothers me given the odd data patterns seen in the graphs for some of the industries, also with the selection in the data (“Our sample includes all firms with names beginning “A” to “L” plus all large R&D firms. After removing firms involved in large mergers or acquisitions and those with missing data . . .”).

• Also the model predicts y from x in the same year, and we’d expect a lag. I didn’t bother addressing this because of my other concerns about the data.

I guess that later work followed up with more comprehensive data sources, but I remain concern that any inverse U pattern, even if supported by this particular dataset, could depend strongly on various artifacts of the measures they are using. There aren’t really a lot of data points at high values of x here, so this downward slope is defined by how these patents are being counted in just a couple of industries.

All this might sound like picky criticism, but it’s the authors of the paper (and the Nobel prize committee), not me, who are making these broad claims about competition and innovation, and concerns about measurement seem really important here.

That said, it’s cool to see that a careful statistical analysis does find the inverted U. It does seem like a real pattern in the data, so the only question is the relevance of these data to the larger economics question.

tl;dr

In response to one of the comments below, I wrote that I don’t at all trust the idea of using time variation of patents within an industry to measure time variation of innovation. I get that the authors were doing their best using available data; still, I don’t think their data and analysis provide evidence for their substantive conclusions.

In the immortal words of John Tukey, “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

Looking at it that way, all my statistical analysis with the transformations and the quadratic curves and the hinge function was all a waste of time, as I’m just doing a fancy analysis of bad (or, at least, irrelevant) data. But . . . summarizing the statistical information still has value in itself. My analysis just took a few hours, a small fraction of the time the authors must have spent on preparing, analyzing, and interpreting their data. I think it’s important, when doing statistical analysis, to do the best we can do, in this case accounting for all these sources of variation and uncertainty.

Or, to put it another way, yes, it would have been fine to dismiss the published results entirely given the problems with the data. But some of the problems with the data became apparent only after I made those plots showing the time series for each industry. At that point, all the fitting may well have been a waste of time–but I only thought of making the plots because I was trying to understand the puzzling fit.

All the links connecting theory, measurement, data, analysis, and conclusions are important.

P.S. All these data and measurement problems just leap out at us, but there’s a whole world of people out there who just accept these conclusions–even some of the graphs from that 2005 paper–as truth. For example I came across this post by political journalist Matthew Yglesias that just straight up accepts the iffy empirical claims from that paper. It’s tricky–journalists are busy, and it’s natural to think that a much-cited paper written by two Nobel prize winners and published in a top journal has to be correct. I don’t really know what to say, except that this is one of the useful functions of social media, to allow us to push back against default narratives. Not that the default is necessarily wrong, just that it can be wrong, and it’s hard for journalists to escape the bubble and recognize this.

P.P.S. There was some concern about the arbitrariness of the log(y+1) transformation so I also fit the model to sqrt(y), which is a standard variance-stabilizing transformation for count data. The results looked essentially the same:

So the inverted U really seems to be there–but you have to assume the curve has the same shape for all industries. If you fit a separate curve for each industry, there’s no way you’d find the U, and you can see that in the scatterplots. The trouble is that the data are too sparse and variable to try to estimate a separate curve for each industry. Just look at electric and gas services, for example. Or passenger transit.

Data and code
Continue reading →

Generalizing Treatment Effects from Trials to EHR Populations (Qixuan Chen’s talk this Tues morning)

Posted on October 16, 2025 6:48 PM by Andrew

My biostatistics collaborator is speaking this Tues 21 Oct, 11am at the NYU Department of Population Health, room CR 314 and on zoom:

Generalizing Treatment Effects from Trials to EHR Populations using Propensity Score Predictive Inference

Although randomized controlled trials provide strong internal validity, they often lack external validity when generalized to populations. This limitation, known as generalizability, arises when trial participants are not representative of the target population. To address this, we develop an interaction-based Propensity Score Predictive Inference (PSPI) method that leverages propensity scores for trial participation combined with flexible outcome models. We introduce two robust PSPI variants that estimate potential outcomes across treatment groups by incorporating natural cubic splines of the propensity score and modeling high-dimensional covariates with Bayesian Additive Regression Trees.

Generalization is important!

“All Our Default Models Are Wrong: Causal inference for varying treatment effects”: my talk this Saturday morning in Ottawa

Posted on October 16, 2025 9:13 AM by Andrew

It’s at this colloquium on meta-analysis in economic research, Sat 18 Oct 2025, 9:30am:

All Our Default Models Are Wrong: Causal inference for varying treatment effects

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Everybody knows that effects can vary, but the usual models we fit do not account for structure in the variation. This is relevant for generalization from sample to population, for anticipating changes over time, and when designing new studies and analyzing existing data. We discuss several directions for going beyond the usual additive model, along with the challenges of fitting such models and interpreting the results, which tend not to reach conventional “statistical significance.”

Perhaps with audience participation there will be a discussion of the particular relevance of this work to economics.

Questions about statistical claims in paper from recent Nobel prize winners; some general challenges in trying understand nonlinear patterns using quadratic regression

Posted on October 15, 2025 9:52 AM by Andrew

A colleague who specializes in the natural selection of bad science points us to this article from 2005, Competition and Innovation: An Inverted-U Relationship, by Philippe Aghion, Nick Bloom, Richard Blundell, Rachel Griffith, and Peter Howitt.

But the above graph (if you just look at the dots and ignore the curve!) does not look like an inverted-U or like anything non-monotonic. It looks like a flattening curve with diminishing returns.

What happened?

Here’s what it says in the article:

In Figure I we show the scatter of data points in between the tenth and ninetieth deciles of the citation-weighted patent distribution, and overlay a fitted exponential quadratic curve. The same exponential quadratic curve is plotted together with a spline approximation in Figure II. It can be seen that the exponential quadratic specification provides a very reasonable approximation to the nonparametric spline, and that they both show a clear inverted-U shape.

OK, so the problem is with this “exponential quadratic curve.” It’s not that the data show an inverted U, it’s that the inverted-U is being induced by the quadratic functional form.

I don’t have the data or code from this article, but I’m guessing that if you simulated data from an underlying model where E(y|x) is an increasing function of x but with declining rate of increase, that this quadratic fit could easily find an inverted U-shape.

We’ve seen this happen before, in a notorious paper by some psychologists that claimed that, in sports, “Top talent benefited performance only up to a point, after which the marginal benefit of talent decreased and turned negative”–but when you look at the data, there is no such negative turn. The reported negative turn, or inverted-U shape, arose entirely from (a) the data being consistent with diminishing but positive returns, and (b) the quadratic curve being too restrictive. Here was their fitted curve:

Screen Shot 2015-10-04 at 12.14.46 AM

and here were their data (ignore all the lines on the graph and just look at the dots):

As with the econ paper under discussion today, if you fit a quadratic curve you get this inverted-U shape, but if you look at the data, all you see is a flattening of the slope.

Another issue that arises in both these examples is that the predictor has an upper bound at 1, which means that, even if the quadratic model is correct, you can have a negative curvature–that is, a negative coefficient on the quadratic term–without there being a decline in the curve in the range of the data. So looking at the estimate and significance of the coefficient on the quadratic term is not enough. In a practical sense this shouldn’t matter because you shouldn’t be routinely fitting quadratic curves–they have the well-known problem that the fitted curve can look like a U or inverted U even if the data pattern is monotonic–but if you do this, you can’t just look at the coefficient.

But let’s continue with the paper under discussion. Here’s their Figure II:

In the above-quoted paragraph, the authors accurately state that both curves show a clear inverted-U shape.

Fine. But what about the data? In particular, how is it that the nonparametric curve goes down so fast at the right of the graph? The curve goes all the way down to E(y|x) = 2.5 at the extreme value of x=1. But if you look at the data in Figure I, there’s this whole cluster of points at the upper right, and, at least based on these data, E(y|x) is around 8 or so in the region where x=1.

I can’t figure out what’s going on. My best guess is that the fitted quadratic-like curve is what you get after adjusting for other predictors not included in the graph–from Table I, these include year effects, industry effects, and some other predictors–but I’m not sure, and it still seems weird that they’re plotting a fitted curve that isn’t close to the empirical pattern of E(y|x) in the key region of the data where they’re reporting a decline.

One possibility is that the data in the upper right of Figure I “don’t count” in that they all belong to one or two industries that have high levels of competition and high levels of innovation, so that this patter is accounted for in the industry effects in the model. But, if that’s the case, I’m still concerned, because this sort of pattern between industries would still be relevant to the question of the correlations of competition and innovation. They write, “It is very likely that different industries will have observed levels of patenting activity that have no direct causal relationship with product market competition, but reflect other institutional features of the industry. Consequently, industry fixed effects are essential to remove any spurious correlation or ‘endogeneity’ of this type.” And I kind of get this, but to the extent that industries with lower profit margins have more patents, that could be relevant too. At the very least, I’d like to see this in the data. Once they subtract industry effects, they’re getting leverage from changes over time within industries, and these could just represent parallel time trends, no? In some sense this is addressed by their instrumental variables analysis described on pages 708-710, but in any case I still have concerns about their claimed inverted-U.

Again, it’s just crazy that their fitted curve doesn’t even go through the data. This is a self-defeating graph on the order of the notorious air pollution in China regression. Again, I haven’t seen the data and there could well be some way around this problem, but, if so, the authors should at least address the problem and explain why they believe this inverted-U pattern to be true in some underlying sense, even though it does not appear in the data.

“Inverted-U” is in the title of the paper!

And then the article has a long section, “Explaining the Inverted U.” So they’re really invested in the idea. For example:

But . . . what if you’re explaining something that isn’t really happening! Again, see Figure I.

So don’t know what to think.

From these graphs, it looks like their pattern is an artifact of including a quadratic (rather than, say, a saturation function such as y = a*(1 – exp(-bx))), in their model, and, as noted above I’ve seen acclaimed researchers do this sort of thing before. Also the statistical analysis includes questionable confirmatory statements such as “Again, we find an inverted-U shape, although due to a substantially smaller sample, the coefficients are not statistically significant.” Also I’ve seen problems with others’ analysis of patents; see for example here and here. Data from patents can be tricky to analyze.

In addition to my concern about using patents as a proxy for innovation, I don’t know what to think about using 1 – profit margin as a proxy for competition (see pages 704-705 of the paper), both when comparing across industries and over time. At the very least, I’d prefer if they’d talk about “patents” and “profit margin” rather than “innovation” and “competition” throughout. That’s just a change in words but I think it would make the issues a lot clearer.

There are also some other data issues, like what industries they are considering, and why they’re doing analysis at the industry rather than firm level, and selection (“Our sample includes all firms with names beginning “A” to “L” plus all large R&D firms. After removing firms involved in large mergers or acquisitions and those with missing data . . .”), and the question of whether it even makes sense to try to predict number of patents (or even “innovation”) from the average profit margin (or level of “competition”) in the same year, rather than considering some sort of lag. They kind of address that last question with a robustness check, but the trouble is that I don’t believe that either, given that their only evidence is statistical significance of a quadratic term in a curve that doesn’t seem to fit the data. Also it’s not clear to me why the lagged model should be the robustness check and not the main analysis.

On the other hand, I haven’t looked into this particular case in detail so maybe it all makes sense if you look at it carefully enough.

One more thing is that I think they’re saying they’re using Poisson regression, but their data are weighted counts which aren’t integers? Also it’s well known that Poisson regression will understate uncertainty. Negative binomial regression is just about always better (see chapter 15 of Regression and Other Stories) or else you can use some sort of robust standard errors or whatever. But straight-up Poisson regression will generally give you standard errors that are too small–often much too small.

The big picture, as I see it, is that this paper has some theoretical results and some empirical results. The theory alone could be interesting but wouldn’t count for much without the empirics. The empirical results are iffy–at best, there are some patterns there and the authors just didn’t fully display their data and explain their model, but I’m doubtful. It’s possible that future, more careful, analysis found similar results–or not! It looks to me like the authors followed a standard practice in social science research of finding a statistically significant coefficient estimate and taking this as evidence in favor of a particular theory. But there are enough gaps between data and theory, gaps that include the functional form of the model, the method used to average over industries and years, and the variables being measured, that I don’t see it. As I said, this is standard practice in social science, and we wouldn’t really be looking this a paper from 2005 had two of its authors not been in the news “for having explained innovation-driven economic growth.”

P.S. My colleague sent me this paper because two of its authors recently won the Nobel prize in economics. This would not be the first time that economics Nobel prize winners made mistakes in interpreting data analysis in high-profile studies. Two cases we’ve discussed in the past are:

Did blind orchestra auditions really benefit women?

How does a Nobel-prize-winning economist become a victim of bog-standard selection bias?

It happens!

Of course, even if the paper, “Competition and Innovation: An Inverted-U Relationship,” is absolutely terrible and even if there is no such inverted-U relationship, that does not mean that the corpus of work by Philippe Aghion, Nick Bloom, Richard Blundell, Rachel Griffith, and Peter Howitt is valueless, or that they don’t deserve a major prize. And similarly for the authors of the two papers discussed in the links immediately above. Everybody makes mistakes. I’ve felt the need to issue corrections to four of my published papers, and I don’t think that all, or even much, of my work is bad.

So don’t take this post as a criticism of this Nobel prize. Rather, we can take it as a plus. When research gets public attention, people will go back and read the original papers, and this leads to post-publication review, as in this post. This is a good thing!

P.P.S. At this point, you may well be saying that I’m just being picky, this is how people did empirical work 20 years ago, why am I being mean to these authors, I’m a hater, tall poppy syndrome, every paper has flaws and assumptions, etc etc etc. So, to keep it simple, let me just say this: I don’t believe their story is supported by the evidence of that paper from 2005. I disagree with their claim that “We find strong evidence of an inverted-U relationship using panel data.” I just don’t see it. It might that their theory is correct, and it might be that further data analysis supplies strong empirical support; I don’t know. I’m not making a statement about reality here; I’m making a statement about evidence. Which I think is a reasonable thing to look at, given that this is what the editors of the Quarterly Journal of Economics had in their hands when they had to decide whether they wanted to publish this claim of strong evidence. Again, this is an issue with lots of empirical work, and I’m not saying this paper was worse than the accepted standard at that time, or even now.

P.P.P.S. See here for my reanalysis.

Statistical Modeling, Causal Inference, and Social Science

Category Archives: Causal Inference