“We conclude that apparent effects of growth mindset interventions on academic achievement are likely attributable to inadequate study design, reporting flaws, and bias.”

Joshua Brooks points us to this research article by Brooke Macnamara and Alexander Burgoyne, “Do growth mindset interventions impact students’ academic achievement? A systematic review and meta-analysis with recommendations for best practices,” which states:

According to mindset theory, students who believe their personal characteristics can change–that is, those who hold a growth mindset–will achieve more than students who believe their characteristics are fixed. Proponents of the theory have developed interventions to influence students’ mindsets, claiming that these interventions lead to large gains in academic achievement. Despite their popularity, the evidence for growth mindset intervention benefits has not been systematically evaluated considering both the quantity and quality of the evidence. Here, we provide such a review by (a) evaluating empirical studies’ adherence to a set of best practices essential for drawing causal conclusions and (b) conducting three meta-analyses. When examining all studies (63 studies, N = 97,672), we found major shortcomings in study design, analysis, and reporting, and suggestions of researcher and publication bias: Authors with a financial incentive to report positive findings published significantly larger effects than authors without this incentive. Across all studies, we observed a small overall effect . . . which was nonsignificant after correcting for potential publication bias. No theoretically meaningful moderators were significant. When examining only studies demonstrating the intervention influenced students’ mindsets as intended . . . the effect was nonsignificant . . . When examining the highest-quality evidence . . . the effect was nonsignificant . . . We conclude that apparent effects of growth mindset interventions on academic achievement are likely attributable to inadequate study design, reporting flaws, and bias.

I haven’t read the paper, let alone the 63 cited studies, but I thought I’d do my part by getting this into the discussion.

We talked about earlier critical work by Mcnamara on growth mindset back in 2018, where I discussed how to think about effect sizes for such interventions.

My main message was that, if mindset interventions work, we’d still expect small average effects, because they won’t work for all students. As I wrote, “it’s a small effect in the context of any student, and of course it’s a small effect. It’s hard to get good grades, and there’s no magic way to get there!”

In one sense, my conclusion is negative on mindset interventions in that I’m saying we shouldn’t expect to see large effects, and any large effects that do show up are likely to be huge overestimates.

In another sense, my conclusion is positive on mindset interventions in that, given that any average effects will be small, the lack of statistically significant average effects in small or even moderately-large studies does not have to imply that mindset interventions don’t work; it just says that they only work in some settings, and individual effects will mostly be small.

Also relevant is this discussion we had a few years ago on mindset interventions with contributions from Russell Warne and David Yeager. Lots to chew on here, also this example helped form my thinking on varying treatment effects, leading to our causal quartets paper and some future lines of research.

25,000 lives saved per ship sunk, $100,000 per citation, a probability of 10^-90 of a decisive vote . . . Is there a through line from B.S. numbers in junk science to B.S. numbers coming from the government?

Did a blithe disregard for innumeracy in pop social science pave the way to a blithe disregard for innumeracy in government?

I came across this:

At first I thought this a parody but maybe it’s real? Here’s the home page, which again looks like a joke but I think it really is coming from the U.S. government:

Again, I’m not sure but for the purposes of this post, let’s assume that this is actually an official government statement:

EVERY TIME WE HIT A NARCO-TRAFFICKING VESSEL, WE SAVE TWENTY-FIVE THOUSAND LIVES.

This is innumerate crazy talk.

Paul Campos discusses this in the context of the president’s cognitive degeneration, but I think there’s more to it than that.

Standard-issue innumeracy

Bill James once wrote that his innovation as a sports analyst was to think of baseball statistics as numbers that could be added, subtracted, multiplied, and divided, in contrast to the usual attitude in which statistics are treated like words (so-and-so hit .300 or led the league in stolen bases or whatever).

We often see this meaningless-numbers-as-words attitude coming from credentialed academic social scientists. Some examples we’ve discussed over the years include:
The claim that beautiful parents are 36% more likely to have girl babies (thanks, Freakonomics!),
The claims that single women were 20% more likely to support Barack Obama and three times more likely to wear red or pink clothing during certain times of the month (thanks, Psychological Science!),
The claim that every execution prevents 18 murders (thanks, Harvard!).

These are examples of what one might call standard-issue innumeracy, which is how we might characterize claims that could in theory be correct but whose plausibility disintegrates after any serious engagement with reality. These are numbers that don’t make a lot of sense but they kinda sound good. A moment’s reflection would cause immediate skepticism, but who has time for a moment’s reflection? Not Steven Levitt, Cass Sunstein, or various authors, reviewers, and editors for Psychological Science. The numbers don’t mean anything, they’re just a way to tell a story.

Standard-issue innumeracy can come by fishing in small samples of noisy data, yielding what can be massive overestimates of effect size.

Hard-core innumeracy

But then there are what we might call hardcore innumeracy, those quantitative statements that don’t even require a moment’s reflection to recognize as absolutely ridiculous. For example:
The claim that the probability of a decisive vote is 10^-90 (thanks, British Journal of Politics and International Relations!),
The claim that scientific citations are worth $100,000 each (thanks, Ted talks!).

As Campos might say, these are the equivalent of saying that a baseball player is hitting 3.000 or that somebody is on track to hit 272 home runs in April.

But credentialed social scientists write these things! What’s the point? 10^-90 is a really tiny number and $100,000 is a really big number, that’s the point.

When the President of the United States says, “every time we knock out a boat, we save 25,000 American lives,” is kinda like when the Robert Gray Dodge Professor of Network Science and University Distinguished Professor writes, “We can, in other words calculate exactly how much a single citation is worth. . . . in the United States each citation is worth a whopping $100,000.”

Two things are going on here:
(a) They’re being hard-core innumerate, providing numbers that are orders of magnitude away from anything reasonable.
(b) They are exercising political or social authority, the power to say things that don’t make sense without getting called on it. Just like in the story of the Emperor’s New Clothes: the more power you have, the more outlandish things you can get away with.

Also as with the emperor in the story, I suspect that the President and the Distinguished Professor believe the numbers they’re stating. I don’t think they’re bullshitting, exactly; it’s more that they treat numbers as words, not as things that can be added, subtracted, multiplied, and divided.

And their interlocutors don’t care either, maybe because they too think of numbers as words or maybe because it’s better for their careers to agree with the emperors, maybe question them on some specifics while showing a careful deference to their core ability to make outrageous claims and not be questioned.

A through line?

We live in a world in which certain quarters of academia and the prestige news media give strong support to outrageously innumerate claims. So I guess no surprise to see it coming from the government too–especially given the government’s recent proclivity to cite nonexistent or fraudulent research. Scary times all around.

Again, the problem is not just the innumeracy, it’s the blithe disregard for it, the idea that being off by multiple orders of magnitude–in one case, literally dozens of orders of magnitude–just doesn’t matter.

More on school reform, this time New Orleans

Recently we discussed a debate about how much of the improvement in test scores of students in Mississippi can be attributed to a policy of holding back more students–in particular, having kids repeat third grade will be expected to improve average for fourth graders. Education researchers Howard Wainer, Irina Grabovsky, and Daniel Robinson expressed skepticism about claimed dramatic benefits from the Mississippi plan, but then there were good arguments on the other side. One thing is that a lot of the discussion was about what happened right after the new plan was implemented in the mid-2010s, but there have been longer-term trends in Mississippi and other states. Changes in averages are always hard to interpret because of possible changes in compositional effects, including decisions of the age at which children start first grade, classification of students as disabled, and who’s taking the test in any given year. Also, all these comparisons are observational: as Wainer puts it, there’s no control group. On the other other hand, decisions need to be made in the absence of ironclad evidence. So I was left in a state of uncertainty.

A couple days later we learned that Wainer et al. had garbled some statistics, entirely misreporting Mississippi’s fourth and eighth grade math scores. Wainer et al. were making a general point about testing and selection, something they’d seen in various forms many times in their careers, but they were evidently not close to the data from Mississippi, even to the aggregate data that are easily available. As I discussed, I should’ve earlier been more suspicious of their claims about the math scores, given that in my earlier post I’d noticed a discrepancy between those and others’ claims. After all this, I remain unsure what to think about Mississippi. It’s an observational comparison, there’s selection, there’s variation between states in how much they teach to the test, and at the individual level there are the spillover effects on the kids who are not held back . . . all sorts of things. On the other hand there are these long-term trends. Selection has to be explaining some of what is happening in Mississippi–if you hold kids back and give them the test later or manage to exclude them from the tested population entirely, the average scores of the remaining students should rise–but it’s hard to say how much, and at some point you have to go with the data in front of you. As is often the case, we’re not just arguing about causal effects; we’re also trying to pin down what exactly is happening.

In the meantime, I received an email from another education researcher, Doug Harris, who writes:

Wainer et al. also got it wrong on the other cities like New Orleans. To quote them: “We have seen several previous K–12 education ‘miracles’ that turned out to be hoaxes. Five of them were in Houston, Atlanta, the District of Columbia, El Paso, and New Orleans . . . The New Orleans miracle was caused by a natural disaster. Hurricane Katrina tragically relocated about a third of the students who came from the poorest areas. Removing thousands of low scorers immediately raised the average test scores of the students who remained.”

Several people pointed this out to me [Harris], especially because I have been studying the New Orleans school reforms for more than 10 years. My center, the Education Research Alliance for New Orleans, has published more than 50 articles about it. Our Advisory Board includes both supporters and critics of the reforms.

When I first came to New Orleans the sharp upward trend in outcomes gave me and others good reason to think this fit the first rule. The school reforms were sparked by Hurricane Katrina, which changed the city in many ways. Many families never returned, at least not to their original homes and neighborhoods. The whole city was hit hard, but low-income neighborhoods were hit a bit harder. Given the correlation between demographics and education outcomes, it was reasonable to be concerned that changes in the population, not the school reforms, drove the change in outcomes. Recognizing the problem, I spent years trying to disentangle this.

In the end, to my own surprise, it became clear that the reforms really did drive substantial improvement in a wide range of education outcomes—elementary/middle school test scores, high school graduation, college entrance exams, college attendance, and college graduation. They reduced many achievement gaps and may have reduced crime in the city (this last point is more difficult to determine with confidence). These results can be found here (ungated) and in economics journal, Journal of Human Resources (gated) and in my book Charter School City (University of Chicago Press, 2020). New Orleans went from being next to last in the state on almost every measure to being about average within ten years, improvement that has been largely maintained.

How did we isolate the Katrina effects from the school reforms? You can read our much longer articles, but here is a short take:

1. We tracked the trajectories of the individual students before and after the reforms and found that those who returned to New Orleans saw improved trajectories. Since these are the exact same students (the data were anonymized, of course), demographic change cannot explain that.

2. We tracked all pre-Katrina students with test scores living in New Orleans before the reforms and compared those baseline scores for returning and non-returning students. They were nearly identical, and we controlled for the remaining differences.

3. We commissioned the U.S. Census to calculate the change in demographics of households with school-age children before and after Katrina and the reforms. Again, they were nearly identical.

4. We tracked students who switched into and out of New Orleans. Those who switched into New Orleans learned at slower rates before Katrina and learned at faster rates afterward.

Demographic change was not the only potential problem. Given the system’s strict accountability, we wondered whether data manipulation was a driving force. We found no evidence of this. Tests for strange patterns in test responses and miscoded high school graduation rates turned up no or slight differences between New Orleans and the rest of the state. We also know how the improvement occurred, which provides even more confidence.

So, why do Wainer et al. call it a “hoax”? Because they apparently never looked for any evidence to back up their claim. Any basic internet search would have turned up our work. Our findings made national news. There is a “hoax” here but it’s not the one they claim.

OK, the New Orleans test scores are another story I know nothing about! What you see above is one take on them. At this point my main role is to convey these different arguments and advertise my uncertainty.

When the numbers don’t look right, check them! (Mississippi education update)

Part 1: Reading what different sources say

The other day, as part of a long discussion about the estimated effects of Mississippi’s education plan, I quoted some education researchers, Wainer et al., who wrote:

The 2024 NAEP fourth-grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.

I also quoted a different critic of the Mississippi claims, Ravitch, who wrote:

In math, [Mississippi’s test scores] zoomed from fiftieth to twenty-third. Adjusted for demographics, Mississippi now ranks near the top in fourth grade reading and math according to the Urban Institute’s America’s Gradebook report.

And I found this from the wikipedia page on the Mississippi Miracle:

After adjusting for demographics, in 2024, Mississippi was the nation’s #1 state in Reading as well as in Mathematics.

I wrote, “But Wainer et al. say that Mississippi is tied for 50th in math. Can they really be worst in the nation, but best after demographic adjustment? I guess it’s possible.”

Part 2: Anomalies!

Wainer et al. said Mississippi’s 4th and 8th grade math scores were the nation’s worst in 2024.

Ravitch said their 4th-grade math scores have increased to 23rd in the nation and that they’re near the top when adjusted for demographics.

Wikipedia said that Mississippi’s math scores were best after adjusting for demographics.

So, Wainer et al. and Ravitch flat-out disagree on Mississippi’s absolute ranking in 4th-grade math; Ravitch and Wikipedia disagree slightly on the result after demographic adjustment (“near the top” or “the nation’s #1 state”); and I can’t be sure, but it also seems doubtful that a state could be #50 unadjusted and #1 after adjustment. As I wrote, it’s theoretically possible but it seems like a stretch.

Part 3: I do nothing.

One of my sayings is that an important characteristic of a good scientist is the capacity to be upset, to recognize anomalies for what they are, and to track them down and figure out what in our understanding is lacking.

In this case, though, I just let the anomaly sit there like a rotting fish. I went around it and I kept writing.

Why did I not explore this 4th-grade math test thing more closely? Partly because I didn’t have the data and hand. It turned out that a quick google was all that was needed, but I didn’t take that step. Another thing is that, in any investigation, many anomalies will come up (one of these was the average age of the students being tested; more on that below), and we can’t look into everything at once. In that way, it’s a like an Agatha Christie-style mystery, where various inconsistencies and anomalies arise and are noted in turn, but then the story moves on, with the explanation happening later. The other day we saw the new Knives Out movie–it was really great! If the original Knives Out was a 10 and the sequel was a 3, this third installment was a solid 9–and it did that thing were anomalies would pop up and get discussed but then set aside. If you stopped the train at every anomaly, you’d never get to the destination.

And the math scores were not a key part of the story, so I just let my bafflement sit there and I did not follow up.

Part 4: Let’s look at the numbers.

In the discussion of our post, two commenters said that Wainer et al. were wrong on the math scores. Steve wrote:

You can look the data up on the 2024 NAEP report:

https://nces.ed.gov/nationsreportcard/

I have no idea how these researchers came up with these claims: “The 2024 NAEP fourth-grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.”

My reading of the report is that Mississippi’s 8th grade math scores had trailed the national average by 18 points in 2000 but by only 3 points by 2024.

And SD wrote:

“The 2024 NAEP fourth-grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.”

This is just literally made up

So I looked it up, and . . . yeah, Wainer et al. had it wrong! Here’s what it says on the NAEP page:

4th grade math: National avg 237, MS avg 239, above average!
8th grade math: National avg 272, MS avg 269, but rank is approx 35th, not 50th.

Also I went to the Urban Institute page to see their demographically adjusted numbers (“The demographics we use for the adjustment include gender, age, race or ethnicity, receipt of free and reduced-price lunch, special education status, and English language learner status”) for 2024:

4th grade math: MS 248.6, they are indeed #1!
8th grade math: MS 281.3, also #1!

You can make of this adjustment what you will. But, in any case, no way were they ranked #50. I contacted Wainer et al., and Dan Robinson, one of the authors on the paper, confirmed that this was a mistake and that they would remove those two sentences from their paper.

Part 5: Where are we now?

As I discussed a couple days ago, I’m coming at this from two directions.

On one side, Wainer, Grabovsky, and Robinson are experienced education researchers, and they are not impressed by the claimed large effects of Mississippi’s policies.

On the other side, Wainer et al. are making their arguments in general terms, and the specific numbers from Mississippi seem impressive. This “on the other side” point is even stronger when we consider that Wainer et al. based part of their argument on math scores on garbled numbers.

There’s also a political angle, which I did not discuss in my original post but which came up in the comments, and it’s interesting because both side’s arguments have a politically conservative flavor. It’s a conservative vs. conservative battle. The proponents of the Mississippi plan offer the conservative argument that back-to-basics education work, also the conservative (in the U.S. context) argument that Mississippians are as good as anyone else. The skeptics of the Mississippi plan offer the conservative argument that there are no miracle cures, that schooling can’t do much to alter the natural order of things, and that government statistics can’t be trusted. I’m exaggerating the political slant in both directions here, but I do think that the arguments are taking place on a conservative turf, which is interesting, and I guess reflects the discrediting in recent years of education practices associate with the left.

Before ending this discussion, though, I wanted to go back to the statistics. Not the details but more of a view from 30,000 feet.

– An intervention was done in Mississippi in the mid-2010s, and people studied state-level aggregate test scores before and after. Mississippi’s test scores improved a lot relative to the nation during this period. This was part of a longer-term improving trend.

– The estimates of the program’s effects are observational. There was no control group. The implicit control is to imagine that previous trends in the state would have continued, or that the trends in Mississippi would be like trends in other states afterward.

– We don’t have easily accessible data on individual students. Robinson asks, “For example, what students benefited most from the intervention? What happened to the scores of the retained students once they took the NAEP reading test again?”

– The critics were coming into this from a generally skeptical position based on their view of previous hype in the education field, also the clear statistical issue that if you delay the kids who are performing poorly on the test, that averages will go up, also the lack of a control group. They did not do the work to quantify these concerns in this particular case, in part because relevant data were not easily accessible, but their distance from the details was a problem, as we could see with the gross error regarding the math tests.

– Mississippi’s average test scores have been going up. How much is this due to selection of who takes the test and when they take it, how much is due to changes in accommodations for disabilities (as discussed by Kelsey Piper in comments), and how much is due to targeted test preparation, I don’t know. It is a luxury of blogging that I can openly admit my uncertainty here.

– Stepping back, it’s clear to me why Wainer et al. remain skeptical, while Piper and other reporters have a more positive take on the Mississippi program.

– Finally, it’s not all about average test scores and it’s not all about the students being held back. I’m still thinking that a key outcome is reading and math ability at the time of school leaving. The idea of the program seems to be that if you hold some kids back a year, that will help them learn by keeping them in classes that are closer to the right level for them, and that this will also allow a higher level of education for the kids who are not held back. Some commenters also argued that the threat of being held back would motivate kids to learn more in third grade. I don’t know about that, but the point is that the problem is complicated enough that I can see the virtue of a “reduced-form” approach that just looks at effects on average test scores–but then you have to be concerned about the lack of control group and about compositional effects, which is where we started!

Part 6: Summary

– I should’ve looked into those math-score claims more carefully! Once I noticed the discrepancy between different reports, that was the time to track down what was happening. I’ve criticized statisticians for just accepting unreasonable numbers without checking, so bad on me for sloppiness here.

– As before, I don’t have a strong take on what’s happening in Mississippi. I see good arguments on both sides and no easy way to resolve them. My Bayesian inclination is to split the difference and say there’s some evidence that these policies are working but not to the extent that is advertised, but I don’t really know. Indeed, I can think of this Bayesian splitting of the difference as a kind of frequentist procedure in the sense that, on average, I think we will do well by splitting the difference in this sort of dispute. In any given problem, I’ll often come down stronger on one side or another (as here, for example), but in this case, nah, I don’t really have more for you.

P.S. I get that many readers of this post and my earlier post on the topic are frustrated because I don’t come to a strong conclusion for or against the Mississippi program. But that’s because I can’t: it’s an observational study with a lot of uncertainty about key aspects of the data. We can criticize particular aspects of various reports on the topic, but that’s not the same as coming to a strong conclusion about the effects of the program. Meanwhile, though, policymakers need to make decisions. And this sort of decision can’t wait on definitive evidence; they’ll need to rely on some mix of theory, judgment, and an assessment of political possibilities.

P.P.S. In part 5 of the above post, I remark that the Mississippi discussion has turned into a conservative vs. conservative debate with not much from the liberal direction. Jonathan Chait discusses this too: at a liberal journalist who supports Mississippi’s school policies, he’s surprised that liberal pundits are taking the conservative line that the policies don’t work.

P.P.P.S. I received the following email from Jean Gordon Cook of the Office of Communication and Government Relations of the Mississippi Department of Education:

The Mississippi Department of Education was made aware of an upcoming article that appears to be set for publication in January in Significance magazine. The article casts doubt on the accuracy of Mississippi’s gains on the National Assessment of Educational Progress

We have noted several errors/issues with the article that we sent to the editor of Significance yesterday. We are sharing these items with you because you reference this article in a blog post.

• Incorrect information is in the second-to-last paragraph on p. 33 when it states that the “2024 NAEP fourth grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.” 2024 NAEP state profiles show Mississippi’s fourth-grade mathematics scores rank the state No. 16 in the nation, and eighth-grade scores rank No. 35.

• Regarding the discussion of retention, the article does not address the fact that students can be retained for reasons other than the Literacy-Based Promotion Act. If you look at the 2018-19 LBPA Annual Report, you will see that 5,049 (14.4%) of third graders did not pass the third-grade reading test on the initial or two retests. Of those students, 4,131 were promoted to fourth grade with a good cause exemption. That means only 918 of the 3,379 third graders who were retained that year were held back because they failed the third-grade reading test.

• The article suggests that students who are held back in third grade may never advance to fourth grade and possibly be in the sample of students who take NAEP. It also doesn’t discuss the fact that students who are retained and students who are promoted to fourth grade with a good cause exemption are required to receive intensive remediation. This is a key part of the Literacy Based Promotion Act (LBPA) and Mississippi’s work to ensure students become strong readers.

The first point is covered in my post above, but I thought it was simplest to share the whole message.

In my reply to Cook, I apologized for not checking the numbers myself the first time. The funny thing is that, as I explain in the above post, those numbers did look odd to me, but then I didn’t follow through and try to look them up.

How much of “Mississippi’s education miracle” is an artifact of selection bias?

Howard Wainer, Irina Grabovsky, and Daniel Robinson write:

We were sceptical when we read Noah Spencer’s 2024 article about “Mississippi’s education miracle” which education economics expert Harry Anthony Patrinos called a “model for global literacy reform. The results Spencer reported from his econometric model do seem to be miraculous . . . Based on the National Assessment of Educational Progress (NAEP) fourth-grade literacy test scores, the state moved from a 49th place ranking in 2013 to the top 20 in 2023. The latest 2024 scores revealed that Mississippi is now tied for 8th place among 53 US states and territories!”

Such a dramatic turnaround clearly marks a sharp deviation from what we expect given the laws of nature/education generated by a century of empirical experience. If the turnaround is indeed legitimate, then the “intervention” that is claimed to be the cause of the improvement, the Literacy-Based Promotion Act (LBPA), which started in 2013, should be seriously considered for implementation in other states.

But now comes the bad news:

The improvement in the average performance of Mississippi’s fourth-graders on NAEP was preceded by two key changes in their schooling in third grade. One was the a priori sensible idea of trying to improve classroom instruction by improved teacher training, instituting preschool, and a variety of other helpful actions. This was to be accomplished through the promise of an additional annual state expenditure . . . about $111.63 of extra funding annually for each pupil. Comparing this amount to what are annual contemporary per pupil expenditures nationally, we have to agree that if such small expenditures can make a visible difference in student performance it truly is a miracle – a Mississippi version of St. John’s loaves and fishes.

But it was the second component of the Mississippi Miracle, a new retention policy, that is likely to be the key to their success.

Third-graders who fail to meet reading standards are forced to repeat the third grade. Prior to 2013, a higher percentage of third-graders moved on to the fourth grade and took the NAEP fourth-grade reading test. After 2013, only those students who did well enough in reading moved on to the fourth grade and took the test.

Wainer et al. share the figure at the top of this post to show how this works quantitatively, and then they continue:

As previously mentioned, the latest NAEP data for 2024 show even more impressive, “miraculous” results on the fourth-grade literacy test scores – a tie for 8th place. Strangely though, for the eighth-grade literacy test, the state’s rank dropped to a tie for 42nd place! This should clear up any miracle illusions that may remain. Need more proof that Mississippi public education is without miracles? The 2024 NAEP fourth-grade mathematics scores rank the state at a tie at 50th! The eighth-grade scores also qualify for 50th place.

OK, I wouldn’t use the word “proof” here, but I take their point.

There’s still some interesting stuff going on in the data, though. Wainer et al. share this figure:

and write:

Scores gradually increased for both from 1990 to 2015. But then the scores began to decline nationally whereas they continued to increase for Mississippi. Why? It is hard to credit Mississippi’s 2013 LBPA as the cause since there does not appear to be any change in Mississippi’s continued improvement. Yet viewed in the context of the national decline, perhaps LBPA deserves some credit. But if we are to credit LBPA for the continued growth, how do we apportion that credit to the Act’s two parts? Is it due to the changes in what was taking place in fourth-grade classrooms? Or who was allowed into those classrooms?

Good question! Here’s their answer:

The most credible way to do this is with a formal experiment in which we form four groups by crossing the two factors – extra per-pupil expenditures, and promotion based on reading performance – yielding four experimental groups . . . Unfortunately, in education, such experiments are rare . . . In the current situation, the best we can do is to use the model given in Figure 1 to predict the gain in mean score from the retention rates and see how much of Mississippi’s gains shown in Figure 2 are left unaccounted for. Were we to do this we would find that most of Mississippi’s gains are due to the retention rate.

Hey, don’t just say “most”! Give a percentage! And an uncertainty estimate!

It’s happened before

In their article, Wainer et al. also provide some historical context:

We have seen several previous K–12 education “miracles” that turned out to be hoaxes. Five of them were in Houston, Atlanta, the District of Columbia, El Paso, and New Orleans.

In the first four, investigators found fraud. The people in charge (e.g., superintendents) cheated to give the impression of increased test scores. In Houston, the numbers of students who were categorised as “special education” were increased so their low test scores would not be included in the school’s overall test scores. In Atlanta, records were falsified. In the District of Columbia, high-school students graduated who should not have. And in El Paso, to inflate scores, Mexican transfer students, who typically scored lower, were prevented from taking the state-mandated tenth-grade achievement tests.

The New Orleans miracle was caused by a natural disaster. Hurricane Katrina tragically relocated about a third of the students who came from the poorest areas. Removing thousands of low scorers immediately raised the average test scores of the students who remained . . .

OK, this seems pretty clear. So how did everybody get fooled?

Wainer et al. write:

There are three possible reasons. First, to his credit, Patrinos cited the 2024 study by Spencer whose analysis concluded that the LBPA was the cause of the increase in fourth-grade reading and maths scores. The gold standard for measuring the effects of causes is to . . . randomly assign students to either treatment . . . or control . . .

Spencer did not have the data required for such an analysis. So, instead, he improvised by using some prior years’ data as the control group, and instead of random assignment he used various bits of covariate information to equate this year’s students with the previous years.

This sort of approach often works–we generally recommend adjusting for pre-treatment variables in observational studies–but not here! Why not? Because, as Wainer et al. point out, “in the current year the bottom of the class was truncated and so were very much unlike the prior years’ scores – and no covariate adjustment was going to make them equal.”

They continue:

Besides weak empirical data, educational reformers like Patrinos should have given greater weight to the extant literature on the Mississippi Miracle. The miracle had already been convincingly debunked.

This last link is to a Los Angeles Times article entitled, “How Mississippi gamed its national reading test scores to produce ‘miracle’ gains,” written by Michael Hiltzik. Hey–I recognize that name! Palko sometimes links to him.

Wainer et al. conclude:

Third, Patrinos, and others who have praised the Mississippi miracle, should know that extreme educational reform success stories are non-existent. History has shown us that a little bit of digging has, in the past, always revealed such claims of miracles to be false. This does not mean that we should give up hope. Small successes are common in education. But dramatic huge successes should always alert us to scepticism.

Other views

I have not looked at these data myself, and I hold out the possibility that Wainer et al. are mistaken, that there are aspects of the situation they’ve not fully thought through, and that a more careful analysis would legitimately show something different. To get a sense of other perspectives, I googled *Mississippi miracle*.

The first item was a wikipedia page that went into the details of the Mississippi plan and report it as an unqualified success: “After adjusting for demographics, in 2024, Mississippi was the nation’s #1 state in Reading as well as in Mathematics.” But Wainer et al. say that Mississippi is tied for 50th in math. Can they really be worst in the nation, but best after demographic adjustment? I guess it’s possible.

Some of the wikipedia article seems to inadvertently support the skeptical position, as here:

Oklahoma, for instance, passed a bill in 2012 that mirrored the LBPA, only to pass a new law two years later that defanged the law. This was done to avoid actually holding back the students that could not read at grade level. Oklahoma’s scores have since plummeted and the state ranks near the very bottom of the NAEP’s list.

This seems consistent with the take that holding back the lowest-performing kids was the cause of the increase in average test scores.

I guess I could add a link to the Hiltzik article to the wikipedia page.

The next link is to an article in the New York Post entitled, “Mississippi’s reading triumph is no miracle — it’s the future of education,” again, 100% buying the effect with no comment on selection bias.

This is not to say that the policies being done in Mississippi are a bad idea! As Wainer et al. emphasize, small improvements matter too. It’s just not right to go around claiming large composite effects that are really due to selection on who takes the test, not improvements for individual kids.

The third link in my Google search is from a website called ExcelinEd entitled, “Four Reasons Why Mississippi’s Reading Gains Are Neither Myth Nor Miracle,” and beginning, “It’s time to debunk criticism of the Magnolia State’s literacy outcomes.” They address the selection issue head on:

This progress would not have been possible without ending social promotion and implementing the so-called “third-grade gate.” . . . Critics tend to take aim at retention for two reasons: First, it can be an emotional issue for families to find out their child needs to repeat a grade . . . Second, critics want to believe gains can be made without retention, and they strategically parse the data to prove their point.

Critics have alleged that Mississippi’s outcomes are a “statistical illusion,” because of the percentage of students retained by the third-grade gate. Retained students’ test scores aren’t part of the overall results, so they argue the picture is rosier than it should be.

But they think the improvement is real:

Except that’s not true at all. Researchers at Mississippi First took a deep dive into the data and what actually happened.

Here’s the short version: The largest NAEP gains in Mississippi were from 2013-2015 when no third graders were retained—because the state had not yet implemented that part of the law. The outcomes that led to the “Mississippi Miracle” designation in 2019 were made by the 2018 cohort of third graders, less than 5% of whom were retained.

There was a one-time jump in retention in 2019, because the state raised the standard for a student to pass the third-grade gate. But the retention rate has declined every year since then, even after the pandemic.

According to Wainer et al., the retention rate was 9.6% in 2018-19 and declined to 7.2% in 2022-23, and they argue that even a 7% retention rate would cause an improvement of 0.15 standard deviations from the selection effect alone. I don’t know how this maps onto test scores or state rankings, also the cumulative retention rate could be higher if some kids are also held back in earlier grades.

I’d never heard of ExcelinEd, so I clicked through to find out who they are. The board of directors is mostly a mix of Republican politicians and political appointees who served under Republican administrations. That’s fine–no reason that a partisan group shouldn’t care about education–, it just gives some sense of where they’re coming from.

In any case, similar arguments are made by nonpartisan sources. The ExcelinEd article points to this post on Chalkbeat, which is another site I’d never heard of before, but it doesn’t seem to have a partisan agenda. They seem to be doing their best to weigh the evidence:

Some have called it the “Mississippi miracle.” Others say not so fast.

In the last decade, Mississippi students have rapidly closed the test score gap with the nation as a whole, particularly in fourth grade. State officials, education wonks, and national journalists have attributed these improvements to the state’s 2013 early reading law, which included emphasizing phonics and holding back third graders who struggle to read. . . . “Mississippi has shown that it is possible to raise standards even in a state ranked dead last in the country in child poverty and hunger,” New York Times columnist Nick Kristof wrote in May. . . .

But a few commentators have pushed back on this rosy narrative. Los Angeles Times columnist Michael Hiltzik recently claimed scores had been “gamed.”

Chalkbeat says:

Hiltzik, the Los Angeles Times columnist, advanced two major critiques of the state’s test score gains in a recent column.

First, he argued that by holding back struggling third-graders, the state had inflated its test scores by removing those students from the pool of fourth grade test-takers.

In reality, this could help explain test scores jumps for a short period of time, but it doesn’t make much sense for longer-term gains. Eventually, students who are retained in early grades will move up to the next grade — they are not held back forever. Because Mississippi has seen sustained improvements, retention gaming appears to be an unlikely explanation. . . .

Andrew Ho, a testing expert at Harvard University and previously a member of the board that oversees NAEP, said his instinct is to question big test score gains. But in the case of Mississippi, he said, “I don’t see any smoking guns or red flags that make me say that they’re gaming NAEP.”

They bring up another issue:

One sometimes overlooked change in Mississippi education policy in the last decade involved not curriculum or instruction, but its testing regimen. In 2015, Mississippi overhauled its state test, including by aligning it more closely with NAEP. . . . Testing experts say that focusing on the content of a particular exam might improve scores because educators teach to that specific test. . . . “To the extent you prioritize NAEP, you risk inflating NAEP scores,” said Ho. However, the state testing shift began in 2015, while NAEP gains began in 2013. Additional scrutiny might shed more light on this issue.

So, lots to chew on.

The next Google link comes from PBS, an entirely uncritical report (sample quote, “The institute’s CEO, Kelly Butler, said she tells them there’s no secret to the strategy. ‘We know how to teach reading,’ she said. ‘We just have to do it everywhere.'”).

Next, a straight-up press release from the state of Mississippi. Sample sentence: “The results speak for themselves.” No mention of selection bias.

Next is from a website called The74 (the name refers to “the education of America’s 74 million children”), “There Really Was a ‘Mississippi Miracle’ in Reading. States Should Learn From It.” They address the statistical concerns:

A research paper last fall from Noah Spencer from the University of Toronto found that the law helped drive the state’s gains.

Spencer estimated that the third-grade retention policy alone could be responsible for about one-quarter of the gains, and it was surely the most controversial element. Some people have even tried to cast doubt on Mississippi’s NAEP gains by arguing they’re merely a function of testing older kids. But this has been debunked: Mississippi does hold back more kids than other states, but it always has, and the average age of Mississippi’s NAEP test-takers has barely budged over time.

Research on third-grade retention policies has found that students who are retained tend to have better long-term outcomes than those who are not . . .

Lots of links there! Let’s look at some of them.

First, a negative take, from education analyst Diane Ravitch:

A long-time cellar dweller in the NAEP rankings, Mississippi students have risen faster than anyone since 2013, particularly for fourth graders. In fourth grade reading results, Mississippi boosted its ranking from forty-ninth in 2013 to twenty-ninth in 2019; in math, they zoomed from fiftieth to twenty-third. Adjusted for demographics, Mississippi now ranks near the top in fourth grade reading and math according to the Urban Institute’s America’s Gradebook report.

So how have they done it? Education commentators have pointed to several possible causes: roll-out of early literacy programs and professional development (Cowen & Forte), faithful implementation of Common Core standards (Petrilli), and focus on the “science of reading” (State Superintendent Carey Wright).

But one key part of Mississippi’s formula has gotten less coverage: holding back low-performing students. . . . a “third grade gate,” making success on the reading exit exam a requirement for fourth grade promotion. This isn’t a new idea . . . But Mississippi has taken the concept further than others, with a retention rate higher than any other state. In 2018–19, according to state department of education reports, 8 percent of all Mississippi K–3 students were held back (up from 6.6 percent the prior year). This implies that over the four grades, as many as 32 percent of all Mississippi students are held back; a more reasonable estimate is closer to 20 to 25 percent, allowing for some to be held back twice.

I don’t quite follow this: are they just holding back 8% of third graders, or are they holding back 8% each year?

Next, a positive take, from Todd Collins, writing at the site of the charter school organization, the Fordham Insitute:

Mississippi didn’t cheat. Its reading gains are real. . . . the data show that it’s done so for at least twenty years, and at the same rates as under the current literacy law.

Retention by itself did nothing for them, mechanically or otherwise. Before 2013, Mississippi ranked forty-eighth for fourth grade reading, despite having one of the country’s highest retention rates. And after the reading retention law went into effect, the year-to-year rate changes had no discernible effect on NAEP results. . . .

Moreover, the pandemic provided a clear natural experiment: What happens when retention stops? In 2021, Mississippi suspended its third grade retention requirement. When those students took the fourth grade NAEP in 2022, the “statistical illusion” should have worked in reverse, sending Mississippi scores tumbling, relative to other states. Instead, although scores did fall, as they did in forty-four other states, Mississippi’s drop was less than the national average.

There’s also the Patrinos article mentioned near the top of this post, but it’s just terrible, as it doesn’t even acknowledge the selection issue at all, it’s just straight-up hype. Similarly, economics journalist Noah Smith writes, “Mississippi has had a big breakthrough in teaching poor kids to read! The core of the approach is an old technique called “phonics” that’s coming back into vogue. But it’s also about identifying students who are struggling and giving them extra resources, while also not simply giving them a rubber stamp and letting them pass to a higher grade.” Sure, but if you don’t let them pass to a higher grade, you’re gonna see higher average scores among the students who do take the test. This is something that an economics journalist should realize!

Finally, here’s a report from Mississippi First, “a leading voice for high-quality early education, high-quality public charter schools.” They’re an interested party here, but they’re also close to the data and have a motivation to get things right, so we should look at what they have to say:

On the 4th grade reading test, Mississippi gained 20 scale points between 1992, when the first state NAEP data were released, and 2019, when we first reached the national average. . . . Mississippi had two periods of big gains: 2005-2009 and 2013-2019 . . .

On the 4th grade math assessment, we gained 18 points from 2003 to 2019. Mississippi’s math gains were very steady in this 16-year period, with the state improving little by little while the nation stood still, until we finally saw Mississippi’s gap-closing jump between 2017 and 2019.

Mississippi’s Literacy-Based Promotion Act (LBPA) did not pass until spring 2013 and the “gate” (i.e., the requirement that students score a minimum level of proficiency, which originally was a level 2 of 5) did not go into effect for 3rd graders until 2015. This means the first year that Mississippi kids who experienced the “gate” were in the NAEP sample was 2017, when 15 points of our 20-point gain had already happened. . . . the LBPA can only explain 4th grade gains beginning in 2017.

What is the “Bottom 10%” Argument, and Why Is It Unpersuasive? . . .

The LA Times column claims that Mississippi’s NAEP success–specifically our reaching the national average in 4th grade reading–is a sham based on the analysis of a blogger armed with a graph of NAEP 4th grade reading data between 2013 and 2022 and the claim that Mississippi had a “nearly 10%” retention rate in 3rd grade following the LBPA’s passage. . . . but the percentage of 3rd graders held back as a result of the LBPA in whole or in part has never been that high. Certainly, the LBPA caused an increase of between 5.74-5.91 percentage points in the retention rate over a base trend of around 3-3.4% in the years immediately prior to the 2014-2015 implementation, but after that first year, the retention rate began to drop back down to pre-LBPA levels. By 2017-2018, the retention rate as a result of the LBPA was no higher than 1.58%, and the overall rate was less than 5%. . . . Because the LBPA caused so few retentions by 2018, Mississippi actually raised the bar in 2018-2019 so that “passing” the gate meant a higher level of proficiency in reading (now a 3 of 5, instead of a 2 of 5). After that, the overall retention rate did reach a high of 9.6% for the 2019 3rd grade cohort (2020’s 4th grade cohort), but those kids weren’t in the 2019 or 2022 4th grade NAEP. . . .

I [Rachel Canter from Mississippi First] object to the whole construct of the bottom 10% methodology because retained students don’t just disappear such that one needs to “add them back in.” They actually eventually get promoted, which means they do show up in 4th grade data, including 4th grade NAEP, just after some remediation (hopefully!). Having better scores after being held back . . . is that not the point of grade retention?

So where are we, then?

I’m not sure what to think. On one hand, Wainer, Grabovsky, and Robinson are experts on educational measurement, their argument about selection effects is persuasive, and their meta-argument about skepticism given the history of education hype also makes sense to me. Also, Howard’s a friend, and he’s a reasonable person, so I’m inclined to agree with him.

On the other hand, it all depends on the numbers: how many kids of each grade are held back each year, how they do in later years, etc. And it seems likely that some of these numbers will never be available.

Another question is, what are the causal inferences we’re looking for? How would we summarize things if we had all the data, including all potential outcomes? We’d like to know the changes due to the program among kids who would not be held back in any case, kids who would be held back in any case, and kids who would be held back under the treatment but not under the control. Among those in that third group, there’s the question of whether you’re comparing later outcomes in the same year (i.e., the same age of the kid) or at the same grade (so that you’re comparing the test scores of held-back kids to the scores they would have receive a year earlier had they been promoted). There’s also the frustrating way in which the discussions jump back and forth between absolute test scores and demographically-adjusted comparisons between states.

Another challenge in sorting this all out is that the Mississippi program had a lot of features, and Mississippi’s test scores had been improving for awhile. Some of the “Mississippi miracle” discussions focus on what’s happened since 2013, but the article from Mississippi First seems to be arguing that state policies have been helping since 1992. So it’s kind of a moving target. There’s also the association with phonics-based language instruction and a kind of general take that Mississippi’s success comes from them holding kids to a more rigorous standard. Which could be, but there I lean toward the skepticism of Wainer et al., in that states always seem to be talking about getting back to basics in education.

So, lots of moving parts. On statistical grounds, it would seem undeniable that some large chunk of the improved test scores in Mississippi come from the selection effect of delaying the students who were going to perform the worst, but it seems hard to put a number on this. In any case, it’s just gonna be hard to make causal attributions and estimate causal effects in a context where the national outcomes are changing so much, as can be seen in the second graph near the top of this post.

Wainer’s reactions

I sent the above to Howard Wainer, the first author of the above-linked paper that questions the claims of Mississippi’s success, and here’s how he responded:

OK — let’s take it from the top.

The basic idea is that they (Mississippi) picked out an outcome variable to measure success (NAEP score). Then they instituted a compound treatment (funding, class size, etc + retention ON THE BASIS OF THE OUTCOME VARIABLE) and the goal was the measure the causal effect of each of the components of the treatment (e.g. how much is due to class size and how much is due to focused retention). This is tough going under any circumstance, but especially without a control group. Hence my earlier comment to you about Hugo Muench’s “laws” of clinical studies, which essentially says that nothing improves the performance of an innovation more than lack of controls.

Anyway, this means that trying to figure out what is the causal effect of each part is tough and so our guess that it was mostly the newly focused retention policy we thought was a good bet. Which is why we included the plot of mean gains as a function of truncation percent to indicate that it accounted for (order of magnitude) most (all?) of the gains claimed. Yes, they had a high retention rate previously, but who was retained was based on a mixture (unknown, at least to us) of variables/causes. The new policy retained specifically on the basis of the outcome variable.

Thus we would posit that the retention rate is unlikely to have much of an effect on the height of 4th graders, but it would if only short kids were retained.

But, there is a lot of dark here, and we tried to offer the most plausible explanation all things considered. We were not inclined to give Mississippi the benefit of the doubt (based on the chicanery that has manifested itself with essentially ALL prior education miracles.

Fair enough. I still wonder what happens with those kids who are held back and are then tested a year later. I guess they improve on average a lot on their own, no matter what is done, during that year.

Also, I’m not sure what’s the ultimate policy goal: maybe to improve reading and math ability (as measured by test scores) when kids leave the K-12 system? I think this would be one of the traditional reasons to hold students back a grade, so that they have a longer time to learn the material, which could be helpful even if the treatment is having no effect on their learning trajectories.

Another way of putting this is that I don’t think it’s always clear what people are estimated here. The causal effect of the treatment would apply to individual students, but the outcomes are being compared in the aggregate, which is a challenge given that the treatment affects who’s being aggregated.

P.S. More here, here, here, and here.

The three funniest items on the Kroger recall list

Palko points us to this news item, “Kroger Recall Update: Full List of Product Warnings Across 18 States.” My favorite:

Yummi Sushi, recalled October 28, 2025: Nashville, Knoxville, Georgia, and South Carolina stores. Possible contamination with metal fragments.

If you’d asked me why they were recalling sushi, “contamination with metal fragments” would not have been in my first hundred guesses. I guess the metal makes it yummier.

And my second favorite:

Face Rock Curds Vampire Slayer Garlic, recalled June 25, 2025: Affects Fred Meyer and QFC stores. Potential Listeria contamination.

I don’t see the problem. Listeria would slay a vampire too, no?

And this:

High Noon vodka Beach Variety, recalled July 28, 2025: Affects Kroger-owned stores located in Wisconsin, South Carolina and Virginia. Kroger says: “Specific lot codes of the product are being recalled due to variety packs may have cans labeled Celsius Energy Drink filled with seltzer alcohol.”

High noon, indeed.

The acupuncture paradox and its resolution

This one’s important.

The other day we had a post on when it’s ok to judge people by their worst.

Dmitri wrote in:

You know who I judge by their worst belief? Healers: doctors and such. If a doctor has one crazy health belief, I am out of there.

My mother-in-law has an acupuncturist who is into all sorts of weird Chinese traditional medicine. I could use an acupuncturist for some tendinitis but I have ruled her out because I know that some of her beliefs are totally nuts. Maybe she knows where to stick for the tendinitis but I am not taking any chances.

I told him I disagreed regarding the nutty acupuncture beliefs (more on this below), and Dmitri elaborated on his reasoning:

When I choose a health-care practitioner I am choosing on the basis of the quality of their health-care reasoning. I want them to figure out what’s wrong with me and find a way to fix it. If I have any evidence that their health-care reasoning is faulty, I want to stay away.

My comment about acupuncture was written from the standpoint of the belief that acupuncture might really work for certain kinds of ailments. I had a friend who got good results with tendinitis. But even if it’s all placebo I suspect I’ll get the best placebo effect if I have faith in the person administering the treatment.

This all makes sense, also it pleases my sense of nostalgia to see a mother-in-law joke. OK, it wasn’t really a joke on Dmitri’s part; still, it brought back memories of wacky mothers-in-law in old sitcoms.

The acupuncture paradox

But back to the acupuncture. I’ve been thinking about this for a long time. Here’s a post from 2011:

The scientific consensus appears to be that, to the extent that acupuncture makes people feel better, it is through relaxing the patient, also the acupuncturist might help in other ways, encouraging the patient to focus on his or her lifestyle.

A friend recommended an acupuncturist to me awhile ago and I told her the above line, to which she replied: No, I don’t feel at all relaxed when I go to the acupuncturist. Those needles really hurt!

I don’t know anything about this, but one thing I do know is that when I discuss the topic with any of my Chinese friends, they assure me that acupuncture is real. Real real. Not “yeah, it works by calming people” real or “patients respond to a doctor who actually cares about them” real. Real real. The needles, the special places to put the needles, the whole thing. I haven’t had a long discussion on this, but my impression is that Chinese people think of acupuncture as working in the same way that we think of TV’s or cars or refrigerators: even if we don’t know the details, we trust the basic idea.

Anyway, I don’t know what to make of this. The scientific studies finding no effect of acupuncture needles are plausible to me—but if they’re so plausible, how come none of my Chinese friends seem to be convinced?

My question here is not whether acupuncture could work (possibly through some backdoor mechanism like the needles stimulating your body in some useful way, or whatever) but on the evidence of how much it does work. As noted, I think the overwhelming impression among my Chinese friends–statisticians included–is that it does work, and not merely through some vague calming effect. But this would seem to contradict the research, so I don’t know what to think.

This does seem to be a paradox, as evidenced by some of the discussion in the 56 comments on the above-linked post.

We had another good comment thread when I brought up the topic again in 2016.

What are acupuncturists doing?

So, yeah, this paradox was bugging me for years, and then at some point I came up with a satisfying resolution.

My resolution of the acupuncture paradox might not be scientifically correct–indeed, it would be wonderful to design some experiments to study the topic and see what, if any, of my ideas in this domain hold up–but it has the virtue of being a possible solution to the problem. Which is more than I had before.

I’ve talked with a bunch of people about this idea, and I’ve mentioned it in some public lectures, but this might be the first time I’ve written it up.

My resolution of the paradox starts with the idea that the success of acupuncture, as with physical therapy, coaching, teaching, and many other things, comes from a fruitful interaction between the patient and the therapist. Good acupuncturists, like good physical therapists, coaches, and teaching do not just push buttons and follow a template; they work closely with each patient and figure out what is needed. In addition, I assume that acupuncture is like these other endeavors in that an key function of the therapist is to motivate patients to keep up with the work.

Thinking of a treatment effect as a vector with direction and magnitude

I’ve written that you can conceptualize an education intervention as a vector, where the direction of the vector is the material being learned and the length of the vector is the amount that students are motivated to work. You want the material learned to be useful–you’d like the vector to have a positive “dot product” with the vector of skills, knowledge, and understanding that will be useful going forward–but, conditional on those two vectors being roughly aligned, the real gain is in the magnitude. And this magnitude will be an interaction between the teacher and student: there’s no button to push to create motivation, and if there were a button it would already have been pushed.

What I’m saying is that, when thinking about acupuncture, or physical therapy, or coaching, or teaching, we have to go beyond what I’ve called the penicillin model of science, the idea that innovations come from nowhere and that the job of statistics is to design and analyze experiments to reject the null hypothesis of no effect, and in which the treatment in such experiments is considered as a black box, with the goal being to estimate an average treatment effect.

I don’t think the penicillin model usually applies. Most of the time in health, education, and just about any field, improvements are incremental, and the goal is to improve the process while gaining understanding. Clinical trials and offline experiment both play a role, and you’re not going to learn much by studying a treatment as if it’s a black box.

This is not to say that there cannot be new developments in any of these fields, nor is it to deny that such developments can sometimes arise serendipitously. I just think that, in any case, you have to go beyond the average treatment effect and think about the mechanism of action.

Resolution of the acupuncture paradox

OK, now on to my answer.

Let’s suppose that acupuncture really works, not just as a placebo or as relaxation or whatever, but as a set of physical manipulations that help you heal better. Let’s also suppose that the mechanism of acupuncture is not the position of the needles or qi or whatever, but the result of the acupuncturist observing you, listening to you, watching and feeling you as you move, then getting a sense of where your problems are and doing movements and giving you advice that will improve your healing. There’s no need for either or both of these statements to be true, but they could be, and suppose they are.

In that case, we should see two things:

1. The usual controlled studies of acupuncture should show no effect. If you do an experiment where the treatment group gets acupuncture with the needles in the “correct” places and the control group gets acupuncture with needles in the “wrong” places, there will be no difference. If useful acupuncture is being done, with good interaction between the therapist and patient and the therapist giving informed, patient-specific therapy, then it will work in both treatment and control groups. If push-button acupuncture is being done, without that focus on the patient, it won’t work in either group. In either of these scenarios, the treatment effect in the experiment will be zero, or nearly zo.

2. Real-world acupuncture would work, and not just because of relaxation/placebo/etc.

In this setting, the usual experimental research method won’t work, because the experiment with the random needle placements is removing the very mechanism by which acupuncture works (or is assumed to work, in my scenario).

That’s a paradox for you: We can have an effect that is real but which will not show up under the usual controlled-trial design.

This is not to say that the effect could not be detected. You’d just need a different design, for example acupuncture (done however the therapist wants to do it) versus nothing, or versus some default therapy. In such a setting you’d want to gather lots of intermediate data to find out what the acupuncturists (and also the control-group therapists) and the patients are actually doing, how their bodies are moving as the weeks go on. Get rich data, thick description that can then be analyzed using multilevel models as necessary.

What about the theory?

To return to Dmitri’s original statement: What about the theory of acupuncture, the placement of the needles, the lines on the body, the qi, etc.? I don’t know. I admit I’m skeptical, and I think that acupuncture could work (and not just as relaxation etc.) even with all these theories being bogus. I’d think of the theories as a sort of checklist, a framework that gives acupuncturists some focus and gives them the space to observe the patient and figure out what to do and what to recommend.

Just as in chess it is said that planning is important, even your plan does not work out, it could also be true in acupuncture (and also in physical therapy, coaching, teaching, etc.) that even a misguided or empty theory can provide useful structure.

But we don’t always have a good language for talking about this when we talk about science. We can talk about specific mechanisms (this gene codes for this protein which unlocks this other protein which catalyzes that reaction, or whatever) or we can talk completely abstractly (this treatment works, as has been demonstrated in a bulletproof clinical trial), but we’re not so good about the steps of trying to work through a mechanism by gathering intermediate measurements and modeling them.

So, give your mother-in-law’s acupuncturist a break. Her beliefs and medical theories may be “totally nuts,” but they may be no more than a framework that she can use to do the useful things that she does.

P.S. Andrew Vickers points in comments to a meta-analysis from 2018 finding positive effects of acupuncture, beyond any placebo effect, in clinical trials.

Conflicting statistical evidence on the long-term effects of children on being whacked by their parents

A few years ago we had a post on the lack of clear evidence on the long-term effects of children on being whacked by their parents. This is sometimes called “corporal punishment” but I think that term is too mild, because from the kid’s perspective what’s relevant is not the “punishment” part (to a kid, the adult world is full of ever-changing rules, so you can be punished for pretty much anything if the adult in power decides to do so) but the bit about being hit by someone who is taking care of you and is possibly supposed to love you. It’s also sometimes called “physical abuse,” which to me seems like an accurate term but which I will avoid because the term “abuse” when applied to children brings to mind sexual abuse which is not what I’m talking about here. So I’ll stick with “whacking” which I think conveys the pain of being hit, if not the feeling of betrayal.

Back in that earlier post, I questioned some journalists who reported certain pro-whacking research (“The research, by Calvin College psychology professor Marjorie Gunnoe, found that kids smacked before age 6 grew up to be more successful . . . ‘The claims that are made for not spanking children fail to hold up. I think of spanking as a dangerous tool, but then there are times when there is a job big enough for a dangerous tool. You don’t use it for all your jobs.'”). Nothing wrong with citing this work, but then it would make sense to also cite the research pointing in the other direction.

And this goes both ways. John “not Towering Inferno” Williams writes:

My step-son is visiting, along with his wife and two boys, aged 4 and 6, and I’ve retreated upstairs to my study. My step-son and wife are following what seems to be the current norm for parenting, which involves trying to reason with or distract misbehaving children, rather than setting limits and teaching them that breaching the limits has consequences. The boys, being bright, realize that they can get away with all kinds of misbehavior, and take full advantage of this.

Now, I was raised by pretty permissive parents: my father spanked me only once, and my mother never did. However, behaving badly got me or my sibs put “out of the living room” for five minutes for the first offense, longer for subsequent ones, and this was enforced. I don’t recall that we ever resisted this, probably because on an unspoken threat of corporal punishment if we did, but in any case, some semblance of order was maintained.

Anyway, the noise from downstairs got me wondering where the current fashion for child rearing came from, so I started poking around on the web, and came across sites such as that of the Center for Parenting Education, where I found “The Case Against Corporal Punishment,” which among other things says that:

Children who are hit as toddlers have a lower IQ than children who are not spanked.

According to Murray Straus, a professor at the University of New Hampshire and Director of the Family Research Lab there, children who were spanked or slapped averaged a five-point drop in IQ.

The strongest link between corporal punishment and IQ occurs when parents continue to hit their children into their teen years. Yet, “even small amounts of spanking make a difference,” according to Straus. (Glenn).

This seems last implausible, so I looked up “Murray Straus corporal punishment IQ” on Google Scholar and found: Corporal Punishment by Mothers and Development of Children’s Cognitive Ability: A Longitudinal Study of Two Nationally Representative Age Cohorts 2009. Murray A. Straus & Mallie J. Paschall; here is the abstract:

This study tested the hypothesis that the use of corporal punishment (CP), such as slapping a child’s hand or “spanking,” is associated with restricted development of cognitive ability. Cognitive ability was measured at the start of the study and 4 years later for 806 children age 2–4 and 704 children age 5–9 in the National Longitudinal Study of Youth. The analyses controlled for 10 parenting and demographic variables. Children of mothers in both cohorts who used little or no CP at Time 1 gained cognitive ability faster than children who were not spanked. The more CP experienced, the more they fell behind children who were not spanked.

So, it is an exploratory study is looking at existing data; with lots of forking paths and measurement issues (just what is a spanking?); the main result is shown in Figure 1, with no error bars, and an unexplained benefit from very frequent spankings for older kids.

Figure 1. The more spanking, the lower the child’s cognitive ability score four years later.

From the text, we learn that the data are really for the previous week, but, not to worry.

CP was measured during two 1-week assessment periods in order to identify children who experienced as close to no-CP as possible with this data. The fact that a score of zero identifies children who were not spanked in either of the 2 sample weeks over a 2-year time span makes it plausible to consider the zero group as children for whom CP was extremely rare or in some cases nonexistent. Nevertheless, in the light of the extremely high intervention rates needed to properly supervise toddlers (once every 6–10 minutes; Lee & Bates, 1985; Minton, Kagan, & Levine, 1971; Power & Chapieski, 1986), there were innumerable opportunities for the mothers to use CP as one of the disciplinary tactics and, as another national survey found, 94% of parents use CP with toddlers (Straus & Stewart, 1999). Thus, the CP scale used for this study does not eliminate the possibility that the children in the zero category experienced CP on rare occasions.

Google Scholar also lists a 2009 talk, explaining that the decrease in corporal punishment explains the recent increase in IQ scores: DIFFERENCES IN CORPORAL PUNISHMENT BY PARENTS IN 32 NATIONS AND ITS RELATION TO NATIONAL DIFFERENCES IN IQ*; here is the abstract:

A previous study found that spanking by parents of two nationally representative age cohorts of children found that the more spanking at the start of the study, the more the child fell behind in development of cognitive ability when tested again four years later. There is also evidence of a world-wide decrease in use of corporal punishment (CP) by parents and of a world-wide increase in IQ. The combination of these three sets of research results suggested the hypothesis that the decrease in use of CP is part of the explanation for increase IQ in many nations. A preliminary test of this hypothesis was tested using data on CP experienced by 17,404 university students in 32 nations and data on national average IQ scores. The results show that the higher the percent of parents who used CP, the lower the national average IQ. These results provide additional evidence on the harmful side-effects of CP. Because the historic decrease in use of CP is accelerating, these results also suggest future gains in national IQ.

Murry seems to be a big gun in the anti-corporal punishment world, so I looked up some more of his stuff, such as

Murray Straus 2010 PREVALENCE, SOCIETAL CAUSES, AND TRENDS IN CORPORAL PUNISHMENT BY PARENTS IN WORLD PERSPECTIVE

This starts with:

This article looks at corporal punishment by parents from several angles–from its links to familial behavior patterns to global variations in its use. First, it describes the prevalence of spanking and other legal forms of corporal punishment (CP) around the world. Second, it presents and illustrates a theoretical model arguing that an important part of the causes of CP are to be found in the nature of society. Third, it presents some of the evidence that a world-wide reduction in the use of CP is taking place. Fourth, it suggests changes in society that may be producing the decrease. The bulk of the research leads to the conclusion that CP has harmful side effects, and that conclusion is an underlying assumption of this article.

At least he is forthright about assuming his conclusion. Anyway, farther down:

How often parents use CP is critically important because many of the adverse effects on children are in the form of a “dose response”–that is, the more frequent the CP, the greater the probability of the adverse side effect. This is illustrated by studies of the relation of CP to depression,8 antisocial and cognitive ability.10 The dose-response pattern is also the basis for the erroneous claim that, when rarely used, spanking is harmless.11

Going to 11 in the references gets to a couple of comments on other articles.

11. See generally Diana Baumrind, Robert E. Larzelere & Philip A. Cowan, Ordinary Physical Punishment: Is It Harmful? Comment on Gershoff (2002), 128 PSYCHOL. BULL. 580 (2002); Robert E. Larzelere, Response to Oosterhuis: Empirically Justified Uses of Spanking: Toward a Discriminating View of Corporal Punishment, 21 J. PSYCHOL. & THEOLOGY 142, 146 (1993).

The main thing I learn from all this is that Straus thinks ever hitting kids is bad, but I don’t see a lot of evidence that “even small amounts of spanking make a difference.” Certainly corporal punishment can be overdone–when I was young, many decades ago, I had friends whose father whipped them badly with his belt, and that was indeed bad. I don’t doubt that even much less than that is bad–I hit my own kid only once, when he didn’t come away from the waves on a dangerous beach when I told him to–but I don’t see evidence that an occasional swat on the bum is going to hurt a kid. I’ve spent only a couple of hours looking into this, so maybe there really is good evidence on that hitting kids is always bad, but it smells a lot more like ideology than science.

I think it’s fair to say that the research results on the effects of parents whacking children are not so clear, and that’s kind of inevitable given the observational nature of the data, the difficulty of recall, reporting errors, etc.

Parenting can be tough. Whether you think it’s ok to handle your frustrations with the job by occasionally whacking your kids until they cry, that’s your call. Unfortunately, your kids don’t have much of a say in it, at least not until they’re big enough to fight back.

I don’t think any of the studies under discussion considered the immediate effects of whacking, balancing out the parent’s stress release and feeling of satisfaction in exercising power with the child’s pain and feelings of betrayal.

It’s kind of funny (in the interesting, not the ha-ha sense) to consider only the long-term effects, and none of the intermediate effects, of an immediately violent action that’s pretty much being taken for the purpose of giving immediate satisfaction to the perpetrator.

The Office of Risk Assessment at the Netherlands Food and Consumer Product Authority is looking for an applied statistician with expertise in Bayesian statistics or causal inference

Joost Meekes writes:

At the Netherlands Food and Consumer Product Authority (NVWA), Office of Risk Assessment, we have a vacancy for an applied statistician (or a data scientist with expertise in statistics). We are particularly interested in candidates with knowledge of and experience with Bayesian statistics or causal inference. If you know anyone who might be interested in this position, or if you would publish the vacancy on your blog, we would be most grateful.

The position requires proficiency in Dutch.

The Netherlands Food and Consumer Product Authority is a government agency which oversees a wide variety of domains, working to guarantee public interests including food and product safety, plant health, and animal health and welfare. The position offers the opportunity to work on a wide variety of applied statistical and machine learning problems with societal impact and comes with excellent benefits.

This sounds really cool, also a great opportunity to improve jouw nederlands.

Survey Statistics: Blue Rose Research is hiring !

As readers may know, I’m a survey statistician at Blue Rose Research. We survey the public to forecast elections and test political messages, used to advise Democrats. We’ve announced hiring here a few times (e.g. in April 2025). We’ve discussed our 2024 election retrospective. And now we hiring again ! Looking for experts at the intersection of AI and statistical modeling.

We estimate causal effects of in-survey political messages using scaled-up versions of MRP. To get more insight, we connect this pipeline to new AI tools for generation and summarization. We are looking for a teammate with deep expertise in both LLM tools and statistical modeling to build tooling that scales our analyses with care and thoughtfulness. We want a teammate who clearly communicates assumptions, results, and uncertainty. We are a mission-driven team that values kindness and collaboration.

For more details, see the job posting.

  • Salary: $140,000 – $190,000 annually, commensurate with experience.
  • Benefits: Competitive health, dental, and vision coverage; generous leave; and a supportive, mission-driven culture.
  • Work setup: Fully remote team with an NYC office and regular in-person meetups (NYC & DC). Most of our work happens on East Coast time.

Please circulate and apply !

Reanalysis of that Nobel prizewinning study of patents and innovation (with R and Stan code)

A few days ago we discussed a paper from 2005, Competition and Innovation: An Inverted-U Relationship, two of whose authors recently won the Nobel prize in economics.

I had some concerns about the analysis, which I can express with reference to the above figure from that paper:

1. The paper’s all about an inverted U relationship, but this is driven by fitting a quadratic curve rather than, say, a curve with diminishing returns.

2. The line does not seem to go through the data points. In particular, the curve seems to be too low at the rightmost part of the graph–an artifact of fitting a quadratic, perhaps?

3. The y-axis is some weighted count of patents but it’s being used as a measure of the more abstract concept of “innovation.”

4. The x-axis is an average of profit margins but it’s being used as labeled a measure of the more abstract concept of “competition.”

5. The model predicts patents from profit margin in the same year, but to the extent the model is appropriate I think you’d expect a lag.

6. They use Poisson regression, but the data are not counts, also if you don’t somehow correct for overdispersion your standard errors will be too low.

The plan

Here are the ways I’m gonna adjust for the above issues in my reanalysis:

1. I’ll fit a quadratic curve to replicate what they did in the paper, and I’ll also fit another family of curves (the “hinge”) that allows for nonlinearity but without enforcing non-monotonicity.

2. One problem with the above graph is that it excludes 20% of the data (see the figure caption). I’ll plot the fitted curves showing all the data. Another possible reason for the problematic fit is that in the paper they say they adjust for industry and year effects, and so I’ll make a plot showing the fitted curve and the data broken down by industry. I’ll do the adjustments in two ways: “fixed effects” as in the published paper (using various R packages), and multilevel modeling (using Stan).

3, 4. I’ll relabel the axes of the graph to more accurately capture the authors’ measures of competitiveness and innovation, but I’ll swallow the concern that the analysis only uses a subset of the data that were available at the time (from the paper, “Our sample includes all firms with names beginning “A” to “L” plus all large R&D firms. After removing firms involved in large mergers or acquisitions and those with missing data . . .”).

5. This lag thing is an issue, but I have enough concerns with items 3 and 4 that it’s hard for me to take the paper’s theoretical and causal arguments seriously. But, sure, you could replicate my analyses at different lags.

6. I’ll compare four sorts of models: Poisson, quasipoisson, negative binomial, and normal regression on log(y+1). Quasipoisson and negative binomial are two different ways to correct Poisson regression for overdispersion. Modeling on log(y+1) is a completely different way to model data that are mostly positive but have some zeroes, and I think it actually makes the most sense for this example, as the data are not actually counts. When fitting Poisson and negative binomial regressions, I’ll round the data to the nearest integer, which turns out to essentially not affect the results.

The above is not a preregistration plan; I wrote it in the middle of the project after doing some of the analyses already.

The data

Bradford in comments links to a post from 2014 by Leif Nelson and Uri Simonsohn that includes a file linking to the website of Nicholas Bloom, one of the authors of the paper. Navigating the site takes me to this page with a link to a zip file with the data, which I then downloaded.

Good job by Bloom to post the data from a paper published in 2005. I’m not so good with data availability myself.

The data file has 354 records, including data from 1973-1994 and from 17 industries, and for each of these records it has the weighted patent count (“patcw”), the measure of profit margin (“Lc”), and a bunch of variables that I didn’t try to figure out, because these are all I need to replicate the main analysis. (The file does not seem to include a code book, but I’m not complaining, as they’re already way ahead of me by making the data available at all.)

Initial data analysis: quadratic regression using Poisson, quasipoisson, and negative binomial models

I get going by plotting the data and fitting some quadratic models:

Compared to the graph at the top of this post, the above plot shows more data (I’m not trimming the upper and lower 10% of observations), and the quadratic curves are much flatter than were shown in that paper:

(1) green curve: poisson fit to rounded data (using glm() in R)
(2) red curve: quasipoisson fit to rounded data (using glm())
(3) blue curve: quasipoisson fit to raw data (using glm())
(4) pink curve: negative binomial fit to rounded data (using glm.nb())
(5) orange curve: negative binomial fit to rounded data (using stan_glm())

The first four curves are so similar that the lines overlap and you can’t see the green and red curves: They’re there, just overwritten. For each model I display the point estimate of the curve (best fit for models 1-4, posterior median for model 5). In the models above, quasipoisson uses the Poisson fit and then adjusts standard errors to account for overdispersion; negative binomial is a probability distribution that accounts for overdispersion in a different way.

We went from 1 to 2 to see if switching to quasipoisson made a difference (it did; the standard error was much bigger after we allowed for overdispersion); we went from 2 to 3 to see if rounding made a difference (it did’t for these data); we went from 3 to 4 to see if switching to negative binomial made a difference (it didn’t do much, but the standard error increased slightly); and we went from 4 to 5 to see if switching to full Bayes made a difference (it looked a lot different, actually).

I’ll put the code at the end of this post so as not to distract from the story. Here are the relevant pieces of console output:

(1) poisson fit to rounded data
            coef.est coef.se
(Intercept) -74.82    25.85 
Lc          164.76    54.90 
I(Lc^2)     -88.39    29.14

(2) quasipoisson fit to rounded data
            coef.est coef.se
(Intercept) -74.82    84.64 
Lc          164.76   179.78 
I(Lc^2)     -88.39    95.44

(3) quasipoisson fit to raw data
            coef.est coef.se
(Intercept) -75.01    84.01 
Lc          165.13   178.43 
I(Lc^2)     -88.55    94.72 

(4) negative binomial fit to rounded data
            Estimate Std. Error
(Intercept)   -80.80      89.49
Lc            177.20     190.10
I(Lc^2)       -94.85     100.92

(5) negative binomial fit to rounded data (Bayesian)
            Median MAD_SD
(Intercept)  -7.1   33.2 
Lc           21.1   69.5 
I(Lc^2)     -12.2   37.3

In a paper I’d prefer to display these uncertainties graphically; here I’m giving the console output to give a sense of how things might go in our usual workflow of fitting models and looking at them. Here you can see that the quadratic terms all have large standard errors–except for the very first model, the Poisson, but its standard error is too low as it does not account for overdispersion.

The only thing that really puzzles me here is what’s going on with model 5. Could it be that the Bayesian model uses the posterior median rather than the optimum? I’ll check by re-fitting it, running stan_glm on “optimizing” setting:

(5_opt) negative binomial fit to rounded data (Bayesian posterior mode)
            Median MAD_SD
(Intercept)  -5.3   36.8 
Lc           17.1   76.7 
I(Lc^2)     -10.9   39.4 

That’s not much different from the posterior median. So I guess the difference between the negative binomial regressions fit by glm.nb() and stan_glm() arise from differences in the fitting algorithms, or maybe something I missed in the coding. If I wanted to proceed further down this track I would have to investigate this a bit further, reading up on what glm.nb is actually doing in R, and in Stan programming the negative binomial model myself from scratch and also comparing to brms.

For now I’ll just move on. My guess is that the difference in fits has something to do with how seriously the model takes the small number of influential data points near the top of the graph, but I’ll set that aside for now because we’ll be fitting some more models.

Quadratic regression adjusting for industry and year effects

The next step is to adjust for industry and year effects, which I’ll do in a few ways. I’ll aid in the interpretation of these by plotting the data and fitted curve separately for each industry. In the data the industry codes took on 17 different values ranging from 22 to 49, which according to the paper correspond to “two-digit SIC codes.” So I googled “two-digit SIC codes,” which are listed in various places online. I could not find an official link, but various unofficial sources seemed to agree; here’s one such list.

For each plot, black dots show the data for that industry, with a large dot showing the data from the first year of the data and thin black lines showing the time sequence. The colored curves are the fitted quadratics for each industry in an average year, as explained below.

It’s kind of weird that Furniture has so many patents, and that the number of patents for machinery and computers started out high and then dropped, and that electric and gas services had no patents for the first few years and then suddenly had a lot . . . this is just what’s in the data. I checked a few things to make sure I didn’t garble the categories but it’s possible that I’m missing something.

Here are the models I fit:

(3a) blue curve: quasipoisson fit to rounded data, including factors for industry and year (using glm())
(4a) pink curve: negative binomial fit to rounded data, including factors for industry and year (using glm.nb())
(5a) orange curve: negative binomial fit to rounded data, including factors for industry and year (using stan_glm())
(6a) purple curve: multilevel negative binomial fit to rounded data, with varying intercepts for industry and year (using stan_glmer())

Models (3a), (4a), and (5a) include unmodeled coefficients for industry and year (“fixed effects,” in economics jargon); model (6a) considers these coefficients as latent variables and estimates their distributions (this is what economists call “random effects”).

From the above graph you can see that, after adjusting for industry and year effects, the quadratic curves are much stronger. Most of this comes from the industry effects; there’s not much evidence for unexplained variation at the year level. Here’s the relevant console output from model (6a), which conveniently estimates the scale of each batch of varying intercepts:

stan_glmer
 family:       neg_binomial_2 [log]
 formula:      patcw_rounded ~ Lc + I(Lc^2) + (1 | sic2) + (1 | year)
 observations: 354
------
            Median MAD_SD
(Intercept) -68.3   27.1 
Lc          146.6   57.5 
I(Lc^2)     -77.9   30.9 

Auxiliary parameter(s):
                      Median MAD_SD
reciprocal_dispersion 5.6    1.0   

Error terms:
 Groups Name        Std.Dev.
 year   (Intercept) 0.11    
 sic2   (Intercept) 2.11    
Num. levels: year 22, sic2 17

In this fitted model (6a), and also in (4a), and (5a), not shown here, the estimated coefficient for the quadratic term is a bit more than 2 standard errors away from zero, that is, statistically significant at the conventional level. The estimated quadratic term in (3a), the quasipoisson regression, is a bit more than 4 standard errors from zero; I guess this difference is attributable to the different ways that the quasipoisson and negative binomial effectively weight the extreme values in the data.

Quadratic regressions on log(y+1)

When modeling count data we usually start with the negative binomial model with log link. And here I wanted to connect to whatever version of Poisson regression was used in that published paper.

But, as noted above, these data aren’t counts–they’re not even integers, and I think it makes sense to just directly model them on the log scale. Some of the observations have zero values, though, and so I’ll follow the standard practice of modeling nonnegative data y by fitting regressions to log(y+1). My general recommendation along these lines is to model log(y+A), where A is some constant that corresponds to a baseline level of the data. In this case, the 354 data points include 46 zeros, then another 60 values between 0 and 1, then various values (none of them are integers!) ranging as high as 44.7. For the purpose of studying patent counts as measures of innovation, it seems reasonable to add 1 to these data, which blurs the lowest values (there is no big distinction between 0 and the lowest nonzero observation, 0.037) while preserving the distinction between the higher levels. This seems about right: to the extent these data will be supplying a signal on innovation, we’ll want to mostly be learning from data on the high end.

Here’s what happens when we fit the quadratic regression, not adjusting for industry and year effects, to the log(y+1) transformed data. Now we can just use normal errors and we don’t have to worry about Poisson or negative binomials, so there’s just one curve:

You can see the zeros at the very bottom of the graph. The model of normally-distributed errors:

(7) green curve: normal regression fit to log(y+1) (using lm())

is not perfect, but I think it’s close enough, and we no longer have to worry about exactly how we’re modeling extremely high values. On the log scale we still see a fitted quadratic curve. The fit is noisy–the coefficient for the quadratic term has a standard error that is much higher than the estimate itself–but, again, let’s move on to the regressions that adjusts for industry and year effects.

As before, we show the data and fit (this time, with both on the log(y+1) scale) broken out by industry:

The curves come from two fitted models:

(8) orange curve: normal regression fit to log(y+1), including factors for industry and year (using stan_glm())
(9) purple curve: multilevel regression fit to log(y+1), with varying intercepts for industry and year (using stan_glmer())

They’re pretty similar; I assume that the small differences between the two fits arise from the fact that the least-squares model (8) does more adjustment for industries. The estimated coefficients for the quadratic terms are 3 or 4 standard errors from zero. Here’s the output from model (9):

 family:       gaussian [identity]
 formula:      log_patcw ~ Lc + I(Lc^2) + (1 | sic2) + (1 | year)
 observations: 354
------
            Median MAD_SD
(Intercept) -86.8   24.6 
Lc          186.6   52.5 
I(Lc^2)     -98.6   27.9 

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.5    0.0   

Error terms:
 Groups   Name        Std.Dev.
 year     (Intercept) 0.09    
 sic2     (Intercept) 1.04    
 Residual             0.47    
Num. levels: year 22, sic2 17

Hinge regression on log(y+1)

As we have discussed already, the big problem with quadratic regression is that it enforces non-monotonicity. I’d like to fit a model that allows nonlinearity but without a declining slope automatically turning into a negative slope.

My first thought was to fit a model with an asymptote, something like this: b_0 + b_1*(1 – exp(x/b_3)), as an alternative. But this curve has the opposite problem: it is restricted to be monotonic. What I really want is a model that can have diminishing returns or can asymptote or can have that inverted U shape, with the question answered by the data.

We have such a model–the hinge function:


This is a curve that smoothly connects two straight lines which can have arbitrary slopes. The parameters of the curve are:
x0: the value of x where the two lines would intersect
a: the value of y where the two lines would intersect
b0: the slope of the line on the left side of the hinge
b1: the slope of the line on the right side of the hinge
delta: the scale of the continuous curve connecting the two lines

For our purposes, we are most interested in b1: is there evidence that this slope is negative?

We program the hinge model in Stan–the code’s right there at the linked post–also including a multilevel model with varying intercepts for industry and year, as above. Here’s the whole program:

functions {
  vector logistic_hinge(vector x, real x0, real a, real b0, real b1, real delta) { 
    vector[size(x)] xdiff = x - x0;
    return a + b0 * xdiff + (b1 - b0) * delta * log1p_exp(xdiff / delta);
  }
}
data {
  real<lower=0> delta;
  int N;
  vector[N] x, y;
  int J1, J2;
  array[N] int<lower=1,upper=J1> group1;
  array[N] int<lower=1,upper=J2> group2;
}
parameters {
  real x0;
  real a, b0, b1;
  real<lower=0> sigma, sigma1, sigma2;
  vector<offset=0, multiplier=sigma1>[J1] a1;
  vector<offset=0, multiplier=sigma2>[J2] a2;
}
model {
  x0 ~ normal(1, 1);
  a ~ normal(0, 100);
  b0 ~ normal(0, 100);
  b1 ~ normal(0, 100);
  a1 ~ normal(0, sigma1);
  a2 ~ normal(0, sigma2);
  y ~ normal(logistic_hinge(x, x0, a, b0, b1, delta) + a1[group1] + a2[group2], sigma);
} 

When fitting the model, we specify the scale parameter delta, setting it to 0.05, which allows for a gentle curve within the range of the data (as you can see from the plots above, x ranges from about 0.85 to 1.0). There’s some arbitrariness here, but it’s just too hard to fit this parameter directly from the data. In effect this curvature is hard-coded into the quadratic regressions fit earlier.

We assign weak priors to the other parameters in the data, effectively excluding extreme values for the slopes of the curves. The only one of these priors that might be confusing is the prior for x0, the x-position of the hinge. We’re soft-bounding it at the high and low ends just so that the fit won’t get lost in extreme values: once x0 is far outside the range of the data, the curve effectively becomes a straight line and the location of the hinge becomes non-identified.

So, yeah, the hinge is a bit more work to fit compared to the quadratic, but that’s the price we have to pay to fit a more flexible model to this small dataset. And in this case I think the flexible model is absolutely necessary given the goal of seeing whether the data indicate that inverted U shape.

Here’s the result:

In each plot, the blue curve represents the posterior median of the parameters and the red curves correspond to 20 draws from the posterior distribution.

Here’s the summary of inferences:

 variable   mean median    sd   mad      q5    q95 rhat ess_bulk ess_tail
   lp__    71.69  71.85  6.65  6.87   60.81  82.19 1.00      838     1428
   x0       0.94   0.93  0.07  0.09    0.82   1.05 1.04      132     1204
   a        4.03   4.03  0.72  0.69    2.84   5.23 1.01     1178     1549
   b0      52.71  36.93 42.67 24.45   15.29 146.35 1.03      251     1288
   b1     -50.66 -31.22 46.50 22.85 -149.89 -11.44 1.03      146      883
   sigma    0.47   0.47  0.02  0.02    0.44   0.50 1.00     4279     2726
   sigma1   1.11   1.08  0.21  0.20    0.81   1.50 1.01      880     1217
   sigma2   0.08   0.08  0.04  0.04    0.02   0.16 1.00      999     1301
   a1[1]   -0.25  -0.25  0.30  0.30   -0.74   0.26 1.00      486      736
   a1[2]    0.23   0.22  0.30  0.30   -0.26   0.75 1.00      491     1036
...

The key parameter is b1, the slope on the right side of the curve. The posterior mean is -50.66 with standard error 46.50–so, apparently, not statistically significant–but if you look at the posterior quantiles, you’ll see that the distribution is very skewed. Indeed, it turns out that all 4000 of the posterior simulation draws of b1 are negative here.

So, after all this modeling, it seems that the data do clearly indicate an inverted U pattern!

That said, I haven’t explored the hinge model too carefully. For example, when I re-fit it with some other choices of the hinge scale parameter delta, it sometimes doesn’t mix well. I think the above graph with delta=0.5 fits these points about as well as is possible, so I’m ok with it as a data summary, but I’m not saying that my fitting procedure is fully computationally robust. If I were to be working more on the computation for this particular problem, I’d start by removing the year effects, as they can be adding stress to the computation without affecting the fit in any meaningful way.

Before getting to the interpretation of these results when it comes to competition and innovation, I want to tie up a couple of loose ends.

Quadratic regression on log(y+1), programmed in Stan

In fitting the above multilevel hinge regression, I moved from the pre-programmed Stan code of stan_glmer() to a custom Stan program. Just to check that nothing funny is going on here, I’ll go back and program the multilevel quadratic regression directly in Stan. It’s easy enough to write the program; I just replace the hinge by a quadratic function and get rid of the priors, and, indeed, we get essentially the same result as when fitting using stan_glmer(). I won’t bother showing the new graph here. No surprise, it’s just good to check.

The other thing we can do with the Bayesian fit is see whether the peak of the curve falls within the range of the data. The curve b0 + b1*x + b2*x^2 has its peak at x = -b1/(2*b2) (just take the derivative, set to 0, and solve for x), so we can compute the posterior distribution of this value from our simulations. It turns out that only 3 of the 4000 simulated curves has a peak outside the range of the data, so fair enough that, according to the fitted quadratic model (which I don’t like), the inverse U pattern is occurring within those bounds.

Adjusting for the group-level mean of the predictor

In general when fitting a multilevel model, it’s a good idea to adjust for the group-level mean of the predictor–in this case, the average value of x within each industry. Otherwise you have to worry about correlations between x and the varying intercepts. In this case, adding this group-level predictor doesn’t change much. I won’t share the result here just because I’ve already done a lot of work to write this up and there are enough other concerns with the analysis, but, yeah, it’s a good idea to include this predictor too.

Summary

OK, so what have we learned?

• The graph from the original paper (reproduced at the top of the above post) did not show a good fit because it didn’t display the adjustments for industry. We can see the pattern a lot clearer using separate plots for each industry.

• I think it makes more sense all around to model these data on the log(y+1) scale rather than using Poisson regression or any of its variants.

• The patterns within and between industries aren’t so clear to me. I don’t get why furniture has so many patents, why the number of patents for machinery and computers dropped, why the number of patents for electric and gas services shot up, and so forth. I’m not saying these numbers are wrong, and I might have made some mistake in coding; I’m just saying I’m baffled.

• I think the hinge model is much more appropriate than the quadratic for the propose of seeing the extent to which the data support an inverted U pattern. It wasn’t hard at all to program, debug, and fit the hinge model, and in this case it does support the inverted U. So in that sense the published paper was correct in its statistical conclusion, even if I think they got kinda lucky in finding the pattern with their quadratic curves.

• I still don’t buy the claim in the paper that they “find strong evidence of an inverted U relationship” between “product market competition and innovation.” Again, my main problem here is their measure of innovation using average weighted patent counts, an issue that further bothers me given the odd data patterns seen in the graphs for some of the industries, also with the selection in the data (“Our sample includes all firms with names beginning “A” to “L” plus all large R&D firms. After removing firms involved in large mergers or acquisitions and those with missing data . . .”).

• Also the model predicts y from x in the same year, and we’d expect a lag. I didn’t bother addressing this because of my other concerns about the data.

I guess that later work followed up with more comprehensive data sources, but I remain concern that any inverse U pattern, even if supported by this particular dataset, could depend strongly on various artifacts of the measures they are using. There aren’t really a lot of data points at high values of x here, so this downward slope is defined by how these patents are being counted in just a couple of industries.

All this might sound like picky criticism, but it’s the authors of the paper (and the Nobel prize committee), not me, who are making these broad claims about competition and innovation, and concerns about measurement seem really important here.

That said, it’s cool to see that a careful statistical analysis does find the inverted U. It does seem like a real pattern in the data, so the only question is the relevance of these data to the larger economics question.

tl;dr

In response to one of the comments below, I wrote that I don’t at all trust the idea of using time variation of patents within an industry to measure time variation of innovation. I get that the authors were doing their best using available data; still, I don’t think their data and analysis provide evidence for their substantive conclusions.

In the immortal words of John Tukey, “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

Looking at it that way, all my statistical analysis with the transformations and the quadratic curves and the hinge function was all a waste of time, as I’m just doing a fancy analysis of bad (or, at least, irrelevant) data. But . . . summarizing the statistical information still has value in itself. My analysis just took a few hours, a small fraction of the time the authors must have spent on preparing, analyzing, and interpreting their data. I think it’s important, when doing statistical analysis, to do the best we can do, in this case accounting for all these sources of variation and uncertainty.

Or, to put it another way, yes, it would have been fine to dismiss the published results entirely given the problems with the data. But some of the problems with the data became apparent only after I made those plots showing the time series for each industry. At that point, all the fitting may well have been a waste of time–but I only thought of making the plots because I was trying to understand the puzzling fit.

All the links connecting theory, measurement, data, analysis, and conclusions are important.

P.S. All these data and measurement problems just leap out at us, but there’s a whole world of people out there who just accept these conclusions–even some of the graphs from that 2005 paper–as truth. For example I came across this post by political journalist Matthew Yglesias that just straight up accepts the iffy empirical claims from that paper. It’s tricky–journalists are busy, and it’s natural to think that a much-cited paper written by two Nobel prize winners and published in a top journal has to be correct. I don’t really know what to say, except that this is one of the useful functions of social media, to allow us to push back against default narratives. Not that the default is necessarily wrong, just that it can be wrong, and it’s hard for journalists to escape the bubble and recognize this.

P.P.S. There was some concern about the arbitrariness of the log(y+1) transformation so I also fit the model to sqrt(y), which is a standard variance-stabilizing transformation for count data. The results looked essentially the same:

So the inverted U really seems to be there–but you have to assume the curve has the same shape for all industries. If you fit a separate curve for each industry, there’s no way you’d find the U, and you can see that in the scatterplots. The trouble is that the data are too sparse and variable to try to estimate a separate curve for each industry. Just look at electric and gas services, for example. Or passenger transit.

Data and code
Continue reading

Generalizing Treatment Effects from Trials to EHR Populations (Qixuan Chen’s talk this Tues morning)

My biostatistics collaborator is speaking this Tues 21 Oct, 11am at the NYU Department of Population Health, room CR 314 and on zoom:

Generalizing Treatment Effects from Trials to EHR Populations using Propensity Score Predictive Inference

Although randomized controlled trials provide strong internal validity, they often lack external validity when generalized to populations. This limitation, known as generalizability, arises when trial participants are not representative of the target population. To address this, we develop an interaction-based Propensity Score Predictive Inference (PSPI) method that leverages propensity scores for trial participation combined with flexible outcome models. We introduce two robust PSPI variants that estimate potential outcomes across treatment groups by incorporating natural cubic splines of the propensity score and modeling high-dimensional covariates with Bayesian Additive Regression Trees.

Generalization is important!

“All Our Default Models Are Wrong: Causal inference for varying treatment effects”: my talk this Saturday morning in Ottawa

It’s at this colloquium on meta-analysis in economic research, Sat 18 Oct 2025, 9:30am:

All Our Default Models Are Wrong: Causal inference for varying treatment effects

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Everybody knows that effects can vary, but the usual models we fit do not account for structure in the variation. This is relevant for generalization from sample to population, for anticipating changes over time, and when designing new studies and analyzing existing data. We discuss several directions for going beyond the usual additive model, along with the challenges of fitting such models and interpreting the results, which tend not to reach conventional “statistical significance.”

Perhaps with audience participation there will be a discussion of the particular relevance of this work to economics.

Questions about statistical claims in paper from recent Nobel prize winners; some general challenges in trying understand nonlinear patterns using quadratic regression

A colleague who specializes in the natural selection of bad science points us to this article from 2005, Competition and Innovation: An Inverted-U Relationship, by Philippe Aghion, Nick Bloom, Richard Blundell, Rachel Griffith, and Peter Howitt.

But the above graph (if you just look at the dots and ignore the curve!) does not look like an inverted-U or like anything non-monotonic. It looks like a flattening curve with diminishing returns.

What happened?

Here’s what it says in the article:

In Figure I we show the scatter of data points in between the tenth and ninetieth deciles of the citation-weighted patent distribution, and overlay a fitted exponential quadratic curve. The same exponential quadratic curve is plotted together with a spline approximation in Figure II. It can be seen that the exponential quadratic specification provides a very reasonable approximation to the nonparametric spline, and that they both show a clear inverted-U shape.

OK, so the problem is with this “exponential quadratic curve.” It’s not that the data show an inverted U, it’s that the inverted-U is being induced by the quadratic functional form.

I don’t have the data or code from this article, but I’m guessing that if you simulated data from an underlying model where E(y|x) is an increasing function of x but with declining rate of increase, that this quadratic fit could easily find an inverted U-shape.

We’ve seen this happen before, in a notorious paper by some psychologists that claimed that, in sports, “Top talent benefited performance only up to a point, after which the marginal benefit of talent decreased and turned negative”–but when you look at the data, there is no such negative turn. The reported negative turn, or inverted-U shape, arose entirely from (a) the data being consistent with diminishing but positive returns, and (b) the quadratic curve being too restrictive. Here was their fitted curve:

Screen Shot 2015-10-04 at 12.14.46 AM

and here were their data (ignore all the lines on the graph and just look at the dots):

>

As with the econ paper under discussion today, if you fit a quadratic curve you get this inverted-U shape, but if you look at the data, all you see is a flattening of the slope.

Another issue that arises in both these examples is that the predictor has an upper bound at 1, which means that, even if the quadratic model is correct, you can have a negative curvature–that is, a negative coefficient on the quadratic term–without there being a decline in the curve in the range of the data. So looking at the estimate and significance of the coefficient on the quadratic term is not enough. In a practical sense this shouldn’t matter because you shouldn’t be routinely fitting quadratic curves–they have the well-known problem that the fitted curve can look like a U or inverted U even if the data pattern is monotonic–but if you do this, you can’t just look at the coefficient.

But let’s continue with the paper under discussion. Here’s their Figure II:

In the above-quoted paragraph, the authors accurately state that both curves show a clear inverted-U shape.

Fine. But what about the data? In particular, how is it that the nonparametric curve goes down so fast at the right of the graph? The curve goes all the way down to E(y|x) = 2.5 at the extreme value of x=1. But if you look at the data in Figure I, there’s this whole cluster of points at the upper right, and, at least based on these data, E(y|x) is around 8 or so in the region where x=1.

I can’t figure out what’s going on. My best guess is that the fitted quadratic-like curve is what you get after adjusting for other predictors not included in the graph–from Table I, these include year effects, industry effects, and some other predictors–but I’m not sure, and it still seems weird that they’re plotting a fitted curve that isn’t close to the empirical pattern of E(y|x) in the key region of the data where they’re reporting a decline.

One possibility is that the data in the upper right of Figure I “don’t count” in that they all belong to one or two industries that have high levels of competition and high levels of innovation, so that this patter is accounted for in the industry effects in the model. But, if that’s the case, I’m still concerned, because this sort of pattern between industries would still be relevant to the question of the correlations of competition and innovation. They write, “It is very likely that different industries will have observed levels of patenting activity that have no direct causal relationship with product market competition, but reflect other institutional features of the industry. Consequently, industry fixed effects are essential to remove any spurious correlation or ‘endogeneity’ of this type.” And I kind of get this, but to the extent that industries with lower profit margins have more patents, that could be relevant too. At the very least, I’d like to see this in the data. Once they subtract industry effects, they’re getting leverage from changes over time within industries, and these could just represent parallel time trends, no? In some sense this is addressed by their instrumental variables analysis described on pages 708-710, but in any case I still have concerns about their claimed inverted-U.

Again, it’s just crazy that their fitted curve doesn’t even go through the data. This is a self-defeating graph on the order of the notorious air pollution in China regression. Again, I haven’t seen the data and there could well be some way around this problem, but, if so, the authors should at least address the problem and explain why they believe this inverted-U pattern to be true in some underlying sense, even though it does not appear in the data.

“Inverted-U” is in the title of the paper!

And then the article has a long section, “Explaining the Inverted U.” So they’re really invested in the idea. For example:

But . . . what if you’re explaining something that isn’t really happening! Again, see Figure I.

So don’t know what to think.

From these graphs, it looks like their pattern is an artifact of including a quadratic (rather than, say, a saturation function such as y = a*(1 – exp(-bx))), in their model, and, as noted above I’ve seen acclaimed researchers do this sort of thing before. Also the statistical analysis includes questionable confirmatory statements such as “Again, we find an inverted-U shape, although due to a substantially smaller sample, the coefficients are not statistically significant.” Also I’ve seen problems with others’ analysis of patents; see for example here and here. Data from patents can be tricky to analyze.

In addition to my concern about using patents as a proxy for innovation, I don’t know what to think about using 1 – profit margin as a proxy for competition (see pages 704-705 of the paper), both when comparing across industries and over time. At the very least, I’d prefer if they’d talk about “patents” and “profit margin” rather than “innovation” and “competition” throughout. That’s just a change in words but I think it would make the issues a lot clearer.

There are also some other data issues, like what industries they are considering, and why they’re doing analysis at the industry rather than firm level, and selection (“Our sample includes all firms with names beginning “A” to “L” plus all large R&D firms. After removing firms involved in large mergers or acquisitions and those with missing data . . .”), and the question of whether it even makes sense to try to predict number of patents (or even “innovation”) from the average profit margin (or level of “competition”) in the same year, rather than considering some sort of lag. They kind of address that last question with a robustness check, but the trouble is that I don’t believe that either, given that their only evidence is statistical significance of a quadratic term in a curve that doesn’t seem to fit the data. Also it’s not clear to me why the lagged model should be the robustness check and not the main analysis.

On the other hand, I haven’t looked into this particular case in detail so maybe it all makes sense if you look at it carefully enough.

One more thing is that I think they’re saying they’re using Poisson regression, but their data are weighted counts which aren’t integers? Also it’s well known that Poisson regression will understate uncertainty. Negative binomial regression is just about always better (see chapter 15 of Regression and Other Stories) or else you can use some sort of robust standard errors or whatever. But straight-up Poisson regression will generally give you standard errors that are too small–often much too small.

The big picture, as I see it, is that this paper has some theoretical results and some empirical results. The theory alone could be interesting but wouldn’t count for much without the empirics. The empirical results are iffy–at best, there are some patterns there and the authors just didn’t fully display their data and explain their model, but I’m doubtful. It’s possible that future, more careful, analysis found similar results–or not! It looks to me like the authors followed a standard practice in social science research of finding a statistically significant coefficient estimate and taking this as evidence in favor of a particular theory. But there are enough gaps between data and theory, gaps that include the functional form of the model, the method used to average over industries and years, and the variables being measured, that I don’t see it. As I said, this is standard practice in social science, and we wouldn’t really be looking this a paper from 2005 had two of its authors not been in the news “for having explained innovation-driven economic growth.”

P.S. My colleague sent me this paper because two of its authors recently won the Nobel prize in economics. This would not be the first time that economics Nobel prize winners made mistakes in interpreting data analysis in high-profile studies. Two cases we’ve discussed in the past are:

Did blind orchestra auditions really benefit women?

How does a Nobel-prize-winning economist become a victim of bog-standard selection bias?

It happens!

Of course, even if the paper, “Competition and Innovation: An Inverted-U Relationship,” is absolutely terrible and even if there is no such inverted-U relationship, that does not mean that the corpus of work by Philippe Aghion, Nick Bloom, Richard Blundell, Rachel Griffith, and Peter Howitt is valueless, or that they don’t deserve a major prize. And similarly for the authors of the two papers discussed in the links immediately above. Everybody makes mistakes. I’ve felt the need to issue corrections to four of my published papers, and I don’t think that all, or even much, of my work is bad.

So don’t take this post as a criticism of this Nobel prize. Rather, we can take it as a plus. When research gets public attention, people will go back and read the original papers, and this leads to post-publication review, as in this post. This is a good thing!

P.P.S. At this point, you may well be saying that I’m just being picky, this is how people did empirical work 20 years ago, why am I being mean to these authors, I’m a hater, tall poppy syndrome, every paper has flaws and assumptions, etc etc etc. So, to keep it simple, let me just say this: I don’t believe their story is supported by the evidence of that paper from 2005. I disagree with their claim that “We find strong evidence of an inverted-U relationship using panel data.” I just don’t see it. It might that their theory is correct, and it might be that further data analysis supplies strong empirical support; I don’t know. I’m not making a statement about reality here; I’m making a statement about evidence. Which I think is a reasonable thing to look at, given that this is what the editors of the Quarterly Journal of Economics had in their hands when they had to decide whether they wanted to publish this claim of strong evidence. Again, this is an issue with lots of empirical work, and I’m not saying this paper was worse than the accepted standard at that time, or even now.

P.P.P.S. See here for my reanalysis.

Bridging prediction and intervention in social systems

This is Jessica. Lydia Liu, Deb Raji, Angela Zhou, and a whole bunch of other people (one of which is me) write:

Many automated decision systems (ADS) are designed to solve prediction problems— where the goal is to learn patterns from a sample of the population and apply them to individuals from the same population. In reality, these prediction systems operationalize holistic policy interventions in deployment. Once deployed, ADS can shape impacted population outcomes through an effective policy change in how decision-makers operate, while also being defined by past and present interactions between stakeholders and the limitations of existing organizational, as well as societal, infrastructure and context. In this work, we consider the ways in which we must shift from a prediction-focused paradigm to an interventionist paradigm when considering the impact of ADS within social systems. We argue this requires a new default problem setup for ADS beyond prediction, to instead consider predictions as decision support, final decisions, and outcomes. We highlight how this perspective unifies modern statistical frameworks and other tools to study the design, implementation, and evaluation of ADS systems, and point to the research directions necessary to operationalize this paradigm shift. Using these tools, we characterize the limitations of focusing on isolated prediction tasks, and lay the foundation for a more intervention-oriented approach to developing and deploying ADS.

This paper is still in working paper format, but I’m posting because it does a good job synthesizing problems that arise if you fixate on optimizing prediction accuracy when deploying models to inform decisions about people, rather than taking a more holistic perspective. I’ll probably assign it as a course text next time I teach my Prediction for Decision-making class. The perspective the paper advances is that introducing such systems is a policy change and should be implemented and evaluated as such (e.g., in comparison to “bureaucratic counterfactuals” representing decision processes in the absence of the new system). It summarizes decision theoretic and causal inference formulations and includes case studies to illustrate. 

This came out of a workshop led by Lydia, Deb, and Angela last summer, which will be followed up with another workshop at UC Berkeley’s Simon’s Institute next January. 

When thinking about causal inference, mechanistic or process models are important. I think that the association of “causal” with black-box models leads to lots of problems.

Columbia University computer science professor Elias Bareinboim points to a new textbook he’s been developing, Causal Artificial Intelligence. He also points to a recent paper with Drago Plecko, On the Structural Basis of Conditional Ignorability, that revisits the connection between potential outcomes and graphical models. Bareinboim writes that it is intended as a more technical note that addresses specific questions and provides a mathematical grounding to some topics we discussed back in 2012, on the use of hierarchical modeling to generalize to new settings.

Bareinboim’s book and course looks great. He doesn’t use the methods that I am familiar with, but it is important for students to be exposed to multiple perspectives. (If you’re curious what we say, you can take a look at chapters 18-21 of Regression and Other Stories, which can be downloaded here.) Ideally students would take both of our courses so they can be experts in both approaches.

A key theme of Bareinboim’s book is that mechanistic or process models are important, and I agree. When I first took causal inference from Rubin back in 1985, he emphasized the amazing thing about randomized experiments that you can measure a causal effect without having any mechanism. But more and more I think that a causal effect without a mechanism model is rarely useful, first because effects tend to be small and so it’s rare to precisely identify an effect size from data along with no model, second because we are always interested in generalization (in causal jargon, we almost always only care about the population distribution of treatment effects, not the sample distribution), third because we care about variation, and fourth because even without the other three items, in the rare cases when we can discover a causal effect from a black-box experiment, we’ll want to understand the mechanism going forward. This reasoning does allow space for black-box causal discovery in the screening process–we look for clear effects which can then be studied in more detail–but even there we have an implicit population of possible effects under study (e.g., many different drugs, or many different genes, or many different policy innovations), and then I’d argue we’re already halfway there to some sort of process model, which in practice I would implement as a Bayesian latent-variable model, but there are lots of ways to do it. To me, the latent variables correspond to the “gears” in a mechanistic model.

It’s not that I think Rubin was wrong on the technical point; I just think, in retrospect, that by concentrating on the estimation of the sample average treatment effect he was attaining mathematical beauty at the cost of generalizability. Of course, in practice Rubin was very interested in generalization and very sensible about such issues; it’s just that in his theoretical work he focused on the in-sample problem. His argument was that it was best to start with what could be estimated from the data with minimal assumptions. I expressed my disagreement with this focus in item 2 of my (generally positive) review of the Imbens and Rubin book: https://statmodeling.stat.columbia.edu/2015/09/07/comments-on-imbens-and-rubin-causal-inference-book/

I made some of the above points in a post a few years ago, “Causal” is like “error term”: it’s what we say when we’re not trying to model the process: Unfortunately, Judea Pearl didn’t seem to understand my point there, but overall the comments on that post are helpful. Perhaps the title of that post was misleading. In any case, my point was that the mechanistic models we use in science are indeed causal (under Pearl’s definition or Rubin’s): they say that if you do X, then Y will happen. For example, if I fit a multi-compartment model in pharmacometrics, that’s causal: it says that if you increase the concentration of the drug in one compartment, this will have predictable effects going forward in time, as governed by a certain differential equation. But in statistics and econometrics, the term “causal inference” tends to be reserved for black-box settings where there’s no mechanistic model, and inference is done using a purely design-based “identification strategy” such as regression discontinuity or whatever. Causal inference is very glamorous right now in statistics and econometrics, and that’s fine, but people who love causal inference should also love mechanistic models. I think that the association of “causal” with black-box models leads to lots of problems. So I think this puts me in agreement with much of the spirit of Bareinboim’s book even if we are using different methods.

Spillovers

There are some interesting things going on at the border of black-box causal inference and process models. One such example is spillover effects. In the traditional statistical formulation of black-box causal inference, spillovers are an annoyance, a violation of the “stable unit treatment value assumption.” (There are other reasons to abandon the stable unit treatment value assumption–see this paper–but we won’t get into that here.) So the idea would be to design the study so there would be no spillover or to get estimates that were robust to spillover or to construct estimands of total effects averaging over spillover or to fit models in which spillover wouldn’t happen . . . but all that is the old way of thinking about things. The new way to handle spillover effects is to model them using some sort of mechanistic or process model, that is, a parametric model that corresponds to some model of the spillover process. It could be a spatial model, for example. I think of this as being on the border between black-box and process models for causal inference, in that the magnitude of the treatment effect might be estimated using some black-box regression approach (as in chapters 18-20 of Regression and Other Stories) or some black-box identification approach (as in chapter 21 of that book), but in a context where the spread of the effect across units is modeled as a process.

The Desperation of Causal Inference in Ecology

This post is by Lizzie. The Figure is take from Frank 2024.

I was in a meeting a little over a year ago in which I asked a student to define causal inference. The definition he gave me focused on complex approaches often used to try drag causality out of observational data. So I asked if causal inference involved experiments at all? “No,” came the reply. I double-checked. “No.” The student was certain. Someone else following up later did not change their mind.

Experiments cannot help with causal inference.

I knew we had a problem then, but how did it happen? I’ll tell you my version of what happened and some of what I can put together for how this happened, but I am open to other theories and ideas. And if perhaps the new ‘causal inference’ movement in ecology really has — finally — struck on a way for us to figure out ecology, then time will obviously prove me wrong, and you’re welcome to beat time to it in the comments section.

I could start back with Sewall Wright and the demes of cows I was once told he used to visit and Fisher and his fields of corn (or some agreeable consistent crop just waiting for its split plot design), but I will just start in the 1990s with path analysis in ecology. Path analysis (what I would call structural equation modeling with standardized coefficients) was hot in the 1990s in ecology. There was a chapter on it by Mitchell in the book ‘Design and Analysis of Ecological Experiments’ in 2001 (perhaps around its peak). It had this on the first page:

plant traits → visitation → pollination → reproduction

Isn’t that great? I could link plant traits to plant reproduction via those traits’ effects on (insect) visitation (to flowers) and how all that racey visiting led to pollination and then — reproduction (and then I might even make a run at … plant fitness!). I mean it is great. I like the idea. I liked the chapter. I did a path analysis. But I didn’t call it path analysis, I called it structural equation modeling because, by the time I was publishing, path analysis had hit some bumps.

Namely, everyone had done path analysis and many of those people had done it poorly in one way or another and suddenly all those little paths looked like a lot of made up stories with lots of little asterisks representing lots of significant p-values that didn’t really hold up to scrutiny. Shocker! (No, not shocker.) So, we all stopped doing path analysis and (within a few years it seems to me) we started doing structural equation modeling, sometimes with standardized coefficients. But we never called it path analysis again.

We couldn’t let go of path analysis because the dream was still alive. We wanted causality. We wanted to link things to explain how the world works. And manipulating plant traits is hard (have you ever tried to paint flowers different colors in a field? Or paste on tiny hairs (which we call trichomes)?), but measuring them is comparatively less hard. We wanted causality from observational data. That was the dream.

And, the dream is still alive. After all this time.

And the dreamers seem to have just discovered some of the basics of causal inference for observational data from the social sciences and econometrics literatures. With this, they have discovered that diversity (more species) in grasslands leads to lower productivity, not higher (Dee et al. 2023) and linked white nose syndrome in bats to increased infant mortality across the eastern US (Frank 2024). This latter paper is the one that rattled me because I attended a discussion group with colleagues and found out how many of my colleagues are excited by these ‘new techniques’ and how they have learned from them the amazing power of fixed effects for finding causality and the dangers of random effects to lead us astray.

Huh? Fixed effects to save the day and random effects of doom?

I tracked some of this down to me thinking of the common ecology definition of fixed versus random (I think some closer to definition #2, page 245 of Gelman and Hill: “2. Effects are fixed if they are interesting in themselves or random if there is interest in the underlying population. Searle, Casella, and McCulloch (1992, section 1.4) explore this distinction in depth.”) whereas the ‘new’ methods in ecology are using (I believe) definition # 5 (“5. Fixed effects are estimated using least squares (or, more generally, maximum likelihood) and random effects are estimated with shrinkage (“linear unbiased prediction” in the terminology of Robinson, 1991). This definition is standard in the multilevel modeling literature (see, for example, Snijders and Bosker, 1999, section 4.2) and in econometrics….”).

This explains some of the interesting lines I found in these papers, including:

Random effects account for clustering in data via the error structure of the model (Bolker et al. 2009; Gelman and Hill 2006), rather than estimating cluster means as part of the data generating process of a model (i.e., via fixed effect for each cluster’s mean, using the terminology of the mixed models literature). (Byrnes & Dee 2025)

The time-varying site attributes (μ_{st}) are also modeled in a fully flexible way that allows a year- specific effect for each site (in the estimation, an indicator for each year is interacted with an indicator for each site). (Dee et al. 2023)

I think the authors of this new Tower of Babel for ecology have also defined random and mixed effects to mean only ever linear models with lmer-style partial pooling on intercepts (never slopes I presume?) with fixed effects on slopes (back to definition #2). They even go so far as to refer to this as the “Common Design in Ecology” (they also capitalize Ecology and Ecologists in Byrnes & Dee 2025, which I find odd — is this high German? Personally, as an ecologist, I don’t think I need an capital letter) and explain:

Without more variable transformations, the multi-level modeling approach does not easily lend itself to controlling for as many unobservable sources of confounding as can be done in our linear, additive, fixed-effects panel data estimator. (Dee et al. 2023, in supp)

I thought about calling this post ‘The Tower of Babeling Causality’ or ‘The Problem with Statistical Terminology,’ but the real problem is not how lost in the weeds of words we get in with terminology. It’s partly how easily ecologists do want ‘new’ terms and approaches that will solve everything. The authors who have come armed with econometrics panel data approaches and instrument variable analysis (when most ecologists don’t know what is an instrument in their experiments) had the ground laid for them by all the ecologists who are enthralled by ‘random effects.’ I agree we have too many people trained to believe that chucking enough categorical covariates (site, plot, year …) on the intercept of a simple linear model will save the day. It’s a problem how much we sway from this being correct statistics to that being correct statistics. And how quickly think a new approach will change everything in ecology. We seem to quickly learn — and re-learn — that bad stats can easily lead you astray, but never take on that good statistics alone will not save you.

To be clear, I don’t have a giant problem with these methods. I have a problem with how they are presented as saviors (and somewhat how they are presented as new, but perhaps we need the ‘new’ and ‘savior’ angle to follow Grace) but I have a bigger problem in how rapidly they are being taken up. I fear the next 10 years I will live in a sea of piranhas where lots of ecological problems explain 5-10% of infant mortality and plant productivity.

And that’s the other problem — the bigger one: how much people want this causality. They want to believe that we have the data and methods to show that a disease that wipes out bats leads to an 8% increase in infant mortality. Of course we should want causality, we’re scientists, but the drive for causality seems to jettison a lot of the stuff we also need as scientists, especially estimates of uncertainty and the ability to leave room for uncertainty so that we search out better methods and better answers. I don’t know if bat decline has increased infant mortality 8% (though I highly doubt that number given the language of the author and how ‘outrageous’ he thinks it is that he is expected to share all his data for people to believe his claims). I just know we have managed to do science before and make progress and it wasn’t because we got better statistical methods or memorized a glossary of one particular set of people’s definitions of DAGs and fixed effects.

I am cited in one of these papers for old work I did where I compared shifts in the timing of flowering and leafout with warming over time (due to anthropogenic climate change and natural variation) and experimental warming (due to infrared heaters or teeny tiny plastic greenhouses — also, hello instruments in ecological experiments!). Estimates from experimental and observational data were different — the effect of warming in observational data was bigger. I did lots of different statistical analyses to figure this out, I even did something probably close to the ‘Common Design in Ecology’ (although the authors don’t seem to ding me for this) and the effect never went away. With Ailene Ettinger and other colleagues, I eventually got all new data and found the same thing using slightly different statistics. But that wasn’t why we got all the new data, we got it to test hypotheses about what drove the difference. And we found out that it appeared to be two things: warming experiments dry out soils which delays leafout and flowering and warming experiments over-report their warming (so their per degree estimates look smaller than they should).

I did all of this without ever invoking the term ‘causal inference.’ And that’s what really worries me for trainees today; that ‘causal inference’ will now mean a narrow branch of amazing ‘fully-flexible’ completely un-confounded statistics. We’re ecologists; we actually can manipulate some stuff. And somehow we’re going so gaga for econometrics statistics to give us causality through time-invariant fixed effects (or whatever) that we have students who don’t know how experiments could relate to causal inference.

What’s the solution? If you ask me, be less gaga over any statistical method (and I do love my own statistical methods so I could practice a little more of what I preach) and teach everyone basic mathematical notation and basic biological models. Teach them that generative modeling doesn’t belong to any one part of statistics or to only fixed or random effects. Teach them to be able to write out a simple biological model and simulate data from it and then fit their statistical model to it. ‘Only connect!’ Connect the models you learn for ecological theory with those you learn in stats. (And maybe teach them about the long debate in conservation biology about Cassandra’s curse, but that is a topic for another post.)

That external validity question: How to think about a 3-year UBI study?

Dale Lehman writes:

You may be aware of the recently completed guaranteed income experiment. There is much to be studied here – but it seems like the experiment and its goals are a bit of a mismatch, and I haven’t seen a clear recognition of that point. To the extent that the results are stated in terms of the effect of a temporary unconditional income grant, it is accurate. But much of the interest in a universal basic income is for an ongoing policy, not a temporary one. I suspect the 3 year timeframe of this study may not permit much to be said about ongoing policies such as the UBI. This experiment only recently ended, so perhaps followup a few years from now will help frame the limitations of this temporary experiment for revealing potential impacts of a “permanent” UBI. Given the resources available, perhaps only a temporary experiment was possible. But just browsing through their discussion of results, it isn’t clear to me how meaningful this experiment was for policies such as the UBI. They talk about the results in terms of how the cash was used for consumption, savings, employment decisions, etc. without distinguishing whether or not these effects might be due to their temporary nature. Looking at the anecdotal evidence they provide, I see many people speak of their experience in terms of disbelief and luck – these seem like transitory effects to me, and not indicative of what a UBI policy would entail. I’m not sure why a temporary unconditional income grant is of much policy interest, so I think the question of whether a temporary experiment can generalize to an ongoing policy might be important.

I don’t have any insights regarding universal basic income policies or this particular study–but, yeah, questions of external validity arise all the time when generalizing from a time-limited study to the real world.

My favorite (i.e., least favorite) example of this sort was the psychology study published as “The more you play, the more aggressive you become: A long-term experimental study of cumulative violent video game effects on hostile expectations and aggressive behavior”–but it wasn’t a “long-term” study at all: the study took place over 3 days! Unless you’re a fruit fly, 3 days is not long-term. Then again, the last author of that paper had some other problems.

The 3-year time frame problem discussed by Lehman is an example of something that happens in so many studies in health, policy, and business. You’re interested in long-term effects on health, but in the meantime it makes sense to publish what results you have after 6 months or 1 year or 3 years or whatever. You’re interested in long-term policy effects but in the meantime you have to decide what to do so you look at outcomes after 1 year. You do an A/B test with the goal of increasing clicks in the future, but all you get is what happened during the period of your experiment. As Lehman points out, in the UBI example, there are concerns about the outcomes being measured after a short period of time and also that the treatment itself is time limited.

My only general comment here is that any attack on this problem involves modeling (i.e., assumptions). So, in this case I’d recommend trying out some explicit models of the effects as a function of the way that the program is implement, and how the outcomes develop over time.

Experimentation and thinking at the level of a program of experiments

Here at MIT we’re yet again hosting the Conference on Digital Experimentation (CODE@MIT) this fall. As part of the run up to the deadline for submissions on September 12, the organizing team spent some time talking with experts in industry about relevant topics that perhaps have been underrepresented at the conference. My co-organizer David Holtz wrote about three topics that came up, but I wanted to say a little bit more about one here.

One of the things we’re hoping to see more of at the conference — and I’d like to see more of in academic literature and applied practice — is thinking about experimentation at the level of a whole program or practice of experimentation.

People often try to evaluate single experiments to figure out whether they are worthwhile or how much value they added. Maybe you try something new out, which many people were skeptical about, and the experiment shows it works well. Perhaps more often, you retain the null hypothesis of no effect and — if you’ve sized your experiment appropriately — that may reflect that this is evidence against a “worthwhile” effect. And then sometimes some idea seems like an obvious improvement, but it is — at least in its actually existing version — really bad and the experiment saves you from blindly launching something harmful.

Sure, we can focus on a single experiment and see where it might fit among these, but it may be more useful to think about this at a higher level of analysis. That “obvious improvement” that turned out to be bad is something we might be able to iterate on, eventually yielding — with the help of a series of experiments — a version that works. And if we show that the idea everyone was skeptical about works, this might empower more people to try things; the resulting series of many experiments will probably have lots of duds, but also just involve trying many more things. Many experiments also give us more of a chance at estimating effects on what we really care about — whereas many individual experiments might be too short or small for that purpose.

My sense is that this is not usually the kind of thing that statistics, econometrics, etc. looks at. To the extent that there is a focus on something like a sequence of experiments, it is often in the narrow frame of something like multi-armed bandit problems and Bayesian optimization — which can be valuable tools, but aren’t as connected with the messier, more creative process of innovation and product experimentation. And they don’t readily let us think about many, seemingly unrelated experiments conducted by the same organization with (partially) common goals.

Here are a couple of papers that put interesting quantitative or methodological lens on whole programs of experiments.

In “A/B testing with fat tails”, Eduardo Azevedo, Alex Deng, José Luis Montiel Olea, Justin Rao, and Glen Weyl consider how the distribution of the quality (i.e. average treatment effects) of new ideas affects optimal experimentation strategy. If the distribution of treatment effects has very heavy tails, then there are some really good ideas (well and some really bad ones too) mixed in. Even small experiments could then be well-powered to detect those big effects, and finding those provides most of the value. I quite like this conceptually — and it is nice to have some data about that distribution. For that reason, I usually feature this distribution (alongside a couple others) when teaching analytics:

Distribution of estimated average treatment effects from many product experiments: Fat tails in three different data sets

Others have followed up on this idea. However, this abstract version of the problem doesn’t, in my opinion, really match a lot of product innovation and experimentation. Sure, novice experimenters are often concerned about “overlapping” experiments, so they make all of their experiments exclusive, non-overlapping — thereby creating a tradeoff between number of experiments and sample size per experiment. This can be necessary in some cases when two innovations are indeed exclusive of each other (only one audio compression algorithm can be used at once, only one ranking of content can be shown to a user at once). But many experiments should really be run in an independent way. So it isn’t clear to me that this is the real tradeoff in this situation. Nonetheless, there still may be other sorts of budget constraints on experimentation that could lead to related conclusions.

In “Evaluating decision rules across many weak experiments”, Winston Chou, Colin Gray, Nathan Kallus, Aurélien Bibaut & Simon Ejdemyr consider how to empirically evaluate and optimize the decision rules used to make launch decisions once a team runs an experiment. In practice, these decision rules are often some combination of launching if there’s a statistically significant positive effect on some proxy metric, often in the absence of a detected negative effect on some guardrail. It’s easy to see that this can lead to some odd choices that a Bayesian decision-maker would not make (always be integrating your loss function over your posterior). But can such a simple rule have good empirical performance in terms of the aggregate effects on the main metric of interest? Yes, but the status quo rule might not be the best — and might be quite bad — on that front. As this paper highlights, naive evaluation of a decision rule can get the evaluation quite wrong. This is because many of the experiments will be underpowered for effects on the main metric of interest, so there can be quite the “winner’s curse”. And, in a phenomenon related to weak instruments, naive evaluation will often misestimate how diagnostic the proxies used in a decision rule are about those effects. So this paper provides some better ways to get at the aggregate effects of applying a decision rule to many experiments. (However, it doesn’t really tell us about whether we were running the right experiments.)

 

It would be great to see more methodological work on experimentation programs — whether similar or dissimilar to these two examples. And I’m sure there are also additional, unknown-to-me literatures on this topic. Please let me know about them. And if you’re doing this kind of work, submit it to CODE@MIT, whether this year or in the future.

This post is by Dean Eckles. Since it is about analytics practice mainly at “tech companies”, I’ll note that, among other disclosures, several tech companies are financial supporters of CODE@MIT. 

 

 

What writing a failed rock-paper-scissors program taught me (or should have taught me) about sample size and uncertainty

In a recent comment thread, Phil shared the story of a program he wrote on the little computer in our high school math room. I think it was an OSI (Ohio Scientific) that used, ummm, I think it was 5-inch floppies but maybe it was 8 inches. As Phil said, we programmed in Basic. The cool thing to me was that it had a terminal! I’d learned programming using punch cards. Being able to type directly into the terminal and see the results on the screen . . . that was just amazing.

Anyway, there are two programs I remember writing for it. First was the lunar lander program–that one’s a classic, zillions of people have programmed it, actually it’s kinda fun. Your lander is subject to gravitational acceleration and you have a fixed supply of fuel. You can fire the rockets to slow down, and the goal is to land softly. Use too much fuel too early and you’ll decelerate too fast, with not enough fuel to slow you down at the end when you need it. Wait too long and it’ll be too late! This is the simple one-dimensional version–it’s easy to program (just Newton’s laws!) and a small challenge to play and survive.

The other thing I remember is writing a program to play rock-paper-scissors. It started with random plays, then after 10 rounds or so, it tried to figure out what to do based on the opponent’s pattern of play. It was crudely fitting a lag-1 Markov process, estimating the probability of each of the 3 choices conditional on the previous play. I programmed it up, tried it out a few times myself to see that it worked, and then gave it to others to play (without revealing its rules).

I fully expected the program to do well. People are predicable, right? Actually, though, it bombed. I don’t think its poor performance came from savvy play on the part of its opponents. That is, I don’t think my friends adapted their play against my rock-paper-scissors robot so as to defeat its algorithm. Rather, I think it’s just noise. First, playing a few rounds against someone would not be enough for my bot to estimate the conditional probabilities. I can’t remember how my program worked, but it probably used a crudely empirical estimate which would in practice be super-noisy. Second, even if the program really was somewhat better (or worse) than chance, you’d need lots of rounds to be able to notice it. To really see if the program is doing well, you’d need to collect a bunch of data.