A colleague sent me an email with the above title and the following content:

We were talking about Jamaica childhood intervention study. The Science paper on returns to the intervention 20 years later found a 25% increase but an earlier draft had reported a 42% increase. See here.

Well, it turns out the same authors are back in the stratosphere! In a Sept 2021 preprint, they report a 43% increase, but now 30 rather than 20 years after the intervention (see abstract). It seems to be the same dataset and they again appear to have a p-value right around the threshold (I think this is the 0.04 in the first row of Table 1 but I did not check super carefully).

Of course, no mention I could find of selection effects, the statistical significance filter, Type M errors, the winner’s curse or whatever term you want to use for it…

From the abstract of the new article:

We find large and statistically significant effects on income and schooling; the treatment group had 43% higher hourly wages and 37% higher earnings than the control group.

The usual focus of economists is earnings, not wages, so let’s go with that 37% number. It’s there in the second row of Table 1 of the new paper: the estimate is 0.37 with a standard error of . . . ummmm, it doesn’t give a standard error but it gives a t statistic of 1.68—

What????

B-b-b-but in the abstract it says this difference is “statistically significant”! I always thought that to be statistically significant the estimate had to be at least 2 standard errors from zero . . .

They have some discussion of some complicated nonparametric tests that they do, but if your headline number is only 1.68 standard errors away from zero, asymptotic theory is the least of your problems, buddy. Going through page 9 of the paper, it’s kind of amazing how much high-tech statistics and econometrics they’re throwing at this simple comparison.

Anyway, their estimate is 0.37 with standard error 0.37/1.68 = 0.22, so the 95% confidence interval is [0.37 +/- 2*0.22] = [-0.07, 0.81]. But it’s “statistically significant” cos the 1-sided p-value is 0.05. Whatever. I don’t really care about statistical significance anyway. It’s just kinda funny that, after all that effort, they had to punt on the p-value like that.

Going back to the 2014 paper, I came across this bit:

I guess that any p-value less than 0.10 is statistically significant. That’s fine; they should just get on the horn with the ORBITA stents people, because *their* study, when analyzed appropriately, ended up with p = 0.09, and that wasn’t considered statistically significant at all; it was considered evidence of no effect.

I guess the rule is that, if you’re lucky enough to get a result between 0.05 and 0.10, you get to pick the conclusion based on what you want to say: if you want to emphasize it, call it statistically significant; if not, call it non-significant. Or you can always fudge it by using a term like “suggestive.” In the above snippet they said the treatment “may have” improved skills and that treatment “is associated with” migration. I wonder if that phrasing was a concession to the fat p-value of 0.09. If the statistic had been at more conventionally attractive size 0.05 or below, maybe they would’ve felt free to break out the causal language.

But . . . it’s kind of funny for me to be riffing on p-values and statistical significance, given that I don’t even like p-values and statistical significance. I’m on record as saying that everything should be published and there should be no significance threshold. And I would not want to “threshold” any of this work either. Publish it all!

There are two places where I would diverge from these authors. The first is in their air of certainty. Rather than saying a “large and statistically significant effect” of 37%, I’d say an estimate of 37% with a standard error of 22%, or just give the 95% interval like they do in public health studies. JAMA would never let you get away with just giving the point estimate like that! Seeing this uncertainty tells you a few things: (a) the data are compatible (to use Sander Greenland’s term) with a null effect, (b) if the effect is positive, it could be all over the place, so it’s misleading as fiddlesticks to call it “a substantial increase” over a previous estimate of 25%, and (c) it’s empty to call this a “large” effect: with this big of a standard error, it would have to be “large” or it would not be “statistically significant.” To put it another way, instead of the impressive-sounding adjective “large” (which is clearly not necessarily the case, given that the confidence interval includes zero), it would be more accurate to use the less-impressive-sounding adjective “noisy.” Similarly, their statement, “Our results confirm large economic returns . . .”, seems a bit irresponsible given that their data are consistent with small or zero economic returns.

The second place I’d diverge from the authors is in the point estimate. They use a data summary of 37%. This is fine as a descriptive data summary, but if we’re talking policy, I’d like some estimate of treatment effect, which means I’d like to do some partial pooling with respect to some prior, and just about any reasonable prior will partially pool this estimate toward 0.

Ok, look. Lots of people don’t like Bayesian inference, and if you don’t want to use a prior, I can’t make you do it. But then you have to recognize that reporting the simple comparison, conditional on statistical significance (however you define it) will give you a biased estimate, as discussed on pages 17-18 of this article. Unfortunately, that article appeared in a psychology journal so you can’t expect a bunch of economists to have heard about it, but, hey, I’ve been blogging about this for years, nearly a decade, actually (see more here). Other people have written about this winner’s curse thing too. And I’ve sent a couple emails to the first author of the paper pointing out this bias issue. Anyway, my preference would be to give a Bayesian or regularized treatment effect estimator, but if you don’t want to do that, then at least report some estimate of the bias of the estimator that you are using. The good news is, the looser your significance threshold, the lower your bias!

But . . . it’s early childhood intervention! Don’t you care about the children???? you may ask. My response: I do care about the children, and early childhood intervention could be a great idea. It could be great even if it doesn’t raise adult earnings at all, or if it raises adult earnings by an amount that’s undetectable by this noisy study.

Think about it this way. Suppose the intervention has a true effect that it raises earnings by an average of 10%. That’s a big deal, maybe not so much for an individual, but an *average* effect of 10% is a lot. Consider that some people won’t be helped at all—that’s just how things go—so an average of 10% implies that some people would be helped a whole lot. Anyway, this is a study where the standard deviation of the estimated effect is 0.22, that is, 22%. If the average effect is 10% and the standard error is 22%, then the study has very low power, and it’s unlikely that a preregistered analysis would result in statistical significance, even at the 0.1 or 0.2 level or whatever it is that these folks are using. But, in this hypothetical world, the treatment would be awesome.

My point is, there’s no shame in admitting uncertainty! The point estimate is positive; that’s great There’s a lot of uncertainty, and the data are consistent with a small, tiny, zero, or even negative effect. That’s just the way things go when you have noisy data. As quantitative social scientists, we can (a) care about the kids, (b) recognize that this evaluation leaves us with lots of uncertainty, and (c) give this information to policymakers and let them take it from there. I feel no moral obligation to overstate the evidence, overestimate the effect size, and understate my uncertainty.

It’s so frustrating, how many prominent academics just can’t handle criticism. I guess they feel that they’re in the right and that all this stat stuff is just a bunch of paperwork. And in this case they’re doing the Lord’s work, saving the children, so anything goes. It’s the Armstrong principle over and over again.

And in this particular case, as my colleague points out, it is not just that they are not acknowledging or dealing with criticism in the prior paper, but here they are actively repeating the very same error with the very same study/data when they have been made aware of it on more than one occasion in this new paper and not acknowledging the issue at all. Makes me want to scream.

**P.S.** When asked whether I could share my colleague’s name, my colleague replied:

Regarding quoting me, do recall that I live in the midwest and have to walk across parking lots from time to time. So please do so anonymously.

Fair enough. I don’t want to get anybody hurt.