You’ve heard it a million times, the idea is that if you have an estimate of .003 (on some reasonable scale in which 1 is a meaningful effect size) and a standard error of .001 then, yes, the estimate is statistically significant but it’s not practically significant.

And, indeed, sometimes this sort of thing comes up (and, irritatingly, such studies get publicity in part because of their huge sample size, which seems a bit unfair in that they *need* the huge sample size in order to detect anything at all), but not so often.

What is much more common are small studies where estimated effects are statistically significant but the estimates are unrealistically huge (remember, the statistical significance filter).

We’ve spend a lot of space on this blog recently on studies where the noise overwhelms the signal, where any comparisons in the data, statistically significant or not, are essentially meaningless.

But today (actually, in the future, whenever this post appears; I’m actually writing it on 22 Nov), I’d like to focus on a more interesting example where an interesting study was performed on an important topic, the estimate was statistically significant, but I think the estimate is biased upward, for the usual reason of the statistical significance filter.

It’s the story of an early childhood intervention on children that, based on a randomized experiment, was claimed by a bunch of economists to have increased their earnings (as young-adults, 20 years later) by 25% or 42%. Here’s what I wrote:

From the press release: “This study adds to the body of evidence, including Head Start and the Perry Preschool programs carried out from 1962-1967 in the U.S., demonstrating long-term economic gains from investments in early childhood development.” But, as I wrote on an earlier post on the topic, there is some skepticism about those earlier claims.

And this:

From the published article: “A substantial literature shows that U.S. early childhood interventions have important long-term economic benefits.”

From the press release: “Results from the Jamaica study show substantially greater effects on earnings than similar programs in wealthier countries. Gertler said this suggests that early childhood interventions can create a substantial impact on a child’s future economic success in poor countries.”

I don’t get it. On one hand they say they already knew that early childhood interventions have big effects in the U.S. On the other hand they say their new result shows “substantially greater effects on earnings.” I can believe that their point estimate of 25% is substantially higher than point estimates from other studies, or maybe that other studies showed big economic benefits but not big gains on earnings? In any case I can only assume that there’s a lot of uncertainty in this estimated difference.

**Here’s the point**

The problem with the usual interpretation of this study is *not* that it’s statistically significant but not practical significant. We’re not talking about an estimate of .003 with a standard error of .001. No, things are much different. The effect is statistically significant and *huge*—indeed, small sample and high variation ensure that, if the estimate is statistically significant, it will have to be huge. But I don’t believe that huge estimate (why should I? It’s biased, it’s the product of a selection effect, the statistical significance filter).

And all this “statistically significant but not practically significant” talk can completely lead us astray, by leading us to be wary of very *small* estimates, while what we should really be suspicious of, is very large estimates!

… or maybe we should just be suspicious ;)

Great post. I’ve been recently reading Mayo’s work, and she argues for ‘severe’ testing. Which I’ve understood to be a sort of pre-statistical/mathematical framework for determining test validity.

On the other hand, your posts and arguments tend to go off more of a general analysis based on your own deep and thorough knowledge of statistics and research design.

My question for you is do you guide your analysis through something like ‘severe’ testing, or another philosophical modelling perspective, but then just write it in a way that your audience will understand? Or do you tend to just stick to your deep well of methodology knowledge and not get caught up in the philosophy of science frameworks?

Thanks!

Simon

@Andrew:

Do you disbelieve all huge effects?

All huge effects that were not produced by a pre-registered study?

Rahul:

No, I don’t disbelieve all huge effects, nor do I disbelieve all huge effect that were not produced by a preregistered study.

Perhaps it’s worth emphasizing that I’ve never done a preregistered study in my life. Then again, I’m not in the habit of making claims that seem ridiculous, backing up my claims with nothing but a flexible theory and a statement of statistical significance. If people do find my claims implausible, they’re free to go replicate them as best they can.

So, I guess the sniff test is the most reliable option after all. If it seems fishy it probably is.

There doesn’t seem any “objective” (or even weakly so) way of judging a claim. After all, we do believe *some* huge effect claims, even though they were not produced by pre-registered studies. It’s just that we pick-n-choose which ones, & there doesn’t seem to be much solid guidance from statistical metrics about this process.

I agree that in practice, the problem you are describing (implausibly large effect estimates because of significance filters) is a bigger problem than overpowered studies that detect miniscule effects.

That said, it is genuinely problematic that we use the same word (significance) for two completely different concepts. This is obvious to statisticians, but the public needs to be constantly reminded. Mixing up the two meanings will lead to mistakes of the first kind, but not the second. Ideally, we’d find a new word for statistical significance (or relocate the concept to the trash can of history, where it belongs).

Yes, completely agree.

Treating “statistically significant” as if it means “proven” is very very bad. But it’s very very bad in a different way than treating “statistically significant” as if it means “practically significant.” These are both problems.

Great point.

Perhaps the distinction is most useful for those cases where “statistical significance” is NOT achieved, and people are tricked into making a type M error (or just ignoring effect sizes altogether). People need reminders that just because an estimated effect is not statistically significant does not mean the true effect is not practically significant. A great example of this is the response to the Oregon Medicaid experiment.

McCloskey and Ziliak pound on this point through their whole book, to the extent of missing (IMO) the point that I think is more important — that large *reported* effects are biased upward through the significance filter. Replication, of course, corrects both types of errors in the longer run. The latter problem is worse (again IMO) because significance is used to try and shut people up. Insignificance rarely is, mainly because insignificance rarely gets published.

I think you are right with research about non-important stuff that people don’t care about. The next psych priming study will only see the light of day if they get the magic p-value, because hardly anybody really cares about that effect anyway.

But when it comes to policy areas that actually matter, people make the other mistake all the time, reporting that some study found “no effect,” often with little or no thought to whether the confidence interval really allows us to rule out effect sizes of practical importance. I mentioned the Oregon Medicaid experiment, but there are many other examples, just google “found no effect.”

According to Wikipedia Perry found “At age 40 follow-up 42 percent higher median monthly income ($1,856 vs. $1,308).” (They don’t mention data for early 20s.) As you know, for full time jobs the Heckman group found a similar size effect, but in this article they focused on current job even though for most of the participants who were still in school they were in non-full-time, non permanent jobs of the kind students often have.

But, that’s not what the Heckman study says anyway, because the Perry data AFAIK had normal low income kids and the Heckman study is dealing with stunted children, and it found that stunted children who received the intervention had 25% higher income than those who did not and that they caught up with a non stunted group from similar economic background. I’m just not sure I buy the idea that stunted is a proxy for extremely low income. Their data doesn’t say anything about what the impact of the program on non stunted low income children would be from what I can tell. We don’t know what was going on in the home that contributed to the stunting.

Also, I think you need to consider, 25% of what? These are children from poverty in Jamaica, in absolute terms a 25% increase even of earnings may not be that much. The World Bank says that in 2013 the GNI per capita was $5220. If we used that as a starting point we are talking less than $53/month, probably less since these were kids in poverty still only in their early 20s, many still in school. I think this is somewhat of a case where percent change sounds a lot bigger than absolute change. Would you be posting about it if they press releases said the effect as $53/month or $13/week? I find that amount pretty believable. I wish they would just show the raw median earnings for the groups at some point. And I’m sure that increase represents a huge, life changing difference in the lives of those young adults and, probably even more important, in the lives of the children of those young adults.

The effect size in the data is such (and much bigger on college attendance and other education variables than on income at this point) that you do have to stand up and take notice, even though it’s a small sample and there is some differential loss to follow up. The difference in migration (in the data, migration is pretty strong associated with the intervention) and most certainly of education could be the real explanation for income differences though the authors do attempt to control for that. Given that they saw large effects on cognitive skills throughout the study period I don’t think that the impact on earnings is that surprising. I just wish they had been doing data collection that would have helped understand the thinking behind the migration decision, which is really intriguing.

I don’t think the issue is that they don’t really have such an observed difference in their data, I think the problem is that because of the significance filter and focusing on percent change a replication would probably not find an effect size close to what they found, but that caution doesn’t mean the “real” effect of such programs is 0. If the program is scaled up it would be for a much wider, healthier group of low income children, and that would also probably yield a lower effect size especially if we use percent difference as the outcome. So maybe it’s $5/week. Would $5/week make a meaningful difference in the lives of people in Jamaica? Sure, even in the US that can mean the difference between being able to ride the bus to work versus walking or not having to go to a food pantry until the next day. Even the minimum wage increase from $8 to $8.75 here in NY, almost a 10% increase and $30/week if you work full time, is meaningful in people’s daily lives. $.75/ hour would mean a lot less to me both in terms of percent change and in terms of daily life but a lot more to someone in a slum in Kingston.

It’s important to estimate the absolute increase, because there could, conceivably, be more cost-effective ways than education to raise average blue-collar incomes in Jamaica by a comparable amount. Off-topic, I would add that, absent full employment, Perry-type data also raises the question whether program beneficiaries have simply gotten jobs that other people would have gotten absent the intervention, leading to little or no societal gains.

That would be pretty hard to figure out. For example, the Perry kids were much less likely to have more than 5 arrests or be incarcerated. Also they on average receive lower social benefits. So social costs are down in those respects. It’s also not clear to me how much benefit you get by just being better at a job and lasting longer at a job (which would make sense with the kinds of skills Perry preschool focused on, such as dealing with conflicts) so possibly getting some pay increases keeping in mind we are talking $500 a month; we are still talking about people in their 40s who are making on average under $23,000 a year. (So their annual wage differential would be about $6000 dollars, and a year incarcerated is about $17,000 in direct costs.)

[…] The phrase “Statistical significance is not the same as practical significance” is lead…. […]

[…] but small effect may well not be scientifically interesting or important. But Andrew Gelman argues that such cases are rare, and that we should instead teach students to be suspicious of […]

[…] interviews with would-be and convicted terrorists. Representation of women and the genius myth What’s misleading about the phrase, “Statistical significance is not the same as practical signi… (related post […]

“What is much more common are small studies where estimated effects are statistically significant but the estimates are unrealistically huge (remember, the statistical significance filter).

We’ve spend a lot of space on this blog recently on studies where the noise overwhelms the signal, where any comparisons in the data, statistically significant or not, are essentially meaningless.”

hi andrew, could you clarify? a test statistic can be thought of as a signal-to-noise ratio. when the result is statistically significant, the test statistic is big. thus we have a large signal-to-noise ratio, which in the scenario you describe is incorrect. but you say that the noise overwhelms the signal. however, in that sample where we have a large test statistic, the signal has overwhelmed the noise. is it that the true noise overwhelms the true signal, which can lead to an incorrect overestimated signal-to-noise ratio? is that kind of the point?

Jimmy:

I think you’re saying that if z=d/s (where d is the observed difference and s is the estimated s.e.), then d is an estimate of signal strength and s is a measure of noise strength, so if z=3 (say), then there is evidence that the signal is much stronger than the noise. Sure, but in some settings it’s only very weak evidence. There are a lot of settings where the prior knowledge makes it clear that the signal has to be very small. Hence that “This is what power=.06 looks like. Get used to it.” graph.

thanks! i think i get it.

[…] great Andrew Gelman disagrees. […]