Early p-hacking investments substantially boost adult publication record

In a post with the title “Overstated findings, published in Science, on long-term health effects of a well-known early childhood program,” Perry Wilson writes:

In this paper [“Early Childhood Investments Substantially Boost Adult Health,” by Frances Campbell, Gabriella Conti, James Heckman, Seong Hyeok Moon, Rodrigo Pinto, Elizabeth Pungello, and Yi Pan], published in Science in 2014, researchers had a great question: Would an intensive, early-childhood intervention focusing on providing education, medical care, and nutrition lead to better health outcomes later in life?

The data they used to answer this question might appear promising at first, but looking under the surface, one can see that the dataset can’t handle what is being asked of it. This is not a recipe for a successful study, and the researchers’ best course of action might have been to move on to a new dataset or a new question.

Yup, that happens. What, according to Wilson, happened in this case?

What the authors of this Science paper did instead was to torture the poor data until it gave them an answer.

Damn. Wilson continues with a detailed evisceration. You can read the whole thing; here I’ll just excerpt some juicy bits:

Red Flag 1: The study does not report the sample size.

I couldn’t believe this when I read the paper the first time. In the introduction, I read that 57 children were assigned to the intervention and 54 to control. But then I read that there was substantial attrition between enrollment and age 35 (as you might expect). But all the statistical tests were done at age 35. I had to go deep into the supplemental files to find out that, for example, they had lab data on 12 of the 23 males in the control group and 20 of the 29 males in the treatment group. That’s a very large loss-to-follow-up. It’s also a differential loss-to-follow-up, meaning more people were lost in one group (the controls in this case) than in the other (treatment). If this loss is due to different reasons in the two groups (it likely is), you lose the benefit of randomizing in the first place.

The authors state that they accounted for this using inverse probability weighting. . . . This might sound good in theory, but it is entirely dependent on how good your model predicting who will follow-up is. And, as you might expect, predicting who will show up for a visit 30 years after the fact is a tall order. . . . In the end, the people who showed up to this visit self-selected. The results may have been entirely different if the 40 percent or so of individuals who were lost to follow-up had been included.

Red Flag 2: Multiple comparisons accounted for! (Not Really)

Referring to challenges with this type of analysis, the authors write in their introduction:

“Numerous treatment effects are analyzed. This creates an opportunity for ‘cherry picking’—finding spurious treatment effects merely by chance if conventional one-hypothesis-at-a-time approaches to testing are used. We account for the multiplicity of the hypotheses being tested using recently developed stepdown procedures.”

. . . The stepdown procedure they refer to does indeed account for multiple comparisons. But only if you use it on, well, all of your comparisons. The authors did not do this . . .

One problem here is that, as the economists like to say, incentives matter. Cambpell et al. put in some work into this study, and it was only going to get published in a positive form if they found statistically significant results. So they found statistically significant results.

Two of the authors of the paper (Heckman and Pinto) replied:

Dr. Perry Wilson’s “Straight Talk” dismisses our study—the first to study the benefits of an early childhood program on adult health—as a statistical artifact, where we “torture the poor data” to get findings we liked. His accusation that we tortured data is false. Our paper, especially our detailed 100-page appendix, documents our extensive sensitivity and robustness analyses and contradicts his claims.

I’ve done robustness studies too, I admit, and one problem is that these are investigations designed not to find anything surprising. A typical robustness study is like a police investigation where the cops think they already know who did it, so they look in a careful way so as not to uncover any inconvenient evidence. I’m not saying that robustness studies are necessarily useless, just that the incentives there are pretty clear, and the actual details of such studies (what analyses you decide to do, and how you report them) are super-flexible, even more so than original studies which have forking path issues of their own.

Heckman and Pinto continue with some details, to which Wilson responds. I have not read the original paper in detail, and I’ll just conclude with my general statement that uncorrected multiple comparisons are the norm in this sort of study which involves multiple outcomes, multiple predictors, and many different ways of adjusting for missing data. Everybody was doing it back in 2014 when that paper was published, and in particular I’ve seen similar issues in other papers on early childhood intervention by some of the same authors. So, sure, of course there are uncorrected multiple comparisons issues.

I better unpack this one a bit. If “everybody was doing it back in 2014,” then I was doing it back in 2014 too. And I was! Does that mean I think that all the messy, non-preregistered studies of the past are to be discounted? No, I don’t. After all, I’m still analyzing non-probability samples—it’s called “polling,” or “doing surveys,” despite what Team Buggy-Whip might happen to be claiming in whatever evidence-less press release they happen to be spewing out this month—and I think we can learn from surveys. I do think, though, that you have to be really careful when trying to interpret p-values and estimates in the presence of uncontrolled forking paths.

For example, check out the type M errors and selection bias here, from the Campbell et al. paper:

The evidence is especially strong for males. The mean systolic blood pressure among the control males is 143 millimeters of mercury (mm Hg), whereas it is only 126 mm Hg among the treated. One in four males in the control group is affected by metabolic syndrome, whereas none in the treatment group are affected.

Winner’s curse, anyone?

The right thing to do, I think, is not to pick a single comparison and use it to get a p-value for the publication and an estimate for the headlines. Rather, our recommendation is to look at, and report, and graph, all relevant comparisons, and form estimates using hierarchical modeling.

Reanalyzing data can be hard, and I suspect that Wilson’s right that the data at hand are too noisy and messy to shed much light on the researchers’ questions about long-term effects of early-childhood intervention.

And, just to be clear: if the data are weak, you can’t necessarily do much. It’s not like, if Campbell et al. had done a better analysis, then they’d have this great story. Rather, if they’d done a better analysis, it’s likely they would’ve had uncertain conclusions: they’d just have to report that they can’t really say much about the causal effect here. And, unfortunately, it would’ve been a lot harder to get that published in the tabloids.

On to policy

Early childhood intervention sounds like a great idea. Maybe we should do it. That’s fine with me. There can be lots of reasons to fund early childhood intervention. Just don’t claim the data say more than they really do.

32 thoughts on “Early p-hacking investments substantially boost adult publication record

  1. I’ve been listening to and reading about different early childhood programs. Brookings, particularly, has hosted considerable number of panels and published reports. I suppose that since we have one of the largest mass public education systems precipitating ongoing evaluations by universities, state and federal entities, etc, it is inevitable that some percent of these studies are not going to be very good. As you suggest Andrew, there should be ongoing awareness campaigns as to the pitfalls of using any of these measuring tools. Especially since statisticians may or may not persist in definitional promiscuity.

  2. If this were a trial, “our 100-page appendix vindicates our apparently weak results” would be the sort of false exculaptory statement that the prosecutor would love.

    • Paul:

      Here’s what I think is going on. The authors are economists who pride themselves on their rigor. When they’re criticized for what are, essentially, non-rigorous inferences, these guys get annoyed: Who are these nobodies to fault them for lack of rigor. So that’s one reason the responses to criticism are so weak: the original authors have never taken the criticism seriously.

      • Indeed. But just to be clear, I long ago stopped seeing the point in distinguishing between self-serving hubris and moral corruption. Both of them will get you into the Iraq War “despite the best of intentions.”

        The authors’ defense of their farcical multiple comparison analysis is especially damning. They understand multiple comparison adjustment precisely well enough to pretend to use it while preventing it from harming their precious, barely significant p-values.

      • Sameera, Paul:

        Yes, I think “self-serving hubris” is the right way to put it. It’s not so much that the authors are avoiding criticism; it’s more like they’re not seeing the criticism as being legitimate so they don’t bother to address it at all.

        Or, to put it another way, I’m thinking that they think it’s important to respond to the criticism for reasons of politics or public relations, but they don’t see the criticism as serious science.

        My guess is that the only criticism that they would respect would have to come from a credentialed economist—I guess that would mean a professor of economics from a top-15 university, or something like that. And, for political reasons, credentialed economists don’t like getting into public disagreements with these guys. Also, early childhood intervention is a good thing, right? So you don’t see a lot of motivation to oppose a study that supports it. I think something similar was going on with that air-pollution-in-China study we discussed awhile back. For one thing, everybody’s against air pollution. For another, economists seem to be willing to argue with each other on issues of policy, but not so much on methodology. They perhaps have a common interest in being the guardians of rigor. That’s one reason I admire Angus Deaton and James Heckman in their criticisms of randomized controlled trials as the gold standard of causal inference: These economists were willing to go against the standard views in their field. Ironic that Heckman can’t handle criticism of his applied work—but, again, this criticism is not coming from credentialed economists so it doesn’t count.

        • > credentialed economists don’t like getting into public disagreements with these guys.
          That may be a common problem, in clinical research I would get emails from credentialed clinical researchers that essential said, I agree what they did is stupid but please do not repeat that as it might affect my career negatively.

        • Andrew,

          You are so right. Thank you. I gotta check out these guys b/c I’m use to a lot of guys who whimp out too easily.

          I have noticed that some circles include some more non-academics too in specifically education related & national security fora b/c after all how are they going to get any new perspectives? What is experimental education but also largely an exercise in imagination & practice.

        • I guess it depends on what “nice thing” they are opposing by supporting early childhood intervention (sic). It seems to me that economists are so invested into early childhood intervention because is a way to oppose other progressive policies: access to and funding for higher education particularly (the reasoning is “if you fund higher education for the poor, you are actually benefiting the most advantaged among them. And either way, in lack of early childhood intervention, they won’t benefit from investment later in their life).

  3. Being heavily influenced by Jerome Bruner’s work, I think that more re-emphasis should be on theory and practice particularly as we have been undergoing shifts in epistemology and pedagogy, increasingly technologically driven; AI, machine learning, etc.

    In so far as the love affair with these gifted programs was precipitated by the curiosity for how some kids can learn to read advanced texts by the time they are 6. But this shouldn’t be obsessive standard by which to pursue early childhood education goals and objectives. The reason I have continued to suggest this is that such reading levels can’t be achieved by everyone to begin with. Nor does there seem to be any one right way to improve reading skills. Over the years I’ve been besieged with queries about which educational strategy would be optimal. Rather I think the child who reads at higher levels has had unusual education to begin with.

  4. ” A typical robustness study is like a police investigation where the cops think they already know who did it, so they look in a careful way so as not to uncover any inconvenient evidence. ”

    Well, I only read a narrow slice of the scientific literature, so I can’t say how typical this kind of robustness study is. But what you’re describing is a robustness study that is designed to fail. Any technique can be misused so as to purposely fail. A good robustness study will, in fact, do the opposite of what you say. It will test the ability of the findings to withstand alternative assumptions so extreme that they break the conclusions. That’s the real finding of a robustness study: how hard do you have to push before things break. If your robustness study hasn’t found that point, then it’s not a real robustness study (unless we’re checking the robustness of a parameter whose support is bounded and we’ve already gotten to the bounds and nothing breaks.)

    • Clyde:

      In theory, I agree that a robustness study can be used to explore the breakdown point of a method. In practice, though, I think the vast majority of robustness studies are performed as a reassurance. You say “designed to fail,” but I’d say these studies are designed to succeed—for the purpose of increasing confidence in the original claim, fending off critics, and getting the paper published.

  5. I’m wondering how much of these sloppy statistics (p-hacking etc.) and publication bias are exacerbated by scientist activists?

    I.e. Is it more likely that the studies supporting large scale top-down intervention are more likely to get a pass when publishing? Are you aware of any studies looking at retraction rates and political leanings?

    What happened with social psychology (left leaning interventionism from sociology affecting psychology negatively e.g. implicit bias training etc.) seems to be part of a broader trend across multiple disciplines.

    Falsification does not come naturally to humans due to psychological biases, probably much harder if researcher has activism orientation.

    • I have argued here, however, that these early childhood intervention studies reflect a particular type of liberal ameliorist orientation that falls well short of what we usually think of as leftism. My point is that, even assuming these childhood interventions “work,” they amount to spending resources to move some disadvantaged people a little way up the steep societal ladder, above some people who didn’t get the intervention. An unspoken premise seems to be that if we found an intervention that “worked,” and we gave it to every disadvantaged person, they could all move up—but that premise is inconsistent with what we know about loose labor markets (absent full employment) as well as educational and social hierarchies. One might also ask, would it be a better use of resources to design, implement, track, and report long term on an intervention—-or just to give that same amount of money directly to the people you are trying to help? One option keeps a lot of upper middle class people busy and interested, while the second option is 100% guaranteed to make some people less poor.

      • Kyle,

        I lean to your characterization. It does fall short of the typical view of leftism. But ‘leftism’ is not synonymous with ‘eclecticism’ necessarily, as the latter is the fount for the potential for childhood education. These childhood intervention programs produce more ‘conformity’ than ‘distinctiveness or diversity’.

        Many of the researchers come from middle and upper middle class communities. Their sensibilities are informed by their own limited perspective on the nature of intelligence and what constitutes intelligence.

  6. Very interesting reading, thanks. I was curious about the fact that Wilson argues that the attrition creates problems using WWC standards, but when discussing the multiple comparisons problem he uses FDA guidance. WWC standard is to adjust variables within the same outcome domain, which sounds like what the authors are doing. I do RCTs for the federal government and this is very common in evaluation studies. Are we doing this wrong? Should we adjust all outcomes together when doing multiple comparisons in RCTs?

  7. I am not sure that economists are intrinsically more aggressive in defending their papers when they are criticized — it seems to me that in most disciplines there is an unfortunate tendency to get defensive once criticized. Once the choice is made to be aggressive in the defense of one’s work, attacking the legitimacy of the critics by questioning their expertise in the area is a natural, although again unfortunate, route to take. I feel I see that in other disciplines as well, eg in some of the cases where Andrew criticized papers in psychology. So, I am not really sure this is specifically an economics thing. I also disagree a bit with the claim that what Andrew calls “credentialed economists” are not willing to take on “these guys.” For a while David Card (and they dont really come more credentialed than David!) taught a very impressive class where he had students replicate and reanalyze well known papers by senior economists. The response to the criticisms of these papers was often quite aggressive, suggesting it is not just criticisms from non-economists that are frowned upon by economists. As another example, my recent comment on the Cartwright and Deaton piece on randomized experiments (here is a link: https://people.stanford.edu/imbens/publications), is more substantially more critical than Andrew’s comment on the same piece. There are many other examples of senior economists take issue with each other’s work. Maybe we need even more of that. It might help if there were more economics journals publishing papers with comments, the way many of the leading statistics journals do.

    • Guido:

      I didn’t mean to imply that economists are more aggressive in their defenses, compared to other academic researchers. Indeed, I’ve argued that one reason academics are so defensive in response to criticism is we’ve been trained to think of criticism as something to deflect, rather than something to take seriously.

      And I agree that some credentialed economists are willing to criticize other work done in the economics field. And that criticism does get taken seriously.

      One thing I do think is different about economists, though (or maybe it’s something that economists share with M.D.’s) is that they typically don’t seem to accept the legitimacy of criticism coming from outside their field (or from low-status people within economics). In the response from the authors of the paper discussed above, I just had the impression that never for one second did the authors consider that the critic might have a point.

      In any case, my interest here is not so much in what makes economics special or different, but rather the general issue of not taking criticism seriously. We’ve had a lot of discussion in the past few years about researchers behaving badly when criticized, and it has recently struck me that the key problem comes a step before. The key problem is not so much that people are responding badly to criticism, but rather that they’re not taking the criticism seriously in the first place. I’d think (or at least hope) that if the authors of this original paper had considered the possibility that they were fundamentally in error, that they’d have been able to process the criticism and learn from it. But I don’t think the authors ever got to that point.

      It’s sad, when researchers ares given, for free, a serious critique of their work, and they squander the opportunity by not taking the critique seriously. It happens all the time, and it makes me sad every time.

  8. I knew the last author (Yi Pan) quite well. He was a good friend of mine, and an excellent young statistician. Sadly, he died of leukemia in 2016. I thought you and your readers should know.

Leave a Reply to Anoneuoid Cancel reply

Your email address will not be published. Required fields are marked *