The latest Perry Preschool analysis: Noisy data + noisy methods + flexible summarizing = Big claims

Dean Eckles writes:

Since I know you’re interested in Heckman’s continued analysis of early childhood interventions, I thought I’d send this along: The intervention is so early, it is in their parents’ childhoods.

See the “Perry Preschool Project Outcomes in the Next Generation” press release and the associated working paper.

The estimated effects are huge:

In comparison to the children of those in the control group, Perry participants’ children are more than 30 percentage points less likely to have been suspended from school, about 20 percentage points more likely never to have been arrested or suspended, and over 30 percentage points more likely to have a high school diploma and to be employed.

The estimates are significant at the 10% level. Which may seem like quite weak evidence (perhaps it is), but actually the authors employ a quite conservative inferential approach that reflects their uncertainty about how the randomization actually occurred, as discussed in a related working paper.

My quick response is that using a noisy (also called “conservative”) measure and then finding p less than 0.10 does not constitute strong evidence. Indeed, the noisier (more “conservative”) the method, the less informative is any given significance level. This relates to the “What does not kill my statistical significance makes me stronger” fallacy that Eric Loken and I wrote about (and here’s our further discussion)—but only more so here, as the significance is at the 10% rather than the conventional 5% level.

In addition, I see lots and lots and lots of forking paths and researcher degrees of freedom in statements such as, “siblings, especially male siblings, who were already present but ineligible for the program when families began the intervention were more likely to graduate from high school and be employed than the siblings of those in the control group.”

Just like everyone else, I’m rooting for early childhood intervention to work wonders. The trouble is, there are lots and lots of interventions that people hope will work wonders. It’s hard to believe they all have such large effects as claimed. It’s also frustrating when people such as Heckman routinely report biased estimates (see further discussion here). They should know better. Or they should at least know enough to know that they don’t know better. Or someone close to them should explain it to them.

I’ll say this again because it’s such a big deal: If you have a noisy estimate (because of biased or noisy measurements, small sample size, inefficient (possibly for reasons of conservatism or robustness) estimation, or some combination of these reasons), this does not strengthen your evidence. It’s not appropriate to give extra credence to your significance level, or confidence interval, or other statement of uncertainty, based on the fact that your data collection or statistical inference are noisy.

I’d say that I don’t think the claims in the above report would replicate—but given the time frame of any potential replication study, I don’t think replication will be tested one way or another, so a better way to put it is that I don’t think the estimates are at all accurate or reasonable.

But, hey, if you pick four point estimates to display, you get this:

That and favorable publicity will get you far.

P.S. Are we grinches for pointing out the flaws in poor arguments in favor of early childhood intervention? I don’t think so. Ultimately, our goal has to be to help these kids, not just to get stunning quotes to be used in PNAS articles, NPR stories, and Ted talks. If the researchers in this area want to flat-out make the argument that exaggeration of effects serves a social good, that these programs are so important that it’s worth making big claims that aren’t supported by the data, then I’d like to hear them make this argument in public, for example in comments to this post. But I think what’s happening is more complicated. I think these eminent researchers really don’t understand the problems with noise, researcher degrees of freedom, and forking paths. I think they’ve fooled themselves into thinking that causal identification plus statistical significance equals truth. And they’re supported by a academic, media, and governmental superstructure that continues to affirm them. These guys have gotten where they are in life by not listening to naysayers, so why change the path now? This holds in economics and policy analysis, just as it does in evolutionary psychology, social psychology, and other murky research areas. And, as always, I’m not saying that all or even most researchers are stuck in this trap; just enough for it to pollute our discourse.

What makes me sad is not so much the prominent researchers who get stuck in this way, but the younger scholars who, through similar good intentions, follow along these mistaken paths. There’s often a default assumption that, as the expression goes, with all this poop, there must be a pony somewhere. In addition to all the wasted resources involved in sending people down blind alleys, and in addition to the statistical misconceptions leading to further noisy studies and further mistaken interpretations of data, this sort of default credulity crowds out stronger, more important work, perhaps work by some junior scholar that never gets published in a top 5 journal or whatever because it doesn’t have that B.S. hook.

Remember Gresham’s Law of bad science? Every minute you spend staring at some bad paper, trying to figure out reasons why what they did is actually correct, is a minute you didn’t spend looking at something more serious.

And, yes, I know that I’m giving attention to bad work here, I’m violating my own principles. But we can’t spend all our time writing code. We have to spend some time unit testing and, yes, debugging. I put a lot of effort into doing (what I consider to be) exemplary work, into developing and demonstrating good practices, and into teaching others how to do better. I think it’s also valuable to explore how things can go wrong.

31 thoughts on “The latest Perry Preschool analysis: Noisy data + noisy methods + flexible summarizing = Big claims

  1. A society, even one as wealthy as this country, has limited resource. Overestimating the effect size of an intervention leading to wasteful spending does great harm to overall social welfare. We have too many credulous “believe in science” types and too few grinches these days.

  2. “Remember Gresham’s Law of bad science? Every minute you spend staring at some bad paper, trying to figure out reasons why what they did is actually correct, is a minute you didn’t spend looking at something more serious.”

    Amen.

    “It’s hard to believe they all have such large effects as claimed”

    Yes you have to wonder with all these amazing things happening why we’re not all above average!

    • > Remember Gresham’s Law of bad science? Every minute you spend staring at some bad paper, trying to figure out reasons why what they did is actually correct, is a minute you didn’t spend looking at something more serious.

      Hey! Getting annoyed and confused at dubious science is half the fun here.

      There are good entertainment to be had in here (copying these from the press release):

      > While the researchers do not have earnings data on the Perry participants’ children, they note that the children “likely earn more than those in the control group, perhaps due to enhanced cognitive and noncognitive skills.”

      We don’t know, but probably, and here’s why.

      > “About 8 percent of the second-generation male children of the male participants in the treatment group are employed college graduates compared to none in the control group,”

      Is that like 2-3 people? There’s like 200-300 kids, so assume half came from male Perries, and half of those kids are male, and then is there the treatment/control split too?

      From the abstract:

      > The intergenerational effects arise despite the fact that families of treated subjects live in similar or worse neighborhoods than the control families.

      In arguing for the existence of 1960s piranhas, we’re concluding that piranhas didn’t exist from like 1980-2000 (I think most of the children considered in this study were like 20-45 years old)

      And then there’s a section labeled:

      > Fertility Decisions of the Perry Participants

      which is creepy enough to justify rejection in and of itself.

  3. Ignoring the direct costs and the opportunity costs of an intervention makes it easy for investigators and observers to get blown away by good intentions+statistical significance.

  4. The Heckman and Karapakula paper does not factor in the possibility that the Perry Preschool Project was fraudulent in some way. Doing so would, of course, weaken their findings further.

    Why is this important and how do we know there is a non-trivial probability of fraud? Because Heckman proved it in his previous work. He used to cite three miraculous intervention studies, Perry Preschool, Abecedarian, and The Milwaukee Project. One of those, The Milwaukee Project, apparently turned out to be fraudulent (as best we can tell). So we know that projects that pass Heckman’s credibility detector may be fraudulent.

    https://en.wikipedia.org/wiki/Rick_Heber

    • Nope. If the Milwaukee Project was fraudulent it does not affect Heckman and Karapakula’s findings. You should check the validity of the Perry Preschool Project, exactly as you should do it if you believe the Milwaukee Project was OK. Relying on some authority as “credibility detector” is not a good way to establish your priors in this kind of situation. (Not even when such authority is named Andrew).

      • I don’t understand.

        The Heckman analysis implicitly assumes zero probability of fraud. That seems like an incorrect assumption. If there is a positive probability of fraud, the analysis should acknowledge that. We know from past experience that Heckman’s assumption of a zero fraud probability for the Milwaukee Project was incorrect, so I see no reason to accept the same assumption for the Perry Preschool Project.

        • What I mean is that the probability of fraud in the Perry Preschool Project is independent of the correctness or otherwise of what Heckman assumed about the Milwaukee Project. You should not change your prior about that probability based on past experience about the assumption.

        • I don’t think this is right. It’s entirely meaningful from a Bayesian perspective to say that the probability of fraud in any of these studies is some number say p, which is unknown… and then to observe an instance of fraud, and to revise your posterior for p upwards to a higher number, thereby altering your posterior inference for the other studies as well…

          it’s a model choice to make them independent, or equal, or unequal but dependent… whatever.

        • I think the OP caused the confusion by referring to Heckman’s “credibility detector”. That’s too subjective of a statement. The causal mechanism is more likely to be: fraud caused all those studies to pop up on Heckman’s radar. Heckman is just a filter/selection mechanism. The discovery that one of those studies is fraudulent lends evidence to this causal assumption.

  5. “Are we grinches for pointing out the flaws in poor arguments in favor of early childhood intervention? ” No you are not.Props to you for doing the necessary work journalists don’t do. Perry Preschool is touted & funded by liberal lobbyists in DC that use this fraudulent “research” in the Heckman and Karapakula papers to expand big government into toddlers lives.Andrew, see the Tennessse and Qubec preschool studies that disprove the “Big claims” in the Perry Preschool study.

      • The really surprising thing about much that is published in the social sciences is that people don’t seem to realize that, by definition, if you need statistical analysis to uncover the effect, it has to be small enough to be invisible. So really everyone should be suspicious of any education study that claims a large effect. Rather than protect their results, as they seem to have done, the authors of the Quebec study should have torn it apart, certain in the knowledge that the results can’t possibly be accurate.

        • I don’t think this is quite true, though it has an element of truth. There are some big things that no-one discovers because they don’t look at them right. The Birthdays model from the front of BDA is kind of an example of this.

        • The cover of Andrews book Bayesian Data Analysis 3rd Ed (BDA3) plots a model of the frequency of births at different times of year, days of week, and special holidays etc. There are very large effects on certain holidays. Probably if you asked some nurses in the Maternity section they knew this but I don’t think the size of these effects was widely known.

        • Thanks Daniel. I didn’t know that. Yes, I agree that would be hard to confirm without statistical analysis.

          I wasn’t sure if you were referring to the Gladwell thing about athletes commonly having Bdays just after the cut-off ages in their respective sports, meaning those kids are the oldest players in each age group, which gives them an advantage.

        • Jim:

          You write, “by definition, if you need statistical analysis to uncover the effect, it has to be small enough to be invisible.” Not quite. There are lots of examples of phenomena that were not noticed until statistical analysis was performed. Once the analysis is conducted and understood, it is often possible to go back and make a graph of raw data so the pattern is clear—but, in practice, that did not happen until the pattern was discovered using statistics.

        • The effects studied in this paper were not noticed until the statistics were performed :D.

          But the bit of what Jim said I like is:

          > if you need statistical analysis to uncover the effect, it has to be small enough to be invisible

          And you’re always saying better measurement better measurement better measurement — but if we had great measurements we just wouldn’t really need fancy statistics!

        • Hmmm…well, do you have an example? I can’t think of one but it’s not my field either.

          I really don’t think there are a lot of examples. Humans are stupidly good at detecting and exploiting patterns. In part it depends on what you mean by “statistics”. We can’t know without counting that X% of voter group Y prefer candidate Z. But is just tallying that “statistics”? I’d say that modern statistics is a method of analysis, not a method of counting.

          So I guess what I’m claiming is that if an assessment of statistical significance is necessary to support a claim, it is -almost- by definition small.

        • Jim,

          I’ve had examples from my own work. See, for example, my article How Bayesian Analysis Cracked the Red-State, Blue-State Problem. Here’s what I say in the abstract:

          Now that our analysis has been done, we believe it could be replicated using non-Bayesian methods [or even pure graphical displays], but Bayesian inference helped us crack the problem by directly handling the uncertainty that is inherent in working with sparse data.

          I’ve seen this happen a lot, that the first analysis of a problem is complicated, and once we understand what’s going on, we can simplify the analysis and sometimes remove all the analysis entirely. But if we’d been restricted to simple graphical comparisons, we might never have caught the pattern in the first place.

          Statistical analysis, when used well, can be an amplifier of human pattern-recognition skills.

          In addition, there are various technical domains such as image and sound processing where statistical methods can be used to sharpen a signal and make it more visible. Lots of examples of images or sounds that are difficult or impossible to make out directly, but you can see or hear them clearly once some statistical analysis or signal processing has been done on the input.

        • Thanks Andrew. Your data is convincing, although a few states buck the trend. Have you ever tried to find a different way besides states to break up the country? For example here in WA the state is strongly split politically by the Cascades. East of the Cascades is culturally comparable to ID (think Napoleon Dynamite) while the west is comparable to CA – and has the world’s two richest people (Gates / Bezos).

          Yes, good point about image and sound data. I hadn’t thought of that.

  6. What disturbs me about the discussions around the Perry Study and early childhood education is that the default ought to be that the effects are non-linear, but it seems like everyone assumes the opposite. Of course preschool of the right kind is going to help some kids, but it is also going to hurt others. Of course it will improve outcomes on some metrics, but it should also make some outcomes worse. If an intervention in a complex system is efficacious at all it should have disparate impacts. In medicine, everyone knows “the dose makes the poison.” “Universal whatever” ought to be a red flag. There should be very few interventions in a complex systems that should work all the time or that are free from very bad side effects and those should be obvious without any study. Our focus in public policy should be on the question of under what circumstances is the intervention helpful and not on whether the intervention has benefits.

    • I would agree such work requires more digging. Effects across classes should be expected to be disparate – but how many subclasses of effects should be examined (we could go medieval on such a discussion)? IMNSHO, an ideology of any touted ‘educational program’ can be viewed as skeptically as one might the plethora of diet programs that all claim some generalizability to a population.

      Where I think educational and economic research could and should be more focused is in the credible development of comparative evidence regarding what type of educational program may have what type of effect (or negative consequence) to which student profile (consider things such as race/ethnicity, SES, age interval, educational history, other priors), and over time (yes, this is terribly dynamic and requires longitudinal understanding). To me this would be far more informative, but currently every *brandable* idea needs a general audience to make it to TEDx, or perhaps an IPO :)

      There are SO many clear biases in this type of research – geographical and cultural drivers being but a few – regarding what an effective educational intervention looks like. In the United States, the typical thought in educational leadership is that STEM (now I hear ‘STEAM’ is becoming more commonplace to include the arts, not sure what that means in practice) is perfectly reasonable to push down to younger and younger students, without a single convincing shred of evidence that this is the right path that extends to a general population.

      Consider whether studies in cognition are reasonably correct when many argue that learning at younger ages in general is more inclined to things like language learning (I’d argue this would include maths if taught as a language), music, and art. I suspect this several-decade-old push is part of the ‘make my brilliant child brillianter’ movement wherein pushing sciences (take coding for example and the strangely instituted ‘financial literacy’) is modeled around early inclinations of a rarer population (i.e. the minority of kids who are inclined at an early age toward sciences). Charter schools have fed off of the ‘science pushdown’ like gut worms – and often have student populations more heavily biased to students who fit early inclinations in science (and use *their* outcomes rabidly to rake public ed over the coals.

      I can’t recall precisely where I read a research criticism of studies surrounding Montessori that suffered from similar problems as Perry.

      Still I get the policy discussion of all of this, educational equalizers are important and there appears not to be one size fits all – which must be recognized by parents, policymakers, and more fundamentally, researchers.

      One other thought – has anyone compared the Perry preschoolers to children who have been educated under other Pre-K to K philosophies? Here my thought is: can we get to a point whether effects are more simply due to the presence or absence of attendance of preschool students in ANY program? In particular, are effects more likely to be due to things more general to preschool (like socialization, structure, caring/sharing, or discipline), or is the academic program itself [or specific tenants of it] the most explanatory of targeted outcomes?

Leave a Reply to Dan S Cancel reply

Your email address will not be published. Required fields are marked *