Skip to content

Happy talk, meet the Edlin factor

Mark Palko points us to this op-ed in which psychiatrist Richard Friedman writes:

There are also easy and powerful ways to enhance learning in young people. For example, there is intriguing evidence that the attitude that young people have about their own intelligence — and what their teachers believe — can have a big impact on how well they learn. Carol Dweck, a psychology professor at Stanford University, has shown that kids who think that their intelligence is malleable perform better and are more motivated to learn than those who believe that their intelligence is fixed and unchangeable.

In one experiment, Dr. Dweck and colleagues gave a group of low-achieving seventh graders a seminar on how the brain works and put the students at random into two groups. The experimental group was told that learning changes the brain and that students are in charge of this process. The control group received a lesson on memory, but was not instructed to think of intelligence as malleable.

At the end of eight weeks, students who had been encouraged to view their intelligence as changeable scored significantly better (85 percent) than controls (54 percent) on a test of the material they learned in the seminar.

I can believe that one group could do much better than the other group. But do I think there’s a 31 percentage point effect in the general population? No, I think the reported difference is an overestimate because of the statistical significance filter. How much should we scale it down? I’m not sure. Maybe an Edlin factor of 0.1 is appropriate here, so we’d estimate the effect as being 3 percentage points (with some standard error that will doubtless be large enough that we can’t make a confident claim that the average population effect is greater than 0)?


  1. Hmm, Friedman has selected the wrong finding: The original paper states, “We conducted a manipulation check by examining differences between the experimental and control groups on their learning of the intervention material. If the intervention was successfully communicated, then we would expect the experimental group to perform better on the items that tested their incremental theory knowledge [the intervention], but the two groups should perform equally well on the items that tested their knowledge of brain structure and study skills [which was taught to both groups]…. The groups did not differ on their scores for general workshop content…. The experimental group, however, scored significantly higher on the items that tested the incremental theory intervention content than did the control group (84.5% vs. 53.9%…).”

    In other words, the treatment group scored better on a test of material that they were taught and the control group wasn’t. But does it matter? Only insofar as it indicates that the two groups received different treatments. The actual outcomes of interest were differences in mindset (theory of intelligence), classroom motivation, and academic achievement, which Friedman missed.

    • WOW, so you’re saying they taught both groups some stuff, and then only taught additional stuff to one group (say group A) and then amazingly group A scored better on a test of the additional stuff they were taught that group B wasn’t??

      So…. it’s almost like teaching things actually works!!! Shocking.

      • Indeed. But, in fairness to the study authors, they did this preliminary analysis to check that their two treatment conditions (intervention and control) were actually different before going on to the analysis of their main outcomes. It’s really just a side note that Friedman unfortunately highlighted as if it were a main finding.

    • mark says:

      In that 2007 paper Dweck and her colleagues also make the old ‘the difference between “significant” and “non-significant” is itself not statistically significant’ error that Andrew has written about (see pages 250-251).

      • Andrew says:


        Oh, sure, that happens so often I don’t always even notice when people do it!

        • mark says:

          She also makes the same error in her most recent offering in Psychological Science (Haimowitz & Dweck, 2016). Really annoying.

          • Andrew says:


            Indeed. I googled *haimowitz and dweck psych science* and found this press release from the Association for Psychological Science, which linked to this paper by Kyla Haimovitz and Carol Dweck.

            The title of the paper is, “What Predicts Children’s Fixed and Growth Intelligence Mind-Sets? Not Their Parents’ Views of Intelligence but Their Parents’ Views of Failure,” which indeed suggests they may be attributing significance to a comparison of significant to non-significant.

            The comparison in question appears to be mentioned in the abstract here: “Study 3b showed that children’s perceptions of their parents’ failure mind-sets also predicted their own intelligence mind-sets.”

            So I downloaded the paper and scrolled to Study 3b.

            I see some warning signs:

            A power analysis based on estimates from the previous studies and targeting power of 90% determined that 100 participants was an appropriate sample size.

            Nope. Those estimates from previous studies were too high. They didn’t have 90% power, they just thought they had 90% power, which is dangerous in that it leads them to be all too confident that they’ll find what they’re looking for, in this particular sample.

            During normal school hours, the students filled out a questionnaire that con- tained the same child-reported measure of intelligence mind-set as in Study 1 (α = .81). We additionally assessed the children’s perceptions of their parents’ failure mind-sets, using a shortened two-item scale because of constraints on the survey’s length (“My parents think failure can help me learn,” “My parents think failure is bad and should be avoided”). The items had reasonable reliability (α = .64), so we reverse-scored the first item and then averaged responses to form a composite variable such that higher numbers indicated greater perceptions of a failure-is-debilitating mind-set in parents.

            Nothing wrong here, exactly, but lots of room for forking paths, of course.

            This study included a number of additional measures that are unrelated to the current research question and not reported here.

            Fine, but again, these represent a bunch more lottery tickets that they could’ve tried to cash in, had their first tries not worked.

            Let me emphasize here that I don’t believe in restricting data analysis. It’s fine to study all your data. But then you won’t learn much from isolated p-values.

            Haimovitz and Dweck continue:

            There were no effects of child’s age or gender on any of the key variables, so these demographics were not considered in further analyses.

            Points for honesty, but . . . again, forking paths. Interactions with age and sex are two of the paths not taken, that increase the total probability of finding a statistically significant comparison, even if all the responses were noise.

            And, finally, what we’ve all been waiting for:

            As expected, the children’s perceptions of their parents’ failure mind-sets were significantly related to their parents’ reports of their own failure mind-sets (β = 0.30, p = .002). However, the children’s perceptions of their parents’ intelligence mind-sets were not significantly related to their parents’ reports of their own intelligence mind-sets (β = 0.11, p > .25).

            OK, let’s do the direct comparison. The first estimate was 0.30 with a standard error of . . . ummm, the p-value is .002 so the z-score is 3.1 (that’s -qnorm(.001), in R), so the standard error is 0.10. The second estimate is 0.11, the p-value is more than .25, ok, let’s call that .3, say, then the z-score is 1.0, so the standard error is 0.11. The two regressions have approx the same standard error, which makes sense.

            Now let’s compute the difference: 0.30 – 0.11 = 0.19. And the standard error . . . hmmm, they’re not 2 different experiments so we really need the individual-level data to figure this one out, but let’s get a quick answer by treating the two regressions as independent. Then the s.e. is sqrt(.10^2 + .11^2) = 0.15.

            So . . . the key comparison has an estimate of 0.19 and a standard error of 0.15, which is, as Haimovitz and Dweck might put it, “not significant (p > .25).”

            I did not read the whole paper in detail but from a quick glance I see similar issues in the other results. I have no good reason to believe that these results would show up in a preregistered replication.

            I then went back and read the abstract and it’s not clear to me how the studies described there can really demonstrate the truth of the claim in the title of the paper. But I was getting tangled with all these different measures, so maybe I’m just missing something.

            I have no idea whether Haimovitz and Dweck are on to anything here, but it does seem to me that their research designs have enough degrees of freedom that they could take their data to support just about any theory at all.

            I just spoke on some of this stuff at Stanford! Not the difference-between-signfiicant stuff, but the statistical crisis in science more generally. My talk was at the b-school. I notified some people in the psych dept but I didn’t think of contacting Dweck. Too bad. But maybe she or one of her colleagues will read this blog, and maybe they can do better next time.

            • mark says:

              The same error (significance versus not-significance is not significant) is also spectacularly evident in study 1. They also get their mediation analysis wrong. The direct effect beta is almost twice the size of the correlation when it should be identical. I’ve decided to write a commentary and submit it to Psych. Science. Probably tilting at windmills but this is getting silly and perhaps someone out there will learn something (even if it is just the editor who desk rejects it).

              • Andrew says:


                Hey, I was thinking of writing a letter to the journal, even though it’s hard for me to imagine them publishing it, as I have a feeling they’ll just consider it nitpicking! If you’re writing one, you can feel free to use the material I posted in the comment above.

              • mark says:

                Thanks! I’ll keep you updated on what happens.

              • AJ says:

                I can’t access the article yet, but why would the direct effect need to be identical to the correlation? It’s the total effect that would need to be.

              • mark says:

                The way that the present it is simply the standardized regression coefficient for relationship between IV and DV. Of course, they say that it is the standardized coefficient but when you rerun the analysis with the correlation matrix + descriptives you see that they actually present the unstandardized coefficients. That of course, means that the size of the indirect effect is misreported. The whole paper is a complete mess.

              • Carol says:

                Wait, wait, wait, Mark! Nick (the same Nick who posts here, with whom I correspond regularly over e-mail) brought this article to my attention a couple of days ago, asking about various analyses in Study 1,including the mediation analysis to which you refer. It is possible to compute the coefficients for all the paths in the path diagram from the correlations in Table 1. But that assumes the traditional computational method of doing a mediation analysis. The authors did not do the traditional mediation analysis, however, they did a bootstrapping analysis using the method developed by Preacher and Hayes. That could give different results. Even so, the numbers don’t look right but it is not clear where the problem lies. I asked Nick earlier this morning to do a few exploratory analyses for me, and then, depending on the results, intended to contact the authors.

                FWIW, Nick and I caught some errors in another PSYCHOLOGICAL SCIENCE article. I contacted the authors, they agreed that they had made a computational error, and PSYCHOLOGICAL SCIENCE promptly issued a corrigendum. I was amazed.

              • Carol says:

                I believe that the “total effect” is the effect without the mediator (second predictor) which they report above the bottom path (.41). The “direct effect” is the effect with the mediator which they report below the bottom path (.16). The difference between these two numbers (.25) should equal the product of the numbers on the other two paths (.60*.42 = .25) which it does. This is the “indirect effect.” So I am not sure what number you think is too large, Mark. But I am certainly not an expert at mediation analysis, and I could be wrong.

            • Carol says:

              Also, Mark, did you note the small suppression effect in the first set of regressions for Study 1? The beta increases from .24 to .26 when an additional predictor is added to the model.

              • mark says:

                I did know about the Preacher and Hayes approach that they used and know that this can give slightly different answers but the difference here is simply too large to be plausible (although I may be wrong). I think what happened is that they accidentally reported the beta for IV-DV relationship as the indirect effect. Let me know what Nick figures out. I already contacted the action editor about the two already identified errors in Study 1 and Study 3 – I figured this was not worth the work of a commentary submission.

              • Andrew says:


                It’s a funny thing. Any one of these errors would be enough to stop the paper from being published, if the error was pointed out in the review process. But once the paper has been published, the standard for corrections is much more stringent.

            • Carol says:

              Mark, I looked at this path diagram again and I think that you are right that they are reporting unstandardized regression coefficients. If they were reporting standardized regression coefficients (beta), then the beta for the path from “parent’s failure-is-debilitating mindset” (X) to “perceived parent’s performance-versus-learning orientation” (Y) would equal .37, because the standardized regression coefficient (beta) for a one-predictor regression is equivalent to the correlation. But the beta equals .42. The formula for the unstandardized regression coefficient is the correlation times the ratio of SDY to SDX, which equals .37*(.76/.66) = .42 or .43. A little bit of variation would be expected because of the bootstrapping.

              • Carol says:

                I responded, Mark, before I saw your new comment. It did not occur to me earlier that they would switch from standardized regression coefficients to unstandardized regression coefficients in the middle of the analyses for the same study, although I did realize that something was wrong with the numbers in the path diagram. I tip my hat to you!

                It did occur to me, though, that they did not seem to understand that in a one-predictor model, the standardized regression coefficient beta is equal to the correlation. If they did, it seems odd that they would report the betas for the first two (one-predictor) regressions in the first analysis in Study 1; that information is already in the correlation matrix in Table 1. All they needed was the two-predictor regression. But this may be due to lack of experience on the part of the graduate-student author.

              • mark says:

                Since Psychological Science did retract that earlier paper (Thorstenson, Pazda, & Elliot, 2015) for making the same error I cannot see how they cannot act on this one.

              • mark says:

                Yeah, the whole thing is very weirdly incompetent but there is no excusing either of the authors. The “Author contributions” note clearly states:
                “Both authors contributed to the study concept and design. K. Haimovitz collected and analyzed the data. K. Haimovitz
                drafted the manuscript, and C. S. Dweck provided critical revisions. Both authors approved the final version of the manuscript for submission.”

              • Carol says:

                I agree with Andrew (or do you go by Andy?) that the standards for commentaries are much higher than they are for the original article.

                As far as this article being “weirdly incompetent,” Mark, I’d say that you have not spent much time looking at social psychology research. This is the norm.

                I think Stephen Lindsay, the current editor of Psychological Science, is genuinely trying to improve matters. It is gong to be an upward battle, though.

              • mark says:

                That may be true. I am an applied social psychologist but most of the problematic stuff I see is more subtle nonsense (e.g., much of what is described as SEM). I’ve had plenty of interactions with editors about articles published in their journals and most have no interest in fixing things so I am quite encouraged by the recent actions of Psychological Science to clean up its act. Now if their reviewers would just pay a little more attention.

              • David says:

                Mark and Carol:

                Apparently, the editor says that the reason the Thorstenson et al. paper was retracted was not because of the Gelman error, but because of other problems. The Gelman error alone would have caused a correction, not retraction.

            • Carol says:

              This is study 3a, not 3b, Andrew.

              • Andrew says:


                It was so hard for me to keep these straight, as it was hard for me to see how any of the studies demonstrated the claim being made in the title of the paper.

                In any case, study 3b also had forking paths and comparisons of significant to non-significant.

              • Carol says:

                I sympathize, Andrew. Just reading it made me dizzy.

            • Carol says:

              What, actually, would be the best way to analyze these data, given the question posed in the title of the article? Is Jim Steiger’s article on comparing the entries in a correlation matrix (PSYCHOLOGICAL BULLETIN, 1980, 87, 245-251) useful here? Or how about a multivariate model with multiple response (outcome, criterion, dependent) variables and
              multiple predictor (independent) variables (including controls)? The sample sizes might not be adequate for that.

              • Andrew says:


                I’m not quite sure how I’d analyze these data—my guess is that there’s not enough signal there to find anything—but I’d want to analyze all comparisons and outcomes together, using a hierarchical model, rather than trying to select out particular comparisons as was done in the article.

              • mark says:

                Carol and Andrew:
                A quick update: the action editor got back to me and while he is going to ask the authors to clarify the mediation results he does not see problems with the “difference between significant and non-significant is not significant” issue we had identified in two of the studies because he does not think that this is central to the findings. I obviously disagree – and the title of the paper obviously supports me.
                Oh well…

              • Andrew says:


                Send me the correspondence and we can write an article (for a different journal, maybe a statistics journal or even a different psychology journal) about this. When a game-changing error is found in a paper and the editors’ response is only to ask authors to “clarify” (which in practice will just about never result a recognition of error, rather it just means that, one way or another, they’ll come up with a reason that we should still believe every one of their published claims), this is an indication that something is broken. And it’s worth us writing it up and making the point better known.

              • mark says:

                I have forwarded my correspondence to you.

              • Carol says:

                Mark and Andrew: Once one realizes that the coefficients on the path diagram are unstandardized rather than standardized as implied by the beta symbol (thanks to Mark for pointing this out), I think that they are correct. The total effect is the effect of X on Y without the mediator (M) in the regression. The direct effect is the effect of Y on Y with M in the regression. These are the coefficients reported on the top of, and on the bottom of, the bottom path. The indirect effect is the product of the other two paths and should equal the difference between the total effect and the direct effect.

                For the unstandardized coefficients shown in the path diagram, this works. The coefficient for the total effect is .41 and for the direct effect .16, with a difference of .25. The product of the other two paths is .42*.60, which also equals .25. So this looks right.

                The standardized coefficients can be computed from the table of correlations. The coefficient for the total effect is .24 and for the direct effect .09, with a difference of .15. The product of the other two paths is .37*.41, which also equals .15. So this looks right, too.

                I wonder why they bothered with the bootstrapping.

                Anyway, it seems to me that the only clarification needed on this issue is either a correction pointing out that the coefficients on the path diagram are unstandardized rather than standardized (as indicated by the use of beta) or a correction giving the path diagram with the standardized coefficients, to be consistent with the other regression analyses in this Results section.

                If I am wrong, please let me know. I am not an expert on mediation analysis but this seems to be a simple straightforward example.

              • Carol says:

                Andrew and Mark: Re the “difference between significant and non significant is not itself significant” issue — I am amazed at how subtle this error can be. One of the reasons for this is that authors often are not clear in their articles, and perhaps in their heads, just what question(s) they are asking. Say A is being predicted from B and C is being predicted from D. One can test the significance of each of these against zero (say, is the correlation between A and B significantly different from zero? Is the correlation between C and D significantly different from zero?). One can test the difference between the two correlations (say, is the correlation between A and B significantly different from the correlation between C and D?). Whether what Nick calls for short “the Gelman error” is made depends on which question is being asked. And this can be difficult to discern. I just re-read Study 3A in the Haimovitz and Dweck article. Although they don’t explicitly state that they want to compare the relation between A and B with the relation between C and D, that does seem to be what they want to do. I think that your writing a comment on this issue (with this article and perhaps some others as examples) would be worthwhile. Andrew’s article on this issue was published in THE AMERICAN STATISTICIAN; the similar Nieuwenhuis et al. article was in a neuroscience journal. Few psychologists read either of these journals. If you cannot write a comment for PSYCHOLOGICAL SCIENCE, you might consider PSYCHOLOGICAL METHODS, which is a quantitative psychology journal but likes readable tutorials written for the non-quantitative psychologists.

              • David says:

                I have been in contact with the authors and they sent the data to Mark in May — I presume the thread died because Mark discovered the analyses were accurate?

            • David says:


              As always, I really appreciate your help flagging potential problems with papers and causing us to pay more attention to the details. In this case, I recently read your post, contacted the authors, asked for the data, re-analyzed it, and also read every word of the paper very carefully. I also asked to see their letters to the editor with their submissions and revisions. I wanted to see if they made a claim that the data did not support. I have now done this, and the authors have now posted their data and our syntax online: Disclosure: Dweck is a collaborator of mine.

              I learned a few things. First, the title does seem to require a comparison of effect sizes. But I tried not to be too anchored in the title because I read your post first and that affected how I read it. So I focused on what the paper actually said.

              Next, I don’t see anywhere in the paper where they claim that the effect of intelligence mindset on students’ mindsets is greater than the effect of failure mindset. I also looked at their letters to the editor, and they explicitly stated that they were *not* making that claim.

              I did, however, notice that the authors imply in the abstract that intelligence mindsets predict student *perceptions* more strongly than failure mindsets. But they didn’t report that test. In your post above, you report a test for one of the studies where they have those data, but they also have the data in Study 1. I stacked the data and ran the test; it seems that failure mindsets predict perceptions more strongly than intelligence mindsets.

              I looked for possible “garden of forking paths” problems. For instance, age, gender, or other model specification decisions affecting significance of results. I did not find any, which doesn’t mean that there aren’t any. I agree it’s valid and important to ask about them, but I don’t find much support for this statement: “the research designs have enough degrees of freedom that they could take their data to support just about any theory at all.” The authors disclosed every measure on their surveys (in their osf materials supplement), and those data are posted, so maybe there is something I’m missing.

              In general, I think this was a helpful exercise that uncovered (at least) one way in which the paper could have been done better and could have produced better evidence. I for one am grateful you raised the issue, and I imagine Dweck and Haimovitz feel the same way.

              Last, you’re probably not asking for feedback, but just out of fairness to the graduate student whose paper you’re commenting on, it might be helpful to casual readers to clarify statements like this “I have no good reason to believe that these results would show up in a preregistered replication.” Do you mean the contrast in the sizes of the correlations? Or the significant correlations themselves? Or do you feel that way about all or much of psychology in general–in which case it’s not about Haimovitz & Dweck?

              If it was my own study, I would *always* be nervous my results wouldn’t show up in a replication — that seems like a good default state to be in, because it makes you replicate more.

              • Andrew says:


                1. You write, “I don’t see anywhere in the paper where they claim that the effect of intelligence mindset on students’ mindsets is greater than the effect of failure mindset.”

                The title of the article is, “What Predicts Children’s Fixed and Growth Intelligence Mind-Sets? Not Their Parents’ Views of Intelligence but Their Parents’ Views of Failure.”

                Looking carefully, I do see a difference between the title and your statement: the title says “predicts” and you say “effect.” So I agree that the causal claim is not in the title. However I don’t think this is relevant to the questions being raised regarding statistical analysis, as the paper clearly was making the claim that the predictive power of parents’ intelligence mindset on students’ mindsets is greater than the predictive power of parents’ failure mindset on students’ mindsets. I consider the title as part of the paper.

                2. You ask about forking paths. Loken and I discuss this in our paper, but the quick version is that the concern is not multiple comparisons that are performed on the current data set, but multiple potential comparisons that could have been performed had the data come out differently. It’s great that the authors disclosed all their measures; my point in my comment above was that, given all these measures, there were many different paths in how to analyze the data, and you can see this in the paper itself (for example, statements such as, “This study included a number of additional measures that are unrelated to the current research question and not reported here” and “There were no effects of child’s age or gender on any of the key variables, so these demographics were not considered in further analyses”). Again, to point out forking paths is not to suggest that the authors of the paper were cheating or hacking or whatever; it’s just that their p-values don’t have the claimed meaning.

                3. You write that I’m “probably not asking for feedback.” Of course I’m asking for feedback! That’s what blog comments are all about. I appreciate your feedback.

                4. The reason I say I have no good reason to believe that these results would show up in a preregistered replication is that the claims seem to be supported based on statistically significant p-values obtained in a setting with many many researcher degrees of freedom, with lots of flexibility in data analysis, in the substantive theory, and in the connections between theory and data analysis. If new data were collected and the exact same analyses were reported, I’d expect the results would look pretty much like noise.

                5. I can’t comment on “all or much of psychology in general” or even all or much of the work of Haimovitz or Dweck. All I can comment on is the papers I’ve seen. In any case, I agree with you that it’s always good to replicate studies where possible.

      • David says:

        The 2007 paper does not make this error. Right after the section Mark flagged, where it says that implicit theories do not predict prior grades but do predict future grades, the paper reports the model where they predict future grades controlling for prior grades. The Gelman and Stern error presumably holds when you treat the non-significant correlation as zero, which in this case would mean not controlling for it.

        Although, I’ll admit that in re-reading Gelman and Stern, I was surprised there was no technical definition of the error, and no discussion of when authors are or are not making it, and how to interpret the error as a reader, so it’s probably easy to see the error in people’s papers when they’re not making it. Andy maybe you could update your opinion essay with a more technical treatment?

  2. Curious says:

    While there is certainly room to criticize methods and claims from the study at hand, the underlying theory appears quite plausible and worthy of research.

    That said, there are important issues to consider when designing such a study:

    1. Preexisting mindsets and how to handle that methodologically and statistically.
    2. The causal process by which the effect could take place – something similar to Albert Ellis’ Rational Emotive Behavior Therapy process in which ruminations grounded in unrealistic limiting beliefs are interrupted and re-framed into a more realistic perspective.
    3. The amount of repetition and followup interventions needed for optimal effect.

    • Andrew says:


      Yes, definitely worthy of research. It’s because I care about the subject that I don’t want them to do the research wrong and make claims not supported by the data.

    • Elin says:

      Agreed, I think it’s certainly plausible that there is such an effect and there is enough research of various kinds (of observational, some cross cultural) which indicates that there are at least these two “mind sets” (whatever term you use for that) where some people, both students and teachers, view something like math ability as fixed and therefore just say “I/you can’t do math and that’s just too bad” and other people see it as fluid and say “I/you can learn this if you work at it.” It is certainly is something I have seen in the classroom. I just don’t necessarily think it is easy to change and I’m always doubtful about the large effect sizes from short term interventions.

      • Curious says:

        I too am skeptical of a large effect emerging from a single brief intervention that would last over time. This is one of the areas where I believe some well designed, large sample studies might help elucidate the repetition and sequencing of interventions.

        One of the problems I have with small sample experimental studies focused on topics such as this is that there is very likely substantial between person variation in terms of a response to such an intervention related to the maintenance of baseline beliefs. Competing messaging about this topic can come from parents, teachers, classmates, friends, etc. about fixed versus fluid skills and abilities and can make it more or less difficult to sustain.

        • David says:

          This is a critically important question that research is beginning to address but could do more of. Specifically, there have been a number of large-sample studies that have shown that these interventions can be different from zero. I would not call the scaled-up effects “large” in the absolute sense (ds are around .1). Perhaps “large compared to expectations” for an online treatment, or “meaningful.”

          For instance, there are two non-pre-registered large-scale experiments with growth mindset:

          N~1,500 high school students:

          N~7,300 first-year college students (Study 2):

          Both of these show that, for students experiencing or anticipating academic difficulty, a brief growth mindset intervention via the internet can increase objective outcomes (grades, course failures or full-time enrollment).

          While informative, a potential statistical limitation of both is that the effects are there for a sub-sample — i.e., they are moderated. Although there is strong theory for the moderator and the papers report the main effects (and, in the PNAS supplement, alternative ways of coding the moderator), the Gelman “garden of forking paths” concern applies. That is, perhaps the moderator appeared due to chance and would not replicate upon pre-registered replication.

          This is why it is helpful that another recent paper pre-registered the prior achievement moderator (using the standards of the day) and found that the online growth mindset treatment improved the grades of previously low-achieving students:

          Some notable features are that we paid a third-party firm to collect all data and merge and clean it, before we saw it, and we tried to pre-register things that might normally be researcher degrees of freedom, like the prior achievement moderator or how we were coding the “low achievement” outcome variable. In the last two years, pre-registration has advanced, and so I imagine if we did the study now there would be new things we could do better.

          But, at least based on this published evidence, it does seem like a time-limited, online growth mindset can, under some conditions, increase the overall achievement of high school and college students several months later.

          Another takeaway is that, if one just looked at the first two papers, one might be tempted to say: “garden of forking paths, therefore not true,” despite the large samples and double-blind designs. But the effects *did* appear upon replication, with pre-registration. So perhaps we might say “garden of forking paths, therefore let’s do bigger and better science.” Maybe some things will turn out to be true.

    • mark says:

      Almost all “theories” in psychology appear plausible. That is why it is so easy for psychologist to fool themselves into seeing support for their “theories” (to paraphrase Feynman).

  3. David says:


    I share your skepticism about big, surprising findings. A few points.

    First, you noticed that the finding that precipitated this post was a misreading on behalf of the op-ed contributor? He was summarizing the manipulation check. I didn’t see you acknowledge that, so perhaps you didn’t see it.

    Next, you say: “I think the reported difference is an overestimate because of the statistical significance filter.” This is a plausible theory. There is some evidence for it some of the time. But effects could decline upon replication for other reasons as well, most notably sample heterogeneity. Allcott (2015) shows that initial, striking effect sizes are larger than subsequent replications in part because they are often conducted in settings that increasingly are less likely to show strong effects. He reports 111 Opower experiments with over 8 million homes and shows a strong decline effect from the 1st to the 111th experiment. No file drawer. No p-hacking. No garden of forking paths. Just differences in whether houses are the kinds of houses that are likely to be able to save energy when you get motivated to do so.

    My point here is that it’s a stretch to imply “this effect is big, therefore the next will be small, therefore the researchers engaged in bad behavior in the initial studies.” Let’s consider other possibilities and study them scientifically.

    You have written: “it’s better to start with the presumption of treatment heterogeneity and go from there.” I agree with that.

    • Andrew says:


      You write, “it’s a stretch to imply ‘this effect is big, therefore the next will be small, therefore the researchers engaged in bad behavior in the initial studies.'”

      Perhaps I should clarify that I neither said nor implied what you put in quotes there!

      First, I don’t see any evidence that the effect is big. The effect could be big, but the only evidence for a big effect is that the point estimate is big and statistically significant. The practice of reporting statistically significant estimates leads to bias—this is what we call Type M (for magnitude) errors and it’s what John Carlin and I discuss in our recent paper. The bias can be huge. That’s why I say the published estimate is an overestimate.

      Second, I have no idea if “the researchers engaged in bad behavior in the initial studies” (to use your expression). I think they made statistical errors, but everyone makes statistical errors. Calling this “bad behavior” seems a bit strong! There’s no shame in making a mistake that many other people have made.

      Finally, I’d not noticed that the op-ed had misread the research paper. Perhaps someone can contact Richard Friedman and see if he can write another op-ed for the Times correcting his earlier claims and toning down the hype! This sort of thing really annoys me in science journalism.

  4. Carol says:

    Reply to David’s comment on January 14, 2017. We’re too far down in the chain for me to post directly under your comment. According to the retraction notice on the Thorstenson et al. (2015) PSYCHOLOGICAL SCIENCE article, there were two reasons for retraction. One was the “Gelman error.” The other was oddities in the data. You can read more at Retraction Watch, November 5, 2015. Or just type “Thorstenson” in RW’s search box.



Leave a Reply to Andrew