Authority figures in psychology spread more happy talk, still don’t get the point that much of the published, celebrated, and publicized work in their field is no good (Part 2)

Part 1 was here.

And here’s Part 2. Jordan Anaya reports:

Uli Schimmack posted this on facebook and twitter.

I [Anaya] was annoyed to see that it mentions “a handful” of unreliable findings, and points the finger at fraud as the cause. But then I was shocked to see the 85% number for the Many Labs project.

I’m not that familiar with the project, and I know there is debate on how to calculate a successful replication, but they got that number from none other than the “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%” people, as Sanjay Srivastava discusses here.

Schimmack identifies the above screenshot as being from Myers and Twenge (2018); I assume it’s this book, which has the following blurb:

Connecting Social Psychology to the world around us. Social Psychology introduces students to the science of us: our thoughts, feelings, and behaviors in a changing world. Students learn to think critically about everyday behaviors and gain an appreciation for the world around us, regardless of background or major.

But according to Schimmack, there’s “no mention of a replication failure in the entire textbook.” That’s fine—it’s not necessarily the job of an intro textbook to talk about ideas that didn’t work out—but then why mention replications in the first place? And why try to minimize it by talking about “a handful of unreliable findings”? A handful, huh? Who talks like that. This is a “Politics and the English Language” situation, where sloppy language serves sloppy thinking and bad practice.

Also, to connect replication failures to “fraud” is just horrible, as it’s consistent with two wrong messages: (a) that to point out a failed replication is to accuse someone of fraud, and (b) that, conversely, honest researchers can’t have replication failures. As I’ve written a few zillion times, honesty and transparency are not enuf. As I wrote here, it’s a mistake to focus on “p-hacking” and bad behavior rather than the larger problem of researchers expecting routine discovery.

So, the blurb for the textbook says that students learn to think critically about everyday behaviors—but they won’t learn to think critically about published research in the field of psychology.

Just to be clear: I’m not saying the authors of this textbook are bad people. My guess is they just want to believe the best about their field of research, and enough confused people have squirted enough ink into the water to confuse them into thinking that the number of unreliable findings really might be just “a handful,” that 85% of experiments in that study replicated, that the replication rate in psychology is statistically indistinguishable from 100%, that elections are determined by shark attacks and college football games, that single women were 20 percentage points more likely to support Barack Obama during certain times of the month, that elderly-priming words make you walk slower, that Cornell students have ESP, etc etc etc. There are lots of confused people out there, not sure where to turn, so it makes sense that some textbook writers will go for the most comforting possible story. I get it. They’re not trying to mislead the next generation of students; they’re just doing their best.

There are no bad guys here.

Let’s just hope 2019 goes a little better.

A good start would be for the authors of this book to send a public note to Uli Schimmack thanking them for pointing out their error, and then replacing that paragraph with something more accurate in their next printing. They could also write a short article for Perspectives on Psychological Science on how they got confused on this point, as this could be instructive for other teachers of psychology. They don’t have to do this. They can do whatever they want. But this is my suggestion how they could get 2019 off to a good start, in one small way.

32 thoughts on “Authority figures in psychology spread more happy talk, still don’t get the point that much of the published, celebrated, and publicized work in their field is no good (Part 2)

    • Ulrich:

      I looked up the book you discussed. It costs $168! Whaaa?

      I’ll have to say, though, that it looks interesting. Here’s the table of contents:

      1. An invitation to social psychology
      2. The methods of social psychology
      3. The social self
      4. Social cognition: thinking about people and situations
      5. Social attribution: explaining behavior
      6. Emotion
      7. Attitudes, behavior, and rationalization
      8. Persuasion
      9. Social influence
      10. Relationships and attraction
      11. Stereotyping, prejudice, and discrimination
      12. Groups
      13. Aggression
      14. Altruism and Cooperation

      No joke, this seems like a great set of topics. I don’t quite understand the organization, but this is all stuff I’d love to know more about. So I hope the book has some good stuff, despite its botching of the replication issues. I agree with you that it was a mistake for them to cite Gilbert et al. as representing some sort of serious criticism of the replication studies. I also agree with you that it is silly of them to focus on “sample size” as the foremost change in research in response to the replication crisis. All the sample size in the world won’t save them if their measurements are junk.

      • Social Psychology has nearly always been interesting (at least to me). It’s just too bad that much of the evidence behind it is treated as being controversial for ethical reasons and not for its methodological horrors. For example, in my first intro social psychology course, we learned about all those infamous “experiments”, Stanford Prison Experiment, Milgram, etc (as most do), and our TA goggled at how interesting the study was, and fixated on the controversial ethical dilemmas surrounding the study, but we (our TA and the entire class) had *no* idea how badly designed the studies were and how they were practically fiction.

  1. Something to think about: How may introductory statistics textbooks give a good discussion of “replication failure” and related points? Can anyone provide any evidence of ones that do? Or what percentage of them do? Or provide a list of some that are good on this point? (Maybe we need to do more “putting our money where our mouth is”? Or getting our mouths to the right places?)

    • Martha:

      We talk a bit about these issues in Regression and Other Stories, but (a) it’s not an intro statistics book, it’s a first book on regression, which is not quite the same thing; and (b) we probably don’t discuss it enough.

    • You may be forgetting that there is one that discusses replication: Introduction to the New Statistics
      Estimation, Open Science, and Beyond, 1st Edition
      By Geoff Cumming, Robert Calin-Jageman.

    • I’ve been working on it. It’s not quite introductory, but the following book is basic. When released as an eBook it will be open access.

      https://www.springer.com/us/book/9783030034986

      The third section presents my particular take on some issues about replication.

      A truly introductory text (for psychology undergraduates, so it focuses on hypothesis testing) is at

      https://introstatsonline.com

      Power and replication are discussed at multiple places throughout the text (a code has to be purchased to gain access, but I can provide access for someone wanting to look at the text).

    • I posted a message on the ASA Statistic Education discussion group encouraging people to look at this blog post and raising the question of whether they have encountered introductory textbooks that include discussion of replication problems. The post has gotten a number of replies, but so far only one mention of a textbook that covers the replication “crisis” — that from Paul Velleman, who said that the newest edition of the textbook he co-authors will include a discussion of the Wanskink case, emphasizing Wansink’s inability to produce his data for others to check. He indicated that they expect to expand the discussion of replication in future editions — so I suggested that another “case study” might be Carney’s vs Cuddy’s responses to criticisms of their power post paper.

      • Another interesting comment (by Esther Pearson) on the Stat Ed ASA discussion: She mentioned comments by Jefferey Flier (Harvard Medical school) at the 2017 Sackler Colloquium. I easily found the video of his comments at https://www.youtube.com/watch?v=QRf1m_4JST4. A couple of things he mentioned that seem especially worth noting:

        1. In his experience, trainees are more influenced by the behavior of their mentors than by their training.

        2. As part of being put up for promotion, medical researchers should be required to submit a discussion of controversies in the areas in which they work.

        An implication of #1 is that improving statistics training for future researchers needs to be augmented by improved statistics training for established researchers.

        #2 sounds like a good idea — it works against the tendency toward excessive self-promotion.

      • I got another interesting reply (from Esther Pearson) on the Stat Ed ASA list. She mentioned comments by Arthur Flier (Harvard Med School) at the 2017 Sackler Colloquium. I easily found a video at https://www.youtube.com/watch?v=QRf1m_4JST4.

        A couple of things Sackler mentioned that seem particularly noteworthy:

        1.Trainees are more influenced by the behavior of their mentors than by their training.

        2. As part of being put up for promotion, medical researchers should be required to submit a discussion of controversies in the areas in which they work.

        Item 1 points out that we need to do much more than improve statistics courses to implement serious reform. Item 2 sounds like a good idea for researchers in many fields.

    • Martha: The introductory book by Freedman, Pisani, and Purves discusses replication several times throughout, the first time being on p. 17, where a more careful replication found no effect. Among the others are these: On p. 547, it is in the context of data snooping. On p. 561, they use the words “failure to replicate” in discussing ESP.

  2. You guys love to bag on the “statistically indistinguishable from 100%”. Are you sure it’s really as obviously wrong as you say?

    Let’s pretend that all studies are powered at 0.8 under the researcher’s best estimate of a given effect size. If 100% of the time, this best estimate was right and the study was perfectly replicated (clearly false, but let’s pretend), you’d then expect power of 0.8 on replications. In this light, “statistically indistinguishable from 100%” is not wrong just because a few studies did not replicate; not replicating 20% of the time is perfectly consistent with having been right 100% of the time.

    Are you sure this was not what was meant? I haven’t dug up the quote, but I keep hearing you say he won’t retract it and no more discussion of the quote. It makes me concerned that maybe the quote isn’t as clearly wrong as is presented on this blog.

    Of course, you know you’re not exactly right 100% of the time (you’re exactly right 0% of the time for a continuous parameter), but *if* the replication rate was close to 0.8 (I don’t know what it is, but I assume it’s much lower), at first glance, there isn’t really any evidence of a replication crisis. In fact, things would be looking surprisingly good in that case, i.e., even in the surprisingly good case that we got the direction of the effect right in ever single study that was attempted to be replicated, we should still expect a reasonable portion to be underpowered.

    • Let’s take one of the 80% of studies that replicates in your example.

      1. Choose a predictor variable (X) from the study.
      2. Randomly select a data point that is within 2 standard errors of the mean.
      3. Randomly select a data point that is outside 2 standard errors of the mean.
      4. Assign a probability to your confidence that the two people represented by these data points are actually different on the aspect of reality represented by X.
      5. Repeat for 2 data points from a sample that represents the height(X) of adults in the United States.
      6. Is your answer different for the two?
      7. What are the implications for this to the inferences from and the application of psychometric measures?
      8. What are the potential implications for psychometric measures that become widely established and applied?

      • A couple of ways this has been addressed in psychometrics:

        1. The measure itself is the reality and thus any variation of the scaled score is by definition real — this is true whether the test was devised using a classical framework or an item response framework that attempts to model out variation not due to the construct such as guessing (which is now simply the test score — a bit confusing I know and there is wide latitude on how error is defined). The argument is circular as there is no reference to reality outside the measure itself. IQ is an example of this form of testing and argument. While different tape measures will provide the same rank ordering of height, variation between any two rulers will be consistent such that the variance will be constant at the areas along the continuum between markers where there are slight differences in length. Thus, two tape measures could match up to 5 ft. and then between 5 and 6 ft be off by 1/64 of an inch and then between 6 and 7 another 1/64. For psychometric tests, as two tests do not measure the same thing by definition, all variation between test scores must by extension be real as well, unless there is another clever way to resolve the logical incoherence. It turns out there is by using empiricism which transforms the construct-less measure into a “measure of the measure of the outcome”.

        2. When applying the “measure of the measure of the outcome” a measure is assumed to validly measure what it predicts (which I will grant you gets confusing when the measure predicts more than one thing). Also confusing is that this implies that any two measures, irrelevant of content, that predict the same outcome are now the same “measure of the measure of the outcome”. The causes of the score have now become irrelevant except where they consistently underestimate the “measure of the measure of the outcome” for protected classes of people as defined by the EEOC.

    • A:

      1. I don’t love to bang on the “statistically indistinguishable from 100%” thing. I bring it up a lot but only because it makes me sad and it bothers me.

      2. Unfortunately, the scenario you describe in the second paragraph of your comment is not what’s been happening. What’s actually been happening is that people publish strong claims based on 95% confidence intervals excluding zero, and many of these claims do not replicate. This is no surprise, given that the confidence intervals in question have been selected (that’s what we call forking paths, or the multiverse, or p-hacking, or researcher degrees of freedom, etc).

      • A quick Google search for “statistically indistinguishable from 100%”, besides lots of hits to this blog, brings up the following report with the quote of interest:

        http://projects.iq.harvard.edu/files/psychology-replications/files/harvard_press_release.pdf

        The summary includes the quote of interest. Then, on the first page, we have the following quote:

        “When this error is taken into account, the number of failures in their data is no greater than one would expect if all 100 of the original findings had been true.”

        To me, this implies that the author did indeed intend my interpretation of the quote in question. That is, if we had been correct about our hypotheses 100% of the time, we should still expect this level of “failed” replications.

        I really don’t follow that field close enough to have any idea whether the rest of the author’s argument is valid. However, a very trivial amount of work seems to imply that this “obviously false” interpretation of the quote is *not* what the author meant, yet it’s repeated 100 times here on this blog, and now is being parroted by others.

        I think publicly admitting a mistake is in order.

        • A:

          I agree that the people who said the “statistically indistinguishable from 100%” should admit their mistake, but realistically I don’t think it’s going to happen. The statement, “When this error is taken into account, the number of failures in their data is no greater than one would expect if all 100 of the original findings had been true,” is false. If all 100 of the original findings had been true, we’d have expected to see much more successful replication. See the above-linked post by Sanjay Srivastava for further discussion of these issues. But in any case this is no surprise given the well-known problems of forking paths, etc. Indeed, given everything we know about psychology research, it would be a shock if the replication studies revealed no problems with previously published work.

          In any case, I don’t see any particular point to hassling the authors of this statement to get them to publicly, or for that matter, privately, admit their mistake. To me the statement is interesting mostly in that it indicates the extreme lengths gone by some leading figures in social science in an effort to minimize the serious problems which have been known for years by anyone who’s read the work of Greg Francis, Uri Simonsohn, etc.

        • A reader is asking for you (Andrew) to apologize, not Gilbert et al.

          Lakens discusses a fundamental error they made in their Science paper here:
          http://daniellakens.blogspot.com/2016/03/the-statistical-conclusions-in-gilbert.html

          Ironically, they could have avoided this error had they just read the supplement of the paper they were critiquing.

          I somewhat sympathize with a reader’s point, since the quote could be interpreted as “we have no idea what the actual replication rate is, so our estimate overlaps with 100%”.

          However, I still think it’s a silly way to phrase it, and is clearly propaganda attempting to douse the fires. Personally I don’t have a problem using this quote since I have other issues with their paper, such as what Lakens covers.

          P.S. I hope I quoted them correctly, otherwise Gilbert might come after me: https://twitter.com/DanTGilbert/status/857816991008251904

        • Good legwork.

          It looks like the problem is that the summary of the press release mis-states what the body of the press release actually says.

          The summary says: “the study actually shows that the replication rate in psychology is quite high –
          indeed, it is statistically indistinguishable from 100%.” This is the quote Andrew (I think correctly) criticizes.

          But, as you note, the body says something else, which sounds sensible. It says “When this error is taken into account, the number of failures in their data is no greater than one would expect if all 100 of the original findings had been true.”

          So, whoever wrote the summary screwed up apparently.

        • To me, this is a clear situation where (a) picking the null hypothesis one is trying to compare against shapes the interpretation, and (b) it is important to remember frequentist statistics cannot give evidence FOR a null hypothesis.

          In this example, the null hypothesis is that the replication rate is 100%. Due to a low lack of power (n = 100 on a dichotomous outcome of replicated successfully or did not replicate successfully), one can state that it is not significantly different from 100%. But this is almost surely Type II error. One could probably easily say it is indistinguishable from 50%, too, given the confidence interval for the success rate is going to be quite large.

          A statement like: “Assuming the replication rate was really 100%, we would expect to see this result a non-trivial amount of time” isn’t very helpful, because it is making too strong of an assumption on the very thing we are trying to investigate—the replication rate.

  3. What should be the metric for deciding whether a given replication has succeeded in replicating the original or not? I thought about that to try to understand it and came up with these notes.A bit long, but Prof. Gelman suggested I post them as a comment.

    Study A (“Original”) finds an effect of 8 with a 95% confidence interval of [2, 14]. By itself, it rejects the null of no effect.

    Study B (“Replicator”) finds an effect of 3, with a confidence interval of [-3, 9]. By itself, it fails to reject the null of no effect. It also fails to reject the null of an effect of 8.

    Has the replication succeeded?

    Dan Gilbert, Gary King, Stephen Pettigrew, and Tim Wilson (2016) say YES. Srivastava (2016) is unclear.
    http://science.sciencemag.org/content/351/6277/1037.2
    Evaluating a new critique of the Reproducibility Project

    Gilbert et al. can’t be right. They are saying that Study B, which finds an effect, is *replicated* by Study A, which does not. Or, more precisely, they are saying that we *should accept* Study A’s conclusion that an effect exists, because Study B does not offer enough evidence to reject it. But that more precise rendering is not a good way to match what we mean by “replicate”. If we use the word that way, we must also say that Study A has replicated Study B, and we should accept Study B’s conclusion that no effect exists because Study A does not offer enough evidence to reject it. But we can’t accept both conclusions.

    If Gilbert et al. want to be picky, they can say that to be even more precise, we *cannot reject* Study A’s conclusion that an effect exists, because Study B does not offer enough evidence to reject it. But they also have to say that we cannot reject Study B’s conclusion that no effect exists. And if we can’t reject any of the possible conclusions, we’re back to where we were before anybody did any studies, and we have to say that the discipline hasn’t discovered anything on this subject.

    The Gilbert et al. definition is shifting the burden of proof. The burden of proof is crucial in law and in classical statistics, so that’s not the way to do things.

    Rather, we should proceed one of two ways.

    (1) The simplest way, methodologically, is to combine the data in Studies A and B and see if the conclusion is the same as in Study A.

    That’s not what we mean in ordinary language by the word “replication”, but it is what we mean by “confirmation” or “support”.

    (2) The closest to what what we mean in ordinary language is to repeat the study and see if it comes to the same conclusion.

    In that case, Study B above fails to replicate Study A. The problem relative to (1) is that the replication always adds more data points, increasing the power of the test, so we could have ten replications that each fail to get to 5% but each gets to 4.9% so together they confirm Study A.

    It would be easy to report both methods 1 and 2.

    • This lays it out very well, and it shows how we should not be talking about, “Did it replicate?” In the end, we don’t really care if Study A (“Original”) replicates. What we are interested in is: Should I believe the claim made by Study A? The best answer we have for that is by combining the data in Studies A and B.

      For example, if there are 10 studies conducted on The Effect of X on Y, I do not really care if Studies 2 – 10 comes to the same conclusion as Study 1. What I really care about is is: Does The Effect of X on Y exist? The best evidence for that comes from combining the evidence across the 10 studies. The effect might be small enough that it is only significant in Study 7, but all the data combined show a small—but significant—effect.

      I think it might be more fruitful—and cast less implicit blame—to focus on cumulative knowledge instead of failed vs. successful replications. Now that I am in the private sector, I just wish that the focus of this cumulative science could be on more pressing (in my opinion) social psychological problems (e.g., how do we get people to act on climate change?) instead of arcane things like the power pose.

      • Mark:

        I agree. People have to avoid the research incumbency effect, and I think one way to do this is using the time-reversal heuristic.

        But I’d change slightly to say that the right underlying question is not, Does the effect exist?, but rather, How large is the effect? Where is it positive and where is it negative? Etc. One mistake is to consider there being a unitary “effect” with no variation.

        Also, it’s important to distinguish between truth and evidence: from a statistical analysis we can’t always say much about truth but we can evaluate evidence. In many cases I would not feel comfortable saying that I don’t think a particular claim is true, but I’d be happy to say that I don’t see any good evidence for the claim.

Leave a Reply

Your email address will not be published. Required fields are marked *