“False-positive psychology”

Everybody’s talkin bout this paper by Joseph Simmons, Leif Nelson and Uri Simonsohn, who write:

Despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We [Simmons, Nelson, and Simonsohn] present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Whatever you think about these recommendations, I strongly recommend you read the article. I love its central example:

To help illustrate the problem, we [Simmons et al.] conducted two experiments designed to demonstrate something false: that certain songs can change listeners’ age. Everything reported here actually happened.

They go on to present some impressive-looking statistical results, then they go behind the curtain to show the fairly innocuous manipulations they performed to attain statistical significance.

A key part of the story is that, although such manipulations could be performed by a cheater, they could also seem like reasonable steps to a sincere researcher who thinks there’s an effect and wants to analyze the data a bit to understand it further.

We’ve all known for a long time that a p-value of 0.05 doesn’t really mean 0.05. Maybe it really means 0.1 or 0.2. But, as this paper demonstrates, that p=.05 can often mean nothing at all. This can be a big problem for studies in psychology and other fields where various data stories are vaguely consistent with theory. We’ve all known about these problems but it’s only recently that we’ve been aware of how serious they are and how little we should trust a bunch of statistically significant results.

Sanjay Srivastava has some comments here. My main comment on Simmons et al. is that I’m not so happy with the framing in terms of “false positives”; to me, the problem is not so much with null effects but with uncertainty and variation.

24 thoughts on ““False-positive psychology”

  1. From my experience in grad school and working in university labs, I would say that fully 80% of experimental research is manipulated this way, at least to an extent. I got out of academia in disgust because of this. It is difficult to prove because as you say “manipulations […] could also seem like reasonable steps to a sincere researcher who thinks there’s an effect and wants to analyze the data a bit to understand it further”.

    The incentives are all wrong. Professors’ and grad students’ careers depend on them finding results so they have become experts in doing research that is manipulated just enough to produce them but not enough to pass as fraud. A good 50% of the professors in the universities I’ve been to need to be fired and another %30 need to be reprimanded. I am sickened when a professor lists as a requirement on his or her personal homepage that to be accepted in his or her lab, grad students need to publish once a year. That is a sure sign of a lab that produces only fraud.

    This is a very grave problem that is contagious to industries such as pharmaceuticals who use the fake results abilities developed in universities to extract money from the sick.

    The hiring process in universities desperately needs to be changed.

    The fundamental problem is this: In most fields, even after a few years of research, the likelihood of experiments producing results interesting enough to be published is something like 10%. Universities only hire professors who produce positive results almost every year. Unless the candidate is inhumanly intelligent or lucky, he or she has to have manipulated results to achieve the required publications portfolio. Therefore Universities have setup a system biased towards hiring fraudsters.

    In my mind, the hiring committees should be very suspicious of candidates who have published a lot of positive results. They should instead judge the quality and impact of candidates’ best one or two experiments, published in traditional channels or simply put online for everyone to see and critique.

    • I sympathize with a lot of this — I left psychology in part because of these contradictions. Unlike you I don’t think anyone I worked with was a real fraudster; they were largely honest people trying to make sense of messy data in the ways they’d been trained to. But it was plain to me by the end of my first year of grad school that the rules on paper were not the rules anyone actually lived by, and I’m the sort of literal-minded person who finds that uncomfortable and confusing. (Now I do stats for a living on soft money, where all the same issues are still present, but the hypotheses aren’t my own babies and I’m not trying to use them to get tenure.)

      I don’t know if I agree that your proposed fixes will work, though. Ultimately, universities don’t have to care whether the work their professors are doing is correct or useful, they just have to care whether the profs can bring in grant funding. And I think it’s going to be a hard sell to funding agencies that they shouldn’t value an applicant’s quantity of publications.

      I do agree that the fundamental problem is this equation: “Results = Food”
      What I would love is a world where null findings could be published, because then when an experiment “fails,” at least you can provide some evidence that you haven’t just spent the last year sitting on your hands. I don’t know how to get there from here, though.

  2. I posted the following to Sanjay Srivastava’s blog. Since “decision theory” is one of the keywords you’ve filed this under, it may be appropriate.


    Brent Roberts noted the idea of using a p-value (say p≤0.05=alpha) as a decision rule.

    If you are going to make a decision, you should be using decision theory, not hypothesis testing with p-values. The reason is that real decisions involve not only the probabilities, but also the loss or utility of making the decision under the states of nature that your probability model describes. Just picking a particular alpha level as a decision rule isn’t adequate in the real world.

  3. Their focus on “researcher degrees of freedom” gives an interesting spin to those old complaints about the subjectivity of Bayesian analyses.
    They write:
    “Although the Bayesian approach has many virtues, it actually increases researcher degrees of freedom.”
    People often claim “you must choose one of the standard frequentist tests” as a feature, not a bug, but I hadn’t heard it expressed this way before.

    It might make sense that in some specific cases (controlled experiments looking for a pre-specified effect, with the analyses *not* done by expert statisticians), limiting researchers’ freedom could make the results “safer.”
    But if your canned frequentist method underestimates uncertainty relative to a carefully-thought-out problem-appropriate Bayesian method, you may not be doing better after all…

  4. This was a good snippet from Shrivastava:

    […]A student of Jonides analyzed the same fMRI dataset under several different defensible methods and assumptions and got totally different results. I can believe that […] because that’s true with any statistical analysis where there is not strong and widespread consensus on how to do things.

    The problem is statisticians often oversell their pet method (e.g. Bayesian on this blog) to the applied scientists; when they should rather be settling the procedural details at a statistical theory level.

  5. My favorite example is this study finding support for the power of retroactive prayer: http://www.bmj.com/content/323/7327/1450.full

    Praying for people after they left hospital was found to shorten the length of their (already completed) stay (statistically significant difference compared to the control group which didn’t get prayed for; as far as I’m aware they never did the ethically correct thing which would be to follow up by treating the control :-) ).

    The original purpose of the study was to make exactly the point being made here, but it subsequently got latched onto by a bunch of nuts with no sense of irony and even less understanding of statistics – so the notion of retroactive prayer took on a life of its own.

  6. I wish people would stop linking that old xkcd comic and link the old blog post on it instead http://statmodeling.stat.columbia.edu/2011/08/that_xkcd_carto/

    What is novel/interesting about this? It seems like yet another rediscovery of the problems with using hypothesis tests. I wish academia would simply move to a less-binary concept of knowledge and belief – especially for poorly-understood complex systems (this goes back to those posts on the virtues of meta-analysis) – instead of having a freak-out every other month that p-values aren’t being interpreted correctly and problems arise if the null is mis-specified.

    • Um, that wasn’t at all my understanding of the take-home message of the paper, and I don’t get the sense that it was Andrew’s either.

      The issue is not really with p-values, misspecification of the null hypothesis, the difficulties of studying poorly-understood complex systems, or even binary concepts of knowledge and belief. It’s to do with how all the judgment calls that practicing researchers routinely make can compromise the (frequentist or Bayesian, binary or non-binary) inferences they make. Yes, the example presented in the paper is couched in terms of frequentist statistical approaches, false positives, and p-values. But the take-home message is of broader applicability, for reasons that are pretty obvious and (briefly) discussed in the paper.

      As to how novel the paper is, I think what’s attractive about the paper is not novelty per se but the very nice, clear examples. I can imagine that students in particular would find this paper great fodder for lab group discussions.

      You may want to temporarily set to one side your apparent dislike of hypothesis testing and read the paper with fresh eyes.

      • I probably phrased that too critically. But it is describing a misspecification of a null in the sense that the “researcher degrees of freedom” are not incorporated in their null model. I’m not saying that the problem is easily fixed by correctly specifying the null hypothesis because there’s no practical way to fully model researcher degrees of freedom. But the fact that the false positive rate is higher than the p-value threshold reflects that the null they want to reject against (one that theoretically incorporates confounding due to researcher flexibility) is not the one they are testing against.

        My main point is also not that these problems would disappear if we hypothesis testing wasn’t performed, but that I don’t see what’s new. I agree it has merit as an expository exercise.

  7. When I teach stats, I emphasize that statistics is a tool used by a community for communicating information. Understanding the communication is only possible if you really understand how it is interpreted within the community. Different fields have different methodologies and different standards of significance and different ideas about what degrees of researcher freedom are acceptable, and the researches in each field learn to assess the work in their firld based on their familiarity with this particular set of habits. This is why a familiar but slightly innapropriate statistical analysis is often to be preferred over one that is technically correct but without strong meaning to others in the field.

    • I find this comment disturbing.

      Do we have to conform to the prejudices of a community that uses statistics in a way that is quite inappropriate, when that community clearly does not understand the meaning of what they are doing?

      Wouldn’t it be better to explain to that community why what they are doing is polluting the well? And to convince them to change their ways?

      I think that that is what the authors of this paper are trying to say.

      • Bill: I’ll jump in here

        “Wouldn’t it be better to explain to that community why what they are doing is polluting the well? And to convince them to change their ways?”

        Yup: But that may be much harder than you think.

        At least my kick at this can over 20 years ago went largely un-noticed (<100 citations).


        Using the “unfreeze, redirect and re-freeze” management model.

        Will people recognize the problem?
        It is simply that the literature cannot be safely interpreted as evidence given what the community is doing in any practical way – see John Copas and Students for just how limited any advanced non-Bayesian or Bayesian (selection modelling) method can help here. By the way RA Fisher drew attention to this problem in the 1950s – it’s not new!)

        Is there a way to get less polluting of the well?
        (It’s never simple to fix a community with strong and varying vested interests.)

        Can you head off the re-polluting of the well?
        (Smart people are very good at gaming the system for personal advantage.)

        • Thanks for bringing that paper to my attention.

          (By the way, I jumped in on your comment because I was mostly agreeing with you.)

          Some reactions are that

          1. The overall comment of the paper that the publication bias process (especially sociological/psychological) is just too complex to get a model that is anywhere near helpful was what drove my paper with Detsky – the need to make folks embarrassingly aware of it and force some change…

          2. CR Rao’s comment in Iyengar & Greenhouse seemed to make it clear that any Bayesian method would require a highly informative prior that had no hope of being checked and hence forces you back into 1.

          3. I believe Mengerson and Wolpert would agree here and Greenland has said so (and hence they clarify the need and risk for such multivariate priors in their current work), while Copas continues to press for some sensible sensitivity type or robust modeling. (see Copas and Lozda_Can .The radial plot in meta-analysis: approximations and applications. JRSSC 2009 and references therein. )

          4. For “faced with combining the summaries, a nontrivial technical problem” I did come up with a general MC based solution for this in my thesis. It’s very related to “ABC” methods that have recently become better known and I am trying to draft a half page trivial explanation that I could effortless publish as a note some where (or just unload in a comment here). Though maybe its just better now to use “ABC” methods …

          5. The lack of overlap with people I have worked is intriguing but there was one – Laupacis, A. (1997). He was my boss for a couple years and the reason I did not withdraw as an author on Man-Son-Hing M, Laupacis A, O’Rourke K, Molnar F, Mahon J, Chan K, Wells G. Determination of the clinical importance of study results. J Gen Intern Med. 2002. In the paper an “absolution was granted” for editors to refuse to publish a study that was non-significant and of low power (wide confidence interval). I was unable to convince the other authors not to include it. Too bad I did have Andrew’s Sex and Beauty paper as an example why this absolution was such a bad idea.

      • Bill,

        “Wouldn’t it better” implies some sort of either/or choice that I don’t see here. You can publish papers about the right way to do the statistics, and work to educate your colleagues (and your students) about the advantages of your approach. If you succeed, the field will be easier to understand for outsiders and produce fewer squabbles arising from poor methodology.

        However, your ability to succeed will depend (in addition to many other things, not least your own rhetorical abilities) on how far you are trying to move the field from their comfort zone. If there’s a clear problem and a convincing salesman and a pretty convenient solution, a field might move pretty far. Otherwise, there might be a limit to how much you can fix things.

        Whatever the outcome of your struggle to educate the heathens, in order to work effectively in the field, you have to play by the rules of the field. In my field: “best typical” results have to be carefully selected, raw data has to be shown or it is assumed to be terrible, numerous measures of effect are tried and the best is selected but all the ones that were tried have to be listed and must give results consistent with the reported results. After 20 years in the field, I know who to trust and how they work and what their reports mean.

        My point is that to work well, a field needs agreed upon methodologies that people can come to understand deeply more than it needs an accurate value for the probability of false positives or (for Andrew) completely accurate assessments of the uncertainties associated with parameters in the model. Of course, it’s best to have both. Of course, some methodological approaches are too flawed for even smart, experienced researchers to reliably interpret results.

  8. I cannot remember ever thinking that finding p < 0.05 was enough to warrant publication because of the many choices possible in data analysis. If I could repeat a result (with p < 0.05) again and again, that is what persuaded me. At least in my area of psychology (animal learning) this is taken for granted. Papers usually consist of several experiments that repeat the main result.

    I think the original paper (in Psychological Science) does a good job of showing the problem but a poor job of showing that the problem is common or important.

  9. Pingback: (More) Fun with data | Tom Carter

Comments are closed.