“A Vast Graveyard of Undead Theories: Publication Bias and Psychological Science’s Aversion to the Null”

Erin Jonaitis points us to this article by Christopher Ferguson and Moritz Heene, who write:

Publication bias remains a controversial issue in psychological science. . . . that the field often constructs arguments to block the publication and interpretation of null results and that null results may be further extinguished through questionable researcher practices. Given that science is dependent on the process of falsification, we argue that these problems reduce psychological science’s capability to have a proper mechanism for theory falsification, thus resulting in the promulgation of numerous “undead” theories that are ideologically popular but have little basis in fact.

They mention the infamous Daryl Bem article. It is pretty much only because Bem’s claims are (presumably) false that they got published in a major research journal. Had the claims been true—that is, had Bem run identical experiments, analyzed his data more carefully and objectively, and reported that the results were consistent with the null hypothesis—then the result would be entirely unpublishable. After all, you can’t publish an article in a top journal demonstrating that a study is consistent with there being no ESP. Everybody knows that ESP, to the extent it exists, has such small effects as to be essentially undetectable in any direct study. So here you have the extreme case of a field in which errors are the only thing that gets published.

It’s science as Slate magazine is reputed to be: if it’s true, it’s obvious so no need to publish. If it’s counterintuitive, go for it. (Just to be clear, I’m not saying the actual Slate magazine is like that; this is just its reputation.)

This is indeed disturbing and I applaud yet another publication on the topic. The authors go beyond previous research by Gregory Francis and Uri Simonsohn by focusing specifically on difficulties with meta-analyses that unsuccessfully try to overcome problems of publication bias.

There’s something called the fail-safe number (FSN) of Rosenthal (1979) and Rosenthal and Rubin (1978), “an early and still widely used attempt to estimate the number of unpublished studies, averaging null results, that are required to bring the meta-analytic mean Z value of effect sizes down to an insignificant level,” but,

The FSN treats the file drawer of unpublished studies as unbiased by assuming that their average Z value is zero. This wrong assumption appears mostly not to be recognized by researchers who use the FSN to demonstrate the stability of their results. . . . Without making this computational error, the FSN turns out to be a gross overestimate of the number of unpublished studies required to bring the mean Z value of published studies to an insignificant level. The FSN thus gives the meta-analytic researcher a false sense of security.

The false sense of security persists:

Although this fundamental flaw had been spotted early, the number of applications of the FSN has grown exponentially since its publication. Ironically, getting critiques of the FSN published was far from an easy task . . .

Problems with meta-analysis

Ferguson and Heene continue:

Meta-analyses should be more objective arbiters of review for a field than are narrative reviews, but we argue that this is not the case in practice. . . . The selection and interpretation of effect sizes from individual studies requires decisions that may be susceptible to researcher biases.

It is thus not surprising that we have seldom seen a meta-analysis resolve a controversial debate in a field. Typically, the antagonists simply decry the meta-analysis as fundamentally flawed or produce a competing meta-analysis of their own . . . meta-analyses may be used in such debates to essentially confound the process of replication and falsification.


The average effect size may be largely meaningless and spurious due to the avoidance of null findings in the published literature. This aversion to the null is arguably one of the most pernicious and unscientific aspects of modern social science.

Let me interject here that, although I am in general agreement with Ferguson and Heene on these issues, I have a bit of “aversion to the null” myself. I think it’s important to separate the statistical from the scientific null hypothesis.

– The statistical null hypothesis is typically that a particular comparison is exactly zero in the population.

– The scientific null hypothesis is typically that a certain effect is nonexistent or, more generally, that the effect depends so much on situation as to be unreplicable in general.

I might well believe in the scientific null but not in the statistical null.

Virtually unkillable

Ferguson and Heene continue:

The aversion to the null and the persistence of publication bias and denial of the same, renders a situation in which psychological theories are virtually unkillable. Instead of rigid adherence to an objective process of replication and falsification, debates within psychology too easily degenerate into ideological snowball fights, the end result of which is to allow poor quality theories to survive indefinitely. Proponents of a theory may, in effect, reverse the burden of proof, insisting that their theory is true unless skeptics can prove it false (a fruitless invitation, as any falsifying data would certainly be rejected as flawed were it even able to pass through the null-aversive peer review process described above).

Indeed. We see this reversal of the burden of proof all the time. For example, after a data alignment error was uncovered in their research, Neil Anderson and Deniz Ones notoriously wrote: “When any call is made for the retraction of two peer-reviewed and published articles, the onus of proof is on the claimant and the duty of scientific care and caution is manifestly high. . . . Goldberg et al. do not and cannot provide irrefutable proof of the alleged clerical errors.. . . . We continue to stand by the findings and conclusions reported in our previous publications” Ugh! This bothered me so much when I saw it, it made me want to barf. At the time, I wrote that it’s unscientific behavior not to admit error. Unfortunately, for reasons discussed by Ferguson and Heene, much of the scientific enterprise seems to be set up to avoid admission of error. These are serious issues, and it’s interesting to me that we as a field haven’t been thinking much about them until recently.

47 thoughts on ““A Vast Graveyard of Undead Theories: Publication Bias and Psychological Science’s Aversion to the Null”

  1. One very general issue that confuses me about this whole disaster is that isn’t the role of null hypothesis testing purely falsification? Why is rejection of a null being used to show a theory is correct?

    Shouldn’t one’s work consist of showing that the existing model gets falsified, while your own model makes a prediction that is not? Hypothesis testing/p-values are very rarely used for the latter purpose in practice. Researchers do the former, then jump to the conclusion that one’s model is true.

    Falsification would be easier if models were actually created for the hypotheses being proposed. Without this, there’s only an implicit model with too many implicit degrees of freedom, potentially consistent with any non-zero test statistic. No wonder there’s a problem with falsification.

  2. I applaud you for posting about this and therefore, bringing more attention to the article (and of course, the issue). Unfortunately, I believe the number of psychology ‘professionals’ who are fully aware of this problem is not small.
    ~ W.S.G

  3. I agree with top comment here. Indeed some theories may appear unkillable because no null results are published. But if those are proper theories, then they must have testable falsifiable predictions — that should be the real test of the theory IMO. And if the paper just says that they measure a certain correlation of X vs Y — that’s not necessary a theory.

  4. What I don’t get is what disasters do Frequentists really think will befall Science, Truth, puppies, kittens, apple pie and all things good, if we replace:

    F: “We failed to reject H: m1=m2 at the alpha=.05% level”

    with something like the Bayesian,

    B: “It’s very likely that -.2 < m1-m2 < .5"

    It's a serious question. If there is so much resistance to changing the way social scientists are taught to do science, what realistic fears do Frequentists have that drive this resistance.

    And please for Heaven’s sake don't say "because the Bayesian implication is subjective". The Frequentist statement "F" above, to the extent that it's a claim about repeated experiments which usually aren't tested and in fact, usually can't be tested, is far less objectively verifiable then "B".

    • We infer those parameter values that would, with high probability, have resulted in a more significant result than we obtained. So we do infer the discrepancies indicated and those ruled out, and there is no additional “likely” term that needs to be added. It is licensed by what I call the severity principle, but regardless of what it is called, it is the basis for inferences we detach from evidence, statistical or not, in our daily lives. We simply infer things (i.e., infer specific errors are absent)having done a good job ruling them out.

      • I understand the difference and your SEV(H). My question is what harm in practice do Frequentists think would befall a field like psychology if the had to draw conclusions off statements like “B” rather than “F”? Will psychologists will run around making claims for which they don’t have strong evidence more than they already do? Or maybe psychologists can make correct inferences using “B” but will tend to get it wrong more often in practice?

        Also a side point: if a set of assumptions isn’t strong enough to imply a conclusion deductively, then any such conclusions actually drown necessarily have the status of a “best guess” and inherently lack any guarantees. The words “It’s very likely” just mean “our best guess is” and any philosophy/methods/vocabulary that obscures this fact tends to seriously retard statistics. You’d be surprised how many students of statistics are running around thinking things like “the Central Limit Theorem guarantees the outcome will be X if N is big enough”.

  5. Admitting error is very difficult. Witness Reinhart & Rogoff’s angry op-ed defending their work in which they perform the classic avoidance maneuvers. What stuck out to me is they note the data they admit leaving out of the seminal 2010 paper was included in a 2012 paper – which is a generalized longer term study – and say, in essence, “see, we didn’t act in bad faith.” But they never amended their 2010 paper. Never posted a note on their website. Never acknowledged in any of the op-eds written over the years that they had revised data. They included this data in a way that no one would ever know it had never been in the 2010 paper. Sloppy? Dishonest? Whatever. The point is they could have said, “We should have posted a change notice on our website that noted the effects of including the omitted data.” That would have been honest and forthright.

  6. Pingback: “A Vast Graveyard of Undead Theories: Publication Bias and Psychological Science’s Aversion to the Null”” | Frederick Guy

  7. Andrew: “So here you have the extreme case of a field in which errors are the only thing that gets published.”

    For all I know such extreme cases are common, and not only in psychology.

    At source we need to move away from journal editors “curating” so-called interesting findings. Just publish everything that is publishable.

    If that means the number of (on-line) publications rises 100 fold, so be it. With computers we can process all that information, and then some.

    • Fernando: > With computers we can process all that information

      Given there is unknown, evolving almost not-identified selection (and duplication) processes involved in publication, that is a interesting claim.

      Are you assuming the selection (and duplication) processes will disappear?

      • @K?

        Not sure what you mean. Perhaps I was not clear. What I meant is publish everything that meets basic scientific standards independent of result (null, not, etc.).

        That should take care of publication bias, and, to some extent, fishing. Instead, it would emphasize the soundness of the research design, which is much more sensible.

        Of course this means many more articles would be published each year. But that is ok with me. Already millions of articles are published every year. This means traditional literature reviews are out. The future is in machine assisted ones.

      • I think he means that the editor selection process goes away if there is no editor, not that the author’s selection bias on what they choose to write up goes away.

        Duplication is pretty easy to eliminate via good indexing methods, but in the absence of editors screening for “interesting” findings, we won’t have that strange bias involved at least.

        • @Daniel

          In my view it is editors’ selection parameters that are the immediate source of the problem. Peer review should be about scientific rigor, not about “interesting”/ “significant”. Authors simply respond to incentives (and editors, of course, but that takes us further afield.).

          As for duplication, I doubt any human editor has 50 million journal articles in his/her head. A computer could do a much better job of matching records.

        • @Fernando

          Editorial selection is certainly the component of the selection process that first comes to mind and is most discussed, but all components of selectivity are problematic. As Daniel pointed out, authors selectively decide what to write-up versus leave for later or discard. A more refined one, is one that Sander Greenland and I coined confirmation bias, where in epidemiological studies, authors vary the adjustments until their estimates confirm those of their senior colleagues’ publication.

          As for duplication, in clinical research numerous groups were caught writing up the same trial as two or three different trials by varying the authors and patient inclusion/exclusion, analysis strategies, etc. So these take some detective work to find and some of that could be automated (and the details of the code kept secret).

          Getting rid of editorial selection would be a big step forward. Some interesting work on dealing with multiple studies is currently being done by Judea Pearl and student(s) (transportability) to formalise the opportunities/challenges – but I don’t think they have started to consider selection into set of accessible studies yet.

        • @K?

          I agree with you that selection operates at various levels. You can set this up as a strategic game with various players – publishers, funders, editors, reviewers, authors, university employers, etc – and solve for equilibrium strategies/beliefs. But in a blog we do partial equilibrium….

          A key aspect, however, is technological progress. Editorial selection is in place in part due to limited *print* journal space. Remove that bottleneck by going online and the game becomes a very different one.

    • Frenando: “For all I know such extreme cases are common, and not only in psychology.”

      Well said. It seems Psychology is the poster child for this issue. The issue would seem to transcend discipline. Are null results really that easy to publish in physics?

  8. On the contrary, psychology has a vast graveyard of quite dead theories, primarily because it is most definitely not a cumulative science. Areas of research get developed, theories get formulated, data are amassed. And then, apparently, everyone loses interest and goes on to some new pursuit. Hullian theory, social learning theory, Skinnerian theory, Dollard and Miller theory, balance theory, correspondence theory, dissonance theory are just a few examples. The empirical findings related to these theories were (mostly) not wrong, the conclusions didn’t just go away. But they are not incorporated in any way into current theory or current research. They are not to be found in textbooks. Sometimes it seems to me a little like physicists saying they just are not interested in friction any more, that it’s so 1940ish (when I took physics). The theory and facts of friction have been incorporated into modern physics. That sort of thing is not so in psychology.

    • Heh, actually friction is a really tough subject, and has been largely overlooked. There are a few physicists really looking carefully at it, but at the fundamental level we still don’t understand certain aspects. Part of this is that it’s very material specific and involves unknown surface geometry heterogeneity and deposited pollutants

      That being said, some of the really basic approximations are of course still taught, like static-kinetic friction, and if you take earthquake physics courses you’ll hear a little about rate and state friction.

      • Friction is these days much more of an engineering concern, and you can bet that engineers pay a LOT of attention to it.

        Physicists still teach it, pretty much the same way as always, but the attention of physicists is now directed to other things like Higgs, dark matter, dark energy and so forth.

        • That was the impression I had too. There may be the odd Solid State Physicist concerned in a serious way with friction, but you’re far more likely to find impressive theoretical work on friction done in an Engineering department (perhaps Rheology or something). I really wonder if Physicists will one day come to see the neglect of down-to-earth-but-hard topics like friction in favor of pointless-but-sexy topics like string theory as a massive mistake for the profession.

        • I agree with you and Bill above, but one area where friction is of serious interest is in Geophysics where it forms the core of the faulting process problem. I saw a quite excellent presentation in the USC Geophysics group by Jay Fineberg on some fundamental aspects of friction. They are measuring the behavior of the material at an interface between two transparent plastics during frictional sliding using lasers. Good stuff:


          But it’s funny how Lee (above) mentioned friction since it is sort of an area where physics has mostly abandoned the problem as being “too 1950’s” ;-)

    • Very interesting article. Here’s a good part:

      “Stapel did not deny that his deceit was driven by ambition. But it was more complicated than that, he told me. He insisted that he loved social psychology but had been frustrated by the messiness of experimental data, which rarely led to clear conclusions. His lifelong obsession with elegance and order, he said, led him to concoct sexy results that journals found attractive. “It was a quest for aesthetics, for beauty — instead of the truth,” he said. He described his behavior as an addiction that drove him to carry out acts of increasingly daring fraud, like a junkie seeking a bigger and better high.”

      It would be easy to dismiss this as a rationalization, a post-justification. And it likely is.

      But with this rationalization, there is a certain emotional truth that we can appreciate here. Which of us hasn’t admired the beauty of a theoretical curve that the messiness of the data only suggest?

    • Interesting. Oddly enough, just yesterday I was talking with a colleague in the psychology department who mentioned the research on the stereotype threat. I’d vaguely heard of this phenomenon but not any of the details. Anyway, I asked my colleague if this was a real effect or just one of those notorious psychology studies and she said, no, it’s very robust, it’s been replicated a lot.

      Given this, I was surprised to read List’s statement that, “when you talk behind the scenes to people in the profession, they have a hard time finding” stereotype threat. I wonder if it has to do with who List is talking with? I’ll have to ask around a bit myself, now.

      Incidentally, I think List is way too optimistic when he sys, “if there have been 200 studies that try to find it, 10 should find it, right?” One thing we’ve learned in recent years is that, if you’re looking for statistical significance, a lot more than 5% of your experiments will have statistically significant results. There are just so many different ways to slice your data.

  9. At a general level, I think a fundamental problem in psychology was identified by Feynman as a mistaken attitude:


    As he puts it, “if you’re doing an experiment, you should report everything that you think might make it invalid–not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you’ve eliminated by some other experiment, and how they worked–to make sure the other fellow can tell they have been eliminated.” Such papers exist in psychology, but they are not common.

    Feynman also describes what are essentially questionable research practice, which appear to be common in psychology.

    • I think there are a lot of causal questions that social science tries to answer that are fundamentally underdetermined. If everyone followed that advice, every paper on these subjects would basically be the same long list of possibilite explanations, with, “… and I’m gonna pick solution #X” at the end of it. A lot of the subfields simply wouldn’t exist if they took the approach Feynman advocates.

      I think that generally when there’s this kind of underdetermination problem in the questions that define a field, rhetoric and power determine the consensus.

      • ” A lot of the subfields simply wouldn’t exist if they took the approach Feynman advocates.”

        And that would be bad, why?

      • Not every paper has to be complete with a convincing theoretical conclusion. Uncertain data can be valuable if the topic is important. It does no one a service to theorize on uncertain data. Many psychologist believe that they should let the data define their theory, but that only works if the data measurements are very precise. I fear that a lot of theories in psychology are just chasing noise. If the results are unclear or the theories are speculative, then we should describe them that way.

  10. Let’s take ESP. For most of us, we start with a strong prior that ESP does not exist. A study that shows it indeed doesn’t would it change our belief by much?

    Isn’t the bias towards publishing counter-intuitive findings self balancing? The first article published saying that ESP does work now opens the arena for a flood of both types of articles because now indeed (for some of us) the conclusion isn’t forgone so there is space to adjust our beliefs (in both directions).

    Is my naive reasoning wrong?

  11. Our recent paper, “p-curve: a key to the file-drawer” seeks to address what seems to be the underlying problem identified here: a selectively reported set of significant findings can be deemed evidential or non-evidential.

    We also provide guidelines for conducting the analyses which should reduce the cherry-picking of effects alluded above.

    Paper, supplement, user-guide and web-app at http://www.p-curve.com

  12. Publication Bias in an unexisting problem for Bayesians; the editor simply applies its own prior belief on what paper is worth publishing and that’s toooootally fine. In your face Frequentists!… What? ;)

    • Well it would be tooootally fine if the Editor’s beliefs were correct.

      So were you taught in your statistics education that Bayesians claim all prior beliefs equally useful regardless of whether those beliefs are right or wrong?

    • “worth publishing” is purely opinion, like “worth watching” for movies. Turns out there are a lot of movies that critics really like that I just won’t spend any time on, and some of the real critically panned movies turn out to be quite good. But I’d be pretty pissed if people were out there making whole complete movies at their own cost and I couldn’t watch them in theaters or on DVD simply because some “official editors” didn’t think they were worthwhile. That’s pretty much academic publishing in a nutshell.

  13. As someone who has done work collecting data for the Census, I can verify that collecting data is no perfect process. It’s a simple fact that we have to live with. Data collected will never be 100% accurate. Some pieces of information will be missing. Some people will not collect information accurately and yet you as the analyzer have to have faith that your data is accurate. Bias will always be a fact of life and in science. Is there really such a thing as truth or just close approximations?

  14. Pingback: Flurry of articles and posts in response to replication dust-up, and Nature article. | Åse Fixes Science

Comments are closed.