Skip to content

The competing narratives of scientific revolution

Back when we were reading Karl Popper’s Logic of Scientific Discovery and Thomas Kuhn’s Structure of Scientific Revolutions, who would’ve thought that we’d be living through a scientific revolution ourselves?

Scientific revolutions occur on all scales, but here let’s talk about some of the biggies:

1850-1950: Darwinian revolution in biology, changed how we think about human life and its place in the world.

1890-1930: Relativity and quantum revolutions in physics, changed how we think about the universe.

2000-2020: Replication revolution in experimental science, changed our understanding of how we learn about the world.

When it comes to technical difficulty and sheer importance of the scientific theories being disputed, this recent replication revolution is by far more trivial than the earlier revolutions in biology and physics. Still, within its narrow parameters, a revolution it is. And, to the extent that the replication revolution affects research in biology, medicine, and nutrition, its real-world implications do go a bit beyond the worlds of science and the news media. The replication revolution has also helped us understand statistics better, and so I think it potentially can have large indirect effects, not just about ESP, beauty and sex ratio, etc., but for all sorts of problems in science and engineering where statistical data collection and analysis are being used, from polling to genomics to risk analysis to education policy.

Revolutions can be wonderful and they can be necessary—just you try to build a transistor using 1880s-style physics, or to make progress in agriculture using the theories of Trofim Lysenko—but the memory of triumphant past revolutions can perhaps create problems in current research. Everybody wants to make a discovery, everybody wants to be a hero. The undeniable revolutionary successes of evolutionary biology have led to a series of hopeless attempted revolutions of the beauty-and-sex-ratio variety.

The problem is what Richard Feynman called cargo-cult science: researchers try to create new discoveries following the template of successes of the past, without recognizing the key roles of strong theory and careful measurement.

We shouldn’t take Kuhn’s writings as gospel, but one thing he wrote about that made sense to me is the idea of a paradigm or way of thinking.

Here I want to talk about something related, which are the storylines or narratives that run in parallel with the practice of science. There stories are told by journalists, or by scientists themselves; they appear in newspapers and movies and textbooks, and I think it is from these stories that many of our expectations arise about what science is supposed to be.

In this discussion, I’ll set aside, then, stories of science as savior, ensuring clean baths and healthy food for all; or science as Frankenstein, creating atomic bombs, deadly plagues, etc.; or other stories in between. Instead, I’ll focus on the process of science and not its effects on the larger world.

What, then, are the stories of the scientific process?

Narrative #1: Scientist as hero, discovering secrets of nature. The hero might be acting alone, or with a sidekick, or as part of a Mission-Impossible-style team; in any case, it’s all about the discovery. This was the narrative of Freakonomics, it’s the narrative of countless Gladwell articles, and it’s the narrative we were trying to promote in Red State Blue State. The goal of making discoveries is one of the big motivations of doing science in the first place, and the reporting of discovery is a big part of science writing.

But then some scientists push it too far. It’s no surprise that, if scientists are given nearly uniformly positive media coverage, that they will start making bigger and bigger claims. It’s gonna keep happening until something stops it. There have been the occasional high-profile cases of scientific fraud, and these can shake public trust in science, but, paradoxically, examples of fraud can give “normal scientists” (to use the Kuhnian term) a false sense of security: Sure, Diederik Stapel was disgraced, but he faked his data. As long as you don’t fake my data (or if you’re not in the room where it happens). And I don’t think many scientists are actively faking it.

And then the revolution, which comes in three steps:

1. Failed replications. Researchers who are trying to replicate respected studies—sometimes even trying to replicate their own work—are stunned to find null results.

2. Questionable research practices. Once a finding comes into question, either from a failed replication or on theoretical grounds that the claimed effect seems implausible, you can go back to the original published paper, and often then a lot of problems appear in the measurement, data processing, and data analysis. These problems, if found, were always there, but the original authors and reviewers just didn’t think to look, or didn’t notice the problems because they didn’t know what to look for.

3. Theoretical and statistical analysis. Some unreplicated studies were interesting ideas that happened not to work out. For example, could intervention X really have had large and consistent effects on outcome Y? Maybe so. Before actually gathering the data, who knows? Hence it would be worth studying. Other times, an idea never really had a chance: it’s the kangaroo problem, where the measurements were too noisy to possibly detect the effect being studied. In that beauty-and-sex-ratio study, for example, we calculate that the sample size was about 1/100 what would be needed to detect anything. This sort of design analysis is mathematically subtle—considering the distribution of the possible results of an experiment is tougher than simply analyzing a dataset once.

Points 1, 2, and 3 reinforce each other. A failed replication is not always so convincing on its own—after all, in the human science, no replication is exact, and the question always arises: What about that original, successful study. Once we know about questionable research practices, we can understand how those original researchers could’ve reported a string of statistically significant p-values, even from chance alone. And then the theoretical analysis can give us a clue of what might be learned from future studies. Conversely, even if you have a theoretical analysis that a study is hopeless, along with clear evidence of forking paths and even more serious data problems, it’s still valuable to see the results of an external replication.

And that leads us to . . .

Narrative #2: Science is broken. The story here is that scientists are incentivized to publish, indeed to pile up publications in prestige journals, which in turn are incentivized to get citations and media exposure. Put this together and you get a steady flow of hype, with little motivation to do the careful work of serious research. This narrative is supported by high-profile cases of scientific fraud, but what really made it take off was the realization that top scientific journals were regularly publishing papers that did not replicate, and in many cases these papers had made claims that were pretty ridiculous—not necessarily a priori false, and big news if they were true, but silly on their face, and even harder to take seriously after the failed replications and revelations of questionable research practices.

The “science is broken” story has often been framed as scientists being unethical, but this can be misleading, and I’ve worked hard to separate the issue of poor scientific practice from ethical violations. A study could be dead on arrival, but if the researcher in question doesn’t understand the statistics, then I wouldn’t call the behavior unethical. One reason I prefer the term “forking paths” to “p-hacking” is that, to my ear, “hacking” implies intentionality.

At some point, ethical questions do arrive, not so much with the original study as with later efforts to dodge criticism. At some point, ignorance is no excuse. But statistics is hard, and I think we should be able to severely criticize a study without that implying a criticism of the ethics of its authors.

Unfortunately, not everyone takes criticism well, and this has led some of the old guard to argue . . .

Narrative #3: Science is just fine. Hence we get claims such as “The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%” and “Psychology is not in crisis, contrary to popular rumor. . . . All this is normal science . . . National panels will convene and caution scientists, reviewers, and editors to uphold standards. Graduate training will improve, and researchers will remember their training and learn new standards.”

But this story didn’t fly. There were just too many examples of low-quality work getting the royal treatment from the scientific establishment. The sense that something was rotten had spread beyond academia into the culture at large. Even John Oliver got in a few licks.

Hence the attempt to promote . . .

Narrative #4: Attacking the revolutionaries. This tactic is not new—a few years ago, a Harvard psychology professor made some noise attacking the “replication police” as “shameless little bullies” and “second stringers”, and a Princeton psychology professor wrote about “methodological terrorism”—but from my perspective it ramped up more recently when a leading psychologist lied about me in print, and then when various quotes from this blog were taken out of context to misleadingly imply that critics of unreplicated work in the psychology literature were “rife with vitriol . . . vicious . . . attacks . . . threatening.”

I don’t see Narrative 4 having much success. After all, the controversial scientific claims still aren’t replicating, and more and more people—scientists, journalists, and even (I hope) policymakers—are starting to realize that “p less than 0.05” ain’t all that. You can shoot the messengers all you like; the message still isn’t going anywhere. Also, from a sociology of science perspective, shooting the messenger also misses the point: I’m pretty sure that even if Paul Meehl, Deborah Mayo, John Ioannidis, Andrew Gelman, Uri Simonsohn, Anna Dreber, and various other well-known skeptics had never been born, that a crisis would still have arisen regarding unreplicated and unreplicable research results that had been published and publicized in prestigious venues. I’d like to believe that our work, and that of others, has helped us better understand the replication crisis, and can help lead us out of it, but the revolution would be just as serious without us. Calling us “terrorists” isn’t helping any.

OK, so where do these four narratives stand now?

Narrative #1: Scientist as hero. This one’s still going strong. Malcolm Gladwell, Freakonomics, that Telsa guy who’s building a rocket to Mars—they’re all going strong. And, don’t get me wrong, I like the scientist-as-hero story. I’m no hero, but I do consider myself a seeker after truth, and I don’t think it’s all hype to say so. Just consider some analogies: Your typical firefighter is no hero but is still an everyday lifesaver. Your typical social worker is no hero but is still helping people improve their lives. Your typical farmer is no hero but is still helping to feed the world. Etc. I’m all for a positive take on science, and on scientists. And, for that matter, Gladwell and the Freakonomics team have done lots of things that I like.

Narrative #2: Science is broken. This one’s not going anywhere either. Recently we’ve had that Pizzagate professor from Cornell in the news, and he’s got so many papers full of errors that the drip-drip-drip on his work could go on forever. Meanwhile, some of the rhetoric has improved but the incentives for scientists and scholarly journals haven’t changed much, so we can expect a steady stream of weak, mockable papers in top journals, enough to continue feeding the junk-science storyline.

As long as there is science, there will be bad science. The problem, at least until recently, is that some of the bad science was getting a lot of promotion from respected scientific societies and from respected news outlets. The success of Narrative 2 may be changing that, which in turn will, I hope, lead to a decline in Narrative 2 itself. To put it in more specific terms, when a paper on “the Bible Code” appears in Statistical Science, an official journal of the Institute of Mathematical Statistics, then, yes, science—or, at least, one small corner of it—is broken. If such papers only appear in junk journals and don’t get serious media coverage, then that’s another story. After all, we wouldn’t say that science is broken just cos astrology exists.

Narratives #3 and 4: Science is just fine, and Attacking the revolutionaries. As noted above, I don’t see narrative 3 holding up. As various areas of science right themselves, they’ll be seen as fine, but I don’t think the earlier excesses will be forgotten. That’s part of the nature of a scientific revolution, that it’s not seen as a mere continuation of what came before. I’m guessing that scientists in the future will look in wonderment, imagining how it is that researchers could ever have thought that it made sense to treat science as a production line in which important discoveries were made by pulling statistically significant p-values out of the froth of noise.

As for the success of Narrative 4, who knows? The purveyors of Narrative 4 may well succeed in their short-term goal of portraying particular scientific disagreements in personal terms, but I can’t see this effort having the effect of restoring confidence in unreplicated experimental claims, or restoring the deference that used to be given to papers published in prestigious journals. To put it another way, consider that one of the slogans of the defenders of the status quo is “Call off the revolutionaries.” In the United States, being a rebel or a revolutionary it’s typically considered to be a good thing. If you’re calling the other side “revolutionaries,” you’ve probably already lost.

An alternative history

It all could’ve gone differently. Just as we can imagine alternative streams of history where the South did not fire on Fort Sumter, or where the British decided to let go of India in 1900, we can imagine a world in which the replication revolution in science was no revolution at all, but just a gradual reform: a world in which the beauty-and-sex ratio researcher, after being informed of his statistical errors in drawing conclusions from what were essentially random numbers, had stepped back and recognized that this particular line of research was a dead end, that he had been, in essence, trying to send himself into orbit using a firecracker obtained from the local Wal-Mart; a world in which the ovulation-and-clothing researchers, after recognizing that their data were so noisy that their results could not be believed and after recognizing they had so many forking paths that their p-values were meaningless, decided to revamp their research program, improve the quality of their measurements, and move to within-person comparisons; a world in which the celebrated primatologist, after hearing from his research associates that his data codings were questionable, had openly shared his videotapes and fully collaborated with his students and postdocs to consider more general theories of animal behavior; a world in which the ESP researcher, after seeing others point out that forking paths made his p-values uninterpretable and after seeing yet others fail to replicate his study, had recognized that his research had reached a dead end—no shame in that, we all reach dead ends, and the very best researchers can sometimes spend decades on a dead end; it happens; for that matter, what if Andrew Wiles had never reached the end of his particular tunnel and Fermat’s last theorem had remained standing, would we then said that Wiles had wasted his career? No, far from it; there’s honor in pursuing a research path to its end—; a world in which the now-notorious business school professor who studied eating behavior had admitted from the very beginning—at least six years ago now it was that he first heard from outsiders about the crippling problems with his published empirical work—that he had no control over the data reported in his papers, and had stopped trying to maintain that all his claims were valid, instead worked with colleagues to design careful experiments with clean data pipelines and transparent analyses; a world in which that controversial environmental economist had taken the criticism of his work to heart, instead of staying in debate mode had started over, instead of continuing to exercise his talent of getting problematic papers published in good journals had decided to spend a couple years disentangling these climate-and-economics models he’d been treating as data points and instead really working out their implications; a world in which the dozens of researchers who had prominent replication failures or serious flaws in their published work had followed those leaders mentioned above and had used this adversity as an opportunity for reflection and improvement, as an aside thanking their replicators and critics for going to the trouble of taking their work seriously enough to find its problems; in which thousands of researchers whose research hadn’t been checked by others had gone to check their own work, not wanting to publish claims that would not replicate. In this alternative world, there’s no replication crisis at all, just a gradual reassessment of past work, leading gently into a new paradigm of careful measurement and within-person comparison.

Why the revolution?

We tend to think of revolutions as inevitable. The old regime won’t budge, the newcomers want to install a new system, hence a revolution. Or, in scientific terms, we assume there’s no way to resolve an old and a new paradigm.

In the case of the replication crisis, the old paradigm is to gather any old data, find statistical significance in a series of experiments, and then publish and publicize the results. The experiments are important, the conclusions are important, but the actual gathering of data is pretty arbitrary. In the new paradigm, the connection of measurement to theory is much more important. On the other hand, the new paradigm is not entirely new, if we consider fields such as psychometrics.

As remarked above, I don’t think the revolution had to happen; I feel that we could’ve gone from point A to point B in a more peaceful way.

So, why the revolution? Why not just incremental corrections and adjustments? Research team X does a study that gets some attention, others follow up and apparently confirm in. But then, a few years later, team Y comes along with an attempted replication that fails. Later unsuccessful replications follow, along with retrospective close readings of the original papers that reveal forking paths and open-ended theories. So far, no problem. This is just “normal science,” right?

So here’s my guess as to what happened. The reform became a revolution as a result of the actions of the reactionaries.

Part of the difficulty was technical: statistics is hard, and when the first ideas of reform came out, it was easy for researchers to naively think that statistical significance trumped all objections. After all, if you write a paper with 9 different experiments, and each has a statistically significant p-value, then the probability of all that success, if really there were no effect, is (1/20)^9. That’s a tiny number which at first glance would seem impervious to technicalities of multiple comparisons. Actually, though, no: forking paths multiply as fast as p-values. But it took years of exposure to the ideas of Ed Vul, Hal Pashler, Greg Francis, Uri Simosohn, and others to get this point across.

Another difficulty is attachment to particular scientific theories or hypotheses. One way I’ve been trying to help with this one is to separate the scientific models from the statistical models. Sometimes you gotta get quantitative. For example, centuries of analysis of sex ratios tell us that variations in the percentage of girl births are small. So theories along these lines will have to predict small effects. This doesn’t make the theories wrong, it just implies that we can’t discover them from a small survey, and it should also open us up to the possibility of correlations that are positive for some populations in some settings, and negative in others. Similarly in FMRI studies, or social pscyhology, or whatever: The theories can have validity even if they can’t be tested in sloppy experiments. This could be taken as a negative message—some studies are just dead on arrival—but it can also be taken positively: just cos a particular experiment or set of experiments are too noisy to be useful, it doesn’t mean your theory is wrong.

To stand by bad research just because you love your scientific theory: that’s a mistake. Almost always, the bad research is noisy, inconclusive research: careful reanalysis or failed replication does not prove the theory wrong, it just demonstrates that the original experiment did not prove the theory to be correct. So if you love your theory (for reasons other than its apparent success in a noisy experiment), then fine, go for it. Use the tools of science to study it for real.

The final reason for the revolution is cost: the cost of giving up the old approach to science. That’s something that was puzzling me for awhile.

My thinking went like this:
– For the old guard, sure, it’s awkward for them to write off some of the work they’ve been doing for decades—but they still have their jobs and their general reputations as leaders in their fields. As noted above, there’s no embarrassment in pursuing a research dead end in good faith; it happens.
– For younger researchers, yes, it hurts to give up successes that are already in the bank, as it were, but they have future careers to consider, and so why not just take the hit, accept the sunk cost, and move on.

But then I realized that it’s not just a sunk cost; it’s also future costs. Think of it this way: If you’re a successful scientific researcher, you have a kind of formula or recipe, your own personal path to success. The path differs from scientist to scientist, but if you’re in academia, it involves publishing, ideally in top journals. In fields such as experimental biology and psychology, it typically involves designing and conducting experiments, obtaining statistically significant results, and tying them to theory. If you take this pathway away from a group of researchers—for example, by telling them that the studies that they’ve been doing, and that they’re experts in, are too noisy to be useful—then you’re not just wiping out their (reputational) savings, you’re also removing their path to future earnings. You’re not just taking away their golden eggs, you’re reposessing the goose they were counting on to lay more of them.

It’s still a bad idea for researchers to dodge criticism and to attack the critics who are trying so hard to help. But on some level, I understand it, given the cost both in having to write off past work and in losing the easy path to continuing future success.

Just remember that, for each of these people, there may well be three other young researchers who were doing careful, serious work but then didn’t get picked for a plum job or promotion because it was too hard to compete with other candidates who did sloppy but flashy work that got published in top journals. It goes both ways.

Summary (for now)

We are in the middle of a scientific revolution involving statistics and replication in many areas of science, moving from an old paradigm in which important disoveries are a regular, expected product of statistially-significant p-values obtained from routine data collection and analysis, to a new paradigm of . . . weeelll, I’m not quite sure what the new paradigm is. I have some ideas related to quality control, and when it comes to the specifics of design, data collection, and analysis, I recommend careful measurement, within-person comparisons, and multilevel models. Compared to ten years ago, we have a much better sense of what can go wrong in a study, and a lot of good ideas of how to do better. What we’re still struggling with is the big picture, when we move away from the paradigm of routine discovery to a more continuous sense of scientific progress.


  1. Jag Bhalla says:

    As usual very useful thoughts. But you don’t mention causal logic, and I suspect there are method & logic problems with the “suitcase concept” of cause, which are as widespread as statistical significance misunderstanding/misuse. They’re partially addressed in Judea Pearl’s “Causal Revolution” (The Book of Why)… though his approach also risks “heterogeneity-hiding abstraction and logic-losing numbers”
    (further details in this short piece)

  2. I’m not sure that ‘replication’ necessarily portends a scientific revolution. Daniele Fanelli gave a very good keynote on this which I can’t quite locate at the moment.

    I think the collaborations forged in this time more apt to constitute a scientific revolution. I base this on Goodman, Fanelli, and Ioannidis call for standardization of terminology we use. I think rather scientific revolution is going to be more determined by whether neurosciences and other disciplines can empirically validate their hype.

  3. Jonathan (another one) says:

    This still leaves the problem of the press and the public. Once statistics and science got too hard for the layman to understand without a lot of study we devolved to a faith-based reporting mechanism: science journalists (some of whom are great and some of whom are terrible) served as trusted intermediaries. How you (uninformed member of the public) came to trust whom you trust derives from political leanings, writing style, and conformance with naive priors among other non-science-based factors. This is the problem… without understanding the science one can’t use science-based mechanisms to establish whom to trust.

    At that point, large swaths of public policy intermediated in ways that have fundamentally little relevant scientific component. Maybe it was always so — only a small fraction of the public truly understood science well enough to separate good work from sloppy work. But that was less of a problem when the feedback from science to policy was less important. if a few witches got burned, that’s a personal tragedy for a few witches. If potent carcinogens get into the food supply (or conversely, if cheap pest-resistant foods are mistakenly banned owing to false concerns about carcinogens) you get massive human suffering.

    • yyw says:

      People that don’t have solid analytical skills should not be science journalists. It’s probably too much to ask these days given that a not trivial portion of so-called scientists have minimal analytical skills.

      • Jonathan (another one) says:

        Solid analytical skills are indeed rare, but in this case they aren’t even enough! Analytical skills without domain-specific knowledge are only of modest help for a variety of reasons, most of which have to do with the jargon-riddled inside-baseball unstated premises of the typical journal article. And expecting journalist to have domain-specific knowledge *in addition to* solid analytical skills is almost completely hopeless.

  4. Wonks Anonymous says:

    The idea of within-person comparisons wasn’t salient enough in my head to remember your writings, so I had to search your blog for a previous time you discussed it. There you gave a hypothetical example of a study rewritten to use within-person comparisons. Is there some existing research you’d point to as a good example of using such comparisons? My recollection of Red State Blue State was that it was all between persons.

  5. Thanatos Savehn says:

    The thought that sprang to my mind while considering this and Keith’s post on C.S. Peirce is that the revolutions perhaps parallel the paths of logic. First, abduction. Without reference to beliefs about the world that are outside of experience, what explanation, given these observations alone, best combines and makes sense of them? Second, deduction. The experimenters have found relationships between particles and energy that do not vary and that can be expressed mathematically. What sort of world is entailed if we put them together and solve for x; and what sort of testable observations does x imply? The current revolution then is non-trivial – it would be the induction revolution. In unfathomably complex or emergent systems biased observations betray us; we’re not even sure what x is and at best can only approximate it. How shall we model observations and the variation seen among the many variables involved so as to generate testable predictions which in turn refine the model?

    Each would be overlapping and each afflicted by its own peculiar weakness. Observations are made by people prone to see things through distorting lenses. Put too many things together and solve for too many x’s and you wind up wishing you were in some other universe where Peter Woit didn’t exist and where 11 or 18 or 18 zillion dimensions made sense. Making models tempts one to create the world as you would have it be rather than being predictive of how it embarrassingly is, and ill-defined foundations, imprecise language and software that makes the sublime seem obvious only exacerbate the problem.

    Whether Pearl’s causal revolution is an extension of the deductive one into relationships long thought beyond mathematical formalism or instead fits within the inductive one I’ll leave for Drs. Pearl and Dawid to sort out.

    • I don’t tink that we should leave these queries simply to a few experts. I haven’t yet read Pearl’s book though. Quality of collaborations provide more opportunities to share scientific knowledge. We have enough of a gap between the small number of those who engage in these discussions and the rest. This is why our democracy is vulnerable. We need more citizen scientists.
      Besides, based on Pearl’s and Hernan’s Youtube talks, I have at times expressed some of the same viewpoints even before I heard of these two academics.

  6. Anoneuoid says:

    2000-2020: Replication revolution in experimental science, changed our understanding of how we learn about the world.

    I really don’t see how the need for reproducibility can be considered a novel idea. Its at least as old as that original Royal Society motto.

    What really happened was people were lazy and thought they could get away with not checking each others work. Also, that if they did enough calculations that included probability, or asked each other their opinions in just the right way, or whatever, somehow they would know the probably their results were reproducible. It all seemed to work as long as no one thought too hard about it or actually tried to do direct replications, and they could snow anyone who questioned it under with confusing terminology and pedantic calculations you supposedly needed to spend years figuring out first (and buy into) before even having the right to question them.

    Now, as should have been 100% expected, a huge mess has been generated. There is lots of blame to go around, but I doubt any progress will be made if you let those responsible get away with “we couldn’t have known better”.

    • Mayo says:

      I agree. The problems with data-dependent selections, ad hoc saves, confirmation bias, and all the rest, are front and center in Fisher and in Neyman and Pearson (N-P), and are ancient fallacies pervasive through all science, not just statistics. To view them as new revelations is wrongheaded and only gives excuses to those who claim “how were we to know, this is all new?”. Nevertheless, I did come to the realization around 8 years ago that in some fields it was acceptable behavior to try and try again for significance—it came as a shock to me, especially given the work of Meehl and others. So maybe some people can claim to have been taught badly, I’m not doubting this. To the extent that this is no longer acceptable, the change may be seen as revolutionary. Moreover, even people like Simonsohn say they didn’t suspect some common QRPs would blow up error probabilities as much as they do. So some older results had to be rediscovered (using simulations).
      But where does this leave the so-called new paradigm of “data-driven” science? There are supposed to be constraints that prevent it from being of limited relevance when it comes to generalizing beyond data, but I hear (from data science people) that this is problematic. That’s a new area for me: philosophy of machine learning maybe?

      • Anoneuoid says:

        So maybe some people can claim to have been taught badly

        Children are taught to believe in Santa Clause and can figure it out. I don’t see any difference between that and the more egregious things going on like p-hacking your way to statistical significance. So many people come up with excuses that blinding, controls, whatever basic tech is not needed for their specialty…

        But where does this leave the so-called new paradigm of “data-driven” science? There are supposed to be constraints that prevent it from being of limited relevance when it comes to generalizing beyond data, but I hear (from data science people) that this is problematic. That’s a new area for me: philosophy of machine learning maybe?

        Its all about choosing the right metric for success and using the right validation/test dataset that your “model” is naive towards. If you can get good predictive skill for the resources required, and trust that it extrapolates to your eventual purpose, then you win. I’d guess all the philosophy will really need to be in “trust that [performance on the test dataset] extrapolates to your eventual purpose”.

  7. yyw says:

    I think the current problem in science goes beyond the replication problem in social science and maybe biology. In hard STEM, you don’t see as many flat out wrong and ridiculous findings, but there are many useless and trivial results published even in top journals.

  8. Jonathan says:

    I’m not sure I saw this explicitly, though you talk about the difficulties of statistical work: that as we better understand the tools of statistics, we better understand the multi-dimensionality of the data and better grasp that when it’s compressed to a plot or graph then we can be easily misled by correlations that appear when the data is compressed or viewed in a specific manner. Statistics is viewable as invariant only when dealing with invariant things, notably in its applications within physics (and engineering, chemistry, biology, etc.), and I hesitated to type biology, for example, because there are different uses for statistical ideas in biology. Replication becomes an issue when you see the need for replication, which you see when you realize that the data is being sliced up and viewed from an inherently variant perspective. The problem becomes more acute as we see that the problem extends past fraud, past absurdities claiming to be fact, that it extends to the very ideas of what we conclude and why and how we conclude.

  9. Dzhaughn says:

    “Science is broken” is a binary reduction of a continuous variable. Scientific institutions misallocate resources; so does everyone. The question is: what is the opportunity for improvement? I think Andrew and allies make a good case for a large opportunity, provided that what we care about is knowledge of the world, with a special fondness for knowledge with gives us a better ability to alter things, just a little bit, to our benefit.

    In the best case, that reform is going to mean a lot of reseachers get fewer resources and much less fame than they hope for. Less than their thesis advisors got. Reductions will not be equal across departments or subdisciplines. So, don’t be surprised if the fight continues; there are a lot of capable people with big investments and strong political positions here. And lots of paths to fake reform. (I don’t find Andrew’s happy counterfactual world above so realistic.)

    In the worst case, it maybe that all we really demand from these institutions is a story that we are making progress toward utopia. If so, maybe the p < .05 system works better than the reformed one. At least for the medium term; say, the duration of recognizable political systems.

  10. I agree with everything Gelman writes in his interesting overview, but I still think there are some hidden obstacles to progress that rarely get the limelight, notably underlying assumptions about the nature of statistical inference in precisely the contexts he discusses. The replication crisis has had a constructive upshot in understanding statistical foundations–or, at least, it ought to have. Notably, while significance tests are intended to bound the probabilities of erroneous interpretations of data, this error control is nullified by cherry-picking, multiple testing, trying and trying again, post-data subgroups, and other biasing selection effects. If we didn’t know this before, the replication crisis opens our eyes. However, the same flexibility can occur when the cherry-picked or p-hacked hypothesis enters into methods promoted as alternatives to significance tests in various “world beyond P-values” forums, and moves to “redefine significance”. For example, it’s admitted that

    Bayesian analysis does not base decisions on error control. Indeed, Bayesian
    analysis does not use sampling distributions. …As Bayesian analysis ignores
    counterfactual error rates, it cannot control them. (Kruschke and Liddell 2017,
    13, 15)
    This is in reply to me:

    This is not true for all Bayesians, and Gelman is a notable exception (a link is in my note, or search my blog.) By the way, while error probabilities turn on hypothetical or counterfactual considerations, notably, what would have been inferred under different data, they are not themselves counterfactual error rates. The whole “redefine significance” movement is based on Bayes factors and the fact that P-values don’t match them (when a lump of prior is given to a “no effect” point null).

    “In fact, Bayes factors can be used in the complete absence of a sampling plan…” (Bayarri, Benjamin, Berger, Sellke 2016, p.100)

    Error statisticians have perfectly good grounds to appeal to, if they wish to adjust the balance of type 1 and 2 errors– if they are working with cut-offs– but to appeal to a methodology at odds with error probabilities makes no sense. Take Wagenmakers:

    P values can only be computed once the sampling plan is fully known and specified in advance. In scientific practice, few people are keenly aware of their intentions, particularly with respect to what to do when the data turn out not to be significant after the first inspection. Still fewer people would adjust their p values on the basis of their intended sampling plan. (Wagenmakers 2007, 784)

    Rather than insist they ought to adjust, Wagenmakers dismisses a concern with “hypothetical actions for imaginary data” (ibid.). Reducing the selection effects to “intentions” in someone’s head, makes it easy to ignore the consequences of what was done. Ironically, this relinquishes a main grounds to criticize those guilty of biasing selection effects (without having to resort to a high prior on a “no effect”(point) null hypothesis.

    According to Stephen Goodman, co-director of the Meta-Research Innovation Center at Stanford:

    Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value…But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense. (Goodman 1999, 1010)

    To a skeptical critic, options that alter the error probing capacities of methods have everything to do with the data and how to interpret it. To his credit, Goodman is open about his philosophical assumptions. However, the current discussions go on largely divorced from these underlying foundational disagreements (e.g., about evidence and inference).
    Rejecting the relevance of the error probabilities (and the corresponding) sampling distribution) post data is at the very least “in tension” with providing a justification for preregistered reports, considered one of the most effective means to achieving replicable results. The critical reader of a registered report would look, in effect, at the sampling distribution, the probability that one or another hypothesis, stopping point, choice of grouping variables, etc. could have led to a false positive–even without a formal error probability computation. So I think attention to some of these hidden issues will be needed to get to the new paradigm Gelman invites. I know that Gelman’s own Bayesian position is sensitive to these issues, but they are rarely brought out into the open. That’s the rationale behind an upcoming RSS forum (Sept 3), and my new book: Statistical inference as severe testing : How to get beyond the statistics wars (CUP Sept, 2018).

    Bayarri, M., Benjamin, D., Berger, J., and Sellke, T. (2016). ‘Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses’, Journal of Mathematical Psychology, 72, 90-103.
    Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999; 130:1005 –1013.
    Kruschke, John, and Torrin Liddell. 2017. “The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-analysis, and Power Analysis from a Bayesian Perspective.” Psychonomic Bulletin & Review:1-29.
    Wagenmakers, Eric-Jan., 2007. “A Practical Solution to the Pervasive Problems of P values”, Psychonomic Bulletin & Review 14(5): 779-804.

    • Andrew says:


      Thanks for the comments, which I especially appreciate given that some commenters here can be rude to you.

      Regarding a couple of the quotes you cite:

      “Redefine significance”: I think the whole “significance” thing is a mistake, an attempt to deny uncertainty.

      “In fact, Bayes factors can be used in the complete absence of a sampling plan…”: This does not bother me. We often need to analyze data that do not come from any sampling plan, and even when we do have a sampling plan, it is often not so relevant, what with missing data.

      “Hypothetical actions for imaginary data”: This is pretty much the essence of probability, including Bayesian inference (consider the famous Monty Hall problem, for example), so I don’t see why anyone should think this is a bad thing.

      • Mayo says:

        I know you often say this, but I don’t see how significance tests seek certainty when they’re all about trying to quantify error probabilities. I wonder if you mean that the form of conclusion is not a probability assignment. But that wouldn’t mean it sought certainty. Now your own work employs significance tests of a type to test models, and your inferences aren’t claimed to be certain. Instead the upshot is a form of statistical falsification–which is really the only kind of falsification we get in the real world (outside of philosophical or toy examples, like all swans are white).

        The point of my comment is that there is an important tension in current discussions of the statistical crisis of replication–one that is rooted in olderBayesian/frequentist debates, but is much more pronounced now that preregistration and the ills of biasing selection effects are so well known. Until this is brought out, key disputants will talk past each other.

        • Martha (Smith) says:

          “I don’t see how significance tests seek certainty when they’re all about trying to quantify error probabilities. “

          I don’t think it’s that significance tests inherently seek certainty, but that all too often researchers seek certainty, and misinterpret results of significance tests as giving the certainty they seek.

        • Keith O'Rourke says:

          > don’t see how significance tests seek certainty when they’re all about trying to quantify error probabilities.
          Well the quantification of the error probabilities can easily be miss-perceived as overly certain.

          A couple instances are “Jager and Leek’s key assumption is that p-values of null effects will follow a uniform distribution. We argue that this will be the case under only very limited settings” and that meta-analyses that uncritically claimed such a low combined p-value that could easily be been due do the individual p-values being even slightly non-uniform.

          More generally in applications, rather than toy problems, there are nuisance parameters and then the error probabilities under say a null treatment effect are (sometimes erratic) functions of the unknown nuisance parameter values. It was only in about 2010 that it was no longer an ordeal to make that clear in plots using simulation under the null. For an example see page 68 here

          As an aside, it was very difficult for me to find an example graph online – I had to wade through dozen of articles with convoluted arguments and formulas but not a single relevant graph on this point. And the example does not even have nuisance parameters. It is as if the statistical literature is trying to hide the obvious – at least from non-technical readers ;-(

    • Deborah, I just ordered a copy of your book. I think I will learn a lot from it.

      From the book description: “The book sets sail with a simple tool: if little has been done to rule out flaws in inferring a claim, then it has not passed a severe test. Many methods advocated by data experts do not stand up to severe scrutiny and are in tension with successful strategies for blocking or accounting for cherry picking and selective reporting. Through a series of excursions and exhibits, the philosophy and history of inductive inference come alive. Philosophical tools are put to work to solve problems about science and pseudoscience, induction and falsification.”

      From this I understand that some flaws can be discovered through scrutinizing an inference–before one even peruses the details of the study. Not only that, but such mental testing should be part of the original investigation as well as subsequent review.

      • I think that some people are intrinsically better diagnosticians due to a host of many factors. Only a small percent if given the wherewithal has the capacity to think very critically. I believe that cross disciplinary collaborations of exceptional thinkers, going forward, will be the means by which progress will be made. We see that happening already here and in Europe.

        It’s that some experts get locked into promoting their own methods. I think Paul Feyerebend had some keen insights into scientific methods.

  11. Terry says:

    I agree the replication crisis has led to a revolution. I don’t see how serious researchers can ignore it now.

    My question: has the replication crisis/revolution affected tenure decisions?

    Has there been an increase in the reliability of the research of people getting tenure If so, the crisis/revolution has had a serious impact on science.

    Or is the same old behavior being rewarded with tenure as it was before? If so, has there really been a significant change? If shoddy research still gets tenure, why would anyone ever change?

    I focus on recent tenure decisions because that seems like a well-defined and narrowly-focused data set to think about.

    • The simple answer is no. The more complicated answer is that this issue hasn’t been around long enough to really have an effect on those kinds of political issues. I suspect at least another 20 years will be required.

      • Jonathan (another one) says:

        Or as Max Planck said, “Science advances one funeral at a time.”

          • Thanatos Savehn says:

            I think it’s an argument that he perhaps made an M error rather than an S error. In any event, I found this claim annoying: “If a person has done valuable work in the past, this increases the probability that his current work is also valuable …” Or on second thought maybe it’s a good example of what’s wrong with most of the profundities offered up by our self-appointed thought leaders when they start talking about probabilities.

            • Andrew says:


              I agree. Like you, I find that quote annoying. Here’s what it said in the quote: “This point suggests that the use by scholarly journals of blind refereeing is a mistaken policy. It may cause them to turn down unconventional work to which they would rightly have given the benefit of doubt had they known that the author was not a neophyte or eccentric.” This is ridiculous. Prominent people have lots of places they can publish their work. If you’re famous enough, you can just write a book and people will read it. It’s fine for journals to publish the work of prominent researchers, but prominent researchers are hardly a protected class whose fragile ideas need special promotion.

  12. Nick Danger says:

    “1850-1950: Darwinian revolution in biology, changed how we think about human life and its place in the world.”

    Holy schmoley: anyone who believes that bit of nonsense probably also thinks that Copernicus punctured human pride by displacing us from the center of the universe!

  13. Carlos Ungil says:

    > 2000-2020: Replication revolution in “soft” experimental science (and pseudoscience)

  14. Pablo Verde says:

    Great discussion!

    I would like to commemorate “Douglas Altman” a great colleague, who unfortunately passed away some months ago. In his memorable editorial in BMJ (1994) he wrote:

    “What, then, should we think about researchers who use the wrong techniques (either willfully or in ignorance), use the right techniques wrongly, misinterpret their results, report their results selectively, cite the literature selectively, and draw unjustified conclusions? We should be appalled. Yet numerous studies of the medical literature, in both general and specialist journals, have shown that all of the above phenomena are common. This is surely a scandal.”

    Doug was a co-founder of the international Equator Network for health research reliability:

    To me “the Replication Revolution” in experimental science started in 1994 in honor of Doug Altman. I would say that the revolution has an open end. Just a personal story: This summer I assisted at a local colloquium where a neuroscientist presented some revolutionary results. At the end of the talk, I said: “your results are very interesting, but you are using a case-control study, which corresponds to a low evidence in clinical research”. He answered me: “the only thing that matters is the low prediction error”.

    We have a long way to go…

  15. D Kane says:

    Great post. But I don’t recognize this reference:

    > these climate-and-economics models he’d been treating as data points

    Who/what does this refer to?

  16. Christian Hennig says:

    I’m late to the party… anyway one thought is that much of statistics itself is “experimental science”; I see lots of issues like those discussed here happening in the setup and evaluation of simulation studies (or, even worse, using standard “benchmark” real or less real datasets) by which authors try to convince their readers that their newly proposed method is better than what’s already available in the literature.

    As elsewhere, cracks show more and more clearly but surely a fully fledged revolution hasn’t happened yet. Rather an increasing number of revolutionaries or wnnabe revolutionaries is gathering but there’s some serious infighting. As Andrew correctly states at the end, we don’t really have a new paradigm yet. Obviously one can appeal to ideals such as transparency, replication, more careful design of experiments and measurements, better understanding of issues such as forking oaths, more openness to criticism etc. but as others have written already, most of this isn’t new at all, but there are processes at work in society that reward people who ignore such ideals. These processes are alive and well, and this is what the revolution needs to target. But it’s hard.

  17. Wonks Anonymous says:

    Since studies on the effect of ovulation are a common target here, I thought you might be interested in this attempt to check some of the claims:

Leave a Reply to Wonks Anonymous