No, it’s not “statistically implausible” when results differ between studies, or between different groups within a study.

James “not the cancer cure guy” Watson writes:

This letter by Thorland et al. published in the New England Journal of Medicine is rather amusing. It’s unclear to me what their point is, other than the fact that they find the published results for the new COVID drug molnupiravir “statistically implausible.”

Background: The pharma company Merck got very promising results for molnupiravir at their interim analysis (~50% reduction in hospitalisation/death) but less promising results at their final analysis (30% reduction). Thorlund et al. were surprised that the data for the two study periods (before and after interim analysis) provided very different point estimates for benefit (goes the other way in the second period). They were also surprised to see inconsistent results when comparing across the different countries included in the study (non-overlapping confidence intervals).

They clearly had never read the subgroup analysis from the ISIS-2 trial: the authors convincingly showed that aspirin reduced vascular deaths in patients of all astrological birth signs expect Gemini and Libra, see Figure 5 in this Lancet paper from 1998.

He’s not kidding—that Lancet paper really does talk about astrological signs. What the hell??

Regarding the letter in the New England Journal of Medicine, I guess the point is that different studies, and different groups within a study, have different patients and are conducted at different times and under different conditions, so it makes sense that they can have different outcomes, more different that would be expected to arise from pure chance when comparing two samples from an identical distribution. People often don’t seem to realize this, leading them to characterize differences from chance as “statistically implausible” etc. rather than just representing underlying differences across patients, scenarios, and times.

As the authors of the original study put it in their response letter in the journal:

Given the shifts in prevailing SARS-CoV-2 variants, changes in out- patient management, and inclusion of trial sites from countries with unique Covid-19 disease burdens, the trial was not necessarily conducted under uniform conditions. The differences in the results between the interim and final analyses might be statistically improbable under ideal circumstances, but they reflect the fact that several key factors could not remain constant despite a consistent trial design.

Indeed.

16 thoughts on “No, it’s not “statistically implausible” when results differ between studies, or between different groups within a study.

      • Thanks Andrew for posting this. I don’t think you even need anything even as complicated as temporal drift to explain these results. Even if there was no temporal drift (and yes that’s implausible), the before/after periods are just one arbitrary split into two subgroups. The country/site splits are other arbitrary choices. The trial was only stopped (thus creating the before/after interim analysis periods) because the effect in the interim analysis was large (regression to the mean?). I pointed to the Lancet 1998 paper because this amusing subgroup analysis Table demonstrates that even in very very large randomised trials it is always possible to find subgroups with heterogeneous effects.

  1. Re the astrology signs. I attended a lecture by the great British statistician, Sir Richard Peto, once. He expressed annoyance at the desire for subset analyses. In the academic world at that time, it was common to sit through a presentation that showed no value for some treatment over placebo in a clinical trial with the presenter ending with a coda that an unplanned subset analysis retrospectively found benefit in left handed blue eyed people. Oncology seemed very prone to this data torturing to me. The aspirin trial was quite straightforward, but people kept asking for subset analyses, and Sir Peto threw in the astrological signs to shut them up.

    • Just a note to add to this true story, it was the journal that insisted they include subgroup analyses. They included star signs to encourage people not to take the results of the other subgroup analyses seriously.

      “One of the ISIS-2 trial’s striking idiosyncrasies was its astrological subgroup analysis. Even though the overall results of the trial were so dramatically positive, during peer review The Lancet asked the researchers to subdivide the patients and detail which ones had benefited and which ones had not. The Oxford team regarded this request as “bad science”, but decided to comply, although in a very unorthodox way. Making use of the horoscope column from a newspaper, they subdivided the patients according birth sign. This showed that aspirin didn’t seem to work for those born under Libra or Gemini, but worked brilliantly for Capricorn. This was the only subgroup analysis that the Oxford researchers thought was statistically serious; rather, the really important message of the study was to be found in the overall result which provided more reliable guidance as to the likely effects in many different subgroups.”

      https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(15)61505-7/fulltext

    • https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(15)61505-7/fulltext

      One of the ISIS-2 trial’s striking idiosyncrasies was its astrological subgroup analysis. Even though the overall results of the trial were so dramatically positive, during peer review The Lancet asked the researchers to subdivide the patients and detail which ones had benefited and which ones had not. The Oxford team regarded this request as “bad science”, but decided to comply, although in a very unorthodox way. Making use of the horoscope column from a newspaper, they subdivided the patients according birth sign. This showed that aspirin didn’t seem to work for those born under Libra or Gemini, but worked brilliantly for Capricorn. This was the only subgroup analysis that the Oxford researchers thought was statistically serious; rather, the really important message of the study was to be found in the overall result which provided more reliable guidance as to the likely effects in many different subgroups.

    • That is such a great Peto anecdote. I only met him once. And, coincidentally, he was talking about this subject. He had a great sound-bite. Something like… “Subgroup analyses: always do them, but never believe them.”

  2. For some additional context on the Lancet paper: The authors clearly state, ‘All these subgroup analyses should, perhaps, be taken less as evidence about who benefits than as evidence that such analyses are potentially misleading.’ In other words, the authors defend their position not to do subgroup analyses because otherwise one might find results like Gemini and Libra patients being adversely affected, while all those born under other astrological signs experienced a ‘a strikingly beneficial effect’. This translates to the criticism Mr Watson comments on: Experimental parameters might change over the course of a trial (e.g. Levenstein in her letter to the editor points it out, and the original authors confirm that ‘enrollment after the interim analysis indeed coincided with the emergence of the delta variant’), as Andrew Gelman points out above. But I think Watson’s point is that splitting the data into two subgroups (‘interim’ and ‘post-interim’) is arbitrary and similar to the situation described above by oncodoc – even if we assume that the changes in experimental parameters over time have no noteworthy effect on the outcome.

  3. Watson is unpleasant and has made many foolish predictions, but it comes across as bitter and slightly embarrassing to relentlessly harp on the failures of one of the greatest scientists of the 20th century. What do the failures matter next to the tremendous success?

  4. Hox hox Andrew, something fishy going on! My mini laptop crashes when opening your blog (Ubuntu 22.04) and my phone crashes when opening the Post about simulating measurement error (Android something)

    • My Windows 7 Professional 64-bit Acer laptop stops responding to mouse clicks for a subjectively long time during the loading of this and only this website, and randomly later on here, starting a day or so ago. I can live with it so far.

      (When Windows 8 was announced I immediately bought this as a backup for my Dell W7 laptop and hope to die with it; I grant that there may be new or better capabilities under W10 or W11, but why do they have to change all the old, familiar commands which I can do automatically, including even the shutdown procedure?) (Probably because they laid off the original programmers and had to start fresh.)

    • Hmmm . . . for some reason the file I saved was really huge, and WordPress does not want to compress it for me. I replaced with a slightly smaller (but still huge; sorry!) file. I hope this helps if you refresh.

  5. Or perhaps Thorlund et al were pointing out that the Merck study, published and funded by Merck, would present findings in the most favourable light. Yet a much larger study, with 26,000 patients, conducted using public funds in the UIK, found no benefit of molnupiravir for hospitalization or death.

    Molnupiravir plus usual care versus usual care alone as early treatment for adults with COVID-19 at increased risk of adverse outcomes (PANORAMIC): an open-label, platform-adaptive randomised controlled trial. Lancet. 2023 Jan 28;401(10373):281-293. doi: 10.1016/S0140-6736(22)02597-1. Epub 2022 Dec 22.

Leave a Reply

Your email address will not be published. Required fields are marked *