Bad stuff going down in biostat-land: Declaring null effect just cos p-value is more than 0.05, assuming proportional hazards where it makes no sense

Wesley Tansey writes:

This is no doubt something we both can agree is a sad and wrongheaded use of statistics, namely incredible reliance on null hypothesis significance testing. Here’s an example:

Phase III trial. Failed because their primary endpoint had a p-value of 0.053 instead of 0.05. Here’s the important actual outcome data though:

For the primary efficacy endpoint, INV-PFS, there was no significant difference in PFS between arms, with 243 (84%) of events having occurred (stratified HR, 0.77; 95% CI: 0.59, 1.00; P = 0.053; Fig. 2a and Table 2). The median PFS was 4.5 months (95% CI: 3.9, 5.6) for the atezolizumab arm and 4.3 months (95% CI: 4.2, 5.5) for the chemotherapy arm. The PFS rate was 24% (95% CI: 17, 31) in the atezolizumab arm versus 7% (95% CI: 2, 11; descriptive P < 0.0001) in the chemotherapy arm at 12 months and 14% (95% CI: 7, 21) versus 1% (95% CI: 0, 4; descriptive P = 0.0006), respectively, at 18 months (Fig. 2a). As the INV-PFS did not cross the 0.05 significance boundary, secondary endpoints were not formally tested.

The odds of atezolizumab being better than chemo are clearly high. Yet this entire article is being written as the treatment failing simply because the p-value was 0.003 too high.

He adds:

And these confidence intervals are based on proportional hazards assumptions. But this is an immunotherapy trial where we have good evidence that these trials violate the PH assumption. Basically, you get toxicity early on with immunotherapy, but patients that survive that have a much better outcome down the road. Same story here; see figure below. Early on the immunotherapy patients are doing a little worse than the chemo patients but the long-term survival is much better.

As usual, our recommended solution for the first problem is to acknowledge uncertainty and our recommended solution for the second problem is to expand the model, at the very least by adding an interaction.

Regarding acknowledging uncertainty: Yes, at some point decisions need to be made about choosing treatments for individual patients and making general clinical recommendations—but it’s a mistake to “prematurely collapse the wave function” here. This is a research paper on the effectiveness of the treatment, not a decision-making effort. Keep the uncertainty there; you’re not doing us any favors by acting as if you have certainty when you don’t.

29 thoughts on “Bad stuff going down in biostat-land: Declaring null effect just cos p-value is more than 0.05, assuming proportional hazards where it makes no sense

  1. Re: “And these confidence intervals are based on proportional hazards assumptions.” … They did not use a Cox PH model. The people designing the study will have been familiar immunotherapy studies which is why they used a stratified Cox regression model.

  2. As presented here, it seems pretty horrible. However, in the article, the authors say

    “Therefore, the later timepoints for PFS may better represent the outcomes of the study than the HR, as shown by the shape of the Kaplan–Meier curves, suggesting that the proportional hazards assumption may not have been met. Progression rates were initially higher in the atezolizumab arm than in the chemotherapy arm. However, the curves crossed at approximately 4 months and eventually favored atezolizumab.”

    Reading the article (admittedly not understanding the specifics of this drug or the conditions at all), it didn’t seem so bad at all. The p value is just a distraction and could have been omitted. Certainly the .05 “threshold” reference should have been omitted. But the article seemed much more reasonable in presenting the evidence, and the quote above that the Phase 3 trial had “failed” is not a statement from the paper but from the email.

    • Further, from the article:

      “Although there was no significant difference in PFS observed in BFAST between atezolizumab and chemotherapy in the population with high bTMB, progression-free rates were 24% (95% CI: 17, 32) in the atezolizumab arm versus 7% (95% CI: 2, 11) in the chemotherapy arm at 12 months and 14% (95% CI: 7, 21) versus 1% (95% CI: 0.0, 4), respectively, at 18 months…. Similarly, although secondary endpoints were not formally tested, patients in the atezolizumab arm had a numerically longer OS than the chemotherapy arm. Furthermore, a greater percentage of patients achieved longer-term survival at 12 and 18 months, with the Kaplan–Meier curves crossing.”

      This doesn’t sound like “this entire article is being written as the treatment failing simply because the p-value was 0.003 too high” to me.

      • Yes. Looks like they are acknowledging the direction and effect size indirectly, but have to bow to the almighty p, so they get to publish again.
        It is possible that the authors do know that the direction and magnitude is what is important, rather than p values, but they have to tread carefully.
        OTOH, if thousands of physicians are making decisions down the road, based on this article, their decisions will be binary by definition.

    • They may add some nuance later but the headline conclusion in the Abstract is:
      “Cohort C did not meet its primary endpoint of investigator-assessed progression-free survival in the population with bTMB of ≥16 (hazard ratio, 0.77; 95% confidence interval: 0.59, 1.00; P = 0.053)”

      You see conclusions like this all over the place. Not sure if it’s journals who insist on it or what.

  3. It seems that some people using Bayesian regression models are still thinking this way as well, as it seems many people are still really focused on credible intervals crossing zero for coefficients in their Bayesian model. Or maybe I am just noticing it more….but I get the feeling as more people start using Bayesian regression models with friendly packages like brms and rstanarm (that just practically way better at running complex multilevel models than their frequentist counterparts), I am noticing more people on the Stan forum who have brought that type of thinking into their use of Bayesian models. I’ve also seen where they may hypothesize that there is no effect and use the information that the credible interval for their model coefficient includes zero as evidence for their hypothesis of no effect, even if the interval is quite broad and most of the posterior probability distribution is massed away from zero.

    Maybe this is extreme (and I haven’t thought it through thoroughly), but I think I would be in favor of not reporting any numerical summaries at all in research papers. If people wanted to report the model coefficients, then they would have to show a density plot or histogram of the posterior samples. If they wanted to report estimated effects, then they would have to show plots of predictions with uncertainty (either as shaded regions or spaghetti plots). Any study, even if it is a clinical trial, when repeated, will result in slightly different estimates of effects. These differences would be even greater in most studies, which are not clinical trials. The way I see it, is that in almost all scenarios the total error will be great enough that numerical summaries have meaningless precision and could be precisely enough obtained by eyeballing a plot and estimating the effect from the plot, if someone wanted to quantify the effect. In many cases the answer to the research question might be a ‘maybe’.

  4. Eligible patients were aged ≥18 years, had previously untreated histologically or cytologically confirmed unresectable Stage IIIB or IV NSCLC according to the American Joint Committee on Cancer Staging version 7, Eastern Cooperative Oncology Group (ECOG) Performance Status (PS) of 0 or 1, measurable disease per Response Evaluation Criteria in Solid Tumors (RECIST) v.1.1, bTMB ≥10 mutations (8.3 mut Mb–1) as detected via the bTMB CTA and a treatment-free interval of ≥6 months if they had received previous neoadjuvant or adjuvant treatment

    What is the distinction being made here? That some patients were previously treated for other cancers?

    Also, I wish they would at least try to find out how many people who were excluded survived to the end of the study. How do they know chemo is better than whatever random things people come up with? Historical data from before widespread internet access doesn’t apply anymore.

    • Anon–yes, subjects may have multiple primary tumors for the same organ, or sometimes a different organ. The ones that are not Stage IIIB or IV NSCLC should not have been treated with neoadjuvant or adjuvant treatment within the last 6 months.

      • Right, if you give a drug for tumor A while tumor B is still subclinical then tumor B still gets exposed. At the very least the tissue stem cells were exposed. So some of the tumors may have already developed resistance to cisplatin or whatever. Previous exposure to the chemotherapy seems like a key consideration they ignore.

        As usual, the focus on statistical error when there are huge potential sources of systematic error is misguided.

  5. “our recommended solution for the second problem is to expand the model, at the very least by adding an interaction.”

    Of course, one must be careful about potential model selection bias resulting from changing the model only after looking at the data.
    But even if the researchers had written this strategy in their protocol a priori (e.g., “in the event that the PH assumption is clearly violated, we will add an interaction term”), they would still likely get some type 1 error inflation and not be able to trust the p-value. I looked at this scenario in a StatsInMed paper:
    https://harlanhappydog.github.io/files/SiM.pdf

    • Harlan:

      No problem. I don’t care about type 1 error because the null hypothesis won’t be true anyway. If my only goal were to reject the null hypothesis, I can do that without any data (or, equivalently, by considering a hypothesized huge experiment).

        • Let’s just be clear on what the coverage of the 95% confidence interval means….

          If you hypothesize that your data is coming out of a high quality RNG and therefore has a known, provable, fixed distributional shape, and a fixed set of distributional parameters, then 95% of the time that you run an experiment of the size that you ran, the confidence interval you construct from the data will include the fixed parameter.

          This in particular does NOT mean that in your actual experiment which operates in the real world, that 95% of the time you construct a confidence interval from your actual data that it will contain some notional value or other.

          Confidence intervals are hypothetical constructs about what *would* happen if you were doing *something different* from what you are doing.

  6. Fair enough. But if this is the case, why even bother with statistics at all? As “jd” says in the comment above, perhaps we should stop “reporting any numerical summaries at all in research papers”.

    Personally, I think this is an extreme position. Even if the real world isn’t a perfect reflection of our model, we should still strive towards making meaningful statements in the face of uncertainty.

    In this particular example, I would still be interested in knowing the values of a 95%CI, and I would prefer it if these values were calculated in a way that took into account the fact that, a priori, there was uncertainty about whether or not an interaction term would be included in the model.

    • >Personally, I think this is an extreme position. Even if the real world isn’t a perfect reflection of our model, we should still strive towards making meaningful statements in the face of uncertainty.

      I didn’t suggest to abandon statistical modeling or meaningful statements in the face of uncertainty. I use models all the time. I think making meaningful statements in the face of uncertainty would be better accomplished by providing plots as I described, rather than numerical summaries with meaningless precision. I doubt the precision of your CI exceeds the precision of eyeballing a plot of predicted effects, and the very lack of precision in eyeballing is what makes it good. I think this sorta relates to what the late great commenter Keith O’Rourke quoted WG Cochran as saying https://statmodeling.stat.columbia.edu/2021/12/26/unrepresentative-big-surveys/#comment-2039947

    • IMHO the right thing to report is the complete sample from your posterior distribution as a CSV file in supplements. (and the code to generate it, and the data used)

      In the text I like to say things like “k was about 2 to 3 times as big in group A as group B” and show a KDE graph. I also like statements like “there was a 88% chance that k was bigger in group A than group B” rather than something like “the 95% interval for the difference A-B was [-0.2,4.1]”

      There’s nothing special about 95% and it’s better to concentrate on what appears to be true rather than whether p values are small.

      • There is a danger in simply reporting a difference b/w A and B. The audience will fixate on the result as if it was something that would repeat again and again in a stable manner. Keep in mind that would be the audience that never understood what CIs and p values really mean in the first place.

        • Reporting a kernel density estimate of the difference helps rather than hinders people’s understanding of the uncertainty. Also there’s a big difference between say a nearly uniform distribution, vs one that’s strongly peaked but has fat tails.

          A big advantage of Bayesian methods is we arrive at a density rather than “an interval that contains the value x% of the time” which has no clear relative importance within it.

    • I mentally double the size of whatever interval is reported, which amounts to assuming systematic error is about the same magnitude as sampling error.

      Eg, in this study they found about 90% of people progressed 18 months after getting one drug vs 99% after the other type. I put myself in the shoes of a patient. If the doc tells me this drug in keeps cancer at bay for 1.5 years in 10% of people compared to this other drug only 1%, that isn’t very impressive to begin with.

      However, they don’t mention blinding, so we have to assume the determination of “progression” was biased towards getting the desired result.

      They also didn’t report prior cancer treatments as a baseline for some reason, which you would expect to influence the growth and mutational load of the specific type of tumor looked at here. Seems an odd omission to me.

      And the absolute worst thing, which is shared by every single cancer clinical trial I have ever looked at, is failure to account for caloric restriction (see, eg, https://en.wikipedia.org/wiki/Warburg_effect_(oncology) ) due to the so-called “side effects” of reduced appetite, nausea, and vomiting. They don’t even report the weight of these patients.

  7. In medical journals I think a lot of these problems come from editors and reviewers, not authors. And then authors predicting/assuming what reviewers and editors will want to see. The reviewers and editors are generally not methods specialists, and at least where I publish and review, there are rarely statistical reviewers.

Leave a Reply

Your email address will not be published. Required fields are marked *