Skip to content

Post-Hoc Power PubPeer Dumpster Fire

We’ve discussed this one before (original, polite response here; later response, after months of frustration, here), but it keeps on coming.

Latest version is this disaster of a paper which got shredded by a zillion commenters on PubPeer. There’s lots of incompetent stuff out there in the literature—that’s the way things go; statistics is hard—but, at some point, when enough people point out your error, I think it’s irresponsible to keep on with it. Ultimately our duty is to science, not to our individual careers.


  1. Anonymous says:

    I for one welcome back the cat-pictures. I’ve missed those. I’ve missed those a lot.

  2. Anoneuoid says:

    Latest version is this disaster of a paper which got shredded by a zillion commenters on PubPeer. There’s lots of incompetent stuff out there in the literature—that’s the way things go; statistics is hard—but, at some point, when enough people point out your error, I think it’s irresponsible to keep on with it. Ultimately our duty is to science, not to our individual careers.

    It looks like “not even wrong” angels dancing on a pin stuff to me, I think this sentence from one of the “shredders” pretty much sums it up:

    As you probably know, when we perform a clinical trial, we want to design the study to be sufficiently large that we will actually conclude that there is a treatment effect if one truly exists.

    1) They are actually checking if the null model is wrong. A “treatment effect” parameter equal to zero is only one possible incorrect assumption that goes into deriving the null model.

    2) The null model is always wrong and typically not even believed by anyone, especially not the people running the study (or else they wouldn’t have run it). So all we can learn from this procedure is whether our sample size was large enough to detect the deviation when using a given threshold. And sample size = money, so statistical significance is just a measure of how much money gets allocated to different beliefs (ie, the collective prior probability that an “effect” should exist… which would be 1 for everything in a sane world).

    3) The threshold is arbitrary so can be (and is) adjusted to get the “right” number of “discoveries” for different types of data. Cheap data -> more stringent threshold.

    4) Whether a treatment effect exists is not (or would not be… in a sane world) actually of concern to people. Instead medical researchers should care about the magnitude of the apparent effect(s) of a treatment, both positive and negative. Then these magnitudes can be incorporated into a cost benefit assessment, and also explained theoretically.

    So, complaining that someone misinterpreted their post-hoc power calculation is like being concerned about a small pimple on a large tumor.

    Another (perhaps even better) analogous situation: “Did the father begat the son or was the son begotten of the father?” I think most here would agree this is entirely a non-issue, and even discussing such topics is a waste of everyones time, yet:

  3. Anoneuoid says:

    High-impact surgical science was routinely unable to reach the arbitrary power
    standard of 0.8. The academic surgical community should reconsider the power threshold
    as it applies to surgical investigations.


    Given the inherent complexity of surgery, comparative effectiveness studies present unique methodological challenges in patient accruement. 1 These small studies are often the only practical studies possible, and the results can provide valuable insights when properly communicated to avoid misinterpretation.

    Look at this more… I think this is a roundabout way of saying the field needs to raise the significance threshold above 0.05 because the data is very expensive/noisy. This amounts to the same thing as lowering the power threshold for publication right?

    If they are studying events that occur in 1-10% in the baseline group, the power vs threshold and sample/effect sizes would look like:

    So yea, to “detect” the differences in the range of 20% for such phenomena they will need 1k+ subjects per group unless they want to relax the threshold pretty drastically to above 0.2.

    Also, it is interesting to look at the same type of chart for a t-test:

    Using a threshold of 0.05, power is ~50% to detect a “medium” (0.5*sd) ('s_d) effect size with 30 subjects per group. Intuitively, I would guess that researchers expect 10-50% of what people try should “work”. Ie, if only 1 in 10 studies got published it is “too hard”, if more than half get published it is “too easy”.

  4. TGGP says:

    At PubPeer Andrew uses this url to refer to something in the printed literature:
    That url just goes to the shorturl homepage. What was it supposed to refer to?

  5. Thanatos Savehn says:

    Speaking of statistical power, I just read this sentence in a recent order dealing with the issue of causation in a toxic tort case from a New York court: “Dr. [David] Madigan defines statistical power as “the probability of finding a statistically significant difference between exposed and control subjects when one truly exists.” No real complaint here but it would be nice if folks would swap “assuming” for “when”. Alas, the court read it as intended i.e. there’s truly a difference between the populations and thus this is significant evidence of causation. The expert statistician further opined that all of the many studies showing no statistically significant difference were underpowered.

    Glad to see that Columbia is working hard to keep me gainfully employed.

    • Thanatos Savehn says:

      And from an opinion from the New Jersey Supreme Court:

      “As Dr. Madigan explained, a power analysis examines for the risk that a study’s outcome was a “false negative.” He also said that effect size should be set at 50% or greater which, serendipitously no doubt, made all the studies that found no effect “underpowered” and thus “unreliable”.

      Currently sitting through a “legal analytics” CLE that I made the mistake of taking and I can report that lawyers talking about statistical methods for analyzing data from litigations is, unintentionally, exactly the parody of statistics that you imagine.

    • Martha (Smith) says:

      The link brings to mind a one-time roommate who went to Catholic elementary school. She said the nuns taught the girls that when they took a bath, they should sprinkle talcum powder on the bath water before undressing and stepping in the tub, so they would not see their private parts while bathing.
      So maybe women who attended Catholic schools might be a good source to seek plaintiffs for suits agains J & J.

      • Thanatos Savehn says:

        Honey is to flies as ? is to plaintiffs’ lawyers. They’ll be found. Alas.

        Anyway, the courts demand “causation” but all they get is risk. So, instead of using the info on risk to inform their public policy role (setting the outer limits of potential liability and thus guiding behavior) they turn uncertainty into certainty, population harms, benefits and costs are ignored, and a dichotomized caused/not caused decision is reached. The result is that tiny risks that manifest (allegedly) in harm to a relative handful of people get turned into huge verdicts while big risks like healthcare-acquired infections that impact thousands annually get little attention in no small part because somehow the courts got it in their collective mind that statistics plus A.B. Hill’s “causal criteria” (which is invariably used crystal ball-fashion) yields certainty about causation whereas Koch’s postulates don’t because … where’s the statistically significant result from a statistically significant study using a statistically significant sample that shows anthrax is caused by B. anthracis?

        P.S. In the NJ Sct Ct case the quoted expert’s statistical methodology was found to be wanting by the trial court and the NJ Sct Ct upheld the decision to exclude his testimony. Some judges are starting to get it.

Leave a Reply to Martha (Smith)