Post-Hoc Power PubPeer Dumpster Fire

Posted on May 4, 2019 4:46 PM by Andrew

We’ve discussed this one before (original, polite response here; later response, after months of frustration, here), but it keeps on coming.

Latest version is this disaster of a paper which got shredded by a zillion commenters on PubPeer. There’s lots of incompetent stuff out there in the literature—that’s the way things go; statistics is hard—but, at some point, when enough people point out your error, I think it’s irresponsible to keep on with it. Ultimately our duty is to science, not to our individual careers.

15 thoughts on “Post-Hoc Power PubPeer Dumpster Fire”

Anonymous on May 4, 2019 5:20 PM at 5:20 pm said:

I for one welcome back the cat-pictures. I’ve missed those. I’ve missed those a lot.

Reply ↓
- Zad Chow on May 4, 2019 6:36 PM at 6:36 pm said:
  
  +1
  
  Reply ↓
- Andrew on May 4, 2019 7:36 PM at 7:36 pm said:
  
  Anon, Zad:
  
  The only limiting factor here is the number of cat pictures people send to me. We have about 400 posts a year, so I’ll be needing about 8 pics a week coming in.
  
  Reply ↓
  - Martha (Smith) on May 5, 2019 12:05 AM at 12:05 am said:
    
    1500 free stock cat photos at https://www.pexels.com/search/cat/.
    
    Reply ↓
  - Anonymous on May 5, 2019 1:24 AM at 1:24 am said:
    
    Quote from above: “The only limiting factor here is the number of cat pictures people send to me. We have about 400 posts a year, so I’ll be needing about 8 pics a week coming in.”
    
    For what it’s worth:
    
    Perhaps a cat picture with every post is not needed, welcomed, functional, etc. I can also understand, and assume, that not everybody likes them, and/or thinks they are suitable for this blog, etc.
    
    I personally just like to see them once in a while (once every two weeks or so, would be fine for me personally i think). And perhaps mostly for posts where there is no (better) option for some sort of relevant picture/graph/etc.
    
    I reason, and think, a cat picture once in a while might contribute to providing some sort of (possibly useful) “balance” in some way or form. And i think it’s funny/endearing/etc. (if those are the appropriate words) to think about a professor looking at cat-pictures to see which one he will pick for the blogvpost. I also often try to see if the cat picture somehow relates to the topic of the blog post (i think this is, at least sometimes, the case), which is a fun game to play for a minute or so.
    
    Reply ↓
    - Martha (Smith) on May 5, 2019 3:47 PM at 3:47 pm said:
      
      +1
    - Anonymous on May 5, 2019 3:53 PM at 3:53 pm said:
      
      (From your link to the stock photos:)
      
      https://www.pexels.com/photo/person-giving-high-five-to-grey-cat-38867/
    - Martha (Smith) on May 5, 2019 4:05 PM at 4:05 pm said:
      
      :~)
Anoneuoid on May 5, 2019 11:33 AM at 11:33 am said:

Latest version is this disaster of a paper which got shredded by a zillion commenters on PubPeer. There’s lots of incompetent stuff out there in the literature—that’s the way things go; statistics is hard—but, at some point, when enough people point out your error, I think it’s irresponsible to keep on with it. Ultimately our duty is to science, not to our individual careers.

It looks like “not even wrong” angels dancing on a pin stuff to me, I think this sentence from one of the “shredders” pretty much sums it up:

As you probably know, when we perform a clinical trial, we want to design the study to be sufficiently large that we will actually conclude that there is a treatment effect if one truly exists.

1) They are actually checking if the null model is wrong. A “treatment effect” parameter equal to zero is only one possible incorrect assumption that goes into deriving the null model.

2) The null model is always wrong and typically not even believed by anyone, especially not the people running the study (or else they wouldn’t have run it). So all we can learn from this procedure is whether our sample size was large enough to detect the deviation when using a given threshold. And sample size = money, so statistical significance is just a measure of how much money gets allocated to different beliefs (ie, the collective prior probability that an “effect” should exist… which would be 1 for everything in a sane world).

3) The threshold is arbitrary so can be (and is) adjusted to get the “right” number of “discoveries” for different types of data. Cheap data -> more stringent threshold.

4) Whether a treatment effect exists is not (or would not be… in a sane world) actually of concern to people. Instead medical researchers should care about the magnitude of the apparent effect(s) of a treatment, both positive and negative. Then these magnitudes can be incorporated into a cost benefit assessment, and also explained theoretically.

So, complaining that someone misinterpreted their post-hoc power calculation is like being concerned about a small pimple on a large tumor. https://en.m.wiktionary.org/wiki/bikeshedding

Another (perhaps even better) analogous situation: “Did the father begat the son or was the son begotten of the father?” I think most here would agree this is entirely a non-issue, and even discussing such topics is a waste of everyones time, yet: https://en.m.wikipedia.org/wiki/Arian_controversy

Reply ↓
Anoneuoid on May 5, 2019 7:59 PM at 7:59 pm said:

High-impact surgical science was routinely unable to reach the arbitrary power
standard of 0.8. The academic surgical community should reconsider the power threshold
as it applies to surgical investigations.

[…]

Given the inherent complexity of surgery, comparative effectiveness studies present unique methodological challenges in patient accruement. 1 These small studies are often the only practical studies possible, and the results can provide valuable insights when properly communicated to avoid misinterpretation.

Look at this more… I think this is a roundabout way of saying the field needs to raise the significance threshold above 0.05 because the data is very expensive/noisy. This amounts to the same thing as lowering the power threshold for publication right?

If they are studying events that occur in 1-10% in the baseline group, the power vs threshold and sample/effect sizes would look like:

https://i.ibb.co/TvWXNxw/0-1-power.png
https://i.ibb.co/nj9xGwB/0-01-power.png

So yea, to “detect” the differences in the range of 20% for such phenomena they will need 1k+ subjects per group unless they want to relax the threshold pretty drastically to above 0.2.

Also, it is interesting to look at the same type of chart for a t-test: https://i.ibb.co/569LLwk/power.png

Using a threshold of 0.05, power is ~50% to detect a “medium” (0.5*sd) (https://en.wikipedia.org/wiki/Effect_size#Cohen's_d) effect size with 30 subjects per group. Intuitively, I would guess that researchers expect 10-50% of what people try should “work”. Ie, if only 1 in 10 studies got published it is “too hard”, if more than half get published it is “too easy”.

Reply ↓
TGGP on May 7, 2019 10:34 AM at 10:34 am said:

At PubPeer Andrew uses this url to refer to something in the printed literature:
shorturl.at/bflPU
That url just goes to the shorturl homepage. What was it supposed to refer to?

Reply ↓
Thanatos Savehn on May 19, 2019 5:48 PM at 5:48 pm said:

Speaking of statistical power, I just read this sentence in a recent order dealing with the issue of causation in a toxic tort case from a New York court: “Dr. [David] Madigan defines statistical power as “the probability of finding a statistically significant difference between exposed and control subjects when one truly exists.” No real complaint here but it would be nice if folks would swap “assuming” for “when”. Alas, the court read it as intended i.e. there’s truly a difference between the populations and thus this is significant evidence of causation. The expert statistician further opined that all of the many studies showing no statistically significant difference were underpowered.

Glad to see that Columbia is working hard to keep me gainfully employed.

https://scholar.google.com/scholar_case?case=3732257615911793873

Reply ↓
- Thanatos Savehn on June 19, 2019 4:21 PM at 4:21 pm said:
  
  And from an opinion from the New Jersey Supreme Court:
  
  “As Dr. Madigan explained, a power analysis examines for the risk that a study’s outcome was a “false negative.” He also said that effect size should be set at 50% or greater which, serendipitously no doubt, made all the studies that found no effect “underpowered” and thus “unreliable”.
  
  Currently sitting through a “legal analytics” CLE that I made the mistake of taking and I can report that lawyers talking about statistical methods for analyzing data from litigations is, unintentionally, exactly the parody of statistics that you imagine.
  
  Reply ↓
- Martha (Smith) on June 19, 2019 4:37 PM at 4:37 pm said:
  
  The link brings to mind a one-time roommate who went to Catholic elementary school. She said the nuns taught the girls that when they took a bath, they should sprinkle talcum powder on the bath water before undressing and stepping in the tub, so they would not see their private parts while bathing.
  So maybe women who attended Catholic schools might be a good source to seek plaintiffs for suits agains J & J.
  
  Reply ↓
  - Thanatos Savehn on June 19, 2019 5:41 PM at 5:41 pm said:
    
    Honey is to flies as ? is to plaintiffs’ lawyers. They’ll be found. Alas.
    
    Anyway, the courts demand “causation” but all they get is risk. So, instead of using the info on risk to inform their public policy role (setting the outer limits of potential liability and thus guiding behavior) they turn uncertainty into certainty, population harms, benefits and costs are ignored, and a dichotomized caused/not caused decision is reached. The result is that tiny risks that manifest (allegedly) in harm to a relative handful of people get turned into huge verdicts while big risks like healthcare-acquired infections that impact thousands annually get little attention in no small part because somehow the courts got it in their collective mind that statistics plus A.B. Hill’s “causal criteria” (which is invariably used crystal ball-fashion) yields certainty about causation whereas Koch’s postulates don’t because … where’s the statistically significant result from a statistically significant study using a statistically significant sample that shows anthrax is caused by B. anthracis?
    
    P.S. In the NJ Sct Ct case the quoted expert’s statistical methodology was found to be wanting by the trial court and the NJ Sct Ct upheld the decision to exclude his testimony. Some judges are starting to get it. https://scholar.google.com/scholar_case?case=12660322087934442591
    
    Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Post-Hoc Power PubPeer Dumpster Fire

15 thoughts on “Post-Hoc Power PubPeer Dumpster Fire”

Leave a Reply Cancel reply