This came up already, but in the meantime this paper in the Journal of Surgical Research has been just raked over the coals, over and over and over again, in this delightful Pubpeer thread. 31 comments so far, all of them just slamming the original published paper and many with interesting insights of their own.

The original article should never have been published, but the Pubpeer thread is pretty good, really it has all the stuff that the Journal of Surgical Research should’ve published in the first place.

The authors of the original article do not seem to have yet responded on the thread; I assume that’s because they suspect that any defense of their paper would be throughly rebutted.

Statistics is hard, and it’s understandable that these authors would be confused by the notion of power, but it is irresponsible for them to have written this new paper given that various people have already pointed out their error in print. Commenter #13 in the above-linked Pubpeer thread says the publication of this new paper is a failure of peer review. But it is also a failure on the part of the authors. When people make a mistake multiple times after it has been pointed out to the, it’s ultimately their fault. At some point, ignorance is no excuse.

Andrew, you say ‘The authors of the original article do not seem to have yet responded on the thread’. You’re right, but the journal did publish two letters about the paper (I’ve pinged Andrew Althouse, one of the main commenters on the pubpeer thread, who may have more details):

The letters are:

1. Althouse https://www.sciencedirect.com/science/article/pii/S0022480420305023

2. Griffith & Feyman https://www.sciencedirect.com/science/article/pii/S0022480420305011

The authors’ response is a load of sanctimonious, wearying toss where they repeatedly whine about social media and talk about due process. Despite that, they ignore Andrew Althouse’s letter entirely and, despite acknowledging ‘Mr Griffith and his co-author’, don’t actually cite their letter. Perhaps you are only allowed to refer to surgeons in that journal.

The original article should never have been published. The journal seems to have tried to do an interesting-arguments-on-both-sides kind of thing and the result is a joke. We haven’t seen the end of post hoc power have we?

It is so frustrating that Journals don’t try to get a statistician to review papers that are about statistical methodology. In my experience this same issue extends to study sections for NIH grants where the statistics are important (study design, power, etc.); I’ve received completely nonsensical reviews on grants I’ve been a part of from people who obviously fancy themselves as knowing statistics. I doubt it would be hard to find statisticians to review these papers or sit on study sections. So the only reasons I can think of that editors don’t recruit statisticians are over-confidence (editors thinking they know more than they do), laziness (they don’t want to deal with pesky statisticians), or pride (their embarrassed that they don’t understand statistics either). Drives me crazy.

“I doubt it would be hard to find statisticians to review these papers “

really? i don’t get the sense that there are so many statisticians out there eager to review (more) terrible papers

The time commitment for these terrible papers is really quite low. I have a colleague who is an editor of a medical journal and when he questions the stats on a paper or the reviewers aren’t sure he’ll ask if I have time to look at it. Usually takes less than an hour to figure out if the analysis is at least reasonable. Maybe there aren’t many people who want to do that, you could be right.

> The time commitment for these terrible papers is really quite low

Just the opposite, if your goal as a reviewer is actually to be constructive (and to get asked to review in the future).

I am often called to review papers with technical content that other reviewers/editors in my area don’t feel qualified to address. In that sense, the system is working as intended (at least for the journals that I tend to review for). As you say, it is often the case that it is easy to detect whether something fishy is going on. But it takes considerably longer to explain those problems in a way that is accessible to both the authors and the editor.

Remember, the editor already thinks the technical issue may be beyond them, so your job as a reviewer is to explain it to them in a way that helps them decide whether the paper is at least salvageable. As in any other aspect of science, “trust me” is not an acceptable (or useful) reason. Further, a cursory yea/nay review doesn’t give the authors any opportunity to learn and improve their future work.

I love getting good papers to review, they take an hour. Flawed papers take a considerable investment of time and effort, and you can only hope the authors will appreciate it.

+1

The only thing I found rewarding in reviewing papers is to getting to see a much improved revision. Likely some selection, but it was more than a third of the time.

I’m not confident that the good ol’ “bring statisticians” -solution actually works in practice. See, that has been the advice for decades, but things have not really changed. As for practical constraints, I suppose each university and research institute would have to triple (or more) the statistics department to satisfy the need.

Instead — and bear with me as a layman (!) — I am inclined to say that statisticians should write better textbooks. As far as I recall from my classes, many of them could benefit from some pedagogy courses too…

Maybe more statisticians wouldn’t help, but I think this paper makes clear that having no statisticians review statistical papers can lead to garbage. And while that may have been the advice for decades it clearly isn’t followed.

To your second point yes better textbooks are always needed but most doctors don’t take a statistics course beyond undergrad unless they get a post MD degree in clinical research or something similar (I teach a year long sequence specifically for this program at my university). The textbooks aren’t great but you would also need MDs to commit to a minimum of a year of statistics training to cover even the bare minimum of statistics. Also, I’m always a little baffled that non-statisticians think they can take a single course in stats (that ends probably with t-tests or ANOVA) and maybe consult some online tutorials and think that’s sufficient. This would be like me saying I took pre-med biology and can use WebMD so why consult a doctor? The reason you consult an expert in any field is because there is simply too much to learn without a huge time commitment.

Well put.

In your earlier post you wrote:

“No! My problem is not that their recommended post-hoc power calculations are “mathematically redundant”; my problem is that their recommended calculations will give wrong answers”

What amazes me is that, from reading your original letter, and not being a statistician, I got that. The authors did not. Apparently they did not read your comment carefully enough to understand it. Instead, they defaulted to…some other pre-conceived idea and responded to that instead. This seems almost universal in the “comment and reply” sections of scientific journals. The comment says “Your Item A is a problem”. The reply says “We stand by our points for Item B”. It’s not always clear if this is a technique for distraction or creating a strawman, or if it’s a true misunderstanding. Perhaps its a little of both – defensive reaction and misunderstanding.

Whatever the case, it’s interesting and surprising how many disputes result from simple misunderstandings.

“Apparently they did not read your comment carefully enough to understand it. Instead, they defaulted to…some other pre-conceived idea and responded to that instead.”

Agreed. Somehow we need to learn how. to educate students to be aware of this problem — and to be able to read technical material without defaulting to some “pre-conceived idea”.

(Which reminds me, that when I developed a continuing-education course called “Common Mistakes in Using Statistics — Spotting Them and Avoiding Them”, I started by listing the first “common mistake” as “Expecting Too Much Certainty”.

I was struck at reading the comments that, with a few exceptions, the magnitude of the observed effect is not commented on or part of the conversation. The ex ante power calculation was based on an estimated effect size and what emerges is a required sample size to achieve the targeted power assuming the effect size used in the calculation. A study that does not achieve a p value of alpha either didn’t reach its target sample size or had an observed effect lower than the originally effect size used in the power calculation.

Rather than fussing with post-hoc power calculations, the authors of any paper should be reflecting on the observed effect size. Is it low and not clinically relevant, in which case who cares if the study was underpowered and not able to distinguish the observed effect size from the null, or is it large enough to be clinically relevant but imprecisely measured, and thus with more uncertainty than we might be comfortable about as to how accurate it is. This is the discussion journal authors should be expected to present, not whether they were underpowered for the effect size observed.

I recall reading a summary of two studies published in the same medical journal in the same week, for two different drugs used for similar conditions. Both had similar effect sizes, but for one the p value was 0.05. The first was lauded as a breakthrough treatment that should be widely disseminated, the latter as worthless. This is what comes from fetishizing p-values and ignoring the magnitude of estimated effects and and their precision when interpreting study results. The focus on post-hoc power analysis rather than substantive assessment of the estimated effect is part of this problem, and reflects a more general problem with how study results and statistics are presented in clinical journals. These discussions continue the tendency of clinical journals to focus on whether the null can be rejected and, if not, treating the null as the “true” estimate of effect.

Quick clarification. Text in third para reading “Both had similar effect sizes, but for one the p value was 0.05” should read “Both had similar effect sizes, but for one the p value was less than 0.05, and for the second greater than 0.05.”