Here it is. It’s not always clear what people mean by this expression, but sometimes it seems that they’re making the “What does not kill my statistical significance makes it stronger” fallacy, thinking that the attainment of statistical significance is a particular feat in the context of a noisy study, so that they’re (mistakenly) thinking of the “limited statistical power” of that study as a further point in favor of their argument.

More from Eric Loken and me here.

It seems (from a quick glance) that people use this phrase “Despite limited statistical power” to mean several distinct things:

1. “Bearing in mind, for the record, the limited statistical power,” [we’ll go ahead and report what we found].–That is, they’re throwing in this caveat (because they know someone will bring it up) and then proceeding to report the results anyway. If someone says, “But look at the low statistical power,” they can reply, “Yes, we already mentioned that.” (I see this kind of caveat from time to time in New York Magazine articles.)

2. “Even given the low statistical power,” [look at the impressive results]. This seems like the “What does not kill my statistical significance makes it stronger” fallacy.

3. “Given both low statistical power and high intuitive correctness,” [these results not only seem right but seem to have policy implications]. That is, the study may be weak, but it turned up the results that the researchers desired or expected.

Agreed, and in my experience, the phrase very often means [1] and hardly ever [2].

The fourth on is an oddity: “… despite limited statistical power, this study found no association between atopy and vaccine exposure.”

What thought process was going on there?

Google results are not consistent across searchers. So, for example my 4th entry was:

“Despite limited statistical power, several studies have suggested a number of candidate genes in association with SSc in different populations”

But yeah, yours is weird. rephrased, it becomes “despite the fact that we have virtually no ability to measure this question, we were unable to find out anything about this question”

I think it comes down to taking “failure to reject the null” as “evidence that the null is true”

From a superficial reading, I understand that they report the result from Mullooly et al. despite the study having limited statistical power.

“No association was found in that study, but they were not very likely to find it in the first place so take that with a grain of salt.”

The message is not “the study didn’t find an association despite having low power”, but “despite this study having low power, we acknowledge the result (but keep in mind it’s not very informative)”

Urgh Fourth ONE

Hmm, I did a google scholar search, and I was actually somewhat impressed, not dismayed. I will state that I didn’t read all the papers but just skimmed what was mentioned, but I’m guessing that’s the trend here.

Why was I impressed? In every paper that showed up, the authors were referring to a study in which a large number of effects were screened. If their results were “we scanned 100 effects and 5 were significant at the alpha = 0.05 level”, of course that would not be impressive, and I would probably go home and cry about the state of science. I assume that isn’t what I was seeing, because that is just too obvious to make it through peer review. Maybe a strong assumption on my part?

But if their results were “we scanned 100 effects and found 25 were significant at the alpha = 0.05 level, *even with out adjusting for multiple comparisons*”, then I’m actually starting to think the results are good; at very least, given appropriate calculations of p-values, this indicates the researchers are not looking at a problem where large effects are extremely rare. So we should now put some faith back into the p less than 0.05 results. Remember, p less than 0.05 + moderate prior distribution of strong effects = high posterior probability of strong effect.

And now, if they did multiple comparison adjustments, which would further hurt their statistical power, we should be especially convinced that they are screening types of effects that are likely to be strong.

“But if their results were “we scanned 100 effects and found 25 were significant at the alpha = 0.05 level, *even with out adjusting for multiple comparisons*”, then I’m actually starting to think the results are good; at very least, given appropriate calculations of p-values, this indicates the researchers are not looking at a problem where large effects are extremely rare. “

Maybe. If the 25 different effects are more or less independent, then I could agree. But sometimes when one sees this, the 25 different effects are 25 minor variants of the same thing. In that situation the 25 findings are only minimally more informative than any one of them alone. (It is, however, a showing that that one finding is robust to the minor variations used to create the 25 findings.)

A biostatistician of my acquaintance once told me that in the early days of high-throughput microarray experiments he was at a biology conference and, after a presentation, commented (at the mic) that the results he had just seen — specifically, of 60,000 genes screened, 3,000 passed the 5% statistical significance threshold — were exactly what should be expected if none of the genes were actually differentially expressed. He was told that that was a *very* controversial statement.

But I would assume that at this point people get it.

In regards to microarray screening, false discovery rates seem to be the default summary reported, rather than unadjusted pvalues, in my experience. Couldn’t put a timeline on when that became standard.

That being said, most of the individual tests used in microarray or sequencing data are very fragile to model mis-specification (i.e. parametric tests with extremely small sample sizes). I come from the school of thought of “your model is definitely non-trivially mis-specified, so what does that imply?. Sometimes a problem, sometimes not. Try to be in the not category.” Since just about all multiple comparison procedures are highly sensitive to uniformity of the p-value under the null, it’s not clear that merely using FDR’s gets us out of the woods.

But at the very least, the issue of multiple comparison has made its way to being attempted to be accounted for in standard procedures.

Microarrays are another simple example of a ‘far from statistical equilibrium’ scenario.

That is, num. parameters >> num. measurements.

You can’t do reliable inference for individual parameters in such cases but you can, for example, potentially do reliable inference for larger ‘collectives’ of genes.

This reduces the effective complexity of the parameter space (you are doing inference for the parameters characterising the collectives) and hence gets you closer to equilibrium, n > p.

In 2015 I attended a behavioral endocrinology conference. A senior researcher in that field gave a talk in which he defended his notoriously small sample sizes like this: “Why should I add any participants when the findings are already significant? Making the sample larger will just decrease statistical significance!” I and many others in the audience sat there aghast. Needless to say, people have a hard time replicating that researcher’s work.

Aargh!

Add an “s” to the end of that google search trying to make a quick little play on words, and, if you are me, you instead just wasted 30 minutes down an internet hole of terror. So here we go… from lazy word play to peak Open Science in 4 jokes, culminating in potential surgico-tracheal(tm) manslaughter, of which case-study it is useful to occasionally remind ourselves.

The original joke: “However, we had limited statistical powers…”

https://openarchive.ki.se/xmlui/bitstream/handle/10616/39498/thesis.pdf?sequence=1

Hehehe. Limited statistical powers.

And yeah I know, I know, we aren’t picking on junior researchers anymore (or we can but it is in poor taste, or… I dunno…I stopped reading that thread. I email people when I think they are wrong and it is worth emailing them. Then sometimes I put it on the internet, if it is worth putting on the internet.). Anyway, the point isn’t to pick on a dissertation that happened to contain the exact typo I was searching for (you know, for comedy purposes)…the point is that I just happened to notice where that dissertation came from:

Karolinska Institutet, Stockholm, Sweden

Second joke: KI might have limited statistical powers, but their powers for evil are mighty!

Didn’t those guys refuse to fire a dude who kept killing people with the same procedure so he could be all famous about it when he pretended his patients didn’t die horrible deaths. That happened and was real, right?

“Paolo Macchiarini is not guilty of scientific misconduct”

“Dragging the professional reputation of a scientist through the gutter of bad publicity before a final outcome of any investigation had been reached was indefensible.”

http://www.thelancet.com/journals/lancet/article/PIIS0140-6736(15)00118-X/fulltext

Wait what? Oh, sorry … just had to shorten-up my google search window:

“It is evident from biopsies and bronchoscopy data that epithelialisation of the graft was incomplete, that the patient suffered serious complications, and that he eventually died. The published report states that ”there were no major complications, and the patient was asymptomatic and tumour free 5 months after transplantation”. “

http://www.thelancet.com/journals/lancet/article/PIIS0140-6736(16)00520-1/fulltext

Third Joke: The Lancet’s evolving editorial stance on guys under investigation for manslaughter.

OK one more. That background was just bringing us where we needed to be. That was just a setup. For a punchline. On dud traches.

https://forbetterscience.com/2017/08/01/unpublished-macchiarini-manuscript-confirms-5-forgotten-trachea-transplant-patients-jungebluths-surgery-practice-in-italy/

“I have been forwarded a manuscript by the scandal surgeon Paolo Macchiarini, which was originally intended to present in detail all of his known 9 cadaveric trachea recipients (only 4 are recorded officially). …, with the small difference that the paper (allegedly rejected at Nature Communications) presents their clinical evolutions quite differently from reality. … The manuscript also confirms that Macchiarini’s acolyte Philipp Jungebluth was directly involved in the transplant surgeries of these patients, despite his most probably not having a permit to practice medicine in Italy. This makes Jungebluth co-responsible for up to 13 trachea transplants, 10 of them lethal. The German doctor is currently suing me in court for alleged libel…”

Fourth joke: the Abstract of Pauolo & Pals maybe-rejected follow-up manuscript to their questioned-redeemed-indicted-mayberetracted Lancet Paper

“In conclusion, tissue engineered biological tracheal scaffolds can be safely used in a clinical setting but with some risk of mechanical compromise in the intermediate and long-term post-operative course. Further improvements are required to preserve the scaffolds’ patency”.

#ScientificPatency

There may be confusion here about tests: all of testing operates on the reasoning that an observed disagreement between H and x is evidence for an alternative hypothesis H’ only if, and only to the extent that, such a difference is difficult to achieve under the assumption that H is true. Were it easy to bring about such differences, or even larger ones, under H, they would be poor evidence for H’. This is, of course, the Popperian principle of falsification. A statistical test operates the same way. It’s because a test’s power against the null hypothesis is low that a statistically significant result, at low p-value, indicates a discrepancy from the null. Since I’ve written so much on this, I’ll just cite a couple of posts.

https://errorstatistics.com/2017/05/08/how-to-tell-whats-true-about-power-if-youre-practicing-within-the-error-statistical-tribe/

https://errorstatistics.com/2014/03/12/get-empowered-to-detect-power-howlers/

It’s related to my comment earlier today on this blog.

http://statmodeling.stat.columbia.edu/2017/08/16/also-holding-back-progress-make-mistakes-label-correct-arguments-nonsensical/#comment-544697

Studying complex real world phenomena using something like a normal distribution relies on large enough sample sizes to ‘ensure’ that the real phenomenon roughly follows a somewhat stable statistical law.

You can’t really compare extremely small sample sizes to large sample sizes under the same model – you might retrospectively assume the small samples were drawn from the model that applies for large sample sizes but this isn’t valid.

Larger deviations are typically expected under small real world samples because, for one, they follow a different statistical law if they follow one at all.

Mayo: Please tell us you do not actually read Popper to say this. Falsification of H is not Popperian evidence for H’ unless H’ is merely Not H. If H’ has independent content, all complaints about NHST apply. Surely you have conceded this much in your debates with Andrew?

There may be confusion here about tests: all of testing operates on the reasoning that an observed disagreement between H and x is evidence for an alternative hypothesis H’ only if, and only to the extent that, such a difference is difficult to achieve under the assumption that H is true. Were it easy to bring about such differences, or even larger ones, under H, they would be poor evidence for H’. This is, of course, the Popperian principle of falsification. A statistical test operates the same way. It’s because a test’s power against the null hypothesis is low that a statistically significant result, at low p-value, indicates a discrepancy from the null. Since I’ve written so much on this, I’ll just cite a couple of posts.

https://errorstatistics.com/2017/05/08/how-to-tell-whats-true-about-power-if-youre-practicing-within-the-error-statistical-tribe/

https://errorstatistics.com/2014/03/12/get-empowered-to-detect-power-howlers/

It’s related to my comment earlier today on this blog.

http://statmodeling.stat.columbia.edu/2017/08/16/also-holding-back-progress-make-mistakes-label-correct-arguments-nonsensical/#comment-544697

Is the code available somewhere for the simulations reported in the article in Science? I am curious about the details. Thanks!

There is a new fallacy in the wild, but I can’t really pinpoint the confusion:

http://www.sciencedirect.com/science/article/pii/S1053077016000082#react-root

They have somehow identified an increase in mortality without using NHST (ok, fine), but then go on to use NHST to determine “something else” about this data. It is almost as if statistical significance is

onlyused to assess practical/clinical significance, to the exclusion of identifying deviations from the null…Related: My impression from several things I have read on/found on links from this blog and elsewhere is that “clinical significance” is often defined in terms of results of a statistical analysis (e.g., confidence intervals) — which to my mind, is contrary to the idea of “practical significance,” which ought to be independent of a statistical analysis of the data currently being considered — it should be something decided on before the study, so that study results can be evaluated in terms of practical significance as well as statistical analysis.

In grad school, I was always taught “clinical significance” meant parameter values people should actually care about; i.e. a risk ratio of 2 is “clinically significance” but a risk ratio of 1.01 most likely is not. There was discussion about whether clinically significant values were in a confidence interval but of course that gets a bit tricky.

But, this was a stats program and the class in which this topic came up was taught by a professor who worked at the FDA and cared a lot about these kinds of things. So I’m not sure if that’s the global standard.

This post is getting heavily cited/cross-linked — changing the initial link/search in the post to this phrase helps make the point more clear: “despite limited statistical power” -google

I’m missing something. Conditional on statistical significance, the chance of an S or M error is obviously large. But why condition on statistical significance?

Consider Study A, with a large sample, which finds that my favorite social program has a statistically significant effect on income (let’s say p=0.02).

Consider Study B, with a small sample, which finds that my favorite social program has a statistically significant effect on income (with, again, p=0.02).

Assume both studies were pre-registered with no p-hacking. Or is that the whole point?

Isn’t Study A better evidence that my favorite social program boosts income? The Bayesian definition of the strength of evidence is the ratio Odds(Outcome|X)/Odds(Outcome|Y). For any effect size X > Y:

Odds(Study A is significant | true effect X) > Odds(Study B is significant | true effect X)

Odds(Study A is significant | true effect Y) > Odds(Study B is significant | true effect Y)

so that

Odds(Study A is significant | true effect X)/Odds(Study A is significant | true effect Y) > Odds(Study B is significant | true effect X)/Odds(Study B is significant | true effect Y)

OddsRatio(Study A) > OddsRatio(Study B)

Oops, I got confused typing this out. The odds ratios are actually indeterminate, I think. But if you set X and Y such that Odds(Study B is significant | true effect X) = Odds(Study A is significant | true effect Y), then… it’s still indeterminate. Darn. OK, nevermind.

Katriel:

Yes, you’re right, the large study supplies more information and represents better evidence. The point of the above post is that people often think the opposite: They act as if statistical significance supplies

strongerevidence under poor conditions. They say they got statistical significance “despite limited statistical power” (or, more generally, “despite a crappy research design”) and so that’s even more meaningful.As Loken and I discuss in our “backpack” article, the argument has intuitive appeal: it’s the argument that says that, if LeBron can win with such a crappy supporting cast, he must be an even more awesome player. The reasoning works in the basketball setting but not with statistical significance.

Can’t seem to get an answer to this one after scanning the first 3 pages of Google. I have a low sized sample (n=45ish) but it captures the whole population (airline firms in the US) and is a time series. Now, most discussions of power seem to imply that I will be inferring wrongly due to the invalidity of the sample to capture the population betas. Here I have the full population and most variables of interest. I need to run some regressions (perhaps fixed effect or IV style)…will the coefficients I get be reliable or am I still going to suffer from the frailties of under-powered studies.