Katriel:

Yes, you’re right, the large study supplies more information and represents better evidence. The point of the above post is that people often think the opposite: They act as if statistical significance supplies *stronger* evidence under poor conditions. They say they got statistical significance “despite limited statistical power” (or, more generally, “despite a crappy research design”) and so that’s even more meaningful.

As Loken and I discuss in our “backpack” article, the argument has intuitive appeal: it’s the argument that says that, if LeBron can win with such a crappy supporting cast, he must be an even more awesome player. The reasoning works in the basketball setting but not with statistical significance.

]]>Oops, I got confused typing this out. The odds ratios are actually indeterminate, I think. But if you set X and Y such that Odds(Study B is significant | true effect X) = Odds(Study A is significant | true effect Y), then… it’s still indeterminate. Darn. OK, nevermind.

]]>Consider Study A, with a large sample, which finds that my favorite social program has a statistically significant effect on income (let’s say p=0.02).

Consider Study B, with a small sample, which finds that my favorite social program has a statistically significant effect on income (with, again, p=0.02).

Assume both studies were pre-registered with no p-hacking. Or is that the whole point?

Isn’t Study A better evidence that my favorite social program boosts income? The Bayesian definition of the strength of evidence is the ratio Odds(Outcome|X)/Odds(Outcome|Y). For any effect size X > Y:

Odds(Study A is significant | true effect X) > Odds(Study B is significant | true effect X)

Odds(Study A is significant | true effect Y) > Odds(Study B is significant | true effect Y)

so that

Odds(Study A is significant | true effect X)/Odds(Study A is significant | true effect Y) > Odds(Study B is significant | true effect X)/Odds(Study B is significant | true effect Y)

OddsRatio(Study A) > OddsRatio(Study B)

In grad school, I was always taught “clinical significance” meant parameter values people should actually care about; i.e. a risk ratio of 2 is “clinically significance” but a risk ratio of 1.01 most likely is not. There was discussion about whether clinically significant values were in a confidence interval but of course that gets a bit tricky.

But, this was a stats program and the class in which this topic came up was taught by a professor who worked at the FDA and cared a lot about these kinds of things. So I’m not sure if that’s the global standard.

]]>Related: My impression from several things I have read on/found on links from this blog and elsewhere is that “clinical significance” is often defined in terms of results of a statistical analysis (e.g., confidence intervals) — which to my mind, is contrary to the idea of “practical significance,” which ought to be independent of a statistical analysis of the data currently being considered — it should be something decided on before the study, so that study results can be evaluated in terms of practical significance as well as statistical analysis.

]]>The NASA Longitudinal Study of Astronaut Health identified an increase in mortality from cancer, which fortunately has been below the significance level.

http://www.sciencedirect.com/science/article/pii/S1053077016000082#react-root

They have somehow identified an increase in mortality without using NHST (ok, fine), but then go on to use NHST to determine “something else” about this data. It is almost as if statistical significance is **only** used to assess practical/clinical significance, to the exclusion of identifying deviations from the null…

Mayo: Please tell us you do not actually read Popper to say this. Falsification of H is not Popperian evidence for H’ unless H’ is merely Not H. If H’ has independent content, all complaints about NHST apply. Surely you have conceded this much in your debates with Andrew?

]]>Aargh!

]]>Studying complex real world phenomena using something like a normal distribution relies on large enough sample sizes to ‘ensure’ that the real phenomenon roughly follows a somewhat stable statistical law.

You can’t really compare extremely small sample sizes to large sample sizes under the same model – you might retrospectively assume the small samples were drawn from the model that applies for large sample sizes but this isn’t valid.

Larger deviations are typically expected under small real world samples because, for one, they follow a different statistical law if they follow one at all.

]]>https://errorstatistics.com/2017/05/08/how-to-tell-whats-true-about-power-if-youre-practicing-within-the-error-statistical-tribe/

https://errorstatistics.com/2014/03/12/get-empowered-to-detect-power-howlers/

It’s related to my comment earlier today on this blog.

http://statmodeling.stat.columbia.edu/2017/08/16/also-holding-back-progress-make-mistakes-label-correct-arguments-nonsensical/#comment-544697

https://errorstatistics.com/2017/05/08/how-to-tell-whats-true-about-power-if-youre-practicing-within-the-error-statistical-tribe/

https://errorstatistics.com/2014/03/12/get-empowered-to-detect-power-howlers/

It’s related to my comment earlier today on this blog.

http://statmodeling.stat.columbia.edu/2017/08/16/also-holding-back-progress-make-mistakes-label-correct-arguments-nonsensical/#comment-544697

Microarrays are another simple example of a ‘far from statistical equilibrium’ scenario.

That is, num. parameters >> num. measurements.

You can’t do reliable inference for individual parameters in such cases but you can, for example, potentially do reliable inference for larger ‘collectives’ of genes.

This reduces the effective complexity of the parameter space (you are doing inference for the parameters characterising the collectives) and hence gets you closer to equilibrium, n > p.

]]>The original joke: “However, we had limited statistical powers…”

https://openarchive.ki.se/xmlui/bitstream/handle/10616/39498/thesis.pdf?sequence=1

Hehehe. Limited statistical powers.

And yeah I know, I know, we aren’t picking on junior researchers anymore (or we can but it is in poor taste, or… I dunno…I stopped reading that thread. I email people when I think they are wrong and it is worth emailing them. Then sometimes I put it on the internet, if it is worth putting on the internet.). Anyway, the point isn’t to pick on a dissertation that happened to contain the exact typo I was searching for (you know, for comedy purposes)…the point is that I just happened to notice where that dissertation came from:

Karolinska Institutet, Stockholm, Sweden

Second joke: KI might have limited statistical powers, but their powers for evil are mighty!

Didn’t those guys refuse to fire a dude who kept killing people with the same procedure so he could be all famous about it when he pretended his patients didn’t die horrible deaths. That happened and was real, right?

“Paolo Macchiarini is not guilty of scientific misconduct”

“Dragging the professional reputation of a scientist through the gutter of bad publicity before a final outcome of any investigation had been reached was indefensible.”

http://www.thelancet.com/journals/lancet/article/PIIS0140-6736(15)00118-X/fulltext

Wait what? Oh, sorry … just had to shorten-up my google search window:

“It is evident from biopsies and bronchoscopy data that epithelialisation of the graft was incomplete, that the patient suffered serious complications, and that he eventually died. The published report states that ”there were no major complications, and the patient was asymptomatic and tumour free 5 months after transplantation”. “

http://www.thelancet.com/journals/lancet/article/PIIS0140-6736(16)00520-1/fulltext

Third Joke: The Lancet’s evolving editorial stance on guys under investigation for manslaughter.

OK one more. That background was just bringing us where we needed to be. That was just a setup. For a punchline. On dud traches.

“I have been forwarded a manuscript by the scandal surgeon Paolo Macchiarini, which was originally intended to present in detail all of his known 9 cadaveric trachea recipients (only 4 are recorded officially). …, with the small difference that the paper (allegedly rejected at Nature Communications) presents their clinical evolutions quite differently from reality. … The manuscript also confirms that Macchiarini’s acolyte Philipp Jungebluth was directly involved in the transplant surgeries of these patients, despite his most probably not having a permit to practice medicine in Italy. This makes Jungebluth co-responsible for up to 13 trachea transplants, 10 of them lethal. The German doctor is currently suing me in court for alleged libel…”

Fourth joke: the Abstract of Pauolo & Pals maybe-rejected follow-up manuscript to their questioned-redeemed-indicted-mayberetracted Lancet Paper

“In conclusion, tissue engineered biological tracheal scaffolds can be safely used in a clinical setting but with some risk of mechanical compromise in the intermediate and long-term post-operative course. Further improvements are required to preserve the scaffolds’ patency”.

#ScientificPatency

]]>In regards to microarray screening, false discovery rates seem to be the default summary reported, rather than unadjusted pvalues, in my experience. Couldn’t put a timeline on when that became standard.

That being said, most of the individual tests used in microarray or sequencing data are very fragile to model mis-specification (i.e. parametric tests with extremely small sample sizes). I come from the school of thought of “your model is definitely non-trivially mis-specified, so what does that imply?. Sometimes a problem, sometimes not. Try to be in the not category.” Since just about all multiple comparison procedures are highly sensitive to uniformity of the p-value under the null, it’s not clear that merely using FDR’s gets us out of the woods.

But at the very least, the issue of multiple comparison has made its way to being attempted to be accounted for in standard procedures.

]]>A biostatistician of my acquaintance once told me that in the early days of high-throughput microarray experiments he was at a biology conference and, after a presentation, commented (at the mic) that the results he had just seen — specifically, of 60,000 genes screened, 3,000 passed the 5% statistical significance threshold — were exactly what should be expected if none of the genes were actually differentially expressed. He was told that that was a *very* controversial statement.

But I would assume that at this point people get it.

]]>“But if their results were “we scanned 100 effects and found 25 were significant at the alpha = 0.05 level, *even with out adjusting for multiple comparisons*”, then I’m actually starting to think the results are good; at very least, given appropriate calculations of p-values, this indicates the researchers are not looking at a problem where large effects are extremely rare. “

Maybe. If the 25 different effects are more or less independent, then I could agree. But sometimes when one sees this, the 25 different effects are 25 minor variants of the same thing. In that situation the 25 findings are only minimally more informative than any one of them alone. (It is, however, a showing that that one finding is robust to the minor variations used to create the 25 findings.)

]]>Agreed, and in my experience, the phrase very often means [1] and hardly ever [2].

]]>Why was I impressed? In every paper that showed up, the authors were referring to a study in which a large number of effects were screened. If their results were “we scanned 100 effects and 5 were significant at the alpha = 0.05 level”, of course that would not be impressive, and I would probably go home and cry about the state of science. I assume that isn’t what I was seeing, because that is just too obvious to make it through peer review. Maybe a strong assumption on my part?

But if their results were “we scanned 100 effects and found 25 were significant at the alpha = 0.05 level, *even with out adjusting for multiple comparisons*”, then I’m actually starting to think the results are good; at very least, given appropriate calculations of p-values, this indicates the researchers are not looking at a problem where large effects are extremely rare. So we should now put some faith back into the p less than 0.05 results. Remember, p less than 0.05 + moderate prior distribution of strong effects = high posterior probability of strong effect.

And now, if they did multiple comparison adjustments, which would further hurt their statistical power, we should be especially convinced that they are screening types of effects that are likely to be strong.

]]>From a superficial reading, I understand that they report the result from Mullooly et al. despite the study having limited statistical power.

“No association was found in that study, but they were not very likely to find it in the first place so take that with a grain of salt.”

The message is not “the study didn’t find an association despite having low power”, but “despite this study having low power, we acknowledge the result (but keep in mind it’s not very informative)”

]]>I think it comes down to taking “failure to reject the null” as “evidence that the null is true”

]]>Google results are not consistent across searchers. So, for example my 4th entry was:

“Despite limited statistical power, several studies have suggested a number of candidate genes in association with SSc in different populations”

But yeah, yours is weird. rephrased, it becomes “despite the fact that we have virtually no ability to measure this question, we were unable to find out anything about this question”

]]>What thought process was going on there? ]]>

1. “Bearing in mind, for the record, the limited statistical power,” [we’ll go ahead and report what we found].–That is, they’re throwing in this caveat (because they know someone will bring it up) and then proceeding to report the results anyway. If someone says, “But look at the low statistical power,” they can reply, “Yes, we already mentioned that.” (I see this kind of caveat from time to time in New York Magazine articles.)

2. “Even given the low statistical power,” [look at the impressive results]. This seems like the “What does not kill my statistical significance makes it stronger” fallacy.

3. “Given both low statistical power and high intuitive correctness,” [these results not only seem right but seem to have policy implications]. That is, the study may be weak, but it turned up the results that the researchers desired or expected.

]]>