E. J. Wagenmakers writes:

You may be interested in a recent article [by Nieuwenhuis, Forstmann, and Wagenmakers] showing how often researchers draw conclusions by comparing p-values. As you and Hal Stern have pointed out, this is potentially misleading because the difference between significant and not significant is not necessarily significant.

We were really suprised to see how often researchers in the neurosciences make this mistake. In the paper we speculate a little bit on the cause of the error.

From their paper:

In theory, a comparison of two experimental effects requires a statistical test on their difference. In practice, this comparison is often based on an incorrect procedure involving two separate tests in which researchers conclude that effects differ when one effect is significant (P < 0.05) but the other is not (P > 0.05). We reviewed 513 behavioral, systems and cognitive neuroscience articles in five top-ranking journals (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience) and found that 78 used the correct procedure and 79 used the incorrect procedure.

I assume this has been an issue for close to a century; it’s interesting that it’s been noticed more in the past few years. I wonder what’s going on.

P.S. E. J. writes, “I know of no references that precede your work with Hal Stern.” I wonder, though. The idea is so important that I’d be surprised if Fisher, Yates, Neyman, Box, Tukey, etc., didn’t ever discuss it.

It seems to me that as neuroscientists are tackling more controversial topics such as liberalism vs conservatism and extra-sensory perception, their work is scrutinized more closely by colleagues and the media alike. Which is always a good thing, for scientists and for the lay public. Extraordinary claims demand extraordinary evidence…

I think that this has less to do with P-values per se than one might suppose. I have seen similar mistakes with confidence intervals (using two, one for each mean, to judge a difference) and presumably one could make the same error with credible intervals. A related problem is dichotomania, a compulsive obsessive disorder that many medics seem to suffer from. Did your diastolic blood pressure drop by 10mmHg?; then you are a responder and the drug worked for you. Did it drop by 9mmHg?; then you are a non-responder and it didn’t. Idiotic statements are then made by people in the pharma industry who should know better, like “this drug only works for 63% of patients”, based on counting the proportion of “responders”. Much of the hype and the hope of pharmacogenetics is founded on this lunacy and unfortunately regulatory guidelines seem to encourage it.

From the paper: That is, as famously noted by Rosnow and Rosenthal2, “surely, God loves the 0.06 nearly as much as the 0.05”

It occurred to me that of course, 0.05 is just a rule of thumb that originated when, I think it was Fisher, compiled tables of tail areas. Had we not evolved to have thumbs, or only four-digited Disney cartoon hands and feet, no one would have chosen 0.05, but in fact something closer to 0.06 (0.0625, to be precise)!

The recent issue (V8 N3) of Significance had an intriguing article about the status of significance tests in the US legal system. While the article was not without its eyebrow raising moments (that journal is often no better than a front for certain types of 21st century propaganda), it was a good read.

The issues there were not like those mentioned here, where tests of differing significance status are therefore themselves significant [sic], but was a reasonable rebuttal to the idea of significance cutoffs as gospel. The picture was far from exhaustive but did get to a good deal of the issue. The above commenter used the phrase “dichotomania” to describe this; that has been one of my favorive phrases now for some time, as it is a scourge on science.

In the journal’s defense, the same issue had an article on regression to the mean which, while a bit snarky, hit home on some very important and usually overlooked issues in a way I never could have. One of its best articles in some time.

Andrew’s maxim that “the difference between significant & nonsignificant is not necessarily significant” would apply even if one did the test *correctly*. That is, even if the difference between the outcomes in the conditions was “statistically significant,” the size of that effect might not be significantly different from one that is *not* statistically significant. Today’s post–“The statistical significance filter”–bears this point out. So as Stephen points out, the problem identified in the review is not really one associated w/ p-value fetishism– it concerns a much more basic failure in statistical literacy. I’m surprised Andrew didn’t note this himself!

Agree with Stephen that its not anything in particular about p_values but some basic logical error of dismissing uncertainty in favour of right/wrong, works/does not, safe/unsafe.

My favourites are testing assumptions (concluding OK give p > .1) and refusing to publish non-significant results because the confidence interval is too wide.

My old boss had this chant “lack of significance is not evidence of lack” that he had ample opportunities to use from 1988 onwards.

Personally I ran into it twice this week alone and was wondering why it is so common and persistent.

There is a recent book that traces the history of this issue: http://press.umich.edu/titleDetailDesc.do?id=186351

The Cult of Statistical Significance

How the Standard Error Costs Us Jobs, Justice, and Lives

Stephen T. Ziliak and Deirdre N. McCloskey

Interesting — I did read the Ziliak & McCloskey book, but I must have forgotten where they discuss this particular issue. Do you have a page number for me? Cheers, E.J.

I’ve only just started on Cult, so if the specific question is addressed there, I haven’t yet found it. But from the early chapters, their core criticism of significance is that it is “sizeless.” If Study A has P 0.05, they would simply say that A may demonstrate some evidence for the existence of an effect, while study B doesn’t, but neither tells you anything about the magnitude of the effect. So it would be likely that they would also argue that a study of the differences of the effects between A and B also only addresses the existence, and not the magnitude of a difference. Their rejoinder is the “so what?”

Interesting take, but some of their online material seems a bit shallow – in particular bashing Fisher for rules to disregard non-significant findings. One of the first papers on the dangers of p_value censorship (i.e. journals disregarding non-significant findings)

credited Fisher for suggesting the topic.

Also, Frank Yates did economic cost/benefit analyses of sample size in order to try to obtain more funds but then I am repeating my self – see old comment hear http://statmodeling.stat.columbia.edu/2008/12/some_implicatio/

K?

Have to say, I loved Cult, but found the ad-hominem attacks on Fisher a little distracting. There’s a failure to recognise his primacy in Likelihood theory etc. Still, entertaining book for it’s subject. – R

Pingback: A Few Highlights: 9/6-11 | Bootstrapping Life

Pingback: Some thoughts on academic cheating, inspired by Frey, Wegman, Fischer, Hauser, Stapel « Statistical Modeling, Causal Inference, and Social Science

Pingback: The usability of statistics; or, what happens when you think that (p=.05) != (p=.06) « The Hardest Science

Pingback: Testing and significance « Xi'an's Og

Pingback: A Few Highlights: 9/6-11 | Carlisle Rainey