There’s a special issue of the journal (vol. 95, no. 3) featuring several papers on p-values. There’s also a discussion that I wrote, which does not appear in the journal (for reasons explained below) but which I extract and link to below. First, the papers in the special section:
P values, hypothesis testing, and model selection: it’s déjà vu all over again
Aaron M. Ellison, Nicholas J. Gotelli, Brian D. Inouye, Donald R. Strong
In defense of P values
Paul A. Murtaugh
The common sense of P values
Perry de Valpine
To P or not to P?
Jarrett J. Barber, Kiona Ogle
P values are only an index to evidence: 20th- vs. 21st-century statistical science
K. P. Burnham, D. R. Anderson
Model selection for ecologists: the worldviews of AIC and BIC
Ken Aho, DeWayne Derryberry, Teri Peterson
In defense of P values: comment on the statistical methods actually used by ecologists
John Stanton-Geddes, Cintia Gomes de Freitas, Cristian de Sales Dambros
Comment on Murtaugh
Recurring controversies about P values and confidence intervals revisited
Paul A. Murtaugh
Finally there’s my own contribution, The problem with p-values is how they’re used:
I agree with Murtaugh (and also with Greenland and Poole 2013, who make similar points from a Bayesian perspective) that with simple inference for linear models, p-values are mathematically equivalent to confidence intervals and other data reductions, there should be no strong reason to prefer one method to another. In that sense, my problem is not with p-values but in how they are used and interpreted.
Based on my own readings and experiences (not in ecology but in a range of social and environmental sciences), I feel that p-values and hypothesis testing have led to much scientific confusion by researchers treating non-significant results as zero and significant results as real. . . .
I have, on occasion, successfully used p-values and hypothesis testing in my own work, and in other settings I have reported p-values (or, equivalently, confidence intervals) in ways that I believe have done no harm, as a way to convey uncertainty about an estimate (Gelman 2013). In many other cases, however, I believe that null hypothesis testing has led to the publication of serious mistakes . . .
The article under discussion reveals a perspective on statistics which, by focusing on static data, is much different from mine. Murtaugh writes:
Data analysis can be always be redone with different statistical tools. The suitability of the data for answering a particular scientific question, however, cannot be improved upon once a study is completed. In my opinion, it would benefit the science if more time and effort were spent on designing effective studies with adequate replication, and less on advocacy for particular tools to be used in summarizing the data.
I do not completely agree with this quotation, nor do I entirely agree with its implications. First, the data in any scientific analysis are typically not set in stone, independent of the statistical tools used in the analysis. Often I have found that the most important benefit derived from a new statistical method is that it allows the inclusion of more data in drawing scientific inferences. . . .
My second point of disagreement with the quotation above is in the implication that too much time is spent on considering how to perform statistical inference. (Murtaugh writes of “advocacy” but this seems to me to be a loaded term.) It is a well-accepted principle of the planning of research that the design of data collection is best chosen with reference to the analysis that will later be performed. We cannot always follow this guideline—once data have been collected, they will ideally be made available for any number of analyses by later researchers—but it still suggests that concerns of statistical methods are relevant to design.
In conclusion, I share the long-term concern (see Krantz 1999, for a review) that the use of p-values encourages and facilitates a sort of binary thinking in which effects and comparisons are either treated as zero or are treated as real, and also an old-fashioned statistical perspective under which it is difficult to combine information from different sources. The article under discussion makes a useful contribution by emphasizing that problems in research behavior will not automatically be changed by changes in data reductions. The mistakes that people make with p-values, could also be made using confidence intervals and AIC comparisons, and I think it would be good for statistical practice to move forward from the paradigm of yes/no decisions drawn from stand-alone experiments.
Hypothesis testing and p-values are so compelling in that they fit in so well with the Popperian model in which science advances via refutation of hypotheses. . . . But a necessary part of falsificationism is that the models being rejected are worthy of consideration. . . . In common practice, however, the “null hypothesis” is a straw man that exists only to be rejected. In this case, I am typically much more interested in the size of the effect, its persistence, and how it varies across different situations. I would like to reserve hypothesis testing for the exploration of serious hypotheses and not as in indirect form of statistical inference that typically has the effect of reducing scientific explorations to yes/no conclusions.
The journal editors sent me Murtaugh’s paper and invited me to write a short comment, which I did, and it was all set to be published when I found out that there was a $300 publication fee. I couldn’t bring myself to pay money to have the journal publish something that I wrote for them for free! I explained this to the editors who graciously let me withdraw the paper. So instead I’m posting it here, for the marginal cost of approximately $0.