Jeremy Miles pointed me to this article by Leonhard Held with what might seem like an appealing brew of classical, Bayesian, and graphical statistics:

P values are the most commonly used tool to measure evidence against a hypothesis. Several attempts have been made to transform P values to minimum Bayes factors and minimum posterior probabilities of the hypothesis under consideration. . . . I [Held] propose a graphical approach which easily translates any prior probability and P value to minimum posterior probabilities. The approach allows to visually inspect the dependence of the minimum posterior probability on the prior probability of the null hypothesis. . . . propose a graphical approach which easily translates any prior probability and P value to minimum posterior probabilities. The approach allows to visually inspect the dependence of the minimum posterior probability on the prior probability of the null hypothesis.

I think the author means well, and I believe that this tool might well be useful in his statistical practice (following the doctrine that it’s just about always a good idea to formalize what you’re already doing).

That said, I really don’t like this sort of thing. My problem with this approach, as indicated by my title above, is that it’s trying to make p-values do something they’re not good at. What a p-value is good at is summarizing the evidence regarding a particular misfit of model do data.

Rather than go on and on about the general point, I’ll focus on the example (which starts on page 6 of the paper). Here’s the punchline:

At the end of the trial a clinically important and statistically significant difference in

survival was found (9% improvement in 2 year survival, 95% CI: 3-15%.

Game, set, and match. If you want, feel free to combine this with prior information and get a posterior distribution. But please, please, parameterize this in terms of the treatment effect: put a prior on it, do what you want. Adding prior information can change your confidence interval, possibly shrink it toward zero–that’s fine. And if you want to do a decision analysis, you’ll want to summarize your inference not merely by an interval estimate but by a full probability distribution–that’s cool too. You might even be able to use hierarchical Bayes methods to embed this study into a larger analysis including other experimental data. Go for it.

But to summarize the current experiment, I’d say the classical confidence interval (or its Bayesian equivalent, the posterior interval based on a weakly informative prior) wins hands down. And, yes, the classical p-value is fine too. It is what it is, and its low value correctly conveys that a difference as large as observed in the data is highly unlikely to have occurred by chance.

P.S. This story is related to the Earl Weaver theme mentioned in a recent entry.

I agree. Treatment effect estimates combined with interval estimates of uncertainty trump p-values every day and twice on Sundays. I believe that there was even a psychiatry journal that banished p-values for a while. Forcing authors to really think about what was the effect of interest and put an interval around it. I don't think I would go quite that far, but it's definitely food for thought.

As someone more closely aligned with frequentist statistics, I'm always interested to see people bridging the gap (e.g. matching priors). Perhaps this work isn't particularly compelling, but I like where his head is at.

We use p-values for something different: setting detection thresholds for pulsar searches. If you're looking at, say, a million independent Fourier frequencies, and you want to bring up an expected one for further study, you look for a power high enough that its p-value is less than one in a million. (Similarly if you're adding multiple harmonics, coherently or incoherently, though counting your "number of trials" becomes more difficult.) I don't know whether there's another tool that can really do the job. (The low computing cost is also important, since in fact those million Fourier frequencies are multiplied by ten thousand dispersion measure trials and five thousand beams.)

That said, we don't really use p-values: in practice, radio-frequency interference means we have no real grasp on the statistics of our problem. There are basically always many signals that are statistically significant but not real, so we rely on ad-hoc methods to try to manage the detection rates.

I did find leaving "it as what it is" was helpful in an introductory statistics course – the only problem was getting the tutors to stop asking for the _silly_ "reject, failure to reject, etc." jargon and putting big X,s when it was missing.

David Cox wrote something fairly recently on these issues in

Frequentist statistics as a theory of inductive inference. Deborah G. Mayo, D. R. Cox (google it).

One of the points I liked was the definition of the null hypothesis as simply being a dividing line between positive and negative effects – rather than anything of interest.

Putting any non-zero prior on it – seems somehow wrong.

K