Long Shot

Frank Harrell doesn’t like p-values:

In my [Frank’s] opinion, null hypothesis testing and p-values have done significant harm to science. The purpose of this note is to catalog the many problems caused by p-values. As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress.

His attitudes are similar to mine:

I [Andrew] agree with most of the ASA’s statement on p-values but I feel that the problems are deeper, and that the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

I also agree with Harrell not to take the so-called Fisher exact test seriously (see Section 3.3 of this paper from 2003).

12 thoughts on “Long Shot

  1. I agree there are a number of issues with p-values, both theoretically and (especially) practically. The problem is of course that very few people agree on the specific issues and the specific resolutions appropriate for various users of statistics. See also the ASA meeting.

    For example, the objections Farrell lists under A, B, G seem misguided to me. Others will, on the other hand, likely strongly agree with them.

    In particular one issue I have (in principle) is that I see the desire for what Harrell calls ‘forward probabilities’ (which are usually or used to be called ‘inverse probabilities’!) as in fact striving for more certainty than is really possible. It’s at the heart of the ‘Bayes is not compatible with falsificationist reasoning’ misconception – eg that we always really want the ‘direct’ probability of a model.

    • To elaborate, many Bayesians (especially Jaynesians) emphasise the need to (formally) ‘condition’ on background information, but almost never put a probability distribution over this. Jaynes (from what I remember) and you (Andrew) often emphasise that we ‘learn the most’ when our model or background information is thrown into doubt by the data. Is this not paradigmatic ‘backwards’ (in Harrell’s sense) reasoning?

      • There are two aspects of the problem where we might consider ‘direction’. One is the order in which information arrives and parameters are formulated. This is subject to debate. The other is the direction of the probabilities that are computed, e.g., computing probabilities of the future given the past or given the present state. It is in the sense that I speak of forward probabilities and note that p-values are backwards.

  2. I believe that Fisher’s use of the term ‘inverse probability’ was not fully honest. Forward probabilities are simply forward in terms of time and information flow. Since they are the only types of probabilities that lead to optimum decisions and especially since only forward probabilities define their own error probabilities for decisions, I believe that we should prefer them in the majority of cases. Discussions about the harm caused by sensitivity and specificity are relevant here.

    • An inverse problem is defined relative to a forward problem.

      If you take the forward problem to be parameter to data, ie a data generating model etc, then the inverse problem is data to parameter.

      Forward problems are often required to respect certain features of the world based on physical principles, but the induced inverse problem typically does not respect these features. That is what makes ‘inference’ difficult.

  3. If we have a pair of mutually exclusive, collectively exhaustive hypotheses (H0 and H1), and our problem is to decide in favor of one or the other, then any decision rule is subject to two possible errors: we decide in favor of H1, but H0 is true (“type I error”), or we decide in favor of H0, but H1 is true (“type II error”). Consequently, the frequentist risk of any decision rule is P(decide in favor of H1; H0 is true) * L(I) = alpha * L(1) if H0 is true, or P(decide in favor of H0; H1 is true) * L(II) = beta * L(II) if H1 is true. The Bayes risk, in turn, is pi * alpha * L(I) + (1 – pi) * beta * L(II), where pi is the prior probability of H0. Any two decision rules will vary in terms of frequentist risk or Bayes risk, then, only by varying in terms of type I and type II error probabilities. If a decision rule takes the form of comparing a statistic T with its critical value T*, then varying the choice of T* trades off the alpha and beta of the decision rule. Reporting T itself allows the reader to decide which value to use for T*, and thus the appropriate alpha/beta tradeoff, given her prior probability and loss function. The p-value is simply a monotonic transformation of T, which may be interpreted as the smallest alpha consistent with deciding in favornof H1.

    All of this is familiar material. The question is where this reasoning goes wrong, such that p-values become the object of so much criticism. It seems to me that this reasoning is sound, and thus p-values are inappropriate only if deciding between a pair of mutually exclusive, collectively exhaustive hypotheses is not the best formulation of our scientific problem. The fact that many users misinterpret p-values, or report quantities that do not really qualify as p-values, is not the fault of p-values so much as it is the fault of inadequate training in statistics. Similarly, the use of binary hypothesis testing where it is not the best formulation of our scientific problem is driven by a demand for p-values from journals, which is itself driven by a combination of inadequate training in statistics as well as a perceived need for a red line to distinguish findings from no findings. Replacing p-values with something else would not overcome this training deficit, nor would it eliminate the perceived need for a red line.

    I agree that, if science was done better, we’d see a lot less p-values, but I don’t think that is because p-values are objectionable in and of themselves, but because users of statistics need better training in statistical methods as well as better goals for scientific research more broadly.

      • Thanks for the pointer.

        If we’re in agreement, then my suggestion is that we focus our public education efforts on appropriate versus inappropriate applications of binary hypothesis testing, and on the arbitrariness of red lines in science, rather than on inherent problems with p-values. I know that you have emphasized these points elsewhere, but I’ve had more than a few collaborators tell me that they know all about how p-values are no good, but then insist on using 95% confidence intervals to decide whether we have a “finding” or not.

        Education about p-values should, in my view, stress their proper computation, interpretation, and application, but if the takeaway is “p-values are no good”, then I don’t think any progress has actually been made.

Leave a Reply to ojm Cancel reply

Your email address will not be published. Required fields are marked *