My problem with the Lindley paradox

From a couple years ago but still relevant, I think:

To me, the Lindley paradox falls apart because of its noninformative prior distribution on the parameter of interest. If you really think there’s a high probability the parameter is nearly exactly zero, I don’t see the point of the model saying that you have no prior information at all on the parameter. In short: my criticism of so-called Bayesian hypothesis testing is that it’s insufficiently Bayesian.

P.S. To clarify (in response to Bill’s comment below): I’m speaking of all the examples I’ve ever worked on in social and environmental science, where in some settings I can imagine a parameter being very close to zero and in other settings I can imagine a parameter taking on just about any value in a wide range, but where I’ve never seen an example where a parameter could be either right at zero or taking on any possible value. But such examples might occur in areas of application that I haven’t worked on.

19 thoughts on “My problem with the Lindley paradox

  1. Again, my papers with Jim Berger: Compare Einstein’s theory of general relativity, which predicts quite precisely an anomalous (relative to Newtonian theory) perihelion motion of Mercury of approximately 43 seconds of arc per century. We can quantify how precise that prediction is (and did in our papers).

    This, versus a modified Newtonian physics that predicts only what some not-considered effect would predict (such as solar oblateness, or “Vulcan”, or a very small modification of the inverse-square law). Again, in our papers we estimated how big such effects could be based on the effects (not seen) on other planets, but how big they should be to produce the anomalous motion of Mercury.

    In both cases Jim and I provided what I believe to be principled priors on the parameters for both hypotheses.

    We did not put posterior probabilities on the hypotheses themselves, since we didn’t put priors. We only computed Bayes factors, which are independent of these, to measure the support that the data provided for each hypothesis, given our (I claim) principled priors.

    What is your objection?

      • OK, but I think if you go back to my response to your original blog posting you will see that I agree about the sort of unconsidered “hypothesis testing” that you complain about. I stated quite clearly that the prior on the parameter of the alternative hypothesis is very important. In some cases the best you can do is bound the Bayes factor (as with Jim Berger’s lower limit using a minimal symmetric prior).

        The more important point is to note that the REAL point of the Jeffreys-Lindley “paradox” is to demonstrate that from a Bayesian point of view, the p-values significantly overestimate the evidence against the null, when interpreted naively (as, for example, when interpreted as “the probability that the null hypothesis is true” or “the probability that we would have gotten this result by chance”). The easiest way to see this is to compare a test where all of the prior probability on the alternative is placed precisely where the data are[*]. The Bayes factor in this case is nowhere near as skeptical about the null in that situation than the naively-interpreted p-value is.

        [*] Sorry, I had Latin in high school.

  2. Maybe one day Bayesians will stop alleging, erroneously, that p-values overstate the evidence against the null! With large sample size, the discrepancy indicated by a small p-value is smaller than if found with a smaller sample size. Oh, I just noticed that Jefferys has said it again! (I also don’t see why he’d want to place “all of the prior”– post-data– on the max likely alternative!) Many discussions of the “large n” problem and the related Jeffreys-good-Lindley paradox are on my blog. For a comic post with links: http://errorstatistics.com/2012/04/28/3671/

    Also in my comments on J. Berger: http://www.phil.vt.edu/dmayo/personal_website/Berger%20Could%20Fisher%20Jeffreys%20and%20Neyman%20have%20agreed%20on%20testing%20with%20Commentary.pdf

    • Mayo:

      One problem here is that p(y|M), the marginal probability of the data given the model, is sometimes called the “evidence.” I hate this sort of loaded term, but now that it’s in common use, it can affect how people talk about these things. Sort of like what unfortunately happened with the term “bias.”

      • >I also don’t see why he’d want to place “all of the prior”– post-data– on the max likely alternative!

        This minimizes the Bayes factor against the null. It is simply a thought experiment to give the alternative hypothesis maximum advantage to compare the numerical value of the Bayes factor to the numerical value of the p-value. See Berger-Delampady.

        And the reason why we say that p-values overstate the evidence against the null is because they do. I do not understand why frequentists are always saying that p-values don’t overstate the evidence against the null. Oh, but Mayo just said that they don’t!

        This is why physicists insist on 5 sigmas, since it is demonstrable that p<0.05 results are overturned much more often than that number would indicate, for numerous reasons including misuse of the statistical machinery, but for other reasons as well.

        • The p-value is just a number, it’s mathematically well defined; without any interpretation, it doesn’t state a strength of evidence for or against anything. It can only “overstate” evidence if you interpret it as something it isn’t.
          The p-value, by definition, doesn’t imply any statement about “how often results are overturned”.

        • One concrete example of the effect Bill is referring to as people tend to only publish when they get a significant result. So there is the” look else where effect” (LEE) that is not being accounted for. But as far as I understand it, the same thing could happen if they where using Bayesian hypothesis testing. So I don’t see this as a good argument against p-values.

  3. Andrew: Can you give a short summary of what the Lindley paradox is? I can’t quite map your PS and the implicit internal conversation onto the statement of the paradox on Wikipedia. I would like it even more if you added your summary to the post as a PPS!

    • oh wait I think I got it:

      H_0 says the value of some parameter is *exactly* zero.

      H_1 says the value can be anything in some wide range.

      H_1 has much better likelihood (at best value for the parameter) than H_0; nonetheless, H_0 has much better *marginalized* likelihood, because it is so specific.

    • An important part of the Jeffreys-Lindley “paradox” is the behavior under large sample sizes. As Lindley stated, you can conduct an alpha-level test of an exact null for some (arbitrary) very small alpha, where H_1 says the value of the parameter can be in some fixed range R. It is important that alpha and R be fixed. Then for large amounts of data you can have simultaneously (1) the null is rejected, p<alpha and (2) the Bayes factor against the null is <alpha. That's the "paradox".

      • It is important that this conflict between p-value and Bayes factor is certain to arise, regardless of the specific prior. You can choose an uninformative prior, a somewhat informative prior, or a highly informative prior; in all cases, with p fixed at any value, say .01, and n -> infinity, the Bayes factor will indicate infinite evidence in favor of H0. As Lindley states “the phenomenon would persist with almost any prior probability distribution”.

  4. The paradox is useful as a teaching point about what BFs are how you can use them. The “alternative model” being evaluated is often a useless model for prediction (eg I don’t know the sign of the effect; I have an order of magnititude or two of uncertainty in it’s absolute value), and when the likelihood of the data is marginalized over that predictive model you get garbage. Fixes like the partial BF (split the sample and estimate on a sub-set to get a real alternative model) are useful, but AFAIK interpretable only as a bound on the posterior model probabilities.

  5. But Bayesian model comparison is just a comparison between a _particular_ H_0 (let’s say a model with prior delta(0) on some parameter theta) and a _particular_ H_1 (let’s say the same model except with a broad prior p1 on theta) – typically the idea is that theta corresponds to some real-world quantity we are interested in. The result of the model comparison is to tell us which of the two models is a better description of reality, in the sense of being the more likely model to have generated the data if we assume that the data were generated by one of the models (which of course they weren’t). It does _not_ tell us whether the real-world theta is exactly equal to zero, but I think it _can_ be interpreted as telling us whether theta=0 is a better approximation than is the posterior (P(theta|D,H_1)) of theta under H_1.

    It’s worth thinking about both the delta and the broad prior:

    a) I agree it is very hard to find applications where a delta prior could conceivably correspond _exactly_ to the phenomenon being modeled. But there are many applications where it could be a very good (and very practical) _approximation_. (Of course, as the amount of data increases the approximation becomes less good and it may start making sense to replace this naive prior with one that has non-zero width.)

    b) There are many choices of p1 compared to which theta=0 is a much better description of reality. This is why Andrew is unhappy with an uninformative (i.e. very broad) choice of p1: we can often construct a much better model. In those cases, we should go ahead and use a better prior (say p2, with corresponding model H_2), which captures more of the available information and which may lead to the conclusion that theta~P(theta|D,H_2) is a better description of reality than theta=0. But this does not invalidate the conclusion drawn from the original analysis, that theta=0 is a better approximation to reality than theta~P(theta|D,H_1).

  6. If you prefer likelihood ratios then you can say (for a one degree of freedom test) that P=0.05 (two sided) corresponds to approximately a ratio of between 6 to 1 and 7 to 1 for the best supported alternative hypothesis compared to the null. This best supported hypothesis will change with the sample size, of course. The P-value is not a great statistic but just as it is often used stupidly it is often attacked stupidly. A common ridiculous criticism is that the replication probability is low. See
    Senn, S. J. (2002). “A comment on replication, p-values and evidence S.N.Goodman, Statistics in Medicine 1992; 11:875-879.” Statistics in Medicine 21(16): 2437-2444.

    It is also perhaps worth noting that Lindley’s original statement of the paradox is misleading (his formula does not satisfy dimensional analysis). This was corrected by Bartlett but the correction shows that the paradox is not quite as automatic as one might (at first blush) suppose. See Bartlett, M. S. (1957). ” A comment on D.V. Lindley’s statistical paradox.” Biometrika 44: 533-534.

    Finally, is it not strange that Bayesians always see the paradox as being a feature of P-values? One could argue that it is actually a (non-paradoxical) feature of Bayesian statistics, since there is no guarantee that as the sample size increases two Bayesian will converge. Thus, to dismiss P-values because they disagree with Bayesian inference ‘.. would require that a procedure is dismissed
    because, when combined with information which it doesn’t require and which may not exist, it disagrees with a procedure that disagrees with itself ‘ See Senn, S. J. (2001). “Two cheers for P-values.” Journal of Epidemiology and Biostatistics 6(2): 193-204.

    • “there is no guarantee that as the sample size increases two Bayesian will converge”

      If there were such a guarantee, that would _really_ be a paradox. As pointed out by Jaynes, it is easy to demonstrate situations where “two Bayesians” (by which I take it you mean two Bayesian analyses conditioned on different initial information states) converge on different answers as sample size increases. The division of the available information into “initial information” and “data” is arbitrary; we don’t expect to get the same answer to analyses conditioned on different data, so why should we expect the same answer to analyses conditioned on different initial information?

      Also: the only difference between H_0 and H_1 is in the prior.

  7. Konrad, I agree that the non-(necessary) convergence of Bayesian inference is perfectly logical. I described it as a non-paradox. In fact I gave a simple demonstration of strong disagreement between Bayesians in ‘Two cheers for P-values’. However, since the fact that one Bayesian inference strongly contradicts another, Bayesian inference is not a paradox, how can the fact that a P-value is not compatible with a given Bayesian inference be a problem, let alone a paradox. You may find all sorts of other arguments as to why P-values are illogical but the non agreement as a measure with a Bayesian inference can’t be one of them.

    • I agree, calling it a paradox is silly and I don’t think the idea that it’s a non-paradox is at all controversial (e.g. see the wikipedia page on Lindley’s paradox).
      This is also consistent with the argument I made above re Bayesian model comparison.

Comments are closed.