The posterior distribution of the likelihood ratio as a summary of evidence

Gabriel Marinello writes:

I am a PhD student in Astrophysics and am writing this email to you because an enquiry about point null hypothesis testing (H0: Theta = Theta0 and H1: Theta != Theta0) in a bayesian context and I think that your pragmatic stance would be helpful. In Astrophysics is not rare to find attempts to use Bayes factor in this sense, despite all the problems associated with them. Recently posterior likelihood ratios have appeared as an alternative. I read the papers of Aitkin, and they have reasonable frequentist limit, however, they don’t feel completely bayesian. I thought that the right quantity should be the ratio of the posterior densities p(Theta|x) and not the likelihood p(x|Theta). Considering this ratio of posterior densities you get the posterior likelihood ratio for a uniform prior (as usual in Bayesian analyses) and you get the condition p(x|Theta0)/p(x|Theta) > c for a informative prior for the condition p(Theta0|x)/p(Theta|x) > 1. This “hypothesis” is equivalent to H0: p(Theta0|x) > p(Theta|x) and H1: p(Theta0|x) < p(Theta|x) , the last one is just the highest posterior density interval for level 1-P(H0|x). This seems to suggest that the null hypothesis can be constructed using a posterior interval as in frequentist statistics, but with an interpretation of the test as the probability that is Theta0 is the better than other parameters in the domain in the sense that p(Theta0|x)>p(Theta|x).

I tried to find this idea analysis in the literature and I was not able to find anything in my surprise. As the construction of HDP interval is standard the possibility to use them as hypothesis test is really tempting., especially because is less sensitive to the prior than the bayes factor. So it could be that there is an obvious mistake and I am not a statistician. Do is this idea valid?

My reply:

Christian, Judith, and I published a paper a couple years ago where we expressed our problems with Aitkin’s posterior likelihood ratio.

That said, I’m starting to become more open to the posterior likelihood ratio, partly because of discussions with Dan Kahan where he is pushing me to get more quantitative on what is the evidence supplied by data to distinguish different models of reality. And, I do think that working with p(y|theta_0)/p(y|theta), integrating over theta, makes a lot more sense than trying to compare marginal posterior probabilities.

In any case, I’d like to get away from the whole “null hypothesis” thing here, and instead think of all of this as a way of quantifying the evidence in the data.

If you think of it as a summary of evidence in the data, then it makes sense to work with likelihood ratios rather than posterior density ratios, in which case the “posterior” part of the “posterior likelihood ratio” is just a way of assessing that evidence in the context of the posterior distribution, which makes a certain amount of sense, actually!

And it also relates to LOO and WAIC.

Let’s see what X has to say.

22 thoughts on “The posterior distribution of the likelihood ratio as a summary of evidence

  1. Suppose you have two structurally different models:

    p_1(theta|y) p(theta) is proportional to the posterior for model 1 over parameters theta which are meaningful for model 1

    p_2(gamma|y) p(gamma) is proportional to the posterior for model 2 over parameters gamma which are meaningful for model 2

    across models, there is no way to compare theta to gamma, they just don’t have any meaning within the opposing model, like for example theta is the time averaged rate of change of gumpus, where gamma is the instantaneous value of forblaz… there’s no sense in which they are comparable.

    now what?

    Ultimately, for physical models we can at least start to talk about the dimension of the measurements, and there’s some commonality that is required, the structure of the models at least needs to predict measurements which have the same dimension… but this is totally not true for many social sciences, or medical issues for example, what is the dimension of a 1-5 scale for how bad your headache is today vs how bad your tinnitus is today vs how satisfied you are in your job, vs how sure of yourself you are in your interactions with coworkers… I mean… it’s a problem, in fact, it’s a big problem.

      • “there’s some commonality that is required… but this is totally not true for many social sciences, or medical issues”

        Not sure I understand this. If model 1 and model 2 are both models for the same y, I’d expect that you can get predictive distributions for y_new from both models. (Sure, I can conceive of a scenario where one model that says that more data collection is impossible and the other says the opposite, but that very much seems like a corner case rather than a typical case.)

        • I guess what I meant there was that the interpretation of what a measurement y means can be pretty broad and vary heavily from one model to another which is not as much the case when the measurement has physical dimensions.

          For example, if you ask “how satisfied are you in your career” one model might decide that the y is a combination of a reflection on an internal state, and a prime from the questionnaire itself, and an interaction from the priming with the subject’s gender… Whereas another model might take the answer as a face-value answer to the question about the internal state itself.

          in these two cases, the *meaning* of the data is fundamentally different based on different interpretations of what the measurement process does.

          Presumably we can predict new data from both models, but in the first model, we don’t actually care about the new data so much as we care about the internal state which is not an observable in the first model, but IS observable in the second.

        • We might, for example, conclude that although the first model badly predicts new data, it’s because it can’t do very well at predicting the effect of the prime and its interaction, but that it is actually predicting the “true” internal state better than the second model… but the person with the second model is free to ignore this statement and believe that it’s meaningless because there IS no hidden internal state, only observable output data.

          Although occasionally in physics we have some group thinking about the “true” meaning of the mass of particles, many times we can ignore that and treat a weight on a scale as directly proportional to a mass with a relatively well known gravitational constant, and two different models are likely to believe that the mass involved has fundamentally the same meaning.

  2. Yeech…, first, I am not certain this is what Gabriel Marinello (who asked the original question of that post) had in mind. See e.g. his re-expressing null and alternative hypotheses as H⁰: p(θ⁰|x)>p(θ|x) and H¹: p(θ⁰|x)<p(θ|x). No sign of integration there. Second, I do not see how the posterior density ratio p(θ⁰|x)/p(θ|x) can have a mathematical meaning, let alone a Bayesian significance. In a general model comparison, the two densities are about different entities, most likely against different and possibly orthogonal reference measures. The numerical value of the ratio is thus missing a scale. (This is also Daniel Lakeland’s point, I believe. Furthermore, as indicated in our discussion of Murray Aitkin’s proposal, the distribution of this ratio has to be evaluated wrt to a joint distribution on both parameters θ⁰ and θ that makes little sense if one model is eventually to be eliminated, since only the marginals should matter. From a Bayesian perspective, the ratio of the posteriors does not exist a priori, hence it is hard to endow it with a true prior. Further, its “posterior distribution”involves three replicas of the data, with at least one double use of the data.

    I thus stand by my earlier comments in our joint (!) paper. I wish I could write more but an Icelandic dinner is waiting for me! Hopefully without rotten shark.

    • “From a Bayesian perspective, the ratio of the posteriors does not exist a priori, hence it is hard to endow it with a true prior.”

      Taken in isolation, this claim puzzles me. The prior measure and sampling distribution together define a joint prior measure over the product of the parameter space and the sampling space. The ratio in question can be treated as a measurable function from that product space to the positive reals, i.e., as a random variable, and the prior measure will induce a probability distribution for that random variable.

      • p(theta^0)/p(theta) is problematic because densities have dimensions of 1/[theta^0] and 1/[theta], the ratio therefore has units of [theta]/[theta^0] which is free to be any constant we choose since we’re free to measure theta and theta^0 in any units, and since they potentially have different dimensions, we can’t just require that they both be in the same units and hence be dimensionless.

        • Nooo…. the theta variable in the denominator takes values in a specific set and the fixed value of theta_0 in the numerator is an element of that same set. The ratio is dimensionless.

        • Hmm. I took X’s post to mean something about comparing across different models where the thetas are in different reference sets and hence have potentially different dimensions.

        • Well, I am glad of the lively discussion, it has been very enlightening. In the email I tried to make the point that a highest posterior density interval could be a better way to think about the problem, in a way applying the intuitive idea that if a particular value of the parameter is not in the posterior or confidence interval, this is not a good nested model and it is not worthy to apply model selection. However, if the ratio of posterior densities is not well defined, are HDP intervals properly defined?

        • Worse yet, even if you are considering two models in which the parameters theta come from the same reference set and hence can be compared with a meaningful scale, typically in practice we only have the density up to a multiplicative constant, so comparing the densities isn’t possible since you need the ratios of the normalized densities to make it make any sense.

  3. I do think there is some lack of communication here and that might be much of my fault for suggesting an alternative reference too soon.

    I did not see it as a “generic model choice problem” until Xi’ans comment this morning.

    For clarity my comment “Think X disagrees with him on the Infinity and Continuity” referred to what is given below.

    This footnote of Xi’ans in http://arxiv.org/pdf/1303.5973v2.pdf would be a criticism of some of the work in Evan’s new book
    “18 A solution to the measure-theoretic difficulty is to impose a version of Pi1 that is continuous at theta0 so that Pi1(theta0) is
    uniquely defined. It however equates the values of two density functions under two orthogonal measures.”

    I think I understand some of the mathematics that makes densities non-unique but not why that precludes purposely equating them to uniquely represent a statistical application. Part of this may be that I believe statistical applications only involve finite things.

  4. In any case, I’d like to get away from the whole “null hypothesis” thing here, and instead think of all of this as a way of quantifying the evidence in the data.

    What does “evidence in the data” mean? It seems to me that evidence has to be of or for some hypothesis, theory, claim, etc… If posterior likelihood ratios are providing data-based evidence for (or against) the model in the numerator (or the denominator), or vice versa, then how are they getting us away from the whole null hypothesis thing in any meaningful way?

Leave a Reply

Your email address will not be published. Required fields are marked *