(This post is by Yuling)
The likelihood principle is often phrased as an axiom in Bayesian statistics. It applies when we are (only) interested in estimating an unknown parameter $latex \theta$, and there are two data generating experiments both involving $latex \theta$, each having observable outcomes $latex y_1$ and $latex y_2$ and likelihoods $latex p_1(y_1 \vert \theta)$ and $latex p_2(y_2 \vert \theta)$. If the outcome-experiment pair satisfies $latex p_1(y_1 \vert \theta) \propto p_2(y_2 \vert \theta)$, (viewed as a function of $latex \theta$) then these two experiments and two observations will provide the same inference information about $latex \theta$.
Consider a classic example. Someone was doing an AB testing and only interested in the treatment effect, and he told his manager that among all n=10 respondents, y=9 saw an improvement (assuming the metric is binary). It is natural to estimate the improvement probability $latex \theta$ by independent Bernoulli trial likelihood: $latex y\sim binomial (\theta\vert n=10)$. Other informative priors can exist but is not relevant to our discussion here.
What is relevant is that later the manager found out that the experiment was not done appropriately. Instead of independent data collection, the experiment was designed to sequentially keep recruiting more respondents until $latex y=9$ are positive. The actual random outcome is n, while y is fixed. So the correct model is $latex n-y\sim$ negative binomial $latex (\theta\vert y=9)$.
Luckily, the likelihood principle kicks in for the fact that binomial_lpmf $latex (y\vert n, \theta) =$ neg_binomial_lpmf $latex (n-y\vert y, \theta)$ + constant. No matter how the experiment was done, the inference remains invariant.
At the abstract level, the likelihood principle says the information of parameters can only be extracted via the likelihood, not from experiments that could have been done.
What can go wrong in model check
The likelihood is dual-purposed in Bayesian inference. For inference, it is just one component of the unnormalized density. But for model check and model evaluation, the likelihood function enables generative model to generate posterior predictions of y.
In the binomial/negative binomial example, it is fine to stop at the inference of $latex \theta$. But as long as we want to check the model, we do need to distinguish between the two possible sampling distributions and which variable (n or y) is random.
Consider we observe y=9 positive cases among n=10 trials, with the estimated $latex \theta=0.9$, the likelihood of binomial and negative binomial models are
> y=9
> n=10
> dnbinom(n-y,y,0.9)
0.3486784
> dbinom(y,n, 0.9)
0.3874205
Not really identical. But the likelihood principle does not require them to be identical. What is needed is a constant density ratio, and that is easy to verify:
> dnbinom(n-y,y, prob=seq(0.05,0.95,length.out = 100))/dbinom(y,n, prob=seq(0.05,0.95,length.out = 100))
The result is a constant ratio, 0.9.
However, the posterior predictive check (PPC) will have different p-values:
> 1-pnbinom(n-y,y, 0.9)
0.2639011
> 1-pbinom(y,n, 0.9)
0.3486784
The difference of the PPC-p-value can be even more dramatic with another $latex \theta$:
> 1-pnbinom(n-y,y, 0.99)
0.0042662
> 1-pbinom(y,n, 0.99)
0.9043821
Just very different!
Clearly using Bayesian posterior of $latex \theta$ does not fix the issue. The problem is that likelihood ensures some constant ratio on $latex \theta$, not on $latex y_1$ nor $latex y_2$.
Model selection?
Unlike the unnormalized likelihood in the likelihood principle, the marginal likelihood in model evaluation is required to be normalized.
In the previous AB testing example, given data $latex (y,n)$, if we know that one and only one of the binomial or the negative binomial experiment is run, we may want to make model selection based on marginal likelihood. For simplicity we consider a point estimate $latex \hat \theta=0.9$. Then we obtain a likelihood ratio test, with the ratio 0.9, slightly favoring the binomial model. Actually this marginal likelihood ratio is constant y/n, independent of the posterior distribution of $latex \theta$. If $latex y/n=0.001$, then we get a Bayes factor 1000 favoring the binomial model.
Except it is wrong. It is not sensible to compare a likelihood on y and a likelihood on n.
What can go wrong in cross-validation
CV requires some loss function, and the same predictive density does not imply the same loss (L2 loss, interval loss, etc.). For adherence, we adopt log predictive densities for now.
CV also needs some part of the data to be exchange, which depends on the sampling distribution.
On the other hand, the calculated LOO-CV of log predictive density seems to only depend on the data through the likelihood. Consider two model-data pair $latex M1: p_1(\theta\vert y_1)$ and $latex M2: p_2(\theta\vert y_2)$, we compute the LOOCV by $latex \text{LOOCV}_1= \sum_i \log \int_\theta {\frac{ p_\text{post} (\theta\vert M_1, y_1)}{ p_1(y_{1i}\vert \theta) }} \left({ \int_{\theta} { p_\text{post} (\theta\vert M_1, y_1)}{ p_1(y_{1i}\vert \theta) }d\theta}\right)^{-1} p_1 (y_{{1i}}\vert\theta) d\theta,$ and replace all 1 with 2 in $latex \text{LOOCV}_2$.
The likelihood principle does say that $latex p_\text{post} (\theta\vert M_1, y_1)=p_\text{post} (\theta\vert M_2, y_2) $, and if there is some generalized likelihood principle ensuring that $latex p_1 (y_{1i}\vert\theta)\propto p_2 (y_{2i} \vert\theta)$, then $latex \text{LOOCV}_1= \text{constant} + \text{LOOCV}_2$.
Sure, but it is an extra assumption. Arguably the point-wise likelihood principle is such a strong assumption that would hardly be useful beyond toy examples.
The basic form of the likelihood principle does not have the notation of $latex y_i$. It is possibles that $latex y_2$ and $latex y_1$ have different sample size: consider a meta-polling with many polls. Each poll is a binomial model with $latex y_i\sim binomial(n_i, \theta)$. If I have 100 polls, I have 100 data points. Alternatively I can view data from $latex \sum {n_i}$ Bernoulli trials, and the sample size becomes $latex \sum_{i=1}^{100} {n_i}$.
Finally just like the case in marginal likelihood, even if all conditions above hold, regardless of the identity, it is conceptually wrong to compare $latex \text{LOOCV}_1$ with $latex \text{LOOCV}_2$. They are scoring rules on two different spaces (probability measures on $latex y_1$ and $latex y_2$ respectively) and should not be compared directly.
PPC again
Although it is a bad practice, we sometimes compare PPC p-values from two models for the purpose of model comparison. In the y=9, n=10, $latex \hat\theta=0.99$ case, we can compute the two-sided p-value: min( Pr(y_{sim} > y), Pr(y_{sim} < y)) for the binomial model, and min( Pr(n_{sim} > n), Pr(n_{sim} < n)) for the NB model respectively.
> min(pnbinom(n-y,y, 0.99), 1-pnbinom(n-y,y, 0.99) )
0.004717254
> min( pbinom(y,n, 0.99), 1-pbinom(y,n, 0.99))
0.09561792
In the marginal likelihood and the log score case, we know we cannot directly compare two likelihoods or two log scores when they are on two sampling spaces. Here, the p-value is naturally normalized. Does it mean we the NB model is rejected while the binomial model passes PPC?
Still we cannot. We should not compare p-values at all.
Model evaluation on the joint space
To avoid unfair comparison of marginal likelihoods and log scores across two sampling spaces, a remedy is consider a product space: both y and n are now viewed as random variables.
The binomial/negative binomial narrative specify two joint models $latex p(n,y\vert \theta)= 1(n=n_{obs}) p(y\vert n, \theta)$ and $latex p(n,y\vert \theta)= 1(y=y_{obs}) p(n\vert y, \theta)$.
The ratio of these two densities only admit three values: 0, infinity, or a constant y/n.
If we observe several paris of $latex (n, y)$, we can easily decide which margin is fixed. The harder problem is we only observe one $latex (n,y)$. Based on the comparison of marginal likelihoods and log scores in the previous sections, it seems both metric would still prefer the binomial model (now it is viewed as a sampling distribution on the product space).
Well, it is almost correct expect that 1) the sample log score is not meaningful if there is only one observation and 2) we need some prior on models to go from marginal likelihood to the Bayes factor. After all, under either sampling model, the event admitting nontrivial density ratios, $latex 1(y=y_{obs}) 1(n=n_{obs})$, has zero measure. It is legit to do model selection/comparison on the product space, but we could do whatever we want at this point without affecting any property in almost sure sense.
Some causal inferene
In short, the convenience of inference invariance from the likelihood principle also makes it hard to practise model selection and model evaluation. The latter two modules rely on the sampling distribution besides the likelihood.
To make this blog post more confusing, I would like to draw some remote connection to causal inference.
Assuming we have data (binary treatment: z, outcome y, covariate: x) from a known model M1: y = b0 + b1 z + b2 x + iid noise. If the model is correct and if there is no other unobserved confounders, we estimate the treatment effect of z by b1.
The unconfoundedness assumption requires that y(z=0) and y(z=1) are independent of z given x. This assumption is only a description on causal interpretation, and never appears in the sampling distribution or the likelihood. Assuming there does exist a confounder c, and the true DG is M2: y | (x, z, c) = b0 + b1 z + b2 x + c + iid noise, and z | c= c + another iid noise. Then marginalize-out c (because we cannot observe it in data collection), M2 becomes y | x, z= b0 + b1 z + b2 x+ iid noise. Therefore, (M1, (x, y, z)) and (M2, (x,y,z)) admit an experiment-data pair on which the likelihood principle holds. It is precisely the otherwise lovely likelihood principle that excludes any method to test the unconfoundedness assumption.