The post is by Leonardo Egidi.
This Bayesian fairy tale starts in July 2016 and will reveal some mysteries of the magical world of mixtures.
Opening: the airport idea
Once upon a time a young Italian statistician was dragging his big luggage in the JFK airport towards the security gates. He suddenly started thinking about how to elicit a prior distribution that is flexible but at the same time contains historical information about past similar events: thus, he wondered, “why not using a mixture of a noninformative and an informative prior in some applied regression problems?”
Magic: mixtures like three-headed Greek monsters
The guy fell in love with mixtures many years ago: the weights, the multimodality, the ‘multi-heads’ characteristic…like Fluffy, a gigantic, monstrous male three-headed dog who was once cared for by Rubeus Hagrid in the Harry Potter novel. Or Cerberus, a three-headed character from Greek mythology, one of the monsters guarding the entrance to the underworld over which the god Hades reigned. “Mixtures are so similar to Greek monsters and so full of poetic charm, aren’t they?!”
Of course his idea was not new at all: spike-and-slab priors are very popular in Bayesian variable selection and in clinical trials to avoid prior-data conflicts and get robust inferential conclusions.
He left his thought partially aside for some weeks, focusing on other statistical problems. However, some months later the American statistical wizard Andrew wrote an inspiring blog entry about prior choice recommendations:
What about the choice of prior distribution in a Bayesian model? The traditional approach leads to an awkward choice: either the fully informative prior (wildly unrealistic in most settings) or the noninformative prior which is supposed to give good answers for any possible parameter valuers (in general, feasible only in settings where data happen to be strongly informative about all parameters in your model).
We need something in between. In a world where Bayesian inference has become easier and easier for more and more complicated models (and where approximate Bayesian inference is useful in large and tangled models such as recently celebrated deep learning applications), we need prior distributions that can convey information, regularize, and suitably restrict parameter spaces (using soft rather than hard constraints, for both statistical and computational reasons).
This blog post gave him a lot of energy by reinforcing his old idea. So, he wondered, “what’s better than a mixture to represent a statistical compromise about a prior belief, combining a fully informative prior with a noninformative prior weighted somehow in an effective way?”. As the ancient Romans were used to say, in medio stat virtus. But he still needed to dig into the land of mixtures to discover some little treasures.
Obstacles and tasks: mixtures’ open issues
Despite their large use in theoretical and applied frameworks, as far as he knew from the current literature he realized that no statistician had explored the following issues about the mixture priors:
- how to compute a measure of global informativity yielded by the mixture prior (such as a measure of effective sample size, according to this definition);
- how to specify the mixture weights in a proper and automatic way (and not, say, only by fixing them upon historical experience, or by assigning them a vague hyperprior distribution) in some regression problems, such as clinical trials.
He struggled a lot with his mind during that cold winter. After some months he dug something out of the earth:
- the effective sample size (ESS) provided by a mixture prior never exceeds the information of any individual mixture component density of the prior.
- Theorem 1 here quantifies the role played by the mixture weights to reduce any prior-data conflict we can expect when using the mixture prior rather than the informative prior. “So, yes, mixture priors are more robust also from a theoretical point of view! Until now we only knew it heuristically”.
Happy ending: a practical regression case
Similarly to the bioassay experiment analyzed by Gelman et al. (2008), he considered a small-sample example to highlight the role played by different prior distributions, including a mixture prior, in terms of posterior analysis. Here there is the example.
Consider a dose-response model to assess immune-depressed patients’ survival according to an administered drug x, a factor with levels from 0 (placebo) to 4 (highest dose). The survival y, registered one month after the drug is administered, is coded as 1 if the patient survives, 0 otherwise. The experiment is firstly performed to the sample of patients y1 at time t1, where less than 50% of the patients survive, and then repeated to the sample y2 at time t2, where all but one patients die, given that y1 and y2 are non-overlapping samples of patients.
The aim of the experimenter is to use the information from the first sample y1 to obtain inferential conclusions about the second sample y2. Perhaps the two samples are quite different from each other in terms of survived people. From a clinical point of view we have two possible naive interpretations for the second sample:
- the drug is not effective, even if there was a positive effect for y1;
- regardless of the drug, the first group of patients had a much better health condition than the second one.
Both of them appear quite extreme clinical conclusions: moreover, our information status is scarce, since we do not have any other influential clinical covariate, such as sex, age, presence of comorbidities, etc.
Consider the following data where the sample size for the two experiments is N=15:
library(rstanarm)
library(rstan)
library(bayesplot)
n <- 15
y_1 <- c(0,0,0,0,0,0,0,0,1,1,0,1,1,1,1)
y_2 <- c(1, rep(0, n-1))
Given pi ≡ Pr(yi = 1), we fit a logistic regression logit(pi) = α+βxi to the first sample, where the parameter β is associated with the administered dose of the drug, x. The five levels of the drug are randomly assigned to groups of three people each.
x <- c(rep(0,3), rep(1,3), rep(2,3), rep(3,3), rep(4,3))
fit <- stan_glm(y_1 ~ x, family = binomial)
print(fit)
## stan_glm
## family: binomial [logit]
## formula: y_1 ~ x
## observations: 15
## predictors: 2
## ------
## Median MAD_SD
## (Intercept) -4.8 2.0
## x 2.0 0.8
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
Using weakly informative priors the drug is effective at t1, being the parameter β positive and equal to 2.0 (with a posterior sd of 0.8), meaning that there is a positive effect of 2.0 on the log-odds of the survival for each additional amount of the dose.
Now we fit the same model to the second sample, according to three different priors for β, reflecting three different ways to incorporate/use the historical information about y1:
- weakly informative prior β ∼ N(0, 2.5) ⇒ scarce historical information about y1;
- informative prior β ∼ N(2, 0.8) ⇒ relevant historical information about y1;
- mixture prior β ∼ 0.8×N(0, 2.5)+0.2×N(2, 0.8) ⇒ weighted historical information (0.2).
(We skip here the details about the choice of the mixture weights in 3., see here for further details).
fit2weakly <- stan_glm(y_2 ~ x, family = binomial)
fit2inf <- stan_glm(y_2 ~ x, family = binomial,
prior = normal(fit$coefficients[2],
fit$stan_summary[2,3]))
x_stand <- (x -mean(x))/5*sd(x)
p1 <- prior_summary(fit)
p2 <- prior_summary(fit2weakly)
stan_data <- list(N = n, y = y_2, x = x_stand,
mean_noninf = as.double(p2$prior$location),
sd_noninf = as.double(p2$prior$adjusted_scale),
mean_noninf_int = as.double(p2$prior_intercept$location),
sd_noninf_int = as.double(p2$prior_intercept$scale),
mean_inf = as.double(fit$coefficients[2]),
sd_inf = as.double(fit$stan_summary[2,3]))
fit2mix <- stan('mixture_model.stan', data = stan_data)
Let’s figure out now what the three posterior distributions suggest.
Remember that in the second sample all but one patients die, but we actually do not know why this happened: the informative and the weakly informative analysis suggest almost opposite conclusions about the drug efficacy, both of them quite unrealistic:
- the ‘informative posterior’ suggests a non-negligible positive effect of the drug ⇒ possible overestimation;
- the ‘weakly informative posterior’ suggests a strong negative effect ⇒ possible underestimation;
- the ‘mixture posterior’, that captures the prior-data conflict existing between the prior on β suggested by y1 and the sample y2 and lies in the middle, is more conservative and likely to be more reliable for the second sample in terms of clinical justifications.
In this application a mixture prior combining the two extremes—the wildly informative prior and the weakly informative prior—can realistically average over them and represent a sound compromise (similar examples are illustrated here) to get robust inferences.
The moral lesson
The fairy tale is over. The statistician is now convinced: after digging in the mixture priors’ land he found another relevant case of prior regularization.
And the moral lesson is that we should stay in between—a paraphrase for the ancient in medio stat virtus—when we have small sample sizes and eventual conflicts between past historical information and current data.