(This post is by Yuling)

Gregor Pirš, Aki, Andrew, and I wrote:

Stacking is a widely used model averaging technique that yields asymptotically optimal predictions among linear averages. We show that stacking is most effective when the model predictive performance is heterogeneous in inputs, so that we can further improve the stacked mixture by a full-Bayesian hierarchical modeling. With the input-varying yet partially-pooled model weights, hierarchical stacking improves average and conditional predictions. Our Bayesian formulation includes constant-weight (complete-pooling) stacking as a special case. We generalize to incorporate discrete and continuous inputs, other structured priors, and time-series and longitudinal data. We demonstrate on several applied problems.

As per the spirit of hierarchical modeling, if some quantity could vary in the population, then it should vary. Hierarchical stacking is an approach that combines models with input-varying yet partially-pooled model weights. Besides all the points we made in the paper, I personally like our new method for two reasons:

- We started this paper by practical motivations. Last year I was trying to use model averaging to help epidemiological predictions. Complete-pooling stacking was a viable solution, until I realized that a model good at predicting new daily cases in New York state was not necessarily good in predictions in Paris. In general, how good the model fits the data depends on what input location to condition on—All models are wrong, but some are
*somewhere*useful. This present paper is aimed for learning where that*somewhere*is, and construct a local model averaging such that the model weight varies in input space. This extra flexibility comes with additional complexity, which is why we need hierarchical priors for regularization. The graph above is an example in which we average a sequence of election forecast models and allow the model weight to vary by states. - In addition to this dependence on inputs, our new approach also features by its full-Bayesian formulation, whereas stacking was originally viewed as an optimization problem, or an optimization approach toward Bayesian decision problem. For some simple model like a linear regression with big n, it probably makes little difference to use MAP or MLE or average over full posterior distributions. But for complex models, this full-Bayesian approach is appealing for it
- makes hierarchical shrinkage easier: Instead of exhaustive grid search of tuning parameters, we will be using gradient-informed MCMC;
- provides built-in uncertainty estimation of model weights;
- facilitates generative modeling and open-ended data inclusion: When running hierarchical stacking on election predictions, we encode prior correlations of states to further stabilize model weights in small states;
- enables post-processing, model check, and approximate LOO-CV as if in a usual Bayesian model.

Regarding (2), we justify why this full-Bayesian inference makes sense in our formulation (via data augmentation). Looking forward, the general principle of converting a black-box learning algorithms into a generative modeling (or equivalently, a chunk of Stan code) could be a fruitful for many other methods (existing examples include KNN and SVM). The generative model needs be specified on case-by-case basis. Take linear regression for example, the least-squares estimate of regression coefficients equivalents the MLE in a normal-error model. But we should not directly take the negative loss function as a log density and sample from +constant. The latter density differs from the “full Bayesian inference” of the posit normal model unless is a known constant 1. From the “marginal data augmentation” point of view, is a working parameter that needs to be augmented in the density and avenged over in the inference—in general there could be a nuisance normalizing constant and that is when some path sampling is useful. For this additional challenge, we leave other scoring rules (interval scores, CRPS) mostly untouched in this paper, but they could be interesting to investigate, too.

This preprint is the third episode of our Bayesian model averaging series (previously we’ve had complete-pooling stacking and combining non-mixing posterior chains).

Thanks for posting, Yuling. I think this work is really exciting, and I also want to refer readers to this 2014 paper by Kaniav Kamary, Kerrie Mengersen, Christian Robert, and Judith Rousseau, “Testing hypotheses via a mixture estimation model.” (I don’t like the testing hypothesis part but I like the way they set up an artificial mixture model.)

Just to echo, we discussed the connection between our method to this mixture approach and some other mixture-of-mixture modeling approaches.

I also realize our new path sampling paper is a counterpart of this 2014 paper, in which we change from using-a-linear-mixture-to-test-hypothesis to using-a-geometric-mixture-to-test-hypothesis.

How does this compare to a regular Bayesian model where the mixture parameters are just part of the likelihood?

Formulating things like this would necessarily leave out the leave-one-out-cross-validation angle.

You may have this in here but easier to ask :D. Quite dense paper

Ben:

The paper is dense because it’s full of Yuling’s ideas. When I write a paper I have one idea and I stretch it out to 30 pages. Yuling will write a 30-page paper with 30 different ideas.

Ben, we discussed linear mixture/mixture of experts in the paper. Besides computation burden, a full linear mixture often performs poorly when models are similar: it is hard to distinguish between y=beta exp(x)^2 and a y=beta exp(x)^1.8.

> performs poorly when models are similar: it is hard to distinguish between

Is it that it samples slowly or the predictive performance is bad?

I guess another thing — what happens if you just do this as a two stage fit but on the second stage you aren’t doing the cross-validation thing? So first fit your models with all data, and then fit a mixture with all data?

Ben, wouldn’t that be a recipe for massive overfitting?

This is very interesting. A quick question, which I didn’t see addressed (though it’s a pretty dense paper, so I might have missed it): you say: “Instead of choosing fixed values, we view μ and σ as hyperparameters and aim for a full Bayesian solution: to describe the uncertainty of all parameters by their joint posterior distri- bution p(α, μ, σ|D), letting the data to tell how much regularization is desired.

To accomplish this Bayesian inference, we assign a hyperprior to (μ, σ).

hyperprior: μk ∼normal(μ0,τμ), σk ∼normal+(0,τσ), k=1,…,K−1.”

How do you pick which precise numbers to plug in for the parameters in these hyperpriors? In several of the examples, the difference in terms of mean test log predictive accuracy between normal (complete pooling) stacking and hierarchical stacking looks pretty small… do the results depend sensitively on the precise choice of hyperprior?

In all discrete input experiments we use standard normal for hyper priors. It is an ad hoc choice and might be further tuned. Arguably the inference is less sensitive to hyperpriors than to priors.

I’m sure you’re right about that.

On an unrelated note, it occurred to me that standard Bayesian conditional modeling leaves open the possibility that model probabilities can (and in general should) depend on the input, since $p(y|x) = sum_i p(y|x, M_i)p(M_i|x)$, i.e. the model probabilities are (implicitly) conditional on $x$. Yet, I’ve never seen anyone try to do Bayesian model averaging in a way that it is input sensitive.

Olav:

Thats the whole point of our paper, that the weights now depend on x.

I know, you make the stacking weights depend on the input. I was just pointing out that it’s already implicit in the standard Bayesian conditional modeling framework (that uses the posterior probabilities of models as weights) that the model probabilities strictly speaking should depend on the input, in general.

This is an incredibly rich paper that I’ve returned to several times in the last few days. I find it very thought-provoking, so here are a few more questions and comments that have occurred to me:

1. As I said in my above post, it is implicit in standard Bayesian conditional modeling that the posterior probabilities of models can depend on the input, even though no one (prior to you, as far as I know) has taken advantage of this before. But it occurred to me that the same is true even of parametric inference inside a model. In the same way that one model may be accurate on some parts of the input space but not on others, it might also happen that a certain region of the parameter space (inside a given model) is more predictively accurate (or better supported) than another region of the parameter space, depending on the input. Implicitly, in conditional modeling, all the x-inputs are usually “pooled” in the sense that p(theta | x) = p(theta | x’), but they don’t necessarily have to be. On the other extreme, one might fit the same model separately for different regions of the input space, thereby getting different locally optimized fitted models. If the input space is subdivided too finely, this isn’t really going to work, since the model fitting will be highly unstable (in the limit, the parameters might not even be identifiable). But is there a way of compromising between the complete pooling option and the separate fitting option for inference inside a model by using ideas similar to the ones you use in your paper? Or is this just not possible or not worth it? Is this something you have thought about?

2. You show how the problem of optimizing an objective function can be transformed—on a higher level—into a Bayesian inference problem (in the case of model stacking with the log score). This is very interesting, of course. However, I think it’s also true that Bayesian inference problems can, in turn, often be regarded as optimization problems on an even higher level. This is because, given a likelihood p(x|M) and prior p(M), the posterior p_x(M) will be the distribution that uniquely minimizes the objective function Opt(p) = E_p_x[log p(x|M)] + KL[p_x, p], i.e. the posterior distribution will be the distribution that assigns high probability to models/hypotheses that have a good log likelihood score on the data, while at the same time having a small KL divergence from the prior ( this is the classical result in https://www.jstor.org/stable/687182?seq=1 ). Regarding Bayesian inference as an optimization problem has been explored fruitfully in several recent papers (e.g. https://rss.onlinelibrary.wiley.com/doi/full/10.1111/rssb.12158 and https://www.mdpi.com/1099-4300/20/6/442 ). The interplay between these two perspectives on inference is quite interesting I think. (This is more of a comment than a question.)

3. I note that you exclusively use LOO CV in your paper. Is the reason for this just that it is computationally efficient (given the approximations available), or have you also tried 5-fold or 10-fold CV? I ask because (repeated) 5-fold or 10-fold CV is often much better than LOO CV, at least in many other problems, so I was wondering if things are different when it comes to Bayesian model stacking.