Bayesian hierarchical stacking: Some models are (somewhere) useful

Yuling Yao, Gregor Pirš, Aki Vehtari, and I write:

Stacking is a widely used model averaging technique that asymptotically yields optimal predictions among linear averages. We show that stacking is most effective when model predictive performance is heterogeneous in inputs, and we can further improve the stacked mixture with a hierarchical model. We generalize stacking to Bayesian hierarchical stacking. The model weights are varying as a function of data, partially-pooled, and inferred using Bayesian inference. We further incorporate discrete and continuous inputs, other structured priors, and time series and longitudinal data. To verify the performance gain of the proposed method, we derive theory bounds, and demonstrate on several applied problems.

What I really want you to do is read section 3.1, All models are wrong, but some are somewhere useful, where Yuling describes how and why a mixture of wrong models can give better predictive performance than a correct model, even if such a correct model is included in the set of candidates.

The background of this paper is that for a long time it’s been clear that when doing predictive model averaging, it can make sense to have the weights vary by x (see for example figure 12 of this paper)—as Yuling notes, different models can be good at explaining different regions in the input-response space, which is why model averaging can work better than model selection—but it’s tricky because if you let the weights vary without regularization, you’ll just overfit.

Also of interest is section 3.2, on the conditions under which stacking is most helpful. The work here is reminiscent of ideas from psychometrics, where if we want to average several tests to measure some ability, you don’t want tests whose scores are 100% correlated, but it also would not make sense for the tests to have zero correlation. Like many things in statistics, it’s subtle.

This paper combines lots of interesting ideas, including
– Stacking (Wolpert, 1992, Breiman, 1996), an early ensemble learning method;
– Bayesian leave-one-out cross-validation, an idea that Aki and his collaborators have been thinking about since at least 2002;
– Treating a computational problem as a statistics problem, an idea I associate with Xiao-Li Meng and his collaborators (see for example this 2003 paper by Kong, McCullagh, Meng, Nicolae, and Tan);
– Statistical workflow and model understanding (“interpetable AI”);
– Flexible Bayesian modeling (it’s all about the prior for the vector of weights), which is where lots of the technical difficulties arise (and which we can easily do, thanks to Stan);
– And, of course, multilevel modeling: good enough for Laplace, good enough for us!

It’s a pleasure to announce two such exciting papers in the same week, and I’m very lucky in my collaborators (which in turn comes from being well situated at a top university, having the time to participate in all this work, living in a country that offers generous government support for research, etc etc).

15 thoughts on “Bayesian hierarchical stacking: Some models are (somewhere) useful

  1. > model understanding (“explainable AI”)
    You might wish to use the phrase interpretable rather than explainable AI (XAI) as that often refers to giving up on trying to understand the model of choice/use and instead understanding a highly correlated prediction model. But correlation is not causation.

  2. This one’s tricky philosophically

    Yuling describes how and why a mixture of wrong models can give better predictive performance than a correct model

    We have to be very careful about what “correct model” and “predictive performance” mean here. Let’s say we’re given the true sampling distribution p(y | theta) of data y given parameters theta. If we knew the value of the parameters theta, then p(y | theta) provides the best predictive performance, at least if we’re evaluating with proper scoring metrics like log loss or squared error (see Gneiting et al.’s papers on calibration and on proper scoring metrics).

    In practice, we don’t know the true parameters theta. What we have instead is a finite pile of data y_obs and some prior knowledge represented as a prior p(theta). What we have for inference is the posterior predictive distribution

    p(y | y_obs) = INTEGRAL p(y | theta) * p(theta | y_obs) d.theta.

    Because the posterior p(theta | y_obs) will not be a delta function at the true parameter value theta, p(y | y_obs) will not be the same as p(y | theta) for the true parameters theta. Therefore we know that Bayesian posterior predictive inference is sub-optimal compared to the true data generating process. We only know that under some reasonably mild conditions that as the data size y_obs grows, the posterior p(theta | y_obs) approaches a delta function around the true value of theta and thus the posterior predictive distribution p(y | y_obs) approaches the true data generating process p(y | theta).

    I think what this paper is saying is that sometimes we can get lucky and form predictive inference p(y | y_obs) that’s not based on the true joint model p(theta, y) and outperform what we’d get from turning the Bayesian crank.

    Another way to “beat” Bayesian inference is to change the scoring metric. Even with the true joint model form p(theta, y), Bayesian posterior inference is sub-optimal for improper scoring metrics like absolute loss or 0/1 loss. Given that 0/1 loss seems to be the standard in ML, it’s challenging for Bayesian methods to make inroads. I once participated in a TREC (info retrieval) bakeoff around crowdsourced data where MS Cambridge and I built Bayesian systems that dominated on log loss, but were beaten in the actual bakeoff on the official 0/1 loss metrics.

    P.S. Keith was just pointing out in the first comment that “explainable” has come to have a specialized meaning as part of XAI that doesn’t always involve “interpretation”.

  3. I am puzzled by the usefulness of the following observation. You wrote:

    “…where Yuling describes how and why a mixture of wrong models can give better predictive performance than a correct model, even if such a correct model is included in the set of candidates.”

    In a simulation setting (i.e. we know the truth), we indeed can choose a metric where wrong models *can* outperform the true model, but equally we *can* show the exact opposite (there are many wrong models that are truly worse than the true model). Thus demonstrating the former, or the latter, is of no value to me. Clearly what matters is what collection of models you are considering. Simple (or complex) strategies of “model averaging” is no protection to better inference if you have a collection with many inherently poor models versus inference from a well chosen model. Thus we cannot simply say combining models is “better” than a single model.

    cheers Al

      • Hi Yuling/Andrew,
        Yuling, thanks for replying. I am sure we agree that we want useful models, however I fear people may assume that Andrew advocates “collections of wrong models” are “better/more useful” than a single, well chosen, model. I do not know your target audience, but in my world (drug development) some people advocate (naively using) model averaging, with the premise this is “smart”…however some of the models in their collection are really stupid. It would pain to hear someone say “…and Gelman says it is OK to average over wrong models”….hence my comments!
        cheers Al

        • If a model is really stupid (and therefore predictively inaccurate), stacking will generally award it a very low weight (quite often 0). Even several poor models won’t necessarily ruin the predictive performance of the stacked average. Stacking is better than traditional Bayesian model averaging in this respect. I assume the same is true of hierarchical stacking.

        • Hi Olav,
          It perhaps is difficult to explain via short texts, but by “stupid” I mean “not reasonable given what we know about the science”. In particular, I am talking about dose response relationships. These are always non-linear, but given a poor design (e.g. few dose levels closely spaced), and little data per dose, a linear model may *appear* to give a similar model fit to a more complex (non-linear) one. Thus from a simple model comparison, or predictive performance perspective, the linear model may seem, as first glance, reasonable. Critically though, with a better design (wide dose range, large N), the non-linear model will always be much better than the linear model. However with a weaker and weaker design, and less and less N, the linear model will seemingly get closer and closer to the non-linear model. Thus we end up in the perverse situation that the weaker our dataset, the more credible the poor model becomes. This is an example where a “mixture of wrong models” is clearly not sensible, and hence I think we should avoid giving the impression that such a strategy is sensible. Consider the usefulness of many models yes, but be very selective as to your credible “model space”.
          cheers Al

        • Check out section 3 of Yuling et al.’s paper, which among other things explicitly addresses how limited data can lead one astray. As they say, limited data—especially because those limitations can be distributed unevenly over the input space—is a reason to want a hierarchical prior to prevent wild swings in overly-confident model weights.

          Perhaps more relevant to your concern about how stacking weights are interpreted is this line:

          Unlike Bayesian model averaging (Hoeting et al., 1999) that computes the probability of a model being “true”, stacking is more related to…the probability of a model being the locally “best” fit, with respect to the true joint measure. (p. 10)

          This line emphasizes Yuling’s point that these methods are not assessing “truth”, but rather just the ability of the model to describe data patterns. Models can be bad or good for reasons that have nothing to do with their descriptive ability.

          Statistics is about designing default methods that work well enough most of the time. It is a valuable tool for science, but it cannot do our critical thinking for us. It sounds like the models AI describe are bad because they do not correctly describe the causal processes that generate data, which is indeed an important problem. But this is a scientific problem and can only be addressed by finding better ways to express those causal theories in model form.

        • Al,
          The point that “if all individual models are poor, then model averaging does not help” is well taken. We explicitly recommend to feed stacking with a sequence of plausible models, not weak learners (in contrast to boosting or mixture-of-experts).
          The point that “prediction is not causality” is well taken. In Sec 3.3 we argued why hierarchical stacking is better suited for causal inference (in contrast to vanilla stacking).

  4. How does this compare to a Kalman or other filter that takes a bunch of “noisy” predictions from the candidate models and cleverly uses the associated error of the aforementioned models to make a model averaged prediction?

Leave a Reply

Your email address will not be published. Required fields are marked *