Comments on Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection

There is a recent pre-print Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection by Quentin Gronau and Eric-Jan Wagenmakers. Wagenmakers asked for comments and so here are my comments.

Short version: They report a known limitation of LOO when it’s used in a non-recommended way for model selection. They report that their experiments show that LOO is worse than expected as it doesn’t favor the simpler true model strongly enough. I think their experiments show that the specific non-recommended LOO comparison is favoring the simpler model more than expected. I enjoyed the clarity of the writing and the experiments, but they are missing many limitations of LOO and alternative methods, and I disagree on conclusions made on the experiments.

Long version:

The paper focuses on model comparison. I’d like to remind that cross-validation overall and LOO are also useful for estimating the predictive performance and checking of a single model. You can find some examples in Gabry, Simpson, Vehtari, Betancourt, and Gelman (2018), loo package vignettes, and my model selection tutorial notebooks, but I’ll focus here on model comparison.

Cross-validation and LOO have many limitations. The paper focuses on a variation of one known limitation: “idealized case where there exist a data set of infinite size that is perfectly consistent with the simple model M_S, LOO will nevertheless fail to strongly endorse M_S.” The paper uses one specific variant of LOO model selection. The further variation in the paper is the use of idealized data in the experiments.

The strongest assumption in the paper is that true model is one of the models compared. This case is often called M-closed case (Bernardo and Smith, 1994). Cross-validation is usually recommended for M-open case where the true model is not included in the set of models compared (Bernardo and Smith, 1994), and thus we might say that the paper is using cross-validation for something that it is not recommended.

When we estimate the predictive performance of the model for the future data which is not yet available, we need to integrate over the future data distribution p_t(y), which we usually don’t know. Vehtari and Ojanen (2012) provide extensive survey of alternative scenarios. If we assume M-closed case, then we know that one of the models is p_t(y), but we don’t know which one. In that case, it is sensible to model p_t(y) as a posterior weighted average of the all models under consideration, that is, replace p_t(y) with Bayesian model averaging (BMA) model as proposed by San Martini and Spezzaferri (1984). This approach has the same model selection consistent behavior as Bayes factor. Thus if the assumption is that one of the models is true, BMA reference model approach by San Martini and Spezzaferri could be used, although asymptotic model selection consistency doesn’t guarantee good finite sample performance, and BMA may have worse performance than cross-validation for small sample sizes even if one of the models is true.

LOO is known to be model selection inconsistent in M-closed case, but it is good to remember that there are also cross-validation hold-out rules, which are model selection consistent (Shao, 1992). I don’t think these versions of cross-validation are that important, since there are better options for predictive performance estimation in M-closed case as mentioned above and below.

If the methods, that are model selection consistent in M-closed, are used in M-open case they will eventually give weight 1 to one model that is closest to the truth, but which can still be arbitrarily far from truth. In M-open case it can be better if the model weights stay away from 0 and 1 as demonstrated also by Yao, Vehtari, Simpson, and Gelman, 2018.

In M-open case, we assume that none of the models is true one, and we don’t trust any of our models to be even very close to true model. Due to the lack of trust, we make minimal assumptions for p_t(y) and instead of any explicit model, we re-use data as pseudo Monte Carlo draws from the unknown distribution. Alternatively we can also say that we are using Dirichlet distribution with unknown probability mass at the observed locations as a non-parametric model of p_t(y). This step assumes that we believe that this simple non-parametric assumption can represent the relevant properties of the true distribution, which is one of the big limitations of cross-validation. The edge case of theta_t being very close to 0 or 1 and respectively all observations being 0 or 1, is known to be very sensitive to prior assumptions in any inference (see, e.g., Bernardo and Juárez, 2003) and using all 1 observations for a Dirichlet distribution is clearly missing information about the possible variability in the future. In this kind of very weak data for some important features of the future distribution, I would not use Dirichlet distribution, but instead would insist that stronger model information is included in p_t(y). Classic example is the rare disease prevalence estimation, where we might observe hundreds of healthy persons and no persons with disease. Without good prior information, the posterior uncertainty remains large on how far from 0 probability we are and the results are necessarily sensitive to prior information (as reflected also in experiment 1). For experiments 2 and 3 the Dirichlet distribution approximation is likely to work better.

Non-monotonicity observed in some weight curves is likely due to the fact that idealized data is symmetric and centered, and when leaving one observation out, this symmetry and centering doesn’t hold anymore. I guess that in scenario 2, leave-two-out approach by leaving a pair of 0 and 1 at the same time wouldn’t have non-monotonicity in curves. The same holds for Example 3, and it would be better to generate the idealized data as pairs which have the same absolute value but different signs and leave two out at the same time. This way the examples would be even more idealized and still showing the limitations of LOO.

In M-closed case, it is known that LOO is not model selection consistent (e.g., Shao, 1992), which means that the weight for the true model never goes to 1. Gronau and Wagenmakers write that they were surprised that model weights stayed so far from 1. I was surprised that the model weights were so close to 1. If asymptotically the simpler and more complex model are giving practically the same predictions (except in example 1, H_0 model is not giving any probability for a 0 event), then I would assume the weights to be closer to 0.5. I can think two reasons why the weights are not closer to 0.5:

  • Idealized data makes also LOO to favor the simpler model more and if the data would be generated from the true generating process in examples 2 and 3, I would expect the weights to be closer to 0.5.
  • Gronau and Wagenmakers are computing the weights directly from LOO expectations (Pseudo BF), although at least since Breiman et al. (1984, ch. 11), it has been recommended that the uncertainty in cross-validation model comparison should also be taken into account. That is, the model with higher estimated performance is not automatically selected, but uncertainty in the estimates should be considered, too (see also, Vehtari, Gelman, and Gabry, 2017).

My recommendation is that if LOO comparison taking into account the uncertainties says that there is no clear winner, then neither of the models is selected and model averaging is used instead. In case of two models, if both models give very similar predictions, then it doesn’t matter which one is used. In case of nested models I would choose the more complex model to be certain that uncertainties are not underestimated (which happens in the simpler model as some parameters are fixed compare to the more complex model) and then make strict model checking and calibration, and then proceed with M-completed approach to decide if some parts can be dropped as discussed below.

Yao, Vehtari, Simpson, and Gelman (2018) propose Pseudo-BMA+ (for model weights this could be called Pseudo-BF+ weights) which take the relevant uncertainty in the account and produces weights which are further away from 0 and 1.

Yao, Vehtari, Simpson, and Gelman (2018) propose also Bayesian stacking which uses LOO as part of the approach. The Pseudo-BMA(+) has the limitations that it’s seeing only scalar value of the predictive performance, while stacking has the benefit that it sees also how similar or different the predictive distributions of different models are and thus it can jointly optimize better weights (Yao, Vehtari, Simpson, and Gelman, 2018).

Yuling Yao commented in an email:

A quick note is that in the beta-bernoulli examples, the stacking weights can be derived analytically. It is 1 for H_0 and 0 for H_1. Surprisingly it is independent of both the prior distribution and sample size n. The independence of n may look suspicious. But intuitively when each data point (example 1) or each pair of data point (example 2) uniformly supports H_0 more than H_1, it does not take n \to infinity to conclude H_0 dominates H_1. Also only when n goes to infinity shall we observe perfect 0-1 pair in example 2 and exact sample mean= 0 in example 3, so the stacking weighting/selection depends on sample size implicitly.

Although the stacking seems to produce something Gronau and Wagenmakers desire, I would not trust stacking in the edge case with all 1 observations for the reasons I mentioned above. I guess that in this idealized case, stacking with symmetric-leave-two-out would also converge faster.

Gronau and Wagenmakers focused on the case of the simpler model being true, but to better illustrate the limitation of the LOO, it would be good to consider also the case where the more complex model is true one. Consider following alternative experiments where the more complex model is true one.

  • In example 1, let the true theta_t= 1-epsilon, but for the simpler model keep theta_0=1.
  • In example 2, let the true theta_t= 1/2+epsilon, but for the simpler model keep theta_0=1/2.
  • In example 3, let the true theta_t= epsilon, but for the simpler model keep theta_0.

If we choose epsilon very small but within the limits of the floating point accuracy for the experiments, we should see the same weights as as in the original experiments as long as we observe the same data, and only when we occasionally observe one extra 0 in example 1, one extra 1 in example 2, or extra positive value in 3 we would see differences. In the example 1, even one 0 will make theta=1 to have zero probability.

Another experiment to illustrate the limitations of LOO (and cross-validation and information criteria in general) would be to vary epsilon from very small to much larger, plotting how large the epsilon needs to be before we see that the more complex model is strongly favored. I’m not certain what happens for the Pseudo-BF version with no proper uncertainty handling used by Gronau and Wagenmakers, but one of the big limitations of the cross-validation is that the uncertainty about the future when not making any model assumptions is so big, that for being able to make confident choice between the model the epsilon needs to be larger than what would be needed if we just look at the posterior distribution of theta in this kind of simple models (see demonstration in a notebook, and related results by Wang and Gelman, 2014). This is the price we pay for not trusting any model and thus not getting benefits of reduced variance through the proper modeling of the future data distribution! This variability makes LOO bad for model selection when the differences between the model are small, and it just gets worse in case of a large number of models with similar true performance (see, e.g. Piironen and Vehtari, 2017). Even with just two models to compare, cross-validation has also a limitation that the simple variance estimate tend to be optimistic (Bengio and Grandvalet, 2004).

If we are brave enough to assume M-closed or M-completed case, then we can reduce the uncertainty in the model comparison by using a reference model for p_t(y) as demonstrated by Piironen and Vehtari (2017) (see also demonstration in a notebook). In M-completed case, we assume that none of the models is true one, but there is one model which in the best way presents all the uncertainties we can think of (see more in Vehtari and Ojanen, 2012). M-completed case is close to San-Martini and Spezzaferri’s M-closed approach, but with BMA model replaced with just some model we trust. For example, in variable selection the reference model in the M-completed case can be a model with all variables and a sensible prior on coefficients, and which has survived through the model checking and assessment (which may involve cross-validation for that single model). In such case we have started with M-open assumption, but after model checking and assessment we trust one of the models enough that we can switch to M-completed case for certain model selection tasks. In case of the reference model, we can further reduce the variance in the model selection by using the projection predictive approach as demonstrated by Piironen and Vehtari (2017) (with a reference implementation in projpred package). In M-closed case and BMA reference model, projection predictive approach is also model selection consistent, but more importantly it has very low variance in model selection in finite case.

Often the discussion on cross-validation vs. Bayes factors focuses on arguments whether M-closed is sensible assumption. I think M-closed case is rare, but if you insist on M-closed, then you can still use a predictive approach (Vehtari and Ojanen, 2012). If BMA is easy to compute and stable use that as the reference model and then do the model selection as San Martini and Spezzaferri (1984) or even better with the projection predictive approach (Piironen and Vehtari, 2017). If BMA is difficult to compute or unstable, I would recommend trying Bayesian stacking (Yao, Vehtari, Simpson, and Gelman, 2018).

Mostly I don’t trust any model, and I assume M-open case and that it’s possible that models are badly mis-specified. Before any model selection I can discard models based on prior, posterior and cross-validation predictive checking (see, e.g., Gabry, Simpson, Vehtari, Betancourt, and Gelman, 2018). For M-open model selection with a small number of models I use cross-validation with uncertainty estimates (Vehtari, Gelman, and Gabry, 2017). If there is no clear winner, then I recommend model averaging with Bayesian stacking (Yao, Vehtari, Simpson, and Gelman, 2018). In case of large number of models, I recommend trying to convert the problem to M-completed, and to use reference predictive approach or even better the projection predictive approach (Vehtari and Ojanen, 2012; Piironen and Vehtari, 2017).

Since the above is a long discussion, here is my final recommendations how to revise the paper if I would be a hypothetical reviewer

  • Since the paper is citing Vehtari, Gelman, & Gabry (2017) and Yao, Vehtari, Simpson, & Gelman (in press), the minimal requirement would be to mention that these papers explicitly recommend to compute the uncertainties due to not knowing the feature data, and they do not recommend LOO weights as used in the experiments in the paper. In the current form, the paper gives misleading impression about the recommended use of LOO.
  • Even better would be to add also experiments using the other LOO based model weights (Pseudo-BMA+ and Bayesian stacking) introduced in Yao, Vehtari, Simpson, & Gelman (in press). Or at least mention in the discussion that it would be better to make additional experiments with these. I would love to see these results.
  • Even better would be to add small epsilon and varying epsilon experiments described above. Or at least mention in the discussion that it would be better to make additional experiments with these. I would love to see these results.
  • Even better would be to add projection predictive experiments. I would love to see research on the limitations of the projection predictive approach. I know this one requires much more work, and thus I understand if the authors are not willing to go that far.
  • Minor comment: For the earlier use of Bayesian LOO see Geisser (1975).

68 thoughts on “Comments on Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection

  1. This article is very interesting. I have often found the WAIC (which I understand approximates looic) always select more complex models, leaving me with no answers from model selection at all. I thought that I just had strange models and data, but maybe it’s a problem with the WAIC…

    • O:

      In general it’s not always clear what is meant by complexity of a model. For example, consider a hierarchical dataset with two candidate models of the form, y_i = a_j[i] + b*x_i + error; that is, a varying-intercept regression. Consider two candidate models: (1) a model with flat priors on the a_j’s, thus, no pooling; (2) a hierarchical model, a_j ~ normal(mu_a, sigma_a). Both models are equally “complex” in that they are fitting a separate intercept for each group, or maybe model (2) would be considered more “complex” in that it includes hyperparameters. But it is model (1) that will overfit the data, and it is model (1) that will have a higher “effective number of parameters” (as defined in various standard ways, as here).

      The general issue is that with unregularized estimation such as least squares or maximum likelihood, adding parameters to a model (or making a model more complex) leads to overfitting. With regularized estimation such as multilevel modeling, Bayesian inference, lasso, deep learning, etc etc, the regularization adds complexity but in a way that reduces the problem of overfitting. So traditional notions of model complexity and tradeoffs are overturned.

      See also my favorite Radford Neal quote.

      • Thanks for your thoughts Andrew. I’ll have to mull that quote. I’ve been trained that we want model selection to ‘work’ in the sense of justifying a parsimonious model, and I hadn’t questioned it all that much.

        • The question is how to measure parsimony. As Andrew explains, it isn’t as simple as counting parameters.

          Let’s start with the assumption that you don’t have the “true” model in front of you. What’s going to happen as you collect more data is that you’ll be able to fit better and better models. Any “best model” notion is going to depend on data size and quality. As an example, consider building a language model (something that predicts the next word in a conversation or text given the previous words). How much context (number of previous words and their structure) can be exploited depends on the size of the data. That’s why tasks like speech recognition or spelling correction benefit so much from big data—they have very long tails. The “true” model involves human cognition, attention span, context, world knowledge, etc., all of which is only crudely appoximated in a simple language model.

  2. I don’t have B&S nearby – can someone remind me what priors over models or parameters ‘mean’ in an m-open setting? Why assume an additive measure that integrates to one if all possibilities are wrong?

    • Not really an answer to your question, but Vehtari et al specifically recommend ‘stacking’ in the M-open case, not BMA. Although this involves a set of weights summing to 1, it’s not really assigning priors over models, in my understanding. I am no expert though!

      • Thanks for this.

        So what is the ‘Bayesian’ way of thinking about stacking etc?

        ie what is the basic principle at play here if not eg updating an additive, normalised prior measure of belief, support, information etc into a posterior one?

        I haven’t looked at the stacking stuff in any detail but it looks like predictive distributions are stacked. These predictive distributions are still obtained by averaging over posteriors which presumably result from updated priors right?

        So the idea is to take something from an M-closed setting and ‘stack’ them to address the open setting? What about directly addressing the open setting without using the parts from the closed setting?

        (I also have some qualms with how much sense it makes to want the predictive distribution – which averages models that could be quite different – to match to observable/future data, but that’s another topic I guess).

    • Chris: That’s right.

      Ojm: I agree that it seems strange to put priors over models in an M-open setting. Nonetheless, it’s mathematically possible to do so, and some people recommend this procedure, hence we wanted to evaluate it and compare it to the procedure that made more sense to us.

      • What about the situation which is neither M-open nor M-closed? Only half-joking :) I have multiple models for biogeochemical cycling which all capture aspects of the real-world, but also are fundamentally inconsistent with each other in a few respects, and by themselves certainly incorrect/incomplete/wrong. In this situation, it does actually make sense to me to explicitly use probability to weight the models and quantify uncertainty. I increasingly think the situation I describe is the norm in most every area of science, and thus maybe is just better to think of as “M-open”. OTOH, if that is so, there are vanishingly few (if any) non-toy examples of an “M-closed” scenario!

        • Chris:

          M-closed is almost never even approximately true. But the concept can still be useful for theoretical purposes, to develop methods, and to develop intuition. Similarly, the normal distribution, additive linear models, etc., are not true either, and they’re typically not even close to true, but it can be helpful to use these assumptions to build tools that work and to understand the statistical properties of the tools we use.

        • I’ve been developing an idea about Bayes which doesn’t involve treating probability as a continuous generalization of boolean logic. (actually I sent a really draft version of this paper to OJM a while back).

          In this conception, the Bayesian probability measures degree of agreement of expected behavior of the model compared to reality. So for example the “peak” likelihood values are the ones that best match what you think your model should be able to do in terms of predictive power, and as you get farther away from predictive accuracy the likelihood decreases.

          In that context, there is no “M-open” or “M-closed” there’s only “all the models I’ve thought up so far” and “how well each one does relative to the other ones I’ve thought up”

          It makes perfect sense to me to use a continuous measure of agreement (aka probability) in these contexts. And, further it makes perfect sense to me that you can expand your model set and re-do things, and get *different* probabilities, which is a bit puzzling if you think of Bayesian probability as degree of plausibility of *truth*.

        • Just reading what you wrote in this comment makes me think you are essentially thinking in ‘pure likelihood’ terms?

          Plausibility based on ‘predictive’ agreement with data – yup
          No absolute plausibility just how well one does relative to another – yup

          If you consider prior likelihoods as arising from past agreement with data, multiplying these for conditionally independent data etc then you can sythethise past info together.

          But yeah maybe I’ll read what you sent me ;-)

        • Daniel, I’m with you on the ‘there is no “M-open” or “M-closed” there’s only “all the models I’ve thought up so far” and “how well each one does relative to the other ones I’ve thought up”’. Box’s well-worn dictum comes to mind :)

          Also, I suspect that OJM is off the mark here a little in that you are ultimately coming down on the side of using probability to quantify the relevant uncertainties, not some kind of “pure likelihoodism”. But I do agree that the ‘Bayes as continuous Boolean logic’ viewpoint falls flat ultimately for a variety of reasons including the one you mention. At some level, I believe we can just be pragmatic and say that we are using probability theory as a tool to deal with uncertainty in an organized fashion, and that it works well in an astonishing variety of cases.

        • Yes I do come down on a full probability model, but the concept is that likelihood down weights regions of space and then renormalization preserves a common scale… it actually isn’t required that likelihood be normalizable necessarily. Observing data induces a semigroup operation that maps (potentially nonstandard) probability densities to new probability densities, in a commutative, associative operation. The result is a measure of agreement between theory and observation, including both a measurement / prediction component of theory and a pure theoretical component (plays role of prior)

        • Daniel:

          When I last gave a webinar on ABC I

          1. Started with simple rejection sampling method for observed discrete data (draw parameter from prior, draw data using that parameter value, only keep those draws that generated the exact same data and viola there’s the posterior).

          2. Pointed out the kept percentages for each parameter had probability equal the the likelihood (probability of data given parameter value) and hence was a weighted average that we could calculate directly.

          3. Reminded people what importance sampling was (draw from a convenient distribution and re-weight to your target distribution) and re-characterized Bayes as importance sampling from the prior to get the posterior. (Then I moved on to sequential importance sampling to show why MCMC was needed for real problems).

          You seem to be something similar using nonstandard probability – but it just _seems_ like yet another way to characterize implementing Bayes Theorem?

        • Kieth, yes it’s still Bayes theorem, but in this case it’s a characterization of what assumptions we need to make in order to wind up with probability models as the answer to our assumptions. With Cox’s theorem it was always about underlying truth. With my assumptions it’s about agreement with theory, independence of the order of data within a datasets, a measure of predictive power that is defined only up to a scale multiplier, and a mechanism for creating a common scale that depends on all points in the space.

          With those choices, we get probability measure

          That set of assumptions leads

        • So just for the information of commentators here it’s worth noting that when Daniel writes, “it actually isn’t required that likelihood be normalizable necessarily” he means as with respect to the data. In the system he’s working on the likelihood-like component of the update is a fairly arbitrary quantification of the disagreement of a theory’s prediction and data — arbitrary to the point where unnormalizability with respect to the data is permitted. Setting that component to be the likelihood is not required — it’s just one out of many ways one might want to quantify disagreement.

        • Daniel:

          I meant to restrict to proper priors – does your characterization restrict to proper priors?

        • Right, Corey, think of F(Data, Parameters) for fixed parameters as any bounded nonnegative continuous function. The meaning of this function is basically that “bigger is better”. The “good agreement” occurs if the data is in the region where F(Data,…) takes on larger values.

          So, for inference from fixed data, we can actually determine the “good” region of the theoretical quantities (parameters) even if this function isn’t normalizable with respect to the data.

          But, if we want our model to have predictive power, then predicted data is itself a *theoretical quantity* and the distribution over all theoretical quantities must be normalizable or it will be impossible to come to a calculation that can be shared between people (it would always be possible to multiply your result by a constant and claim to have “Better” agreement than someone else who had the same calculation but a smaller constant)…

          So, it’s a corner case, where if you only want to go from fixed observed data to a distribution over parameter space, you can technically relax the requirements of the likelihood, but as soon as you require that the model be able to predict unobserved “future” or “alternative conditions” data, your model has more stringent requirements and the likelihood has to be normalizable with respect to the data.

        • Keith: yes… ish. I work in nonstandard analysis, so the operations form a semigroup that is closed on the nonstandard probability densities. Some of those have no standardization.

          So in this sense, you can’t have your prior be f(x)=1 for x in the real numbers. (not normalizable). But you can have your prior be 1/2N on the nonstandard interval [-N,N] which can be normalized in nonstandard analysis, but there’s no standardization (no standard probability distribution that is “infinitesimally close” to this one)

          As long as your prior is nonstandard normalizable, operations with data will keep the posterior nonstandard normalizable… If on the other hand, there’s no *standardization* you could interpret that to mean that you don’t yet have any finite quantity of information. So, if you started with a nonstandard prior, collected some data, and still couldn’t form a standard distribution… basically the data wasn’t informative.

        • No problem with being pragmatic. It’s just that one of the most common selling points of Bayes is that it is somehow ‘more principled’ than other ‘hacky’ approaches (machine learning, frequentist inference, likelihood theory, Tukey/Huber style data analysis etc etc).

          But most of these other approaches already explicitly embrace the ‘M-open’ view and often justify their hackiness on this – better an approximate answer to the right (M-open) question than an exact answer to the wrong (M-closed) question etc.

          It always seems to me like Bayes buys its ‘principles’ by paying in ‘closed worlds’. E.g. Jaynesian robots, rationality assumptions, settable bets etc etc.

          Now we also have Bayesians saying that ‘of course we should deal with the M-open setting’ while keeping the ‘we should do it in a Bayesian way’.

          But I’m honestly not really sure in what sense this works because Bayes foundations seem so closely tied to an M-closed view. What is the clearest statement of Bayes in an M-open setting, where things are built from the ground up (e.g. not just ‘let’s stack Bayesian predictive distributions’ but why I would want to start from Bayesian predictive distributions in the first place)?

          I’m not just trolling on this – I struggle to see how Bayes is actually supposed to be formalised in an M-open view.

          An again – I’m perfectly fine to be hacky, but now that I’ve embraced hackiness I struggle to see why I would want to be hacky in a Bayesian way.

        • First, I see frequentism and Bayes as different interpretations of probability, that lead to different procedures to evaluate models, not different hacks. In general, I don’t see that its all “hacky”, unless defining and working with “small worlds” is itself hacky. Given a small world, Bayes (i.e. full probability modeling) has several advantages, but of course it is all conditional on “this is the world”, as it were. Playing a bit fast and loose for a moment, you have to define the set across which you are allocating the conserved quantity that is probability. You are free to enlarge, or reformulate the small world of course, but that will have profound implications for how you think about your inference no matter what philosophy you adopt. My $0.02 :)

        • I guess another way to say what I said: yes, all inference is conditional on some set of assumptions. I don’t see this as being in any way uniquely problematic for Bayes (i.e. full probability modeling).

        • > Bayes buys its ‘principles’ by paying in ‘closed worlds’
          Sounds right.

          > how Bayes is actually supposed to be formalised in an M-open view.
          How is anything to be formalized in an M-open view except by enclosing it in a bigger closed view?

          Now within the “closed world” (a representation taken as representing itself) we know how to make re-representations without making the originating representation more wrong (truth preserving operations) – so why mess that up?

        • Aiming to do the best in closed worlds can often be worse than aiming to do OK in open worlds. I tend to prefer the latter

        • > How is anything to be formalized in an M-open view except by enclosing it in a bigger closed view?

          By not starting from an additive, sum to one measure of support?

        • ojm, my view is that once you have written down a model, any model, you have a small world. Every *interesting/useful* statistical procedure I have come across, including various “non-parametrics” can be cast in equivalent model-based form. Now, whether or not this is M-open or M-closed is tangential, as far as I can see based on this discussion. In the small world, the Bayesian mechanics work just fine, and there are a ton of philosophical and pragmatic arguments to use them for learning. IMO, the key is to not be enslaved by them, or by any other mechanical operation in the small world. Things like predictive evaluation take us beyond the small world- unless we embed it in a larger small world, in which case we can use the full probability mechanics again.

        • Corey, unclear if blog ate my comment. Anyhow, I should have worded that as *interesting/useful to me*, rather than the more general claim :) I am usually after scientific interpretability with respect to some specific hypotheses or theories about the world. Most applications of ML I’ve run across in the course of my work are not satisfying in that regard, and in some cases don’t even out-predict parametric statistical equivalents.

        • > whether or not this is M-open or M-closed is tangential

          I thought that was the whole point?

          > In the small world, the Bayesian mechanics work just fine, and there are a ton of philosophical and pragmatic arguments to use them for learning.

          I’ve probably looked at more of these arguments that your average person and I find all problematic in some way or another, and for both philosophical and pragmatic reasons.

          I’m not gonna stop anyone from using Bayes, and I’m happy to use it myself sometimes, but I find the arguments for it, over other approaches, unconvincing.

          Are the goalposts:
          – Bayes is one OK approach or
          – Bayes is best

          – How to formalise M-closed and M-closed tools or
          – Everything is an approximation/everyone makes assumptions of some sort?

          I’m ok to say
          – Bayes is one OK approach, with flaws like any other.
          – I find the formalisation of Bayes in what is referred to as the M-open setting unclear

          But I don’t see any good arguments that Bayes is uniquely better than anything else. Some of its assumptions strike me as particularly awkward when trying to formalise things in the formal M-open setting.

        • Ojm:

          No statistical method is uniquely better than everything else. I suppose there are some methods that are really terrible, but there are a lot of methods on the efficient frontier, considering all possible problems that might be studied.

        • Yes I know you think this, and I agree :-)

          This does mean limitations of different approaches should be acknowledged, including Bayes.

          Really I was just trying to get some clarification on how Bayes is formalised in an M-open setting and how things I see as limitations of Bayes in this context might be addressed. Eg does it make sense to use formally additive (and normalised) measures of uncertainty in this context?

          See my comment much earlier on about stacking and predictive distributions.

        • Ojm, in my take which you need to read ;-) I formalize the concept of a scientific model as a sentence in a formal language such as lambda-calculus (ie. a computer program). The M-open setting is then the setting where you potentially are unable to exclude say infinite sentences, and/or non-terminating programs.

          I’d say that non-terminating programs are simply non-scientific. It’s no use if your predictions take longer than the end of the universe to make. I realize that it’s impossible to compute whether an arbitrary piece of code terminates, but it is possible to reason about some code (ie. code that has finite loops, or what’s equivalent, code that recurses monotonically towards a base case, etc)

          So for the most part, we need to deal with some kind of “finite terminating computable” models, and furthermore, I think most people are ok with saying that a thing is only science if an actual person or reasonable size group of people can actually write down the code in a lifetime. So realistically, we’re always in an M-closed setting: finite, not too big to actually write down sentences that are designed to terminate.

          Dealing with situations where we have unobtainable models to consider is not pragmatic… for example say models involving newtonian mechanics with initial conditions on 10^80 particles where just writing down the initial conditions for the model would take longer than the human race will exist…

          So, I think taking a concrete approach like this is useful. In this concrete approach, Bayes makes good sense, whether it’s “optimal” is more or less not something I think is answerable (at least on its own without describing the objective function you’re optimizing) but I think it’s easy to point out what Bayes does right that *every* other approach in wide use today that I’m aware of falls short in some way.

        • Ojm:

          The basic asymptotic theory of M-open Bayes is described in appendix B of BDA. See also section 3 of this paper with Shalizi from 2012, which in turn is based on various ideas of mine from 1993 or so. Sample sentence: “This is not a small technical problem to be handled by adding a special value of theta, say ∞ standing for ‘none of the above’; even if one could calculate p(y|????∞), the likelihood of the data under this catch-all hypothesis, this in general would not lead to just a small correction to the posterior, but rather would have substantial effects.” See also section 4.3 of that article. Actually, read the whole damn thing!

        • Andrew – thanks, that’s a nice paper I’ve read a few times but good to read again.

          The asymptotics section – putting aside ‘not-too-onerous regularity conditions’ that I think are actually frequently violated in reality – basically says things can easily but arbitrarily bad, right?

          A subtle point that I don’t think Bayesians ever really address is not just normalisation but additivity: once you accept all models are approximations then not only is there not one true model in the support but there is no reason I can see to say that the negation (or even the complement within the set of models considered) of a good approximate model is a bad approximate model etc.

          So it’s not just normalisation that seems suspect to me (which is perhaps more minor) but also the basic algebraic structure of probability theory. Or at least I haven’t ever really seen a convincing reason why I should want to use an additive measure over a set of approximate models. This is an implicit and/or explicit motivation for many non-Bayesian approaches.

        • Andrew, appendix B BDA seems to me explicitly addressing the M-closed case, where the true model is in the prior support. Anyhow, thanks for pointing out the section in your paper with Shalizi- it really is a great read!

          ojm, I’m curious what you would do in practice. You collect or have access to dataset y, and have 3 ‘competing’ models M1, M2, M3, whose likelihoods differ in some structural respect so that they cannot simply be considered special cases of each other. Now, these structural differences also happen to embody different hypotheses about how the world works in some fundamental respect. Although grant that each model is of course, like all models, an idealization/approximate/whatever. In fact, it is even possible (if not likely) that the mechanisms embodied in each model are all plausibly defensible descriptions of the world, simply incomplete.
          Your goal: forecast the future state of some system z, from which data y have been collected, and about which M1, M2, and M3 are all, a priori, potentially in play. Do you combine models in some way or simply select one and condition all forecasts on that selection? If you combine, don’t you want countable additivity and a sum to one measure?

          Note, I am not trying to be cute or deliver some kind of “gotcha” here- I think this is fascinating philosophy of science stuff!

        • Chris:

          In Appendix B we explicitly work in an M-open framework. Here’s what we write:

          The key assumption for the results presented here is that data are independent and identically distributed: we label the data as y = (y1, . . . , yn), with probability density \prod{i=1}{n} f(yi). We use the notation f(·) for the true distribution of the data, in contrast to p(·|θ), the distribution of our probability model. The data y may be discrete or continuous.

          In Appendix B, the density “f” is not assumed to be in the set of p. That is, we’re working in an M-open framework.

        • Andrew, thanks! I think I got confused by the verbage about “if the likelihood model happens to be correct…” and then “If the data truly come from the hypothesized family of probability models…”. So, essentially, as data arrive, the posterior updates around the value theta_o that minimizes KL-divergence from f(.). If ‘f’ is in the set, i.e. M-closed, then great, the posterior concentrates around “true value”; if ‘f’ is not in set, it is simply a distinguished point that happens to minimize KL-divergence, insofar as these data go (of course, assuming the mild regularity conditions). Is that accurate?

        • Yup. When I wrote that appendix, back in 1993 or whenever it was, I did not know about the terms “M-open” and “M-closed” so I just made it all up as I went along.

        • Hi Chris,

          Fair question.

          I’d say it depends on the goals. There are certainly cases where I’d just want to give an averaged prediction, in which case probability seems fine.

          Other cases that might occur: worst case/minimax etc. This to me is closest to possibility reasoning rather than probability reasoning. We want to bound possible behaviour under greater levels of uncertainty. How should you behave given much less info than nature’s full probability distribution?

          This is also related to work on ‘robust’ stats etc – it tends to be built on minimax ideas. I tend to think a lot of mathematical reasoning in general is built on inequality/minimax/possibility style reasoning, though I think Andrew once called it ‘ugly 60s style stuff’ or something. To me it’s a style of thinking that grew on me the more math I learned.

          Similarly you might want to determine when the models give sufficiently qualitatively different behaviours such that they become distinguishable (eg you want to learn about which is closest to the ‘underlying mechanism’ rather than just predict well on average or in the worst case). Bifurcation theory and the like. I don’t think this is well captured by probability style thinking.

          This sort of ‘bounding’ or ‘partitioning’ style reasoning is to me more qualitative and non-probabilistic. Probability really seems to me to be a more fine-grained ‘known unknowns’ sort of reasoning.

          Which is to say – many different approaches are useful. You want an averaged prediction of the future given a handful of models, it seems OK. But there are other things you might want and I think quantitative probability theory is a much more limited reasoning tool in general than often advertised.

        • ojm:

          It seems like your preference for systems that work in the M-open problem is that there is some model “out there” which you don’t yet know, *and you want to discover it using some mathematical tool or technique*

          The problem with that of course is that it’s a strongly non-unique, noncomputable problem. Suppose you come along with me and limit your model universe to strings in a formal computing language, modulo purely formal changes to the structure of the program (such as renaming variables, or lifting local functions to global scope or the like).

          Now, there are still nonformal changes to the program which keep its scientific content intact, for example instead of pow(x,1/2) we could do exp(1/2*log(x)) these call different functions but have the same result in this special case… Similarly we could imagine say series representation of functions, or trigonometric identities, etc. They’re not formal, but they are provably identical.

          Next there’s the strongly difficult problem of termination. If you allow pretty much any string in the formal language, some of them will loop infinitely. So when you do something outside probability theory, you’re left with limiting yourself to a “small” set of provably terminating models *anyway* or your calculation won’t terminate. Machine learning techniques tend to use what you might call “universal approximators” to certain subsets of functions, like neural networks and soforth, but they’re still very limited compared to say the full lambda calculus.

          It’s possible to use the same set of universal approximators within Bayes as well, though computationally difficult, because instead of finding say *one point* in the space of possibilities that does a “pretty good” job, you have to find a random sample of points in the posterior distribution that all do a reasonably good job, thereby quantifying the uncertainty in model space.

          In some sense every problem is m-open, precisely because we know that the “real” model is outside the scope of our necessarily reduced search space, even if we consider the search space to be say the finite strings of lambda calculus with length less than 10^300 symbols (I’d argue that you could simulate the universe exactly to infinitesimal accuracy if you had some “bigger” computer capable of executing such large strings and knew which string to choose ;-) there are thought to be something like 10^80 atoms in the universe)

          So, what we *always* want is some projection of the real model space onto our restricted model space that does “a good job” as measured in some way. How do we measure this? For Bayes, it’s the likelihood which describes what we expect our model to be able to do. It won’t give us our measured data exactly, but it should give us “close to” our measured data, in the sense of making the likelihood be largeish…. but there is no sensible way to describe “largeish” on an absolute universally accepted scale. Needing to define for ourselves a scale that can allow us to compare between two different people carrying out the same calculation with a likelihood differing by purely a constant multiplier leads us to probability theory.

          Now, we often take problems and make them *much more restricted* than what I’ll call the M-practically-pseudo-closed problem of say “all the neural networks less than a gajillion neurons + nonrecursive functions less than 10^9 symbols” In fact I often work with just some very limited functions like “radial basis functions with less than 100 centers” and “nonlinear regression functions with 8 parameters in a particular family” or whatever. We’re still basically doing a *computational shortcut* to what amounts to the M-pseudo-closed calculation we could be doing if we had a trillion times faster machines.

          The fact is, we use prior knowledge to simply set a-priori probability on essentially all of those other models to zero. This is a computational shortcut, not a philosophical principle.

          By the same route, random forests and deep learning and boosted foo-bars or whatever are really just computational shortcuts which limit the model search space, *and* don’t try to quantify the uncertainty in the posterior. A double-edged kind of computational shortcut.

          Principled refusal to admit the prior doesn’t really have the principled flavor it seems to have when you let the model be a lambda-calculus string and realize that by excluding almost every conceivable model as you must to get anything done, there is effectively a prior distribution over models that is being imposed, it’s only priors over the remaining parameters in the nearly-infinitely-restricted remaining domain that are being “left out”.

        • I Think the additivity assumption makes sense as soon as you close the world, which you do when you specify the subset of all formal language strings that you will use to define the model space that you’re considering at the moment. Since everyone is forced to take that step at some point, I don’t think there really is an M-open type of analysis. Once you’ve subsetted your possible models to a finite number of models with possibly continuous parameter values… the additivity assumption makes good sense to me.

          What has to be remembered though is that each analysis only holds as far as you accept the assumptions. There is nothing philosophically wrong with expanding or changing your model space subset and re-analyzing in that context. There is if you think of Bayes as modeling *your actual belief* but if you think of it as modeling hypothetical compatibility of a model set with data from the world… Re-analyzing with a different model set is just asking a different hypothetical. Comparing *across these two analyses* will not work, but this doesn’t bother me. If I want to compare across two model subsets, I need an analysis in which the union of both model sets is included.

        • “Similarly you might want to determine when the models give sufficiently qualitatively different behaviours such that they become distinguishable (eg you want to learn about which is closest to the ‘underlying mechanism’ rather than just predict well on average or in the worst case). Bifurcation theory and the like. I don’t think this is well captured by probability style thinking.”

          ojm, thanks for sharing your thoughts. I think we actually agree to a large extent. I see the above as a key part of the *scientific process*, and agree that application of probability theory is deploying a tool that has some sharp limits. My goal *as a scientist* is to figure out where hypotheses/models/theories make maximally divergent predictions, and set up experimental tests to discern between them (of course, other criteria also play a role, but want to be brief here :)). But, as a forecaster/predictor-of-things, given data y on system z, and assuming that I don’t have a clean test between models M1, M2, M3, …, Mn, I think we’d want to leverage probability modeling. Or at least, stacking predictive densities with a set of weights summing to one to avoid silliness. My $0.02.

        • I’ve often thought in an analysis in which I don’t have a particularly good idea of a mechanistic model, that I could include a “proxy” for a good model in which I take the data and perturb it with random noise, and then create a likelihood in which I expect this model to be within a certain accuracy determined by what I expect my real model to be able to do in terms of predictive precision. this “pseudo-likelihood” enters into the Bayesian analysis with a small prior mixture probability.

          If that model pops out of the analysis with a high posterior probability, it indicates that none of the “real” models fit the expected precision, and that only the proxy model can predict with the precision I expect from my real models. That indicates a failure of the real models, and I can then go looking for a better real model.

          Understanding that idea in terms of underlying “truth” values etc… makes no sense, but understanding that idea in terms of accordance with predictive expectations does make sense, and so it’s one way I think my rethinking of Bayesian foundations helps me understand how modeling should work.

        • Chris, ojm, others

          Interesting discussions here.

          Now in one my comments http://statmodeling.stat.columbia.edu/2018/06/05/comments-limitations-bayesian-leave-one-cross-validation-model-selection/#comment-759090 perhaps I should have pointed to these ideas to clarify where I believe Bayes _should be_ used in the ongoing process – just in the quantitative inference stage. The first and third stages are open worlds.

          “speculative inference -> quantitative inference -> evaluative inference or
          abduction -> deduction -> induction -> or
          First -> Second -> Third
          Over and over again, endlessly.”

          http://statmodeling.stat.columbia.edu/2017/09/27/value-set-act-represent-possibly-act-upon-aesthetics-ethics-logic/

        • Chris – yes I think we probably agree more than we don’t.

          I would add that even for (informal) forecasting I find myself often defaulting to more minimax style reasoning than average case reasoning and I think it’s a bit of a shame that both don’t seem to be emphasised together as complementary as much anymore.

          I actually accidentally took a post grad decision theory course back in the day – before it was fashionable again! – where we did decision making under various types of uncertainty including both probabilistic and ‘extreme’. I often found myself more sympathetic to the more non-probabilistic (extreme uncertainty) and/or more qualitative methods. I’ve never been much of a fan of expected utility ideas.

        • Daniel –

          It’s not clear to me, even when we ‘close’ the world temporarily why we would want to adopt additivity. Why not say that multiple models even of those only considered so far could be equally good? Why should the support of a model be a function of the support or not of the other models, rather than of more direct ‘positive’ evidence?

          Similarly, see the link I posted above – psychologically I see no reason why you might want to give both a model and its contradictory equal support. While flat probabilistic priors have many issues for representing ignorance, terms of how they transform under changes of variables, flat possibility distributions remain flat over their support. See e.g. https://arxiv.org/abs/1801.04369 which I wrote for fun.

        • Andrew – I looked at the appendix you mentioned, thanks.

          A big issue I have with these results is that they require identifiability (and/or a non-singular Fisher information matrix). You basically build in uniqueness and then derive asymptotic normality.

          Which is fine in a toy setting, but imo very far from an ‘open world’ in which there might be no or many unique solutions.

          In such cases the asymptotics is not such that ‘as more and more data arrive, the posterior distribution of the parameter vector approaches multivariate normal’.

          To me ‘open world’ modelling should most definitely not assume identifiability (or e.g. a non-singular Fisher information matrix).

        • >It’s not clear to me, even when we ‘close’ the world
          >temporarily why we would want to adopt additivity. Why not
          >say that multiple models even of those only considered so
          >far could be equally good?

          Of course, multiple models can be equally good: if I have 3 models and a posterior weight on them of 1/3 , 1/3 , 1/3 they’re all equally good yes?

          Since every finite measure is isomorphic with a probability measure, are you asking why choose to normalize our measures to 1? I think the reason is that our measure is a dimensionless ratio and there is no “absolute” scale that we can use to measure “how good” a model is which can be easily standardized among different people.

          You could choose to “normalize” your measure so that the peak density is always 1, or always 100, or you could do other things. But as long as the measure is finite, you can always rescale it to be a probability measure (integrates to 1). So choice of integrating to 1 is an arbitrary choice I admit, but it has good properties.

          If you’re asking “why use a measure at all?” then I think the answer is that you’re starting from the assumption that you’re going to *measure* the degree to which a certain subset of the possibilities accord with your theory. If you want to do that, you need a measure.

          if you *don’t* want to do that, then I agree Bayes isn’t for you.

        • Why not? Because of the reasons discussed already! See eg the discussion in the Shackle link – many have come to similar conclusions. I’m saying additivity seems not intuitive in these circumstances, to convince me I need to see a reason to use it.

          Note that additivity is not the same as normalisation. Also one can instead use eg a maxitive measure as in my most recent link. So additivity is not the same as a measure in general.

          There are many ways to represent uncertainty. Additive measures just don’t seem right to me for some circumstances and I haven’t seen a good argument for why I should impose it in general. It is an assumption!

          Hampel even suggests that the original conception of probability was non-additive – google ‘non-additive probability Hampel’ or similar. Huber speculated whether statistics should instead be founded on Choquet capacities.

          The point is not that additivity is not sometimes correct, but that many have in fact questioned it, particularly in cases of ‘extreme’ or ‘large’ uncertainty.

        • The consequences of non-additivity seem like they’re kinda bizarre for the purpose of statistics. To be honest I haven’t seen people take that seriously and show how it gives very useful results. I don’t really think we ever do anything in an M-open world, it’s always closed for the moment by what kind of stuff we’re including in our analysis…

          So long as I’m willing to “go back to the drawing board” and add in additional models, I always want some measure of *which of these models / parameters best accords with my theory and data*. And if a certain region of parameter space accords well… I want the remaining regions to accord badly particularly due to the concept that accordance(RegionA) + accordance(RegionNotA) = 1 *within the restricted analysis*

          I always think it’s important to consider if your analysis is too limited, but I think it needs to be a meta consideration until you can specify a sufficiently formalized model to be inserted into the analysis.

          If you can show me a whole bunch of applications where non-additivity makes sense I would be willing to reconsider it… but I think you’ll find that hard, because in practice thinking about the “outside model space” in the absence of a specific computable model means you can’t really analyze. I can’t do: quantum mechanics… or “something else” I can only do quantum mechanics vs strangeTheoryA which computes different results.

        • The paper you link (Hampel 2007) describes Bayesian reasoning in an older style: Savage and de Finetti building their foundations on betting issues. He then claims that the desire to be able to place a bet at any time is an unecessary restriction on the role of statistics because it implies the sum to 1 type condition (or at least I think that’s what he’s saying).

          But, I don’t take those foundations as truly foundational for my purposes. I think the Cox lineage makes more sense. The big problem with Cox, that you’ve pointed out, is that it’s a calculation about an underlying “truth” and when there is no underlying truth… you need a new interpretation if you are to keep Bayes.

          My point in working on my interpretation is that what Bayesian probability measures is basically compatibility with assumptions. I call it accordance in my paper. The idea is that as a scientist (not an engineer who just wants predictive ability) I want to collect data and find out what the data implies about my theory, specifically, which subsets of the parameter space make my predictions “work best” according to the theory formalized in my likelihood function, as well as the theory formalized in my base data-free theory (my prior). I see Bayes as a kind of microscope that lets me mathematically look inside my theory and find out which sub-theories are viable in the presence of observation.

          From that perspective, the ruling out of certain regions of parameter space needs to move probability to the remaining space, and I think this is where the sum-to-one comes in. If you don’t have additivity, you don’t have sums and if you don’t have sums you don’t have sum-to-one aka conservation of probability.

          It’s also possible to do so with multiple theories using mixture models, but again, I’m looking inside a *particular given overall mixture theory* and trying to find the viable sub-theories, things not ruled out by what I expect from my predictions, that ruling out some regions pushes probability onto the remaining regions is a feature, not a bug.

        • I think it’s fairly well established that Cox’s theorem doesn’t convincingly show that the only or best approach that uncertainty is probability. Things like additivity etc are effectively just assumed in.

          Fair enough if you don’t want to consider other approaches, but many do. One reason I do is that I notice that no one really believes their probability results or forecasts coz of things like M-open issues or lack of identifiability etc.

          Instead of fighting it I take it for what it seems to indicate about how we reasoning under uncertainty.

        • ojm, I would be very curious to see some of your applied work using other approaches to uncertainty. I think where we are all headed, including Daniel Lakeland in his own attempt at formalization (insofar as he has explained here), is that our inference is always conditional on the assumptions we are making, i.e. our scientific models/stories/whatever. Thus, ‘nobody really believes their probability results’ because, as De Finetti famously said, “probability does not exist!” (outside of stylized balls-in-urns thought experiments or whatever). I never really have p(theta|y), rather p(theta|y,M), and if I change M, or I expand M, or I expand to a set of M(i), my probabilities can change! Using the probability mechanics requires buy-in to additivity and sum-to-one, and I am personally interested to see applied examples where this is *not* what one would want to do.

        • re applied examples – much of this is hidden in the background, eg even in work where I appear to have gone Bayes I have realised that’s not really what I’m doing and that other things are ‘doing the work’.

          I’m trying to be more honest about this in current and future work and explore new approaches, but it’s a process (first step is to acknowledge there’s a problem I guess). Luckily there are many other applied examples from other folk if you’re willing to look!

        • Slightly more concretely – anything based on pure likelihood theory a la Edwards or Royall, robust stats a la Hampel or Huber, Dempster-Shafer theory, fuzzy logic etc would be examples.

          Personally my workflow these days is more akin to classical inverse problems theory:

          – does a decent solution exist?
          – is it unique or are there many equally good solutions etc?
          – are my results stable wrt small perturbations?

          First part is pretty much just point estimation (would always do ‘predictive’ ie data space checked tho!).

          Second step is identifiability analysis and lies outside of Bayes imo.

          Final step is where Bayes enters to me, but I increasingly care about the first two steps and happy to just eg bootstrap this last one

  3. Thanks for these constructive comments, and a free review :-) We will take these comments into account in a revision.
    Cheers,
    Quentin & E.J.

  4. I also read that paper, and was wondering, how much different the estimates from the wrong models were compared to the true ones.

Leave a Reply to Aki Vehtari Cancel reply

Your email address will not be published. Required fields are marked *