## Limitations of “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection”

“If you will believe in your heart and confess with your lips, surely you will be saved one day”The Mountain Goats paraphrasing Romans 10:9

One of the weird things about working with people a lot is that it doesn’t always translate into multiple opportunities to see them talk.  I’m pretty sure the only time I’ve seen Andrew talk was at a fancy lecture he gave in Columbia. He talked about many things that day, but the one that stuck with me (because I’d not heard it phrased that well before, but as a side-note this is a memory of the gist of what he was saying. Do not hold him to this opinion!) was that the problem with p-values and null-hypothesis wasn’t so much that the procedure was bad. The problem is that people are taught to believe that there exists a procedure that can, given any set of data, produce a “yes/no” answer to a fairly difficult question. So the problem isn’t the specific decision rule that NHST produces, so much as the idea that a universally applicable decision rule exists at all. (And yes, I know the maths. But the problem with p-values was never the maths.)

This popped into my head again this week as Aki, Andrew, Yuling, and I were working on a discussion to Gronau and Wagenmakers’ (GW) paper “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection”.

Our discussion is titled “Limitations of ‘Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection'” (published version) and it extends various points that Aki and I have made at various points on this blog.

To summarize our key points:

1. It is a bad thing for GW to introduce LOO model selection in a way that doesn’t account for its randomness. In their very specialized examples this turns out not to matter because they choose such odd data that the LOO estimates have zero variance. But it is nevertheless bad practice.
2. Stacking is a way to get model weights that is more in line with the LOO-predictive concept than GW’s ad hoc pseudo-BMA weights. Although stacking is also not consistent for nested models, in the cases considered in GW’s paper it consistently picks the correct model. In fact, the model weight for the true model in each of their cases is $w_0=1$ independent of the number of data points.
3. By not recognizing this, GW missed an opportunity to discuss the limitations of the assumptions underlying LOO (namely that the observed data is representative of the future data, and each individual data point is conditionally exchangeable).  We spent some time laying these out and proposed some modifications to their experiments that would make these limitations clearer.
4. Because LOO is formulated under much weaker assumptions than is used in this paper, namely LOO does not assume that the data is generated by one of the models under consideration (the so-called “M-Closed assumption”), it is a little odd that GW only assess its performance under this assumption. This assumption almost never holds. If you’ve ever used the famous George Box quote, you’ve explicitly stated that the M-Closed assumption does not hold!
5. GW’s assertion that when two models can support identical models (such as in the case of nested models), the simplest model should be preferred is not a universal truth, but rather a specific choice that is being made. This can be enforced for LOO methods, but like all choices in statistical modelling, it shouldn’t be made automatically or by authority, but should instead be critically assessed in the context of the task being performed.

All of this has made me think about the idea of doing model selection. Or, more specifically, it’s made me question whether or not we should try to find universal tools for solving this problem. Is model selection even possible? (Danielle Navarro from UNSW has a particularly excellent blog post outlining her experiences with various existing model selection methods that you all should read.)

So I guess my very nebulous view is that we can’t do model selection, but we can’t not do model selection, but we also can’t not not do model selection.

In the end we need to work out how to do model selection for specific circumstances and to think critically about our assumptions. LOO helps us do some of that work.

To close off, I’m going to reproduce the final section of our paper because what’s the point of having a blog post (or writing a discussion) if you can’t have a bit of fun.

Can you do open science with M-Closed tools?

One of the great joys of writing a discussion is that we can pose a very difficult question that we have no real intention of answering. The question that is well worth pondering is the extent to which our chosen statistical tools influence how scientific decisions are made. And it’s relevant in this context because of a key difference between model selection tools based on LOO and tools based on marginal likelihoods is what happens when none of the models could reasonably generate the data.

In this context, marginal likelihood-based model selection tools will, as the amount of data increases, choose the model that best represents the data, even if it doesn’t represent the data particularly well. LOO-based methods, on the other hand, are quite comfortable expressing that they can not determine a single model that should be selected. To put it more bluntly, marginal likelihood will always confidently select the wrong model, while LOO is able to express that no one model is correct.

We leave it for each individual statistician to work out how the shortcomings of marginal likelihood-based model selection balance with the shortcomings of cross-validation methods. There is no simple answer.

1. Andrew says:

Hi, Dan. I continue to feel uncool that I’m an REM fan so it makes me happy that you link to a page that includes REM lyrics (non-ironically, I hope). Perhaps I can queue up some REM-related statistics songs along the lines of this.

2. Olav says:

I understand the sentiment in the last quoted paragraph, but it seems to me that confidence in the sense you’re talking about is a matter of degree and can be achieved in a variety of ways, including by using marginal likelihood. LOO-based methods may be “non-confident” in the sense that they won’t assign a weight of 1 to any particular model, but they’re still confident on a higher level in the sense that they will assign a precise weight to each model. Marginal likelihood based methods can imitate this behavior if you embed all the models in a larger supermodel (a convex combination of the models). Given enough data, the marginal likelihood may well turn out to prefer the supermodel to any of the submodels, which would essentially be equivalent to assigning a < 1 weight to each submodel. Furthermore, it seems to me that if you want a method that's even less confident, you can make the weight assigned to each model be a free parameter with its own associated probability distribution. So, again, the kind of confidence you're talking about comes in degrees and can be achieved in multiple ways.

• Andrew says:

Olav:

Yes, I agree. If you can build the continuous supermodel, that can work better than cross validation, for many reasons. What we should really be saying is, “Don’t compare marginal likelihoods unless you have the continuous supermodel.”

There’s a whole literature in statistics on defining and computing marginal likelihoods in the absence of a continuous supermodel, and I think it’s mostly a waste of time, and I think lots of people do it without understanding the problems, doing it only because they think it’s what they’re supposed to do.

In chapter 7 of BDA3 (chapter 6 of the earlier additions), we’re pretty clear, I think, that we prefer continuous model expansion, and that LOO is a stopgap for those settings where we have not constructed the continuous supermodel.

• Dan Simpson says:

Interesting. Do people do this? Is it broadly recommended? Maybe the only challenge I can see is the usual problem with unbalanced learning rates for nested model (it will learn the simpler model is true much slower than that the mixture is true . The maths is in that paper on non-local priors).

Would you ever get an answer that wasn’t the mixiture?

• Olav says:

Dan, which proposal are you asking about? I’m no expert on this topic, so I feel a bit silly talking about it, but I can think of two ways of constructing a supermodel. Suppose we have models M1, M2, M3, etc. One way of making a supermodel is by combining the predictors from the various models. For example, suppose M1 says that y = ax + by and M2 says that y = cy + dz. Then a supermodel that contains M1 and M2 as special cases would be MS: y = ax + by + cy. Depending on what the priors over the model parameters are, MS may have a better marginal likelihood than either M1 or M2. However, the reverse can also happen (that would be an instance of the “Bayesian Ockham’s” phenomenon).

Another way of creating a supermodel is to just directly form the likelihood of the supermodel out of the marginal likelihoods of M1, M2, M3, etc. That is, construct MS* as follows: p(x|a, b, c, …, MS*) = a*p(x|M_1) + b*p(x|M_2) + c*p(x|M_3) +…, with the restriction that a+b+c+…=1, and a, b, c, … > 0. Then assign the parameters a, b, c, etc. prior probability distributions. Given x you can then calculate the posterior probabilities of the parameters, the marginal likelihood of MS*, and so on. Do people actually do this or recommend it? I have no idea. Would the marginal likelihood of MS* always be better than the marginal likelihood of M1, M2, etc.? I don’t see why.

• Olav says:

Actually, on the second way of forming the supermodel the supermodel will always be outperformed by the best-performing submodel (on any given data x). Suppose M1 has the highest likelihood on x, out of M1, M2, M3, etc., then p(x|a, b, c, …, MS*) = a*p(x|M_1) + b*p(x|M_2) + c*p(x|M_3) +… < a*p(x|M_1) + b*p(x|M_1) + c*p(x|M_1) +… = p(x|M_1).

• Olav says:

Actually, that last post isn’t quite right (I wish there were a way of editing posts!).

3. suncraig says:

Why is Bayes theorem still controversial?

Bayes’ Theorem in the 21st Century
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/Science-2013-Efron.pdf

• Andrew says:

As usual, my response to this sort of thing is that the decisions of what to include in the analysis, and the assumptions regarding the form of the data model, are typically much stronger than any assumptions in the prior distribution, so I think it’s pretty silly for statisticians and practitioners to express particular concern or skepticism about Bayesian methods.

• Dan Simpson says:

I wasn’t aware it was? Definitely not in the context of this post???

4. > discuss the limitations of the assumptions underlying … experiments that would make these limitations clearer … should instead be critically assessed in the context of the task being performed … There is no simple answer.

These very important considerations but my sense is the culture in writing statistical papers is to subtly and unobtrusively cover one’s exposure to criticism technically in the discussion or footnotes. This is unfortunate as it results in failing to distinguish what something is (e.g. the maths of p values) versus what to make of it.

• Dan Simpson says:

Danielle Navarro wrote an excellent discussion of the same paper that does a great job making this point. https://osf.io/2e5yv/

• Keith O’Rourke says:

So there is someone else who blogs in your style ;-)

Paper is good but the blog post is something not to be missed.

From the final paragraph of their blog post “we should try to avoid anything that resembles a prescriptive approach to inference that instructs scientists THIS IS HOW WE DO IT”

Now, CS Peirce once argued, the real justification of induction is that even though it will mislead us [give us a false sense of reality that is beyond direct access] if inquiry is adequately persisted in, the false sense will be rescinded.

So my advice is persist in adequate inquiry where adequate can mean no more than – all (important) false senses have been rescinded :-(

5. Christian Hennig says:

“All of this has made me think about the idea of doing model selection. Or, more specifically, it’s made me question whether or not we should try to find universal tools for solving this problem. Is model selection even possible? (Danielle Navarro from UNSW has a particularly excellent blog post outlining her experiences with various existing model selection methods that you all should read.)

So I guess my very nebulous view is that we can’t do model selection, but we can’t not do model selection, but we also can’t not not do model selection.”

Well, it’s not difficult to select models, so sure we can do that. The question rather is, can we do it in a way that we can trust that the selected model is any good.

The linked blog post states correctly that usually the model selection problem is framed in terms of finding the “true” model and that this is mostly useless in reality. Once we get out of this thinking, we can realise that a central issues here is that there are lots of different uses of models, there are different meanings of what a “good” model is for different aims and how to measure this, and consequently different models can be good for different aims.

Prediction quality is an obvious use and the appeal of cross-validation for measuring it is clear, as it is clear that it comes with limitations (as explored in GW and your response). GW conclude modestly that “CV is not a panacea” but it is always easy to criticise overgeneralisations that hardly anybody would subscribe to anyway. Much depends on what we want from our model apart from predicting successfully in the dataset in hand. In what aspects the situations in the future in which the model is used for prediction are different from the collection of the already existing data? Apart from prediction, what other aims should the model fulfill? Do we want to know whether we could get rid of measuring certain variables because they don’t contribute much? Will policy decisions be made based on a rough qualitative interpretation of the model, which may even change the conditions of the application of the model so that in fact the future will be manipulated away from the model and predictions from the model are only used in a hypothetical manner? Are there other reasons for modelling such as getting and communicating an easily comprehensible idea of “what goes on”? Is the model even consciously set up in order to learn and highlight deviations of reality from it rather than predicting it as well as possible? Will the model rather be used to simulate new data to explore the implications of certain decisions or methods? Is the model meant to represent the data only or rather some mechanistic understanding of the process that generated them? Who uses the model, and is it necessary (and maybe an issue) that the model is simple to allow these persons to handle it? Will the use of the model in the future be monitored, should the model potentially be adapted “on the fly” to new data, and should it be set up in a way that makes this easy? To what extent should the model incorporate subjective assessments of what’s likely in advance?
Etc. etc.

Whatever the answers to these questions are, it is easy to imagine that they lead to different (combinations of) quality criteria, so nobody should be surprised in the least that there is no uniform “solution to the model selection problem”.