tl;dr If you have bad models, bad priors or bad inference choose the simplest possible model. If you have good models, good priors, good inference, use the most elaborate model for predictions. To make interpretation easier you may use a smaller model with similar predictive performance as the most elaborate model.

Merijn Mestdagh emailed me (Aki) and asked

In your blogpost “Comments on Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection” you write that

“My recommendation is that if LOO comparison taking into account the uncertainties says that there is no clear winner”…“In case of nested models I would choose the more complex model to be certain that uncertainties are not underestimated”. Could you provide some references or additional information on this claim?

I have to clarify additional conditions for my recommendation for using the encompassing model in the case of nested model

- models are used to make predictions
- the encompassing model has passed model checking
- the inference has passed diagnostic checks

The Bayesian theory says that we should integrate over the uncertainties. The encompassing model includes the submodels, and if the encompassing model has passed model checking, then the correct thing is to include all the models and integrate over the uncertainties (and I assume that inference is correct). To pass the model checking, it may require good priors on the model parameters, which maybe something many ignore and then they may get bad performance with more complex models. If the models have similar loo performance, the encompassing model is likely to have thicker tails of the predictive distribution, meaning it is more cautious about rare events. I think this is good.

The main reasons why it is so common to favor more parsimonious models

- The maximum likelihood inference is common and it doesn’t work well with more complex models. Favoring the simpler models is a kind of regularization.
- Bad model misspecification. Even Bayesian inference doesn’t work well if the model is bad, and with complex models there are more possibilities for misspecifing the model and the misspecification even in one part can have strange effects in other parts. Favoring the simpler models is a kind of regularization.
- Bad priors. Actually priors and model are inseparable, so this is kind of same as the previous one. It is more difficult to choose good priors for more complex models, because it’s difficult for humans to think about high dimensional parameters and how they affect the predictions. Favoring the simpler models can avoid the need to think harder about priors. See also Dan’s blog post and Mike’s case study.

But when I wrote my comments, I assumed that we are considering sensible models, priors, and inference, so there is no need for parsimony. See also paper Occam’s razor illustrating the effect of good priors and increasing model complexity.

In “VAR(1) based models do not always outpredict AR(1) models in typical psychological applications.” https://www.ncbi.nlm.nih.gov/pubmed/29745683 we compare AR models with VAR models (AR models are nested in VAR models). When both models perform equally we prefer, in contrast to your blogpost, the less complex AR models because

– The one standard error rule (“ Hastie et al. (2009) propose to select the most parsimonious model within the range of one standard error above the prediction error of the best model (this is the one standard error rule)”).

Hastie is not using priors or Bayesian inference, and thus he needs the parsimony rule. He may also have implicit cost function…

– A big problem (in my opinion) in psychology is the over interpretation of small effects which exacerbates by using complex models. I fear that some researchers will feel the need to interpret the estimated value of every parameter in the complex model.

Yes, it is dangerous to over-interpret especially if the estimates don’t include good uncertainty estimates, and even then it’s difficult to make interpretations in case of many collinear covariates.

My assumption above was that the model is used for predictions and I care about best possible predictive distribution.

The situation is different if we add cost of interpretation or cost of measurements in the future. I don’t know literature analysing what is a cost of interpretation for AR vs VAR model, or cost of of interpretation of additional covariate in a model, but when I’m considering interpretability I favor smaller models with similar predictive performance as the encompassing model. But if we have the encompassing model, then I also favor projection predictive approach where the full model is used as the reference so that the selection process is not overfitting to the data and the inference given the smaller model is conditional the information from the full model (Piironen & Vehtari, 2017). In case of a small number of models, LOO comparison or Bayesian stacking weights can also be sufficient (some examples here and here).

Aki:

Also relevant are these discussions from several years ago:

Against parsimony

Against parsimony, again

David MacKay and Occam’s Razor

Occam

Yes! Thanks, Andrew!

Very thoughtful and well written post.

I think it will help me get less vague regarding the good, the bad and the ugly of Occam’s razor http://statmodeling.stat.columbia.edu/2018/06/20/quest-beauty-lead-science-astray/#comment-768949

Perhaps the projection predictive approach will keep opportunities to get less wrong while currently making use of the simpler model.

This seems to be a statement about your personal risk-aversion in contrast to the TLDR which is more of a general statement about model selection.

If we both start with the same prior beliefs and examine the evidence in the same way, is it illogical for us to end up with slightly different beliefs (model choice) in the end? Would a complex model be more likely to assign small probabilities to outcomes that are actually impossible? Would a simple model be more likely to predict that rare events are nearly impossible? Are “thicker tails of the predictive distribution” always a good thing?

> Would a complex model be more likely to assign small probabilities to outcomes that are actually impossible? Would a simple model be more likely to predict that rare events are nearly impossible?

We hope to catch these cases in the model checking, but often we don’t know clear line for when possible changes to impossible.

Aki:

It is not entirely clear to me how a simpler model, necessarily, is easy to interpret. Sure, there are less things to interpret, but having the full posteriors allows for richer inference than basically setting to zero. Of course, in some situations we must set things to zero, or it is just to much effort to make inference with many many variables, but I think these are separate issue altogether.

It remains a mystery to me how not computing the posterior, for some, somehow allows for better inferences. I would suggest this is especially the case with multilevel models where I want the full covariance structure, and if the effects are negligible, then what is gained from removal, etc..

> It is not entirely clear to me how a simpler model, necessarily, is easy to interpret.

I didn’t say easy, I said easier which still can be difficult. I could also say that simpler models are easier to explain and we would assume that easier to explain is easier to interpret, but on the other hand we might hide something necessary. My examples are real life industrial, epidemiological and health care predictive models. I thin it it’s easier if I can tell 6 covariates which have 95% accuracy as the full set and adding 14 more gives as that 5%-points more. The application field experts have also been happy with this. More generally, this is how most of teaching goes, we start with simpler models of the world (biology, weather, society, physics, etc) and bit by bit make the models more elaborate. Sure we teach less, but having the full quantum theory may be too much for and not needed in all cases.

I guess a similar argument will also explain why variational inference with a “parsimonious” approximation family can sometimes out predict full inference?

I have seen this at least in cases, where vi is used to compute approximate marginal likelihood for hyperparameters. The vi approximated marginal likelihood favors hyperparameters which make the distributional approximation closer to the conditional distribution. We may say that variational inference with a “parsimonious” approximation family may introduce “an implicit prior”. I don’t like mixing explicit model and prior assumptions with implicit algorithmic assumptions, because it is difficult to diagnose and to predict what happens when we change either.

It is beginning to look more and more like statistics tests are going to be phased out. I see no other solution to the extent of controversies over tests and even modeling. The explanations are so darn technical that they obscure their explanatory power.

I guess what I’m wondering is whether we an agree to any set of conclusions. Even one will do. We seem to end with ‘more research’ is needed. Well fine. Maybe what we need to do is keep reviewing what have published already. After all there are millions of articles already in circulation, of which we access some minute number.