This post is by Aki

We mention the problem of bias induced by model selection in A survey of Bayesian predictive methods for model assessment, selection and comparison, in Understanding predictive information criteria for Bayesian models, and in BDA3 Chapter 7, but we haven’t had a good answer how to avoid that problem (except by not selecting any single model, but integrating over all them).

We (Juho Piironen and me) recently arxived a paper Comparison of Bayesian predictive methods for model selection, which I can finally recommend as giving a useful practical answer how to make model selection with greatly reduced bias and overfitting. We write

The results show that the optimization of a utility estimate such as the cross-validation score is liable to finding overfitted models due to relatively high variance in the utility estimates when the data is scarce. Better and much less varying results are obtained by incorporating all the uncertainties into a full encompassing model and projecting this information onto the submodels. The reference model projection appears to outperform also the maximum a posteriori model and the selection of the most probable variables. The study also demonstrates that the model selection can greatly benefit from using cross-validation outside the searching process both for guiding the model size selection and assessing the predictive performance of the finally selected model.

Our experiments were made with Matlab, but we are working on Stan+R code, which should be available in a few weeks.

Aki:

This is really cool. have a thought, which is related to the idea I had earlier for comparing 2 models. Suppose you fit M different models to the data, and you have N data points. Then you have a N*M matrix of prediction errors. Or, actually, since you don’t observe the true parameter values, you have an N*M*S array of data*models*simulation draws, where each element in that matrix is a log posterior for data point n in model m based on posterior simulation draw s.’

The usual approach is to analyze each of the M models separately and, for each, come up with a summary such as LOO or WAIC. Instead perhaps it makes sense to fit a hierarchical model with data point effects and model effects and interactions.

I think this is related to your paper. The idea in your paper is probably better than the idea I have here, but maybe my idea could be useful in the context of what you are doing.

Also, I think the ideas in your paper are related to the recent work of Christian Robert et al. on using mixture models instead of model averaging.

My plan is to first work this out in the simple setting where just two models are being compared, so that multiple comparisons is not such a concern and we can focus on the comparison of estimated expected prediction error.

This strikes me as similar to something David Dunson did ( arXiv:1403.1345), where you treat each model as a “weak learner” and look for the optimal combination. They fell into a solution using a carefully specified Dirichlet distribution, which links to Judith and Kerrie’s over-specified mixture paper. This is basically the two step version of Christian etc’s mixture method.

I think the difference between this and what Aki is suggesting, is that there isn’t really any selection in these methods. My favourite bit of Aki’s approach is that it uses a non-Bayesian solution to do a not naturally Bayesian thing (selection). Or at least that’s my reading of it.

Unrelated: I do like the thinking under the Yang and Dunson paper, where they get sparsity by explicitly constructing a prior with mass on sparse models, which strikes me as a more sensible way forward than, say, the horseshoe, where a bunch of independent priors with big tails magically give sparsity. Sparsity is a joint property, so it should be specified using dependent priors. (That being said, I can’t actually reproduce some of the calculations needed to make the proofs work, which is always nerve-wracking)

Or, to connect it to a more fundamental thing, this strikes me as similar to “boosting” in Machine Learning. A sort of model-based boosting.

> My favourite bit of Aki’s approach is that it uses a non-Bayesian solution to do a not naturally Bayesian thing

> (selection). Or at least that’s my reading of it.

I think I’m using Bayesian theory following Bernardo & Smith (1994), who combine the Bayesian theory and decision theory. I’m trying to make the best decision (as described in our review paper), we just need to use approximation for the future data distribution and approximations for some integrals, but that doesn’t make it non-Bayesian. You could say it’s approximative-Bayesian, but then most of Bayesian inference is approximative. (There are people who say that using Monte Catlo is non-Bayesian, too)

Aki

Hi Dear Mr Vehtari

Where I Can find Stan+R code for your paper “Comparison of Bayesian predictive methods for model selection”???

I really need them and could not find any more way to do my analysis project.

Can you help me??

Hi Dear Mr Vehtari

Where I Can find Stan+R code for your paper “Comparison of Bayesian predictive methods for model selection”???

I really need them and could not find any more way to do my analysis project.

Can you help me??