A neuroscience graduate student named James writes in with a question regarding validating Bayesian model comparison using synthetic data:

I [James] perform an experiment and collect real data. I want to determine which of 2 candidate models best accounts for the data. I perform (approximate) Bayesian model comparison (e.g., using BIC – not ideal I know, but hopefully we can suspend our disbelief about this metric) and select the best model accordingly.

I have been told that we can’t entirely trust the results of this model comparison because 1) we make approximations when performing inference (exact inference is intractable) and 2) there may be code bugs. It has sheen recommended that I should validate this model selection process by applying it on synthetic data generated from the 2 candidate models; the rationale is that if the true model is recovered in each case I can rely on the results of model comparison on the real data.

My question is: is this model recovery process something a Bayesian would do/do you think it is necessary? I am wondering if it is not appropriate because it is conditional on the data having been generated from one of the 2 candidate models, both of which are presumably wrong; the true data that we collected in the experiment was presumably generated from a different model/process. I am wondering if it is sufficient to perform model comparison on the real data (without using recovery for validation) due to the likelihood principle – model comparison will tell us under which model the data is most probable.

I would love to hear your views on the appropriateness of model recovery and whether it is something a Bayesian would do (also taking practical considerations into account such as bugs/approximations).

I replied that quick recommendation is to compare the models using leave-one-out cross validation as discussed in this article from a few years ago.

James responded with some further questions:

1. In our experiment, we assume that each participant performs Bayesian inference (in a Bayesian non-parametric switching state-space model), however we fit the (hyper)parameters of the model using maximum likelihood estimation given the behaviour of each participant. Hence, we obtain point estimates of the model parameters. Therefore, I don’t think the methods in the paper you sent are applicable as they require access to the posterior over parameters? We currently perform model comparison using metrics such as AIC/BIC, which can be computed even thought we fit parameters using maximum likelihood estimation.

2. Can I ask why you are suggesting a method that assesses out-of-sample predictive accuracy? My (perhaps naive) understanding is that we want to determine which model among a set of candidate models best explains the data we have from our current experiment, not data that we could obtain in a future experiment. Or would you argue that we always want to use our model in the future so we really care about predictive accuracy?

3. The model recovery process I mentioned has been advocated for purely practical reasons as far as I can tell (e.g., the optimiser used to fit the models could be lousy, approximations are typically made to marginal likelihoods/predictive accuracies, there could be bugs in one’s code). So even if I performed model comparison using PSIS-LOO as you suggest, I could imagine that one could still advocate doing model recovery to check that the result of model comparison based on PSIS-LOO is reliable and can be trusted. The assumption of model recovery is that you really should be able to recover the true model when it is the set of models you are comparing – if you can’t recover the true model with reasonable accuracy, then you can’t trust the results of your model comparison on real data. Do you have any thoughts on this?

My brief replies:

1a. No need to perform maximum likelihood estimation for each participant. Just do full Bayes: that should give you better inferences. And, if it doesn’t, it should reveal problems with your model, and you’ll want to know that anyway.

1b. Don’t do AIC and definitely don’t do BIC. I say this for reasons discussed in the above-linked paper and also this paper, Understanding predictive information criteria for Bayesian models.

2. You always care about out-of-sample predictive accuracy. The reason for fitting the model is that it might be applicable in the future. As described in the above-linked papers, AIC can be understood as an estimate of out-of-sample predictive accuracy. If you really only cared about within-sample prediction, you wouldn’t be using AIC at all; you’d just do least squares or maximum likelihood and never look back. The very fact that you were thinking about using AIC tells me that you care about out-of-sample predictive accuracy. And then you might as well do LOO and cut out the middleman.

3. Sure, yes, I’m a big fan of checking your fitting and computing process using fake-data simulation. So go for it!

James may find the hBayesDM package (https://github.com/CCS-Lab/hBayesDM) helpful, as it includes examples of fully Bayesian hierarchical behavioral models of decision making (implemented via Stan).

Also, apart from sanity checking, fake-data simulation can be particularly helpful before doing the experiment (or, the next experiment, if the data is already collected). I’ll plug our recent paper where we show how this can substantially improve inference accuracy for behavioral models:

https://doi.org/10.1371/journal.pcbi.1007593

Would you advise for K-fold CV to reduce the risk of overfitting the dataset and also when the dataset is quite big?

As someone who does very similar work (fitting hyper parameters of Bayesian models to behavioural neuroscience data) I have certainly done my share of maximimum-likelihood fitting and AIC/BIC model comparison. It has some problems (as highlighted in the linked papers), although in practice I have not seen any issues (I guess I have been lucky!). Setting up a sampler using Metropolis Hastings or DRAM (instead of simple max-likelihood) takes a little practice but is not too tricky in the type of low-dimensional models we typically use. One issue though is that it is likely to take a lot more model evaluations than a simple gradient-descent max-likelihood approach (an issue when calculating the likelihood is expensive). Then again, doing K-fold CV also takes a lot of model evaluations….

Regardless of which method you use though, you should always do parameter/model recovery.

Not to be a Negative Nancy, but I’m not sure it’s possible to make too many recommendations in this case because we don’t really know what the purpose of the model comparison is.

The correspondent says they want to know which of two models “best accounts for the data”, but a literal interpretation of that question can be decided by, as Andrew says, least squares or maximum likelihood. Just pick the best one.

But when the correspondent mentioned using AIC/BIC, Andrew jumped to the assumption that “You always care about out-of-sample predictive accuracy”, but I’m not sure that’s true in the sense that CV measures. CV measures the ability of a model to make accurate predictions in the same context as its original application, e.g., can you predict future trials of an experiment given what you know about how that participant performed earlier? This type of generalization is valuable when you expect to do this sort of thing, but most experiments in psychology and neuroscience are not set up like this (psychophysics and psychometrics are exceptions).

The type of generalization that matters in most of these cases is whether the model helps explain the results of other experiments and/or makes a clear recommendation about what future experiments would be valuable to run. And I don’t think any out-of-the-box statistical methods can quantify that type of generalization. Dani Navarro talked about this issue in her “Devil and the Deep Blue Sea” paper last year (https://link.springer.com/content/pdf/10.1007/s42113-018-0019-z.pdf).

Anyway, it sounds to me like your correspondent cares about which model provides the best *explanation* for their data, which is some balance between quality of fit and conciseness and they wanted to use AIC/BIC to measure this. Andrew recommended CV instead, because it is a better measure of a certain type of generalization. But I don’t think any of those methods really address the question the correspondent wants to answer.

That said, simulation can still help a whole lot because it lets you explore the complete outcome space of each model. Similar to Myung & Pitt’s “parameter space partitioning”, simulation lets you see what data patterns the model predicts, which can clue you into whether it generalizes in a useful way. It also helps you understand how predictions are causally linked to different model mechanisms.