I love this stuff:

This article presents a simulation-based method designed to establish the computational correctness of software developed to fit a specific Bayesian model, capitalizing on properties of Bayesian posterior distributions. We illustrate the validation technique with two examples. The validation method is shown to find errors in software when they exist and, moreover, the validation output can be informative about the nature and location of such errors. We also compare our method with that of an earlier approach.

I hope we can put it into Stan.

I think I’ll love it, too. Is this also called “parametric bootstrap”?

Abraham:

No. The bootstrap (parametric and otherwise) is a method for estimating the bias and standard deviation of a statistical estimate. The Cook et al. paper is a method for checking a simulation-based inference procedure.

Ok, thanks for this. Now to show you that I’m reading the paper (and further exhibit my ignorance about statistical terminology) when you have a Inv-Chi^2 distribution with two parameters, is that what wikipedia calls a Scaled inverse chi squared distribution (http://en.wikipedia.org/wiki/Scaled-inverse-chi-squared_distribution)?

Correct me if I am wrong, but these methods appear to be a special class of unit testing procedures that MCMC software developers might use, rather than per-analysis checks that software users would attempt. Once you confirm your sampling algorithm is working, I don’t see the point in repeating it for every model that you run.

I think you’re right, Chris, people familiar with unit testing should interpret this as a type of unit test for Bayesian model fitting. I think it still has a point, though. The division between software user and software developer is not that clear, and although I am a user of PyMC, I am a developer of specific models, the implementation of which I could (and recently have) messed up.

Now that I’ve thought about the approach from this paper some more, though, my question is this: do I gain enough using this quantile->inverse normal cdf->p value->z score transformation to make it worthwhile? Validation is important, but I could just compare the mean of the estimates to the known values, and look at the median average relative error. If I’m trying to convince skeptics that my model is correct, I don’t have much time to make my case, so simplicity is valuable.

Are there often errors in software implementation that which produce the correct mean but incorrect distribution? That would justify the additional complexity of this approach.

I used this when preparing JAGS version 3.0.0, and found bugs in the Pareto and multivariate T distributions.

Andrew:

One of my favourite papers, but has any looked at “abc” methods and techniques?

(i.e. making it approximate and easier rather than exact.)

Just wanted to say good on you for keeping working on Stan. Just read a stack of papers with extremely clever MCMC ideas with no attempt at public implementation.