David Hogg writes:

My (now deceased) collaborator and guru in all things inference, Sam Roweis, used to emphasize to me that we should evaluate models in the data space — not the parameter space — because models are always effectively “effective” and not really, fundamentally true. Or, in other words, models should be compared in the space of their predictions, not in the space of their parameters (the parameters didn’t really “exist” at all for Sam). In that spirit, when we estimate the effectiveness of a MCMC method or tuning — by autocorrelation time or ESJD or anything else — shouldn’t we be looking at the changes in the model predictions over time, rather than the changes in the parameters over time? That is, the autocorrelation time should be the autocorrelation time in what the model (at the walker position) predicts for the data, and the ESJD should be the expected squared jump distance in what the model predicts for the data? This might resolve the concern I expressed a few months ago to you that the ESJD is not affine-invariant, and etc. Thoughts?

Hogg continues with an example:

Imagine you have a three-planet model for some radial velocity data. In the naivest implementation, you have a three-factorial exact degeneracy from swapping planets, but the modes are very well separated in parameter space: Your autocorrelation time in the parameters is essentially infinite (because you will never switch from one permutation of the planets to another, realistically), but in the predictions the autocorrelation time is finite and fine.

My reply:

It depends on the context. Sometimes we have a redundant parameterization in which the individual parameters are not identified, but predictions are well-identified. For a simple example, suppose you have a model, y ~ N (a+b, 1), with a uniform prior distribution on (a,b). Then your data don’t tell you anything about a or b, but you can get good inference for a+b and good predictions for new data from the same model. On the other hand, if you want to make a prediction for new data z ~ N(a,1), you’re out of luck.

More generally, one problem I have with the hard-line predictivist stance—the idea that models and parameters are mere fictions whereas predictions are real—is that models and parameters can be thought of as bridges between the data of yesterday and the data of tomorrow. Consider the speed of light. It’s not just part of a prediction for some particular measurement. It’s also a universal constant. For a more humble example, consider our discussion of physiologically-based pharmacokinetics models in Section 4.3 of my article with Bois and Jiang. In a Bayesian model, good parameterization can be important, as it is typically through the parameters that we put in prior information. In many ways, the parameterization represents a key source of prior information.

Aren’t we dipping in to murky waters when we start talking about prediction? It is well known that good models of estimation might not correspond to good models of prediction-the latter usually benefits from over fitting, hence the widespread use of many black-box type algorithms.

It seems like the criterion should be to evaluate what you’re interested in learning about, which for a statistician is usually estimation.

Arik: That makes sense to me “what you’re interested in learning about”

I was also advised by one of my mentors to plot in the data space (fitted values) rather than model space (try to interpret parameters) but if you are trying to learn about less wrong models, I think you need to be looking in the parameter space.

As John Nelder pointed out, there is always a model that will accommodate all the data equally well (There are no outliers…) but conveniently choosing that model makes little sense to me!

Still have yet to be very convincing about the need to look at the parameter trees in the inference forest http://statmodeling.stat.columbia.edu/2011/05/missed_friday_t/ – but maybe I need to spend more time practicing Basbøll.

David Brooks was hampering on this in his most recent op-ed. The problems with simulations, though I am not entirely sure he understood the complexity in its entirety.