David Rossell writes:

A friend pointed out that you were having an interesting philosophical discussion on my paper with Val Johnson [on Bayesian model selection in high-dimensional settings].

I agree with the view that in almost all practical situations the true model is not in the set under consideration. Still, asking a model choice procedure to be able to pick up the correct model when it is in the set under consideration seems a minimal requirement (though perhaps not sufficient). In other words, if a procedure is unable to pick the data-generating model even when it is one of the models under consideration, I don’t have high hopes for it working well in more realistic scenarios either.

Most results in the history in statistics seem to have been obtained under an assumed model, e.g. why even do MLE or penalized-likelihood if we don’t trust the model. While unrealistic, these results were useful to help understand important basic principles. In our case Val and I are defending the principle of model separation, i.e. specifying priors that guarantee that the models under consideration do not overlap probabilistically with each other. We believe that these priors are more intuitively appealing for testing than the typical Normal-Cauchy prior. For instance, under the null mu=0. Under the alternative usual priors place the mode at mu=0 (or nearby), but mu=0 is not even a possible value under the alternative. If you ask a clinician who wants to run a clinical trial his prior beliefs about mu, they’re certainly not peaked around 0, else he wouldn’t consider doing the trial.

A more pragmatic note on whether posterior model probabilities can be useful at all, inspired by your nice discussion. To me the posterior probability of a model is a proxy for the probability that it is a better approximation to the underlying truth than any other model in the set under consideration. While a purely informal statement, the expected log-BF always favors the model closest (in Kullback-Leibler divergence) to the true model, even when the true model is not in the set under consideration. It would be interesting to pursue this more formally…

There is something exceedingly simple I don’t understand–feel free to ignore–but aren’t the “BFs” calculated to tell you not how to

decideamong models but rather to tell you how tomixmodels? That is, Bayesian calculations tell you the relative weights or probabilities or plausibilities of models. Anydecisionor selection requires some additional utility (which will certainly be subjective)?If you have three models A, B, and C, with prior probabilities p_A, p_B, and p_C and marginalized likelihoods p(D|A), p(D|B), and p(D|C), don’t you just use those factors to build a “best possible model given the inputs” which will be the linear combination of the three models weighted by factors proportional to p_A p(D|A) and so on? That mixture model will be better, according to the Bayesian, than any of the models A, B, or C individually, won’t it? And I even mean better in the sense of better for encoding the data for transmission over a channel. (In fact, this gets back to my objection to MacKay’s claim that BFs contain “Ockams razor”: They don’t, because they produce for you mixture models that are in fact more complex–in the traditional sense of complex–than any of the models individually.)

Anyway, maybe I misunderstand Bayes, but I thought it leads to mixtures of all models you have ever considered, or at least until some BFs become negligibly small (in a relative sense). And if I am right, then any use of BFs for model choice steps outside of probabilistic reasoning, and therefore could never be strictly described as “Bayesian”.

David:

I think that, to the proponents of Bayes factors, model choice is simply an approximation to model averaging. But I don’t like Bayes factors for model choice

ormodel averaging, except in some special cases where I have some comfort in the joint prior distribution of the super-model. For general use of Bayes factors, I’m bothered by the well-known problem that the Bayes factor for a model can be strongly influenced by aspects of the prior distribution that are essentially nonidentified and that have essentially no impact on the posterior distribution for the parameters given the model.I sure agree with that!

You’re right, if the purpose is prediction, and the computational cost of predicting using all models is not an issue. If the purpose is “choosing” amongst models embodying different scientific theories, for reasons of intellectual curiosity, then “mixing” doesn’t make sense. Regarding Ockam’s razor, the more stricty correct claim would be that the mixture will have good predictive performance when both simple and complex models are considered, because the simple model will have high weight when appropriate, and low weight when not appropriate, etc. This contrasts with the (incorrect) fear that the complex model would always look good, even though it overfits (and hence doesn’t actually work well).

The problem is that this all has little justification when the true model is not amongst those considered, or when the priors on parameters within each model are not well-justified. Getting the model closest to the truth in KL divergence doesn’t guarantee that it’s the best of the bunch with regard to some other loss function (the best in a set of wrong models may be different for different loss functions, although the truth is always best for any loss function).

“the truth is always best for any loss function”

Nitpick: No it’s not. “It is difficult to get a man to understand something, when his salary depends upon his not understanding it!” — Upton Sinclair.

I’d say that the definition of the set of reasonable loss function is exactly those for which the truth is best.

Oh boy, that last sentence is a mess. Grammar, I isn’t the goodest at it.

Isn’t that the definition of a “proper” loss function?

Whatever the name, I agree it’s a sensible criterion to consider, but may not always be appropriate. There are legitimate reasons for not attempting to give an answer that’s as close to the truth as possible; e.g. giving up a bit of veracity in order to enhance stability… also known as a bias-variance tradeoff.

George:

No, that is not a good characterization of the bias-variance tradeoff. You are identifying low bias as “veracity” but that is not correct at all. If you have an estimate with low bias and high variance, it is not likely to be “close to the truth.” Reducing mean square error does not give up veracity at all!

Call it a veracity-stability tradeoff if you prefer. It’s the same tradeoff you mentioned the lasso making.

George:

No, you misunderstand me. I agree that bias-variance tradeoff is important, and indeed I was discussing that with regard to lasso. What I’m disagreeing with is your claim that there’s a veracity-anything tradeoff. Veracity is always the goal.

Using the lasso we end up with estimates that zero out lots of coefficients… even when we believe the truth does not contain any zeroes. Now, a more ‘truthful’ point estimate would contain only non-zero coefficients, but we don’t use it, even when it’s available, because we value the extra stability obtained when we zero things out.

Hence one can view the procedure as a tradeoff, between qualitatively how well the estimates reflect the truth and how variable they are.

@george I think you’re still not understanding some things. Often a loss function like a MSE is a more direct assessment of “veracity” than so-called “bias”. Thus, biased estimators are often preferred _for the sake of_ “veracity”.

I think you’re missing the point being made in relation to the lasso and the bias variance tradeoff. First of all, Lasso is not the only way to introduce bias. My guess is from Andrew’s point of view, the nice thing about the adaption of lasso isn’t that he wants to use it himself, but because it has more people acknowledging what you seem to be missing here – that unbiased is not “truthfulness” or “veracity”.

@Anonymous

Thanks, but I do understand that unbiasedness in a point estimate is not “veracity”, and that the lasso is very far from being the only show in town. My point was that there can be situations (i.e. loss functions) where reporting the truth – or trying to do so – is not the best option. Deciding what to do in those situations comes down to a tradeoff, where some measure of how far one’s decision is from the truth is balanced against some competing criterion.

If “veracity” doesn’t appeal as a name for distance-from-truth, perhaps …”truthiness”?

I suspect you’re thinking of [proper scoring rules](http://en.wikipedia.org/wiki/Scoring_rule#Proper_scoring_rules). Not quite the same thing.

Nurse, get me some markdown support, stat!

Wald (1939) bakes in a “loss is zero when you’re at the truth” stipulation – with a charming “of course”!

minimizing f(X) is the same as minimizing f(X) + C for any constant C, so you might as well have zero loss at the truth from a mathematical standpoint.

I found @Radford’s comment clarifying about Ockam, although I still don’t agree in certain ways.

Imagine the scenario I proposed (models A, B, C). Now introduce a model E that is the BF-supported best mixture of A, B, C. Model E will beat all three models A, B, C, in BF by assumption / construction. Therefore–if an investigator is crazy enough to choose models on the basis of BF (which as I said I think is absurd and goes outside Bayes)–it is model E that will be chosen. Model E is not an illegitimate choice; it could only be disfavored on

priorgrounds; that once again goes against using BFs for selection.I think there are some problems with this example if I’m understanding it correctly:

1) If E is already weighted, it’s not a fair comparison because the weights were determined using the data. The standard “complex model”/”simple model” competition would be if a ‘super model’ with some prior weights on A, B, C, and E. E is a model with a mixing parameters on A, B, and C and some diffuse hyperprior on the mixing parameters. In this case, if A was the generating process, A will be weighted more highly because the likelihood of the observations is higher in A than in E.

2) Setting “using the data twice” questions aside, suppose by magic we picked a model which mixed A, B, and C with weights corresponding to the BF weights. The complexity of E isn’t as high as a general mixture model since the weights would be fixed, not free. Also, (assuming there are enough samples) this fixed weighting would be dominated by A, so it would look like “‘almost A’ with a small bit of residual variation due to the B/C components”, and the free parameters of B and C would not affect the behavior or the model substantially. So the model complexity of E would actually be quite close to the complexity of A.

Furthermore, if you compared A, B, C, and E using new data (generated form A), the weight of A would almost certainly be higher given a sufficiently large number of samples, since E didn’t assign A a weight of 1 only because the sample size was limited.

You are probably all aware of this, but I remember being quite impressed by the simplicity and usefulness of MacKay’s (and others’?) proof that BFs cannot systematically disfavour the true model. It’s on page 441 of MacKay, David JC. “Bayesian interpolation.” Neural computation 4.3 (1992): 415-447, http://authors.library.caltech.edu/13792/1/MACnc92a.pdf This (and work that cites it!) might be a good starting point for anyone looking to formalize Rossell’s informal statement about KL divergence and not-quite-true models.

Some further work related to Bayesian procedures choosing the model closest to the data-generating model.

CASELLA, G., GIRÓN, F. J., MARTÍNEZ, M. L. and MORENO, E. (2009). Consistency of Bayesian procedures for variable selection. Ann. Statist. 37 1207–1228. MR2509072