One of my favorites, from 1995.

Don Rubin and I argue with Adrian Raftery. Here’s how we begin:

Raftery’s paper addresses two important problems in the statistical analysis of social science data: (1) choosing an appropriate model when so much data are available that standard P-values reject all parsimonious models; and (2) making estimates and predictions when there are not enough data available to fit the desired model using standard techniques.

For both problems, we agree with Raftery that classical frequentist methods fail and that Raftery’s suggested methods based on BIC can point in better directions. Nevertheless, we disagree with his solutions because, in principle, they are still directed off-target and only by serendipity manage to hit the target in special circumstances. Our primary criticisms of Raftery’s proposals are that (1) he promises the impossible: the selection of a model that is adequate for specific purposes without consideration of those purposes; and (2) he uses the same limited tool for model averaging as for model selection, thereby depriving himself of the benefits of the broad range of available Bayesian procedures.

Despite our criticisms, we applaud Raftery’s desire to improve practice by providing methods and computer programs for all to use and applying these methods to real problems. We believe that his paper makes a positive contribution to social science, by focusing on hard problems where standard methods can fail and exp sing failures of standard methods.

We follow up with sections on:

– “Too much data, model selection, and the example of the 3x3x16 contingency table with 113,556 data points”

– “How can BIC select a model that does not fit the data over one that does”

– “Not enough data, model averaging, and the example of regression with 15 explanatory variables and 47 data points.”

And here’s something we found on the web [link fixed] with Raftery’s original article, our discussion and other discussions, and Raftery’s reply. Enjoy.

**P.S.** Yes, I’ve blogged this one before, also here. But I found out that not everyone knows about this paper so I’m sharing it again here.

This paper has always been a point of reference for me, particularly for this insight “promises the impossible: the selection of a model that is adequate for specific purposes without consideration of those purposes”.

The link “And here’s something we found on the web” is broken.

Damn that linkrot!

You can view it here:

https://web.archive.org/web/20141009013313/http://irt.com.ne.kr/data/BMS_raftery.pdf

I’ve tried more than once to point out your criticisms of Raftery’s suggestions to grad students on whose committees I’ve served, but they (and maybe their supervisors?) don’t seem to pay much attention; they seem to regard BIC as some kind of magic ritual — much like p-values.

From the Rubin et al response: “It’s not “cheating” to use real-world knowledge if you’re actually interested in real-world answers.”

Ouch!

Shravan:

By “Rubin et al,” I think you mean “Gelman and Rubin.” I don’t remember who wrote that particular sentence but it was probably me.

Yes, sorry, I got the author order mixed up. Pity I can’t go back and correct my comment; please add preview and correction facilities on the blog for comments too!

Andrew, a question: What do you think about model averaging? I view it as a version of hierarchical modeling (at least when I’ve used it) with an index that instead of being continuous is discrete. In my applications, the number we were interested in was a common parameter of all the models.

Bill:

We’re actually doing research on that right now! As we discuss in Chapter 7 of BDA3 (or Chapter 6 of the earlier editions), I prefer continuous model expansion to discrete model averaging, as straightforward discrete Bayesian model averaging (of the sort that you describe) fails with weak priors on the individual models except in some special cases in which all the models have the same size and are on the same scale. When continuous model expansion is not feasible, I think discrete model averaging can be OK, but I would do it by optimizing estimated fit to the predictive distribution rather than via a direct Bayesian analysis that applies a posterior probability to each model.

Hi Andrew, thanks. What Jim Berger and I was doing had to do with fitting periodic functions to data…the different models were models with differing numbers of terms in the Fourier expansion. We knew that the number of terms required would be small and would be the low order ones, which is how we signed priors on the individual models.

I suppose one might have a continuous model that continuously weighted down each term but I’m not sure how one would go about that. The common parameter of interest was the constant term in the Fourier expansion. Any guidance on how to approach a problem like this using your preferred approach?