The comments to a recent entry on “what is a Bayesian” moved toward a discussion of parsimony in modeling (also noted here). I’d like to comment on something that Dan Navarro wrote. First I’ll repeat Dan’s comments, then give my reactions.
There’s a great quote by Peter Grunwald in his introductory chapter to “Advances in Minimum Description Length” (2005, p.16; MIT Press) that talks about parsimony.
It is often claimed that Occam’s razor is false — we often try to model real-world situations that are arbitrarily complex, so why should we favor simple models? In the words of Webb , “What good are simple models of a complex world?” The short answer is: even if the true data-generating machinery is very complex, it may be a good strategy to prefer simple models for small sample sizes. Thus, MDL (and the corresponding form of Occam’s razor) is a strategy for inferring models from data (“choose simple models at small sample sizes”), not a statement about how the world works (“simple models are more likely to be true”) — indeed, a strategy cannot be true or false; it is “clever” or “stupid.” And the strategy of preferring simpler models is clever even if the data-generating process is highly complex.
I think that all this comes down to the question of what we are trying to achieve with statistics. If the goal is only to descibe data accurately, then parsimony is irrelevant. If the goal is to describe accurately and concisely, or predict future events in the presence of noise, then parsimony becomes our guard against over-fitting. The earliest formal result that I know of demonstrating this is Akaike (1973), but there have been several variants since then. From an information theoretic point of view, people like Grunwald, Rissanen and Wallace have shown that parsimony is important in the compression of data, while folks like Dawid have talked a lot about predictions of future events (though I’m not as familiar with Dawid’s work as I should be, so I might be misinterpreting).
First off, regarding the first sentence of the Grunwald quote, I’d say that “Occam’s Razor” is more of a slogan than a hypothesis or conjecture, so I don’t think it’s meaningful to claim that it’s “false.” The real question is, how useful is it as a guide to theory or practice.
To get to the substance of Dan’s (and Grunwald’s) claims: ideas like minimum-description-length, parsimony, and Akaike’s information criterion, are particularly relevant when models are estimated using least squares, maximum likelihood, or some other similar optimization method.
When using hierarchical models, we can avoid overfitting and get good descriptions without using parsimony–the idea is that the many parameters of the model are themselves modeled. See here for some discussion of Radford Neal’s ideas in favor of complex models, and see here for an example from my own applied research.
When using least squares, maximum likelihood, and so forth, parsimony can indeed be a guard against overfitting. With hierarchical modeling, overfitting becomes much less of a concern, allowing us to get the benefits of more-realistically complicated models without losing predictive power.
Dan also writes about coding and data compression. Parsimony certainly seems important in these areas. Data compression hasn’t been an important issue in most of the applications I’ve worked on (mostly in social science and public health), so I haven’t thought much about that.