The comments to a recent entry on “what is a Bayesian” moved toward a discussion of parsimony in modeling (also noted here). I’d like to comment on something that Dan Navarro wrote. First I’ll repeat Dan’s comments, then give my reactions.

**Dan wrote:**

There’s a great quote by Peter Grunwald in his introductory chapter to “Advances in Minimum Description Length” (2005, p.16; MIT Press) that talks about parsimony.

It is often claimed that Occam’s razor is false — we often try to model real-world situations that are arbitrarily complex, so why should we favor simple models? In the words of Webb [1996], “What good are simple models of a complex world?” The short answer is: even if the true data-generating machinery is very complex, it may be a good strategy to prefer simple models for small sample sizes. Thus, MDL (and the corresponding form of Occam’s razor) is a strategy for inferring models from data (“choose simple models at small sample sizes”), not a statement about how the world works (“simple models are more likely to be true”) — indeed, a strategy cannot be true or false; it is “clever” or “stupid.” And the strategy of preferring simpler models is clever even if the data-generating process is highly complex.

I think that all this comes down to the question of what we are trying to achieve with statistics. If the goal is only to descibe data accurately, then parsimony is irrelevant. If the goal is to describe accurately and concisely, or predict future events in the presence of noise, then parsimony becomes our guard against over-fitting. The earliest formal result that I know of demonstrating this is Akaike (1973), but there have been several variants since then. From an information theoretic point of view, people like Grunwald, Rissanen and Wallace have shown that parsimony is important in the compression of data, while folks like Dawid have talked a lot about predictions of future events (though I’m not as familiar with Dawid’s work as I should be, so I might be misinterpreting).

**My reactions:**

First off, regarding the first sentence of the Grunwald quote, I’d say that “Occam’s Razor” is more of a slogan than a hypothesis or conjecture, so I don’t think it’s meaningful to claim that it’s “false.” The real question is, how useful is it as a guide to theory or practice.

To get to the substance of Dan’s (and Grunwald’s) claims: ideas like minimum-description-length, parsimony, and Akaike’s information criterion, are particularly relevant when models are estimated using least squares, maximum likelihood, or some other similar optimization method.

When using hierarchical models, we can avoid overfitting and get good descriptions without using parsimony–the idea is that the many parameters of the model are themselves modeled. See here for some discussion of Radford Neal’s ideas in favor of complex models, and see here for an example from my own applied research.

**In summary**

When using least squares, maximum likelihood, and so forth, parsimony can indeed be a guard against overfitting. With hierarchical modeling, overfitting becomes much less of a concern, allowing us to get the benefits of more-realistically complicated models without losing predictive power.

Dan also writes about coding and data compression. Parsimony certainly seems important in these areas. Data compression hasn’t been an important issue in most of the applications I’ve worked on (mostly in social science and public health), so I haven’t thought much about that.

P.S. I see that Dan has some related thoughts on parsimony in part 2 of this post. I’m still down on Ockham’s razor for reasons pointed out here and here.

Parsimony can be understood as a particular type of a prior. Such priors have a few convenient properties: the posterior is approximately the same as the likelihood of a random sample from the model itself. This is important if you employ methods such as cross-validation or bootstrap to evaluate the predictive accuracy of the model.

But it's a prior like any other. It is not commonly spoken of as a prior, because not everyone is into Bayesian statistics.

Quote from Peter Grunwald: "…even if the true data-generating machinery is very complex, it may be a good strategy to prefer simple models for small sample sizes."

This misconception is discussed in a paper of mine in the Valencia 6 proceedings. To a Bayesian, there has to be something wrong with this view,since your prior shouldn't change depending on the amount of data collected (though despite this, I've heard many Bayesians expressing this view). What I think is happening is that people are confusing prior beliefs about ACTUAL functional relationships with prior beliefs about the BEST PREDICTOR based on a small amount of data. It's true that the best predictor is likely to be simple when the data are limited. But this doesn't mean that your prior for the actual relationship should favour simple functions. The predictions from Bayesian inference are averages over the posterior, and it's perfectly possible for the average of a distribution over complex functions to be much simpler than a typical function drawn from that distribution.

Radford Neal

A short reply to Radford: if one interprets the MDL or AIC as a prior, the resulting prior doesn't depend on the amount of the data. Furthermore, MDL interpreted as a prior is proper, too. Finally, one can also average over the posterior to make predictions with this kind of priors.

In my experiments, the Bayesian approach of integrating over the posterior is quite consistently better than MAP. Even the recent AIC literature is proposing some sort of model averaging.

I find this odd:

"When using hierarchical models, we can avoid overfitting and get good descriptions without using parsimony…"

I have always interpreted hierarchical models as being ways of regaining parsimony from complex data sets – simplicity is gained at higher levels, even if it's implemented through increased complexity at lower levels. It's not clear to me if this is just a semantic difference, or if I'm missing something deeper.

Bob

Bob,

It is an issue. For example, consider a regression setting where you have 100 potential predictors, and with two alternative estimation strategies: (1) a parsimonious method that selects the 5 best predictors (in some sense), or (2) a hierarchical method that shrinks the 100 so as to reduce predictive error. Either of methods 1 or 2 could use prior information (e.g., info on which predictors should be important, or info that clusters predictors into meaningful categories, thus allowing efficient shrinkage).

My inclination (and I suspect yours and Radford's) is to prefer approach (2)–although I freely admit that methods for (2) aren't always fully developed, and in practice, method (1) might work better, at least for now.

But method (1) is "parsimonious" in the sense that the finally-chosen model has relatively few parameters and can be described consisely. So I don't think it's just semantics. Hierarchical models really can allow us to be non-parsimonious while keeping predictive error down.

Hmm. I notice that I'm seriously outgunned here (I think I learned half my statistics from Andrew's and Radford's work). Still, I might learn something by following up on this. So, topic by topic…

Parsimony:I've always thought that parsimony is built into Bayesian statistics, not through any explicitly "parsimonious prior", but by the very fact that thereisa prior (with broad support, let's assume). So ifBis a submodel ofA, andBis (for the sake of argument) the "true model", Bayesian inference will tend to preferBoverAbecause the prior over the parameters of modelAplace some mass on parameter values that don't work very well. So when you find the marginal probability of the data given the model, the "true" submodelBwins out.Hierarchical models:I really like the approach, having lately moved into Bayesian nonparametrics. But I guess I don't see that they constitute an argument against parsimony, in the sense that we still place a proper prior at every level of the model. At the lower levels, the posterior should still reflect the same kind of trade-off between data fit and model complexity that a Bayesian prior entails. So, say we're doing clustering. There's high-level parameters for cluster assignments, and then low-level parameters assigning likelihood to the data. If the "true" model has 6 clusters, then (because we have priors at the lower level) the posterior over cluster assignments (at the higher level) should end up reflecting this, not assigning high posterior likelihood to 73 clusters. That seems to capture a parsimony principle rather nicely. It's just been built into the hierarchical model itself. And, like Radford says, the predictive model can be very simple. I don't really see a conflict there.MDL and data-dependent priors:I think I agree with Radford here. The "modern MDL" espoused by Grunwald and Rissanen has a "prior" (they hate the word, but it doesn't bother me much) that effectively depends on the choice of sample space, which has the rather annoying property of leading to violations of the likelihood principle. But isn't that also true of Bayesian reference priors? Doesn't the Bernardo-Jeffreys prior implicitly depend on the sample space, via the Fisher information? So I've always thought of this as a general criticism of non-informative objectivist methods, rather than an argument against parsimony per se.Ok. So now that I've said all these silly things, I'd love to know where I've gone wrong (… or, more optimistically, if I've gone wrong).

Dan,

First off, Bayesian inference will asymptotically converge on the true model (or, if the true data generating process is not in the model, to the closest model in the K-L distance; see, e.g. Appendix B of our book). However, with any finite dataset all the other possibilities will still be in the posterior distribution, so no parsimony there. Also, realistically we can make our models more complicated and realistic as we get more data, so there's no single submodel to converge to.

On your second point, I've never done this sort of inferential cluster analysis myself. I have nothing against it, I just can't comment on it from personal experience. In the hierarchical models I've worked on, I'd just about always like to have more parameters. It's my own simplistic models that are holding me back.

A lot of this discussion depends on how you define "parsimony." Say I have a hierarchical model with zillions of parameters, which themselves are structured in a complex way, with some number of hyperparameters, which themselves have some model (for a simple example, see Section 6 of this paper). This is the kind of model I like. In some sense, this model is parsimonious, in that it can be described relatively compactly, but in another sense, it's not parsimonious at all–it has zillions of parameters!

I don't know if "parsimony" can be clearly defined, but in statistics it usually seems to be associated with reducing the number of parameters–for example, excluding certain predictors from a regression model. In that sense, the hierarchical models that I like are not parsimonious. But if you have a definition of parsimony that allows me to have hierarchical models, then, hey, I'm all for parsimony too!

I just don't want people telling me (or others) that simpler models should be preferred. Life is complicated. My ideal model is easy to understand but at the same time has lots of parameters, allowing everything to matter a little bit and with nothing set to zero.

Finally, regarding data-dependent prior distributions: these are a species of improper prior distributions, and they can sometimes be useful as starting points for an analysis. It's always possible to go back and change the prior distribution (or the data model, for that matter) and add relevant information as is appropriate. I wouldn't worry about the Jeffreys prior distribution. People used to be more obsessed with that sort of thing before there was full awareness of the power of hierarchical models.

Thanks Andrew. I think I learned something there.

"the closest model in the K-L distance"I didn't know that. That's a really nice result. I suppose I should have read your appendix.

"in another sense, it's not parsimonious at all–it has zillions of parameters!"Hmm. That's true, isn't it? I'd never really thought about parsimony in that sense. Had we been frequentists talking about finding maximum likelihood estimates for zillions of parameters I'd have gotten very unnerved, and words like "overfitting" would have sprung to mind. But I guess we don't really have that problem as Bayesians, since we use priors to express our pre-existing beliefs and there's no fitting involved.

I suppose what I think of as "parsimony" relates to the strength of the conclusions that we draw from finite data. For frequentists, there's only so many parameters that can be reliably estimated (that's right, isn't it? If you throw in zillions of predictors without a prior, then there's a serious risk of overfitting, right?) For Bayesians, if our posterior distribution is diffuse, we just have to admit that there's not a lot we can say about about which parameter values are most likely to be right. But like you say,

"with any finite dataset all the other possibilities will still be in the posterior distribution, so no parsimony there."So maybe I'm using the wrong words. Maybe I'm not "pro-parsimony" after all. I'm only "anti-overfitting". For Bayesians, they're not really the same thing, are they?

By the way. You mention not worrying about Jeffreys' prior. I've always wondered about this for hierarchical models, in the sense that once I reach the top level I don't know what prior to use. Half the time I just use something that feels reasonably diffuse. Is there a good way to go about doing this? Does it even matter much in a practical sense?

Dan,

I wouldn't quite say that "we use priors to express our pre-existing beliefs." I'd rather say that the prior distribution is part of a model, all of which we take as an assumption for the purpose of performing inferences. (See here for our take on this as Popperian.)

Yes, using maximum likelihood, there's a real danger of overfitting. With Bayes, less so–at least to the extent that informative prior distributions are used. In a model with zillions of parameters, there might be thousands of hyperparameters, which themselves have to modeled reasonably. In practice, Bayesian models too are subject to overfitting–but at a higher density of modeling, I think. This is the subject of some of our current research (see here, for example). It's a serious issue–actually, I think it's one of the five most important open problems in statistics.

This is one reason why I think the discrete model averaging ideas of Adrian Raftery, Chris Volinsky, and others can be useful–even though, in a deep sense, I don't think their models are the ultimate way to go.

Finally, regarding the prior distribution for the uppermost level of the hierarchy. I don't think there's always a great answer here. Weakly informative prior distributions can be useful. Also adding another hierarchical level. It can make a difference in practice if the number of groups is small, especially if there is a goal of prediction for new groups. I discuss some of these issues in this paper.

Dear Andrew (and Dan and Radford),

Today I found out that you have a very interesting discussion going on in which a recent quote of mine plays a central role. I think I should add a few points here (mainly to make sure that my quote

doesn't get overinterpreted). Unfortunately, the third point (in which I explain my own views in detail) ended up very long and technical. Points (1) and (2) should be easier to read.

Best wishes,

Peter

(1) First some technical remarks:1(a) I agree with most remarks Andrew makes,

but not with this one:

"First off, Bayesian inference will asymptotically converge on the

true model (or, if the true data generating process is not in the

model, to the closest model in the K-L distance; see, e.g. Appendix B

of our book)…".

Unfortunately neither the first nor the second statement are true in

general. Many counterexamples to the first statement are given by

Diaconis and Freedman. While these rather artificial I think they are

important to analyze in this context since they tell us something

about parsimony – see remark 3. below.

As to counterexamples of the second statement: these are easier to

give. See the paper 'Suboptimal behaviour of Bayes and MDL for

classification under misspecification', Grunwald and Langford, COLT

2004, where we give a simple counterexample. The second statement

*is* true if (a) the model under consideration is essentially convex

(eg the set of all Gaussian mixtures with variance 1 and an arbitrary

nr of components); or (b) the model is 'simple' enough in that it

belongs to a finite VC class or is finite dimensional parametric. But

already in the case of (b) there may be anomalies in that convergence

can take extraordinarily long etc. (In case (a) everything works just

as in the well-specified case even if the model is arbitrarily

complex).

..Andrew's remark goes on like:

"…However, with any finite dataset all the other possibilities will

still be in the posterior distribution, so no parsimony there…"

– I think parsimony does have a role to play here , see remark

3 below.

(2) Regarding my quote:

"…even if the true data-generating machinery is very complex, it

may be a good strategy to prefer simple models for small sample

sizes."

I would like to make sure my own position on this doesn't get

exaggerated. I agree with most comments by Andrew and Radford where as

from this quote it might appear that I do not.

(a) First, Radford reacts to my quote

'This misconception is discussed in a paper of mine in the Valencia 6

proceedings. To a Bayesian, there has to be something wrong with this

view, since your prior shouldn't change depending on the amount of data

collected …'

I'm not referring to a change of prior depending on the amount of

data! I can see that the quote (which is shown out of its context) may

suggest that I do, but if you read my

tutorial you'll see that I'm certainly not suggesting changing the

prior when the sample size changes. In fact, …

(b) I agree 100% with Radford, Andrew and others that there may be

other strategies beside 'choosing a simple model' that work as well or

sometimes even much better when used for predicting future

data. Certainly predicting using a Bayesian predictive distribution is

usually preferable than singling out a single model. When writing the

text on 'choosing a simple model is a clever strategy' my goal was to

point out that there is no 'underlying belief that the world is

simple' in MDL approaches. One can indeed formulate the whole theory

in terms of trying to predict well against a log-scoring rule, and in

that formulation any reference to Occam's razor etc. disappears (see

Barron, Rissanen, Yu 98, Dawid at Valencia '92 or my tutorial). I have

heard the criticism "MDL makes no sense since it is based on the false

underlying assumption that the world is 'simple'" over and over again,

and wanted to make clear once and for all that that criticism is not

valid.

So my point, in that context, was certainly *not* to claim that other methods

might not be even better (maybe I should have written 'somewhat clever' rather than 'clever' :-).

3. However, when we consider for example predicting using a Bayesianposterior predictive distribution, then I do think data compression

still has an important role to play.

To illustrate, let's consider the infamous Bayesian inconsistency

examples by Diaconis and Freedman where the posterior predictive

distribution does not converge to the true distribution: they show

that for some models+priors, even if data are generated by a

distribution P which has non-zero priormass in every (suitably

defined) neighborhood of P, then the Bayesian posterior converges to

some P' quite different from P. Barron (1986, ann.stats, Valencia

'98) shows that consistency of the predictive distribution is very

intimately related to whether or not the prior one uses allows one to

*compress* the data (although I'm not a frequentist, I think

consistency of any proposed method is quite important as a sanity

check). Let me explain:

Let M be the Bayesian marginal distribution and x be the data. One can

think of – log M(x) as a codelength, obtained when coding the data

using the code that would be optimal to use if M were the data

generating distribution. Let us call the code with lengths -log M(x)

the 'Bayesian code'. Barron shows, roughly speaking, that if data are

distributed according to a distribution P, then the Bayes predictive

distribution will converge to P in KL-divergence at rate

(coding redundancy of the Bayes-code M relative to P) / sample size .

If the redundancy (extra nr of bits that the bayesian statistician

needs compared to someone who knows the true data generating

distribution) grows linearly in n, then the Bayes predictive

distribution simply does not converge. This is exactly what happens in

Diaconis' and Freedman's examples: they invariably use priors that do

not compress the data in this sense. Barron (1998) effectively shows

that with priors that make the Bayesian code into a universal code,

*inconsistency of the predictive distribution cannot happen*.

('universal code' is a central concept in information theory: a code C

is universal with respect to a set of distributions M if the nr of

additional bits you need to encode the data using code C compared to

the code that is optimal for the data generating distribution P grows

less than linearly in n for all P in M)

Thus, in my view *parsimony certainly has a role to play* when

determining the efficacy of Bayesian procedures. One should make sure

that the prior is chosen such that the Bayes code is universal, with

small coding redundancies. This can be done by putting small

codelength (large prior) to small ('simple') submodels, or – sometimes

more effectively – by using quite complex, eg nonparametric models,

and using a prior that guarantees that the Bayesian marginal

distribution has small coding redundancy. An example of the latter are

recent MDL codes for histogram density estimation, or Bayesian

inference for Gaussian process models.

This (as well as many other considerations) provides strong evidence,

to me, that in order to achieve good predictive accuracy of a

statistical procedure, the most relevant thing may not be being

Bayesian, but basing predictions/inference on 'universal codes'. It

happens to be the case that Bayesian marginal distributions yield very

good universal codes, but they are not the only ones! One can use

Dawid's plug-in predictive codes, normalized maximum likelihood codes

or any type of mixtures between these three forms of codes.

This 'basing inference on universal codes' is exactly what is

advocated in modern types of MDL (see once again my tutorial, section

2.5 and 2.7, and Barron, Rissanen, Yu 1998). So I'm ending up advocating MDL

after all, but not in the simple form of 'picking a simple model

minimizing two-part codelength of the data'. I agree once again 100%

with Andrew that

"When using least squares, maximum likelihood, and so forth, parsimony

[in the sense of 'picking a simple model'] can indeed be a guard

against overfitting. With hierarchical modeling, overfitting becomes

much less of a concern, allowing us to get the benefits of

more-realistically complicated models without losing predictive

power."

..it's just that I don't see any contradiction between MDL (in its

modern 'universal model based' form) and the use of more-realistically

complicated models – as long as the structure of models and priors is

such that the Bayesian marginal distributions compress the data.

I would still call this 'a form of Occam's razor', since one can think

of Bayesian marginal distributions that compress the data as short

(one-part, not two-part) descriptions of the data, even if based on

large models.

4 Now if anybody has followed me this far,

they might suspect I thinkthat 'MDL is always the way to go'. No: I certainly think that there

are valid criticisms pertaining to both Bayesian and MDL

approaches. These become relevant mainly if one works with severely

misspecified models (which I think is often unavoidable, even if one

works with very complicated models. This is certainly the case in the

applications I work with, ie language models for (web) data –

documents are certainly not hidden Markov models, neither are they

mixtures thereof). In those cases, both MDL and Bayes can

theoretically be far from optimal. Apparently in those cases they are

sometimes also performing not that well in practice; see e.g. Bertrand

Clarke (2002).

Hi Peter (and others)

Nice that you could join in. Just in case it wasn't obvious, I wasn't trying to misrepresent your position when quoting you. Obviously, I've read your MDL tutorials, but it's entirely possible that I've misunderstood things. I am, after all, the most statistically illiterate person in the room (not an experience I'm used to). Of course, that just means I've got more questions:

1. On the philosophical point about parsimony, I interpreted Andrew's position in this way. The Bayesian predictive distribution mixes across the posterior distribution, and so is not "parsimonious", in the sense that it is built from zillions (or whatever) of component models. I agree with that completely. But then, if viewed as a coding scheme for the data, this method would probably yield shorter codelengths than using a single (a priori fixed) parameter value and its corresponding distribution. So I guess the question is, switching over to a coding context, in what sense would we think of one-part codes as being "parsimonious" (as opposed to just "effective")? Is it just me, or does the word just not seem to apply anymore? It seems like we've gotten to the point where "Ockham's razor" is no longer an elementary desideratum, but something that often falls out of our more fundamental goals of prediction and compression?

2. As I understand it, there is a weak "data dependence" in the "prior" used in modern MDL. If we encode the data using the normalized maximum likelihood distribution, the normalization term sum_X f(X|hat heta(X)) is dependent on the sample space, though not on the actual observed data. That's correct, isn't it? I've been playing with this recently (i.e., I've got a half-written tech note sitting on my hard drive), and it seems that the implied "prior" is sample-size dependent (not data-dependent). From a sequential coding perspective, that's a little odd, isn't it? Because you'd have to know wait until the entire message were received before you could begin the decoding (since you don't know what the coding scheme was until you know the sample size). Is that right? Or am I missing something here?

Cheers,

Dan.

An interesting discussion. Some random comments:

On sample-size dependencyFirst, Dan is right in that the NML distribution depends on the sample size. (To be more precise, it depends on the

sample spacewhich often defined as the set of all sequences of some length, n. This need not be the case in general. It is just often desirable to let the regret increase in sample size so that we incur smaller regret for short sequences than for long ones.)If this is a problem like in the case of sequential coding, one can use another universal code, e.g., the Bayesian mixture code that does correspond to a stochastic process, i.e., the codes for different sample sizes are compatible. Perhaps the use of NML has been emphasized too much in some work on "modern MDL" (not Peter's): there's really nothing wrong in using different universal models.

In terms of principle, it may seem odd or irrational, at least from a subjective Bayesian perspective, to let the code (and thus the used distribution) depend on the sample space. However, this is perfectly rational, in fact even optimal, if the objective is regret minimization within a fixed sample space.

On the need for 'parsimony'Secondly, I agree that parsimony seems to be a by-product rather than a starting point or a desideratum if the objective can be formalized as clearly as "prediction" (or even better "regret minimization"). As Peter said above:

So basically he agrees too.

In fact, I think that the assumption-free results from "prediction with expert advice" quite nicely complement the MDL theory. Both in the expert setting and MDL there is no assumption of a data-generating mechanism. The former is just more directly focused on prediction while MDL is more about the somewhat elusive concept of

modeling: it tries to answer the question "What does the data have to say?". Related to this, one interesting thing about MDL is the relation to "universal sufficient statistics decomposition" (information vs. noise) and Kolmogorov's structure function. This relation provides a foundation for answering such questions. And to come back to the point, the structure function is intimately related to parsimony and short descriptions.So, maybe one could say that parsimony is of secondary interest when we are talking about

prediction, but of primary interest when talking about MDLmodeling. I'm not qualified to say much about Bayesian modeling, but Peter argues that parsimony is critical also in Bayesian modeling.Dear Dan,

(I'll answer Andrew's points later this week).

Of course it was clear to me that you weren't trying to misrepresent me. Still it seemed a good idea to provide more background on the

context that my quote was made in.

(a) We are talking about model/prior (m/p) combinations such that, in

your words, the Bayesian predictive distribution, "may mix over

zillions of distributions". I argued that such m/p combinations will

lead to good predictions only if the

corresponding Bayesian marginal distribution, when used as a coding scheme, leads to

shortcodelengths of the data. I have to admit that the word 'parsimonious'

doesn't seem to apply for such m/p combinations. Still, I think the

phrase 'minimum description length' is appropriate…in fact this

seems obvious — note the two words in italics above. I would still

say that, if we use such m/p combinations, we are doing something in

the spirit of Occam's razor. I tend to interpret Occam's razor as (1)

'look for a model with which we can somehow concisely describe the

data' (eg using the Bayesian marginal likelihood code), not (2) 'look

for a model such that the two-part description of model + data is

small'. But since "Occam's razor" is not a precisely defined concept

anyway, I guess this is in the end a manner of personal taste, and I

can very well understand if people think that (1) should not be called

'Occam's Razor' any more.

(b) You worry that the prior implicit in modern versions of MDL is

sample-size dependent. That is true, but the dependence is very mild;

the prior quickly stabilizes with increasing n (to Jeffreys' prior or

one of its relatives). Still, you are right: one can

distinguish between 'on-line' (sequential) and 'off-line' versions of

MDL, the latter only being applicable if the sample size is known in

advance. It is not clear whether any of the two is 'inherently

right'. There's lots of discussion on that in my forthcoming book – please bear with me: things would get far too technical to get into that here.

Best,

Peter