Cosma Shalizi and Larry Wasserman discuss some papers from a conference on Ockham’s Razor. I don’t have anything new to add on this so let me link to past blog entries on the topic and repost the following from 2004:

A lot has been written in statistics about “parsimony”—that is, the desire to explain phenomena using fewer parameters–but I’ve never seen any good general justification for parsimony. (I don’t count “Occam’s Razor,” or “Ockham’s Razor,” or whatever, as a justification. You gotta do better than digging up a 700-year-old quote.)

Maybe it’s because I work in social science, but my feeling is: if you can approximate reality with just a few parameters, fine. If you can use more parameters to fold in more information, that’s even better.

In practice, I often use simple models—because they are less effort to fit and, especially, to understand. But I don’t kid myself that they’re better than more complicated efforts!

My favorite quote on this comes from Radford Neal‘s book, Bayesian Learning for Neural Networks, pp. 103-104:

Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

Exactly!

To put it another way, I don’t see parsimony, or Occamism, or whatever, as a freestanding principle. Simpler models are easier to understand, and that counts for a lot. I start with simple models and then work from there. I’m interested in the so-called network of models, the idea that we can and should routinely fit multiple models, not for the purpose of model choice or even model averaging, but so as to better understand how we are fitting the data. But I don’t think simpler models are better.

Part of my attitude might come from my social-science experience: we often here people saying, Your model is fine but it should also include variables X, Y, and Z. I never hear people complaining and saying that my model would be better if it did *not* include some factor or another.

In many practical settings there can be a problem when a model contains too many variables or too much complexity. But there I think the problem is typically that the estimation procedure is too simple. If you are using least squares, you have to control how many predictors you include. With regularization it’s less of an issue. So I think that, in some settings, Occam’s Razor is an alternative (and, to me, not the most desirable alternative) to using a more sophisticated estimation procedure.

The Occam applications I don’t like are the discrete versions such as advocated by Adrian Raftery and others, in which some version of Bayesian calculation is used to get results saying that the posterior probability is 60%, say, that a certain coefficient in a model is exactly zero. I’d rather keep the term in the model and just shrink it continuously toward zero. We discuss this sort of example further in chapter 6 of BDA. I recognize that the setting-the-coefficient-to-zero approach can be useful, especially compared to various least-squares-ish alternatives, but I still don’t really see this sort of parsimony as desirable or as some great principle; I see it more as a quick-and-dirty approximation that I’d like to move away from.

P.S. Neal Beck writes:

I think you cannot deal with this issue without specifying what you are doing.

1. Theoretical models (eg microeconomics): I would quote Milton Friedman’s Methodology of Positive Economics (1953) “Complete “realism” is clearly unattainable, and the question whether a theory is realistic “enough” can be settled only by seeing whether it yields predictions that are good enough for the purpose in hand or that are better than predictions from alternative.” The law of demand is pretty good for many purposes, but might fail in its prediction of the impact of lowering the price of Rolexes. Here simplicity has to do with understanding and prediction for the purpose at hand.

1a. I think this is related to issues where we are non-parametricly smoothing and the various bias-variance tradeoffs. I find that in practice we can often get a nice interpretible picture if we do not ask for perfect smoothness (lowest variance) but as we allow for less and less smoothness the picture becomes hard to understand. (But cross-valdiaton, see below, may be better than esthetics here).

2. Pure prediction. For some types of models, we typically find that simpler models yield better out of sample forecasts than more complex ones. I refer in particular to the choice of lag length in ARMA models (okay, not all that exciting) and to Lutkepohl’s work showing that the use of criteria like the BIC which penalizes complexity more strenuously leads to better out of sample forecasts.

3. A focus on cross-validation often leads to the choice of simpler models (though of course the data could suggest a more complicated model is superior). The nice thing here is that we do not have to use an esthetic criterion to choose between models. I do not know why we do not see more cross-validation in our discipline.

4. As a Bayesian you could just put a heavier prior on the parameters being near zero as you add more parameters. This is what Radford Neal does for his Bayesian neural nets (lots of neurons, as the number of neurons gets large, the prior on them being zero gets stronger).

In reply, let me just say that some of Neal’s examples are of the least-squares sort. As we discuss further in the comments below, if you’re doing least squares (for example, in fitting Arma models), you need to penalize those big models, but this is not such a concern if you’re regularizing.

Minimum message length formalizes Occam’s Razor in Bayesian, information-theoretic terms.

Reminds me of some congressional testimony of about 4-5 decades ago, where a senator was complaining that the witness, no doubt an academic, kept saying, “On the one hand, …; on the other hand, …”, and wished he would give an unequivocal answer. The witness replied, “You give me a one-handed problem, I’ll give you a one-handed response.”

Any serious discussions of Occam’s-Razor (..a very common topic) quickly lead to basic questions of epistemolgy — how does one “Know” anything at all (?)

Claims about what we know must be based on unproven assumptions about our fundamental reality/existence — and we can’t say what we know about reality/existence without making claims about ‘how’ we know it.

Occam’s-Razor is merely a rule-of thumb about what to believe in our reality (..as little as possible, fundamentally). Classification problems in our ‘real world’ tend toward “simple” solutions — the Razor merely summarizes that observation… and is useful as such, apparently.

Ultimately, the only thing We/You/I ‘know’ with certainty — is that there is a non-zero probability of our consciousness existing as perceived. {Cogito Ergo Sum}

Actually, there is a technical justification of sorts for Occam’s razor. The expression for variance includes the factor (N – m) in the denominator, where N is the number of samples and m in the number of degrees of freedom. A smaller m gives smaller variance. So, all else being equal, one would prefer a smaller m. That’s exactly what Occam’s razor says, though in other (non-mathematical) terms.

With more variables, m gets larger, and when m approaches N, the variance becomes arbitrarily large. This corresponds to the situation of extreme overfitting, which we all (I hope) know to avoid. Again, this aligns with Occam’s razor (“don’t needlessly multiply hypotheses”).

So the maxim may arise from a certain philosphical attitude, but there is a statistical basis for it.

Tom:

Yes, but your argument is all based on the use of least-squares or some other optimization algorithm (as discussed in my post above). For structured models with informative priors, you don’t need to divide by N-n. As discussed above, if you’re restricting yourself to least squares, you might need something like Occam’s razor. But I see that as a problem with least squares (or maximum likelihood). We can do better now.

Well, yes, the n-m in the variance can be seen as coming from a least squares calculation. Regularization is basically a method to deal with poorly conditioned matrices, some of which might arise doing a least squares estimation. So the use of a regularization scheme would be orthogonal to the theorem about N-m. At least, that’s been my understanding of the use of the term; maybe you are using it more generally.

Other kinds of calculations might not come up with a formulation that looks the same, but fundamentally for any set of data there is only going to be so much information, and if you divy it up among too many parameters, some of them at least are going to get undesireably noisy. This may get expressed in different ways for different (non least squares) approaches so the formulation would differ between them.

I guess one justification for OR is it might help avoid overfitting. A second one, you allude to, is simplicity and ease of comprehension. A third is inductive: If one can prove the existence of one God, easier to prove the existence of many. But OR is more art than science. One should not be slavish about it.

What is the Bayesian estimation equivalent of OR, if any?

Keep your priors simple?

I suppose priors are what they are, but how often have you scaled back your priors for fear of criticism?

For a really neat description of Occam’s razor and the philosophical theory behind it check out what these Carnegie Mellon philosophers did:

http://www.learningepistemology.com/

I clicked on the link. They talk about “finding true theories.” That’s not what I do, when I’m doing applied statistics.

Nice.

And because the quest is not for truth but for usefulness, Occams Razor might help.

If two models give similar insights and one is simpler, stick with the simpler one. Given limited human cognition, these are more useful.

The best argument for repealing Dodd-Frank and going back to Glass-Steagal is the latter is far more parsimonious law.

Anon:

I have no knowledge or opinion on financial regulation. But if we’re talking about statistical modeling, I refer you to Radford’s quote in the above post. I agree that simple models can be useful but I see no virtue in stopping there.

I agree there are no hard and fast rules. The reasons are entirely practical, and depend on the situation.

In some senses a complex model always does better, especially from a statistical point of view. But from a practical point of view, if it is to have policy consequences, simplicity is an important criterion.

In this blog you have written about problems in measuring and providing incentives for teacher performance, and how hard these things are to implement in policy and explain to voters.

In finance and other aspects of life is no different.

Statistics begun its life as “political arithmetic” for a reason. Without a purpose models and inferences are hard to evaluate.

Here is a quote from Luigi Zingales in FT:

High quality global journalism requires investment. Please share this article with others using the link below, do not cut & paste the article. See our Ts&Cs and Copyright Policy for more detail. Email ftsales.support@ft.com to buy additional rights. http://www.ft.com/cms/s/0/cb3e52be-b08d-11e1-8b36-00144feabdc0.html#ixzz1yupj2AUT

The second reason why Glass- Steagall won me over was its simplicity. The Glass-Steagall Act was just 37 pages long. The so-called Volcker rule has been transformed into 298 pages of mumbo jumbo, which will require armies of lawyers to interpret. The simpler a rule is, the fewer provisions there are and the less it costs to enforce them. The simpler it is, the easier it is for voters to understand and voice their opinions accordingly. Finally, the simpler it is, the more difficult it is for someone with vested interests to get away with distorting some obscure facet.

http://www.ft.com/cms/s/0/cb3e52be-b08d-11e1-8b36-00144feabdc0.html#axzz1yuor4ZfS

When considering simplicity versus complexity in model-building, it’s important to note that we (meaning academics/society/etc.) use simple models all the time even when we know they are less accurate. When considering how much a person would weigh if they stood on Mars, I am sure any physicist will use the Newtonian rule, not results from general relativity. When measuring area of a right-triangle plot on Earth even at the scale of a mile, I would be shocked if anyone bothered to use non-Euclidean geometry rather than “simple” geometry.

This is true even for applied statisticians, who presumably wish to speak about the future and/or related events, and not simply the dataset they are examining. In that case, the important question is not “how accurately can I describe my existing dataset” but rather “how ought we think about this general phenomenon?” This is especially true for counterfactual or purely-explanatory models, such as basic rational voting model. The basic rational voting model tells you that self-interested voting without social preferences is not an adequate explanation for voting behavior. It does not, and is not meant to, explain why people end up voting, and hence what reason could there be for additional complexity?

“I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.”

But nothing in Occam’s Razor says that you should stick with your better performing simple model in that situation. It just says that you should prefer it until if and when you’ve developed a model (whether more simple or complex) that explains / predicts more of the data.

I realise that there are what some purport to be “formal versions” of Occam’s Razor (e.g. AIC), but the relationship seems more metaphorical to me.

Make your models as simple as possible but no simpler

A paper I like discussing how Bayes model comparison can favor simple models, but the right complex models do well too (and how it relates to Occam) is:

http://mlg.eng.cam.ac.uk/pub/pdf/RasGha01.pdf

I also like Radford Neal’s constructive suggestion of one way to use whatever it is a simple model is doing in a more complex model:

http://www.cs.toronto.edu/~radford/pit.abstract.html

I think there’s plenty more to do in this direction.

Can you say more about “so-called network of models”?

Is that the same as or something different from the idea of “model hierarchy”?

i.e.,

as in climate models where one starts with simple models and adds increasing details of physics, biology, etc, in part to understand whether the added details actually help.

or in compute design, where there are both simple functional models to detailed gate-by-gate logic models, and then adding detailed timing.

My problem with OR is that I rarely found a situation when two theories (or models) provide the same explanation and one is simpler. Different models explain or predict different things, so OR is useless. If I use a regularized regression (or regression with priors), different penalties to coefficients will result in different models. And each model will be better at some parts of my sample. So, how OR will be useful here?

You could have simple or complicated penalties. Few or many hyper parameters.

But in vanilla penalized regression, as when one constraints the sum of coefficients to be X, the power against over-fitting comes from the stringency of the constraint.

More complex, and presumably less restrictive constraints, beat the purpose of it.

If you were to plot predictive success out of sample, against complexity of the model, I bet you’ll find a hump shaped relation with a single optimum.

In my reading OR says don’t go beyond that optimum.

I never understood Occam’s Razor to mean that you should put a ceiling on the complexity of your models. I thought it said that if two models seem to explain the data equally well, a good rule of thumb is to go with the simpler model. I don’t think OR says that you should disregard more complicated models that do a better job of explaining the data.

Zach:

No, Occam is generally associated with penalizing more complicated models, not just breaking the tie between two models that fit the data identically well. Again, if you’re fitting using least-squares, this makes sense. But I don’t see it as a good general principle.

If I understand it correctly the statistical justification for “Occam’s razor” is the necessity to avoid overfitting. When fitting a structured model with informative priors you do not do any optimization (well, you might be interested with MAP estimates), but it doesn’t mean the problem of overfitting disappears. Ceteris paribus, the more complex the model the more you pay in terms of posterior uncertainty, and so a kind of “Occam’s razor” is built into bayesian inference. In fact bayesian inference is known to be closely related to various AIT methods which penalize for complexity in one way or another. I’m sure you know that all, so perhaps I am missing something here?

One issue that arose from this conference, in my mind at least, is this: what implications do new methodologies of scientific inference have for philosophy of science? (http://errorstatistics.com/2012/06/26/deviates-sloths-and-exiles-philosophical-remarks-on-the-ockhams-razor-workshop/). The machine learners, some of them, seemed to suggest that philosophy of science was just a handmaiden to technical methods in science, and were bound to embrace whatever philosophy of knowledge those methods seemed to underwrite—according to those using or developing the techniques. In particular, the view implicit in some talks was that philosophers of science ought to embrace a naive sense data instrumentalism, since that comports well with certain work in machine learning. Naturally, I regard this as wrong-headed, and rather surprising, given how very distant their conceptions of philosophy seemed to be from my own practice. But there was too much else to discuss at the conference to consider this kind of meta-issue.

Mayo:

I see what you’re saying, but I’d rather have a scientist think that philosophy is useless, than have a scientist who is a slave to some oversimplified philosophy. I vividly remember going to the Bayesian conference in 1991 and talking with people who (a) had no interest in checking the fit of their models to data and (b) seemed to think there was something illegitimate about model checking. These people were slaves to the subjective Bayesian philosophy.

Perhaps I wasn’t clear, but my remark here had nothing to do with philosophy being useless or useful for science. It had to do with scientists (e.g., machine learners) supposing that they were the deciders of philosophical debates (about which they have a certain outsider’s perspective). See my blogpost. Your example, by the way, is an example of where philosophical considerations are relevant for science, in the sense that if your statistical philosophy is at odds with justifying what you’re doing, then it is not the “philosophy” you need, and you should reject/revise it.

Wow, that is scary! I hope they’re not working with medical data, but I suppose that’s just wishful thinking.

In medical diagnostics, a more complex model requires more assumptoms, each with a degree of uncertainty, leading to conjunctions with proportionally lower probability. Funny thing is, the multiple explanatory hypothesis often turns out to, apparently, be true. Sometimes a preference to be reductionistic, to apply Ockham’s razor, is a lazy surrender to minimize cognitive effort (Daniel Kahnemann style).

I tend to think of a model with a lower L2 norm on the parameters (or whatever regularization you want) as “simpler” model. But maybe that’s silly.

Naive question: what do you mean when you say “if you’re doing least squares (for example, in fitting Arma models), you need to penalize those big models, but this is not such a concern if you’re regularizing?” I thought regularization referred to things like LASSO where you do penalize big models.

Many cultures favor Occam’s Butterknife: he who comes up with the most complicated theory wins because he is the Smartest Guy in the Room. A comparison of the scientific and engineering productivity of Razor cultures v. Butterknife cultures might be informative.