I was asked to write an article for the Committee of Presidents of Statistical Societies (COPSS) 50th anniversary volume. Here it is (it’s labeled as “Chapter 1,” which isn’t right; that’s just what came out when I used the template that was supplied). The article begins as follows:

The field of statistics continues to be divided into competing schools of thought. In theory one might imagine choosing the uniquely best method for each problem as it arises, but in practice we choose for ourselves (and recom- mend to others) default principles, models, and methods to be used in a wide variety of settings. This article briefly considers the informal criteria we use to decide what methods to use and what principles to apply in statistics problems.

And then I follow up with these sections:

Statistics: the science of defaults

Ways of knowing

The pluralistâ€™s dilemma

And here’s the concluding paragraph:

Statistics is a young science in which progress is being made in many areas. Some methods in common use are many decades or even centuries old, but recent and current developments in nonparametric modeling, regularization, and multivariate analysis are central to state-of-the-art practice in many areas of applied statistics, ranging from psychometrics to genetics to predictive modeling in business and social science. Practitioners have a wide variety of statistical approaches to choose from, and researchers have many potential directions to study. A casual and introspective review suggests that there are many different criteria we use to decide that a statistical method is worthy of routine use. Those of us who lean on particular ways of knowing (which might include: performance on benchmark problems, success in new applications, insight into toy problems, optimality as shown by simulation studies or mathematical proofs, or success in the marketplace) should remain aware of the relevance of all these dimensions in the spread of default procedures.

Regular blog readers will recognize many of these themes, but I hope this particular presentation has some added value. And this is as good a place as any to thank my many correspondents who’ve helped contribute to the development and expression of these ideas.

Statistics is a young scienceI wish practitioners made this excuse less often. After all, we’d hardly allow aeronautical engineers or chemical engineers to say that when planes crash or reactors blow up.

Statistics is old enough now. Conversely, in many “old” sciences too lots of progress gets made every year.

Huh? That statement is not an excuse, it’s a description. And it’s really true that practitioners have a wide variety of statistical approaches to choose from.

Unfortunately, statistical methods do make things “blow up” in the sense of giving bad answers (see, for example, Daryl Bem). In those cases, we don’t make excuses. We try to figure out what went wrong so it won’t happen again.

I guess unsettled and groping was too awkward?

(Some Bayesian got really made at me 15 years ago when I suggested Bayesian analyses in clinical research was new technology citing the Bayes publication)

I guess unsettled and groping was too awkward?

With regards to your concluding paragraph, maybe a blog entry on if, and how, blogging, blog discussions, and derived social media impacts has influenced your research would be nice.

Maybe and update of http://statmodeling.stat.columbia.edu/2010/05/24/blogging/

A simple example:

Lots of researchers use regression, and they default to OLS. Sure, there are times when OLS is the best choice, but usually they do not even consider orthogonal regresssion. Heck, I’ve never seen anyone explain why they used OLS instead of orthogonal.

It’s just the default. It’s the default for everyone.

Why? Well, I guess the math is harder for orthogonal? Today’s tools can handle both well enough for our purposes. And yet, the default is incredibly well entrenched.

(This is not meant to be an endorsement of any particular approach. Rather, it is an illustration of how even at the simplest levels, we have defaults of which we are not even aware.)

I think there’s more to it than that. I think it has to do with a hidden notion of causation and the parallels between regression and anova. If you do the orthogonal regression, you’re modeling variation that would be potentially “explained” by other factors in the anova.

OLS assumes a certain non-symetrical relationship. In some circumstances, that is the correct assumption. In other circumstances, an approach that is agnostic regarding the direction of the relationship is more appropriate.

But folks do no think about that. They do not make a decision. They just go with the default.

But doesn’t orthogonal regression (and forgive me if I am thinking of something else) depend on making a definite decision as to the relative units each X and Y variable is measured in?

If I’m right about this aspect of orthogonal regression (and I may be confused) it’s unfair to compare a “default” procedure [e.g. OLS] that says “here is your answer” with one [Orthogonal regressions, unless I’m wrong] that says “ok, after you make these additional decisions [which I can’t help you as to what is right or wrong] only then I can give you an answer”. In that case, it would be pretty clear (socialogically) why one default is preferred. Is, indeed, incredibly strongly preferred. “There is no right answer here, but you need to make a choice based on your domain knowledge and your choice affects the answer; I can’t give you a recipe to make your choice” utterly damns any statistical procedure for widespread use [excluding statisticans of course, and also of course excluding commercial areas where real money is on the line].

1) No more than OLS does.

2) When using OLS, you need to decide whether to regress A on B, or to regress B on A. You need to decide which is the dependent and which is the independent variable, for regression purposes. When making this decision, you are coming close predicting causality, though you usually need to be careful to speak of association, and not causality.

3) If you are not trying to make a prediction, but rather just to examine the relationship between A and B, why use OLS? If you are looking at an association (non-causally), orthogonal gives you symetrical relationship. It is the same, whether you regress A on B, or B on A. That is not true for OLS.

4) You CAN give a guideline.

* If you are going for accuracy of a one-way prediction, use OLS.

* If you are trying to model a relationship, but do not need to make one way predictions or do not have sufficient theory to make a strong causal claim, use orthogonal.

* If you think you might want to make predictions the other way too, sometimes, and want the the relationship to work equally well forwards and backwards, use orthogonal.

5) The goal is not for a particular statistical technique to gain widespread use. That’s marketing. The goal is for it to be used when it would be useful — and not used when other techniques would be more useful.

6) People should ALWAYS make decisions about statistical techniques based upon (in part) their domain knowledge. Choice of technique will ALWAYS affect the answer, and thus the choice should be made based on domain knowledge. Like any other form of advice or consulting, statisticians should make the reasoning clear to the subject domain experts so that THEY can make sure the statistician is not making incorrect assumptions about the nature of the problem or underlying phenomenon.

the Andrew Gelman appreciation society from Australia is awake.

We love him even more now we know he is a Caskett shipper.

and yes I love your articles, I do hope more Aussies do as well!