Classics of statistics

Christian Robert is planning a graduate seminar in which students read 15 classic articles of statistics. (See here for more details and a slightly different list.)

Actually, he just writes “classics,” but based on his list, I assume he only wants articles, not books. If he wanted to include classic books, I’d nominate the following, just for starters:
– Fisher’s Statistical Methods for Research Workers
– Snedecor and Cochran’s Statistical Methods
– Kish’s Survey Sampling
– Box, Hunter, and Hunter’s Statistics for Experimenters
– Tukey’s Exploratory Data Analysis
– Cleveland’s The Elements of Graphing Data
– Mosteller and Wallace’s book on the Federalist Papers.
Probably Cox and Hinkley, too. That’s a book that I don’t think has aged well, but it seems to have had a big influence.

I think there’s a lot more good and accessible material in these classic books than in the equivalent volume of classic articles. Journal articles can be difficult to read and are typically filled with irrelevant theoretical material, the kind of stuff you need to include to impress the referees. I find books to be more focused and thoughtful.

Accepting Christian’s decision to choose articles rather than books, what would be my own nominations for “classics of statistics”? To start with, there must be some tradeoff between quality and historical importance.

One thing that struck me about the list supplied by Christian is how many of these articles I would definitely not include in such a course. For example, the paper by Durbin and Watson (1950) does not seem at all interesting to me. Yes, it’s been influential, in that a lot of people use that statistical test, but as an article, it hardly seems classic. Similarly, I can’t see the point of including the paper by Hastings (1970). Sure, the method is important, but Christian’s students will already know it, and I don’t see much to be gained by reading the paper. I’d recommend Metropolis et al. (1953) instead. And Casella and Strawderman (1981)? A paper about minimax estimation of a normal mean? What’s that doing on the list??? The paper is fine–I’d be proud to have written it, in fact I’d gladly admit that it’s better than 90% of anything I’ve ever published–but it seems more of a minor note than a “classic.” Or maybe there’s some influence here of which I’m unaware. And I don’t see how Dempster, Laird, and Rubin (1977) belongs on this list. It’s a fine article and the EM algorithm has been tremendously useful, but, still, I think it’s more about computation than statistics. As to Berger and Sellke (1987), well, yes, this paper has had an immense influence, at least among theoretical statisticians–but I think the paper is basically wrong! I don’t want to label a paper as a classic if it’s sent much of the field in the wrong direction.

For other papers on Christian’s list, I can see the virtue of including in a seminar. For example, Hartigan and Wong (1979), “Algorithm AS 136: A K-Means Clustering Algorithm.” The algorithm is no big deal, and the idea of k-means clustering is nothing special. But it’s cool to see how people thought about such things back then.

And Christian also does include some true classics, such as Neyman and Pearson’s 1933 paper on hypothesis testing, Plackett and Burnham’s 1946 paper on experimental design, Pitman’s 1939 paper on inference (I don’t know if that’s the best Pitman paper to include, but that’s a minor issue), Cox’s hugely influential 1972 paper on hazard regression, Efron’s bootstrap paper, and classics by Whittle and Yates. Others I don’t really feel so competent to judge (for example, Huber (1985) on projection pursuit), but it seems reasonable enough to include them on the list.

OK, what papers would I add? I’ll list them in order of time of publication. (Christian used alphabetical order, which, as we all know, violates principles of statistical graphics.)

Neyman (1935). Statistical problems in agricultural experimentation (with discussion). JRSS. This one’s hard to read, but it’s certainly a classic, especially when paired with Fisher’s comments in the lively discussion.

Tukey (1972). Some graphic and semigraphic displays. This article, which appears in a volume of papers dedicated to George Snedecor, is a lot of fun (even if in many ways unsound).

Akaike (1973). Information theory and an extension of the maximum likelihood principle. From a conference proceedings. I prefer this slightly to Mallows’s paper on Cp, written at about the same time (but I like the Mallows paper too).

Lindley and Smith (1972). Bayes estimates for the linear model (with discussion). JRSS-B. The methods in the paper are mostly out of date, but it’s worth it for the discussion (especially the (inadvertently) hilarious contribution of Kempthorne).

Rubin (1976). Inference and missing data. Biometrika. “Missing at random” and all the rest.

Wahba (1978). Improper priors, spline smoothing and the problem of guarding against model errors in regression. JRSS-B. This stuff all looks pretty straightforward now, but maybe not so much so back in 1978, back when people were still talking about silly ideas such as “ridge regression.” And it’s good to have all these concepts in one place.

Rubin (1980). Using empirical Bayes techniques in the law school validity studies (with discussion). JASA. Great, great stuff, also many interesting points come up in the discussion. If you only want to include one Rubin article, keep this one and leave “Inference and missing data” for students to discover on their own.

Hmm . . . why are so many of these from the 1970s? I’m probably showing my age. Perhaps there’s some general principle that papers published in year X have the most influence on graduate students in year X+15. Anything earlier seems simply out of date (that’s how I feel about Stein’s classic papers, for example; sure, they’re fine, but I don’t see their relevance to anything I’m doing today, in contrast to the above-noted works by Tukey, Akaike, etc., which speak to my current problems), whereas anything much more recent doesn’t feel like such a “classic” at all.

OK, so here’s a more recent classic:

Imbens and Angrist (1994). Identification and estimation of local average treatment effects. Econometrica.

Finally, there are some famous papers that I’m glad Christian didn’t consider. I’m thinking of influential papers by Wilcoxon, Box and Cox, and zillions of papers of that introduced particular hypothesis tests (the sort that have names that they tell you in a biostatistics class). Individually, these papers are fine, but I don’t see that students would get much out of reading them. If I was going to pick any paper of that genre, I’d pick Deming and Stephan’s 1940 article on iterative proportional fitting. I also like a bunch of my own articles, but there’s no point in mentioning them here!

Any other classics you’d like to nominate (or places where you disagree with me)?

31 thoughts on “Classics of statistics

  1. I have a couple nominations:

    Hal White's 1980 paper on consistent estimators of the covariance matrix with relaxed assumptions about regression "disturbances":….
    This paper has been very influential, served as a model for many extensions, and was a great example of statistical writing in general.

    James Heckman's 1979 paper on selection bias:

    I am surprised you didn't include Gelfand and Smith 1990 (or perhaps the Geman and Geman 84):

  2. Books:

    Box and Jenkins, Time Series Analysis (Box is already on your list, but he's had a long career so it's not that much on a per-year basis).

    A.S.C. Ehrenberg, Repeat Buying. Now on line here:… This is a must for all gamma-poisson / negative binomial or Dirichlet distribution fans. This is one of the seminal works in marketing science (which is even less a science than political science, but I digress).

  3. Joe:

    Green's paper is very clever but it's still just computational. I exclude it from the "classics" for the same reason I exclude Dempster, Laird, Rubin.


    White (1980) is a good one. It's not a method I actually use, so I don't think I'm really the one to nominate it, but maybe it belongs on the list. Heckman (1979), maybe not. I know it's influential, but my impression is that these methods have gone out of style. I didn't include Gelfand and Smith for the same reason I didn't include the EM paper; to me, it's more about computation than statistics.


    Yup. I just had the 1970s on my mind.

  4. I'm intrigued by the mention of an inadvertently hilarious commentary by Kempthorne. My understanding is that he was a proponent of randomized experimental design and felt that post-data inference should be based on the randomization distribution; I can see how that might have led to hilarity, but I can't track down that specific discussion online. Can you formulate a short short version suitable for a blog comment?

  5. Thanks for starting the debate, Andrew!

    I completely agree that books are more formative than papers for our students, but if I start assigning a book a week the students will all leave the course, for sure!

    I "forgot" the Lindley+Smith (1972) when writing the course page, but it was on my mind. Thanks for the other entries. I am however a bit reluctant to include "too old" papers that are definitely classic but harder to enter for our students… Neyman, Fisher or Pitman are thus borderline from this perspective…. I even agree with you that there are too many Stein-like papers, again a reflection on my age and training!

  6. Corey:

    See here for the silly Kempthorne quote, which has nothing to do with experimental design.

    P.S. We discuss inference and experimental design in chapters 6 and 7 of BDA. The short version is that design is relevant both to inference and to model checking, although not quite in the way that Kempthorne had in mind. A weakness of classical frequentist theory (and classical Bayesian theory too, in its own way) is in not separating the steps of inference and model checking. This has led to lots of confusion. And clearing it up is worth an article of its own, I think, not to mention a blog entry!

  7. Andrew – re: silly Kempthorne quote – did you not once post that others don't want posteriors from studies but rather likelihoods (so thay can supply their own priors to the likelihood of the collective of the studies)?

    Anyways, Kempthorne could easily be acused of being awkward – especially when his student's work was involved.

    I would list references people are unlikely to run into – Mallow's The zeroth problem: Probability specification, the excahangeable papers the one with Pregabon, Draper et al and the Lindley et al one, a paper or two from CS Peirce on randomization and confidence intervals from the 1800,s (once I track them down) and something on systematic and random approxiamtion by Gauss or Laplace or Chebysev …

    But mostly those _I_ should re-read


  8. andrew, why do you say that ridge regression is silly? i thought ridge regression is linear regression with a normal prior on the betas?

  9. Keith:

    Kempthorne simply misunderstood the concept of exchangeability. The Lindley and Smith paper gets technical in places, so I could see how Kempthorne could get confused. But if you're confused, the right thing to do is to say so, not to go out and make ill-informed attacks. It would be disturbing rather than hilarious except that people don't really make that sort of error anymore, I think. (I do remember faculty at Berkeley having that misunderstanding of exchangeability, but that was over 15 years ago; I think they're better-informed now, or at least they realize that when they cough and make skeptical-sounding noises about Bayesian models, that they're sounding a bit old-fashioned.)

  10. Thanks for the link. The problem is that a person can be so ill-informed about something that they're not even being aware of being confused about it.

  11. How about Sewall Wright (1921) "Correlation and Causation," Journal of Agricultural Research 20, 557-585? Or for a real classic, Thomas Bayes (1763) "An essay towards solving a problem in the doctrine of chances," Proceedings of the Royal Society of London.

    Incidentally, what is the purpose of the course? Is it historical or is being familiar with these papers supposed to be important for technical reasons or something else? Is the course supposed to hit the most influential things *all time*? In that case, it seems like including Legendre's introduction of least squares (1805?) should be on the list, along with letters or papers by Bernoulli and by Gauss.

  12. I know you talked about p-values and alike many times, but, if I am not asking too much, could you briefly explay why Berger and Sellke are wrong?


  13. Jimmy: I want students to all read those "classics" for two reasons (a) a deep study of a foundational paper that contributed to the core of our field and (b) a constructed presentation to their fellow students. So I want to abstain from an historical perspective and old papers of the 1800's do not fit. Reading Bayes does not either.

  14. Hi prof Gelman

    I quickly checked the list of books you recommended. would you please comment on the differences between "The Elements of Graphing Data" and "Visualizing data", all by Cleveland? After a cursory look, the second book seems very useful for applied statisticians.

  15. Andrew: this would be nice "people don't really make that sort of error anymore"

    I just got taken to task for selling Kempthorne short once – what ends up in print is often much less than the real story…

    On the other hand, when I last had the opportunity to present something on exchangeability – I bailed.

    Do you really think its that easy to understand/explain.

    p.s. being called away on 2 weeks holidays – hope teh speling mistake will be tolerated

  16. I'd also like to see Akaike's AIC included, but also definitely BIC [Schwartz 1978].

    One more (soon to be) recent classic in my opinion:
    the FDR paper by [Benjamini&Hochberg 1995]

  17. Luigi:

    I prefer Cleveland's first book.


    The paper by Benjamini and Hochberg is in Christian's list already. I don't like the FDR idea but I recognize that it's been very useful to people, so I would have no objection to including it in the course. But I would not include the BIC paper. I hate BIC, and I think it's had a negative influence in statistics. (See my 1995 paper with Rubin for more discussion of the topic.)

  18. Papers: Nelder and Wedderburn JRSS A 1972 on generalised linear models.

    Books: Jeffreys "Theory of probability".
    McCullagh and Nelder "Generalized linear models".
    Miller "Beyond ANOVA".

  19. Christian, have you looked at the three-volume "Breakthroughs in Statistics" series? It's a collection of classic articles, and each paper includes an introduction (by a different expert statistician in that field) putting it in context. Fun stuff.
    I can't find a table of contents online, but Amazon reviewers have listed at least a few of the articles in each volume.

    The scope might be too broad/historical for the course you are planning. Still, some of your students might enjoy having these classic articles bound together in a nice collection.

  20. Jerzy:

    These books are fine, but a lot of those articles don't seem like "breakthroughs" to me. If I were teaching such a class, I'd definitely like to choose my own articles.

    Also, I think it would be a very good idea to include an exemplary application article such as Rubin (1980).

  21. What an interesting idea! Did anyone suggest these two?

    Kaplan, E. L.; Meier, P.: Nonparametric estimation from incomplete observations. J. Amer. Statist. Assn. 53:457–481, 1958.

    Cornfield J (1951). A method of estimating comparative rates from clinical data; applications to cancer of the lung, breast, and cervix. J Natl Cancer Inst 11: 1269–1275.

    Another classic, though it is more conceptual than statistical is

    Hill AB. "The Environment and Disease: Association or Causation?," Proceedings of the Royal Society of Medicine, 58 (1965), 295-300.

    Another pair of papers that helped start the whole area of multilevel modeling are

    Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13-22.

    Laird, N.M. and Ware, J.H. (1982) "Random-Effects Models for Longitudinal Data", Biometrics, 38, 963–974.

    Also worth noting as possibly the one statistical methodology paper that more doctors than statisticians have read is:

    Bland JM, Altman DG (1986). "Statistical methods for assessing agreement between two methods of clinical measurement". Lancet 1 (8476): 307–10

    Steve Simon

  22. Bill:

    Regarding Berger and Sellke, see chapter 6 of BDA, in particular Section 6.7. I don't think it's useful to form a theory around calculating Pr(theta=0), at least not in the applications in social and environmental science that I've studied. See page 6 of this article for further discussion of this point.

  23. +1 for Nelder and Wedderburn JRSS A 1972 on Generalised linear models.

    So much of use came out of that.

    I was going to add that anyway, another paper that opened up things was the Wilkinson paper on the nesting and crossing operators for specifying models. This made so many models easier to express, communicate, code & ultimately test. I guess it is not Stats, but it is maths of a high degree.

    Funnily it took about 25 years before there was an implementation in SAS. Excluding my ex-bosses' SAS macros from '85. of course

  24. what about Baron, R. M. and Kenny, D. A. (1986) The Moderator-Mediator Variable Distinction in Social Psychological Research – Conceptual, Strategic, and Statistical Considerations

Comments are closed.