Some Recent Progress in Simple Statistical Methods

Simple methods are great, and “simple” doesn’t always mean “stupid” . . .

Here’s the mini-talk I gave a couple days ago at our statistical consulting symposium. It’s cool stuff: statistical methods that are informed by theory but can be applied simply and automatically to get more insights into models and more stable estimates. All the methods described in the talk derived from my own recent applied research.

For more on the methods, see the full-length articles:

Scaling regression inputs by dividing by two standard deviations

A default prior distribution for logistic and other regression models

Splitting a predictor at the upper quarter or third and the lower quarter or third

A message for the graduate students out there

Research is fun. Just about any problem has subtleties when you study it in depth (God is in every leaf of every tree), and it’s so satisfying to abstract a generalizable method out of a solution to a particular problem.

P.S. On the other hand, many of Tukey’s famed quick-and-dirty statistical methods don’t seem so great to me anymore. They were quick in the age of pencil-and-paper computation, and sometimes dirty in the sense of having unclear or contradictory theoretical foundations. (In particular, his stem-and-leaf plots and his methods for finding gaps and clusters in multiple comparisons seem particularly silly from the perspective of the modern era, however clever and useful they may have been at the time he proposed them.)

P.P.S. Don’t get me wrong, Tukey was great, I’m not trying to shoot him down. I wrote the above P.S. just to remind myself of the limitations of simple methods, that even the great Tukey tripped up at times.

15 thoughts on “Some Recent Progress in Simple Statistical Methods

  1. It is indeed ironic that Tukey, who was involved in computing from the 1940s on, and who is credited with the terms "software" and "bit", pushed pencil-and-paper methods so hard. It seems to have arisen partly from his style of consultancy, while on planes or in committee meetings, etc.

    Stem-and-leaf plots can be useful for small datasets and can even be ideal if there is some pattern of digit preference which is part of the information. That's a bit limited, but then few people try stem-and-leafs with thousands or millions of data points, so no harm is done.

    A bigger unthinking use, in my view, is the use of box plots for a small number of categories or variables, which often are very uninformative about fine structure or even gross distribution shape.

    An instructive example: a U-shaped distribution comes out as a long box with short whiskers, but even experienced people often misread that kind of plot, and guess at a unimodal distribution with very short tails, although everyone should know that if 50% are inside the box, then 50% must be outside. But Tukey in the 1977 EDA book does have an example of exactly that form.

  2. Andrew,

    How do you make such great looking slides? I especially love the mini-talk outline at the top of each slide. Do you have a PowerPoint template available for this kind of thing that you could share?

  3. I'll have to come back and read the articles you suggest; I do like the idea of simple methods.

    Don't knock the utility of paper and pencil! While it's nice to have R or some other system to do analyses, sometimes one truly is sitting at a lunch table or in a meeting in a conference room or wandering around a factory floor, and speed and ease of use is important. As a non-statistician, that's what I like about manual EDA: it provides quick answers that should have more validity than gut feel and yet don't take much more time to produce (and don't make it look like one can't make a decision without asking a computer first).

    It would be nice to have a good list of such tools. I've got the Hoaglin/Mosteller/Tukey book, I've got the booklets from a Stuart Hunter video course in statistics for managers, and I at some point read the ABCs of EDA, but I don't know what else might exist. I'd even be interested in useful nomograms that could be printed up in a Hipster PDA format!

  4. On the default prior distribution article you use fivefold cross validation, but out of curiosity what do you think about Efron & Tibshirani's 1997 article (or related literature in this line) "Improvements on Cross-Validation: The .632+ Bootstrap Method"?

    A training set of data has been used to construct a rule for predicting future responses. What is the error rate of this rule? This is an important question both for comparing models and for assessing a final selected model. The traditional answer to this question is given by cross-validation. The cross-validation estimate of prediction error is nearly unbiased but can be highly variable. Here we discuss bootstrap estimates of prediction error, which can be thought of as smoothed versions of cross-validation. We show that a particular bootstrap method, the .632+ rule, substantially outperforms cross-validation in a catalog of 24 simulation experiments. Besides providing point estimates, we also consider estimating the variability of an error rate estimate. All of the results here are nonparametric and apply to any possible prediction rule; however, we study only classification problems with 0-1 loss in detail. Our simulations include "smooth" prediction rules like Fisher's linear discriminant function and unsmooth ones like nearest neighbors.

  5. I have to disagree with the split suggestion. It is very prevalent among certain fields of epidemiology, and the harms are evident in many cases. For example, the comparability of the results from different populations becomes difficult, especially if (and when) the relation is non-linear. Also, you should count the two degrees of freedom used to estimate the cutpoints (tertiles). With proportional hazards, the recoding is not equivalent to dropping the middle group. See… for more discussion.

  6. Chris,

    Andrew is using the LaTeX and the beamer package to produce the presentations. Much better than Powerpoint, especially for equations and can directly include .eps from R. A good alternative to beamer is powerdot.

  7. Nick,

    Yeah, I hate boxplots. Sometime I'll get around to writing my article, "Better than a boxplot," explaining what to do instead.


    I have to admit, I still make graphs by hand when no computer is available. My work notebook is gridded so it's easy to use the pages as graph paper. From my background as a physics student, I have long practice in accurately putting points at 1/5, 2/5, 3/5, 4/5 between the grid lines.


    This looks interesting. Maybe not critical to our particular problem since we're using a large corpus to insight into a general-use prior distribution, rather than trying to optimize for each dataset. But certainly it's something I should know about.


    I both agree and disagree, will post something more fully on the topic soon.

  8. Inspirational stuff!

    My wants: a graphical representation which indicates between-subject variation as well as within-subject variation, on the same plot. Start with a simple design, say two categorical predictors, two levels, and a Gaussian outcome. Data where a random intercept will do.

    Then once that's been sorted, how about a clever representation for summarising data where you need random slopes as well. Plotting each individual is cheating.

  9. More generally, it would be lovely to have some nice ways to visual what's going on when you've got categorical predictors with more than 2 levels, and interactions with continuous predictors. At the moment I plot predictions using fixed-effects to try to make sense of what's going on. Fiddling around with levels is a pain.

  10. The paper on default priors for logisic regression is very interesting. It looks like it may solve a problem my colleagues and I have had for allowing users to specify priors for covariate effects as part of a clinical trial program.

  11. Hi, Andrew:
    Sorry for digging deep into the history. The link to the mini-talk is dead now. Do you still have it?

  12. It did not work for me either, though I can find it from your homepage. You have to omit the research part of the link as given above.

Comments are closed.