[image of cat with a checklist]

Paul Cuffe writes:

Your idea of “researcher degrees of freedom” [actually not my idea; the phrase comes from Simmons, Nelson, and Simonsohn] really resonates with me: I’m continually surprised by how many researchers freestyle their way through a statistical analysis, using whatever tests, and presenting whatever results, strikes their fancy. Do you think there’s scope for a “standard recipe,” at least as a starting point for engaging with an arbitrary list of numbers that might pop out of an experiment?

I’ll freely admit that I am very ignorant about inferential statistics, but as an outsider it seems that a paper’s methodology often gets critiqued based on how they navigated various pratfalls, e.g. some sage shows up and says “The authors forget to check if their errors are normally distributed, so therefore their use of such-and-such a test is inappropriate.” It’s well known that humans can really only keep 7±2 ideas in their working memory at a time, and it seems that the list of potential statistical missteps goes well beyond this (perhaps your “Handy statistical lexicon” is intended to help a bit with regard to working memory?) I’d just wonder if there’s a way to codify all relevant wisdom into a standard checklist or flowchart? So that inappropriate missteps are forbidden, and the analyst is guided down the proper path without much freedom. How much of the “abstract” knowledge of pure statisticians could be baked into such a procedure for practitioners?

Atul Gawande has written persuasively on how the humble checklist can help surgeons overcome the limits of working memory to substantially improve medical outcomes. Is there any scope for the same approach in applied data analysis?

My reply:

This reminds me of the discussion we had a few years ago [ulp! actually almost 10 years ago!] on “interventions” vs. “checklists” as two paradigms for improvement.

It would be tough, though. Just to illustrate on a couple of your points above:

– I think “freestyling your way across a statistical analysis” is not such a bad thing. It’s what I do! I do agree that it’s important to share all your data, though.

– Very few things in statistics depend on the distribution of the errors, and if someone tells you that your test is inappropriate because your error terms are normally distributed, my suggestion is to (a) ignore the criticism because, except for prediction, who cares about the error term, it’s the least important part of a regression model; and (b) stop doing hypothesis tests anyway!

But, OK, I’ll give some general advice:

1. What do practitioners need to know about regression?

2. See the advice on pages 639-640 of this article.

I hope that others can offer their checklist suggestions in the comments.

1) Make your code public. (I distrust anyone who refuses to do so.)

2) Make your data public. (This may not be possible in all cases, but a (vast?) majority of published work could make the data public.)

3) Make it easy to reproduce your work.

Special kudos to the folks behind Maria mortality estimates, who score highly in all three.

Of course, this checklist is not exactly what is asked for. That is, it won’t prevent you from doing bad work. But, it maximizes the chance that any mistakes you make will be found and corrected, thereby serving the larger goals of Science.

Miller’s 7 +/- 2 experiment is a classic in cognitive science and it is widely known. But it’s not widely understood what we are counting with 7 +/- 2 or what short-term memory refers to. The Wikipedia article on 7 +/- 2 memory limit is reasonably detailed.

After the background reading, it should be clear why this isn’t relevant to the discussion of doing careful statistical analyses. Not only can we use external resources, we can use our long-term memories, and even the Wikipedia.

Statistical choices (in study design, data pre-processing and analysis) are scientific choices, much like the various application-specific methodological choices involved in planning and executing a research study. Like these other choices, they should be defended based on a combination of theoretical considerations, results from previous studies (including simulation studies), and a clear statement of the research questions and operating constraints. Besides clear statements and defenses, ideally reproducible code would facilitate scrutiny by other researchers. If such a checklist would be useful, it would mainly be a list of “don’t”s rather than “do”s, based on common choices that are widely agreed to be problematic by statisticians of essentially all stripes.

You said: “Very few things in statistics depend on the distribution of the errors, and if someone tells you that your test is inappropriate because your error terms are normally distributed, my suggestion is to (a) ignore the criticism because, except for prediction, who cares about the error term, it’s the least important part of a regression model; and (b) stop doing hypothesis tests anyway!”

I’m not disputing the hypothesis test part of things, but prediction can be very important. In many cases of Bayesian decision-making, you need to produce the posterior predictive distribution, which requires knowing the distribution of the errors. If the errors are actually cauchy distributed, rather than normally distributed, then this might impact the decisions you make.

Andrew, as you quoted him, did say *except for prediction*

My point was that he was making prediction seem unimportant and an afterthought. I don’t. So error distributions are important to me.

“Very few things in statistics depend on the distribution of the errors…”

Just wanted to clarify that very few things depend on the *marginal* distribution of the errors. Their dependence structure does often matter, and “nonparametric” methods that are robust against nonnormality are typically not robust against dependence. In general I agree with the sentiment though.

I am confused by the statement that errors are unimportant “except for prediction”. Isn’t prediction usually the ultimate point of statistical analysis of data? Medical studies are used to choose treatments that are predicted to have the best outcome. Political polls are used to predict outcomes of elections and to help politicians predict which campaign tactics will help them get elected. An observation in physics is a prediction of what another experiment will observe if they measure the same quantity and which theoretical models are most likely to correctly predict the results of future different measurements.

I think I must be misunderstanding something.

David:

When I say “except for prediction,” I mean “except for prediction of individual cases.” We often are interested in averages. For example, suppose you have a regression, y = a + bx + error, and the errors are not close to being normally distributed. Least squares (or, if you don’t have a lot of data, penalized least squares) can still give you good estimates of a and b and thus good predictions of the average of y, given x, even if the predictive distributions are wrong for individual cases.

Predictions of averages are not always sufficient though. Imagine I want to invest in the S&P 500 or bonds. How risky each one is, as measured by their expected standard deviation, is very important for making the decision. In addition, the extent of fat tails, esp. in the case of S&P 500, is very important. Assuming normality might lead you to think that the world is less risky and so you could be comfortable leveraging up your assets. Billions of dollars have been lost this way.

John:

I agree that we’re not always interested in averages, and that’s why I wrote “except for prediction” in my above post. In statistics and econometrics, people almost always focus on regression coefficients, and if that’s the goal, the distribution of the errors typically doesn’t really matter. If you’re interested in predictions of individual cases, it’s a different story.

John,

You might find it helpful to look up the difference between a “confidence interval” and a “prediction interval”. (Roughly speaking, the former is an uncertainty interval for a parameter, such as a mean, and the latter is an uncertainty interval for a future observation of the response variable.)

This is something I’ve long wondered about.

In certain research niches (at least in political science), there is a “standard” analysis that people run. For example, OLS with a particular set of “standard” controls. People will come along and demonstrate results (e.g., adding a new variable) in this framework. Now, of course, the “standard” analysis is often flawed in various ways, and others will come along and run an entirely different (and perhaps more reasonable) analysis that shows something new.

Personally, I tend to find results of the first type more convincing even if I think the standard analysis is problematic. The existence of the standard analysis amounts to almost a kind of pre-registration and allays many forking paths concerns. Even if the second analysis sounds more reasonable, I become suspicious that the authors fished around for it.

I can’t decide if my feelings on this are actually rational, though. Where should we tradeoff between minimizing researcher degrees of freedom and getting the “right” model? This is a much more salient concern given observational data than experimental data (where the tradeoff is barely present since you can just get more data). But I hardly ever see discussion of it.