Not frequentist enough.

I think that many mistakes in applied statistics could be avoided if people were to think in a more frequentist way.

Look at it this way:

In the usual way of thinking, you apply a statistical procedure to the data, and if the result reaches some statistical-significance threshold, and you get similar results from a robustness study, changing some things around, then you’ve made a discovery.

In the frequentist way of thinking, you consider your entire procedure (all the steps above) as a single unit, and you consider what would happen if you apply this procedure to a long series of similar problems.

The first thing to recognize is that the frequentist way of thinking requires extra effort: you need to define this potential series of similar problems and then either do some mathematical analysis or, more likely, set up a simulation on the computer.

The second thing to recognize is that, just because a statistical method is defined in a “classical” or hypothesis-testing framework, that doesn’t make it a “frequentist” method. For a method to be frequentist, it needs to defined relative to some frequency calibration. A p-value or a confidence interval, by itself, is not frequentist; for it to be frequentist there needs to be some model of what would be done in a family of replications of the procedure. This is the point that Loken and I make in section 1.2 of our forking paths paper.

In the usual way of teaching statistics, the extra effort required by the frequentist approach is not clear, for two reasons. First, textbooks present the general theory in the context of simple examples such as linear models with no selection, where there are simple analytic solutions. Second, textbook examples of statistical theory typically start with an assumed probability model for the data, in which case most of the hard work has already been done. The model is just there, postulated; it doesn’t look like a set of “assumptions” at all. It’s the camel that is the likelihood (although, strictly speaking, the likelihood is not the data model; additional assumptions are required to go from an (unnormalized) likelihood function to get a generative model for the data).

An example

To demonstrate this point, I’ll use an example from a recent article, Criticism as asynchronous collaboration: An example from social science research, where I discussed a published data analysis that claimed to show that “politicians winning a close election live 5–10 years longer than candidates who lose,” with this claim being based on a point estimate from a few hundred elections: the estimate was statistically significantly different from zero and similar estimates were produced in a robustness study in which various aspects of the model were tweaked. The published analysis was done using what I describe above as “the usual way of thinking.”

Now let’s consider the frequentist approach. We have to make some assumptions. Suppose you start with the assumption that losing an election has the effect of increasing your lifespan by X years, where X has some value between -1 and 1. (From an epidemiological point of view, an effect of 1 year is large, really on the very high end of what could be expected as an average treatment effect of something as indirect as winning or losing an election.) From there you can work out what might happen from a few hundred elections, and you’ll see that any estimate will be super noisy, to the extent that if you fit a model and select on statistical significance, you’ll get an estimated effect that’s much higher than the real effect (a large type M error, as we say). You’ll also see that, if you want to get a large effect (large effects are exciting, right!) then you’ll want the standard error of your estimate to be larger, and you can get this by the simple expedient of predicting future length of life without including current age as a predictor. For more discussion of all these issues, see section 4 of the linked article. My point here is that whatever analysis we do, there is a benefit to thinking about it from a frequentist perspective—what would things look like if the procedure were replied repeatedly to many datasets?—rather than to fixate on the results of the analysis as applied to the data at hand.

4 thoughts on “Not frequentist enough.

  1. > what would things look like if the procedure were replied repeatedly to many datasets?

    And, with modern computation power, you might as well create a full simulation as well!

  2. If you do the best you can on each individual “entire procedure … as a single unit” then you can’t help but do well in “a long series of similar problems”.

    The converse, however, is not true. Doing well in a long series doesn’t imply you’ll do the best you can in on each individual procedure. A better name for this might be “The Frequentist trap”.

  3. If you do the best you can on each individual “entire procedure … as a single unit” then you can’t help but do well in “a long series of similar problems”.

    The converse, however, is not true. Doing well in a long series doesn’t imply you’ll do the best you can in on each individual procedure. A better name for this might be “The Frequentist trap”.

Leave a Reply

Your email address will not be published. Required fields are marked *