Not frequentist enough.

I think that many mistakes in applied statistics could be avoided if people were to think in a more frequentist way.

Look at it this way:

In the usual way of thinking, you apply a statistical procedure to the data, and if the result reaches some statistical-significance threshold, and you get similar results from a robustness study, changing some things around, then you’ve made a discovery.

In the frequentist way of thinking, you consider your entire procedure (all the steps above) as a single unit, and you consider what would happen if you apply this procedure to a long series of similar problems.

The first thing to recognize is that the frequentist way of thinking requires extra effort: you need to define this potential series of similar problems and then either do some mathematical analysis or, more likely, set up a simulation on the computer.

In the usual way of teaching statistics, the extra effort required by the frequentist approach is not clear, for two reasons. First, textbooks present the general theory in the context of very simple examples such as linear models with no selection, where there are simple analytic solutions. Second, textbook examples of statistical theory typically start with an assumed probability model for the data, in which case most of the hard work has already been done. The model is just there, postulated; it doesn’t look like a set of “assumptions” at all. It’s the camel that is the likelihood (although, strictly speaking, the likelihood is not the data model; additional assumptions are required to go from an (unnormalized) likelihood function to get a generative model for the data).

An example

To demonstrate this point, I’ll use an example from a recent article, Criticism as asynchronous collaboration: An example from social science research, where I discussed a published data analysis that claimed to show that “politicians winning a close election live 5–10 years longer than candidates who lose,” with this claim being based on a point estimate from a few hundred elections: the estimate was statistically significantly different from zero and similar estimates were produced in a robustness study in which various aspects of the model were tweaked. The published analysis was done using what I describe above as “the usual way of thinking.”

Now let’s consider the frequentist approach. We have to make some assumptions. Suppose to start with that losing an election has the effect of increasing your lifespan by X years, where X has some value between -1 and 1. (From an epidemiological point of view, an effect of 1 year is large, really on the high end of what could be expected from something as indirect as winning or losing an election.) From there you can work out what might happen from a few hundred elections, and you’ll see that any estimate will be super noisy, to the extent that if you fit a model and select on statistical significance, you’ll get an estimated effect that’s much higher than the real effect (a large type M error, as we say). You’ll also see that, if you want to get a large effect (large effects are exciting, right!) then you’ll want the standard error of your estimate to be larger, and you can get this by the simple expedient of predicting future length of life without including current age as a predictor. For more discussion of all these issues, see section 4 of the linked article. My point here is that whatever analysis we do, there is a benefit to thinking about it from a frequentist perspective—what would things look like if the procedure were replied repeatedly to many datasets?—rather than to fixate on the results of the analysis as applied to the data at hand.

41 thoughts on “Not frequentist enough.

  1. It is still important to ask: (1) What is the quantified evidence from the first analysis? and (2) Is the first analysis worth attempting to be replicated? These are not frequentist questions in spirit.

    The point about analyses using an assumed probability model is especially clear and harks to the inclusion of more parameters in a (primarily Bayesian) analysis to recognize what we don’t know (e.g., degree of non-normality, degree of unequal variances, etc.). Starting the process with an assumption-laden model (whether frequentist or Bayesian methods are being used) leads to false downstream confidence.

    • >Starting the process with an assumption-laden model (whether frequentist or Bayesian methods are being used) leads to false downstream confidence.

      Aren’t all models rather assumption-laden? Are you referring to particular kinds of assumptions?

      • Starting the process with an assumption-laden model

        And Andrew wrote:

        additional assumptions are required to go from an (unnormalized) likelihood function to get a generative model for the data).

        Both of these descriptions are a bit misleading.

        The process always starts with some assumptions, which are collectively called the model. From those assumptions we then derive the likelihood.

        If the likelihood is later shown to be inconsistent with the data, you can only conclude at least one of those assumptions were incorrect (duhem-quine thesis).

        Meanwhile, when it is consistent, you still cannot rule out that another set of assumptions could lead to an even better fit (affirming the consequent).

      • It’s harder to relax assumptions (unequal variance, non-normality) in the frequentist domain. With Bayes you can have parameters that take up less than one degree of freedom each (for example by putting a prior favoring normality but allowing relaxation of that assumption as more information becomes available).

  2. > In the usual way of thinking, you apply a statistical procedure to the
    > data, …

    That may be “usual”, but it would be better to come up with a model (and prior) that you believe, then use them to analyze your data.

  3. “From there you can work out what might happen from a few hundred elections, and you’ll see that any estimate will be super noisy”

    This seems drawing the Owl. You know, step 1 draw two ovals, step 2 draw beautiful, photorealistic owl.

    It’s not entirely clear to me how you can work this out. I think this is where many (me included) stumble – these things are not so obvious to hoi polloi. I believe what you say is true (deference to authority), but I would be hard-pressed to write down the details and defend them.

    • Michael:

      OK, I guess I should write more on this. But for now let me keep it simple. Suppose you’re a researcher and you have some idea of the model you might want to fit. For example, a discontinuity regression, which can be written as a linear regression with an outcome of interest (in this case, remaining years of life), a running variable (in this case, electoral vote margin), an indicator for if the running variable is greater than zero, and some pre-treatment variables (in this case, candidate age at the time of the election).

      Then the first step is to turn your fitted model into a generative model. For linear regression, that’s easy, it’s just y = x*b + error. The next step is to make assumptions about all the parameters in the model: the intercept, the coefficients, and the error term. We can do that! If the outcome is remaining life in years, let’s just guess that the average age of candidates for governor is 55 years, and their average remaining years of life is 20. That’s perhaps an overestimate given that these people have already reached 55 years, but then again we’ll be dealing with old data, and life expectancy didn’t use to be so long . . . anyway, the exact number doesn’t really matter. So let’s put in 20 for the intercept on our regression. Then the coefficient for electoral vote margin: I have no idea on this, so I’ll just assume it’s 0. Again, this is not so important, as it’s just a variable we’re adjusting for in our analysis. Next we need the coefficient for the discontinuity. I have no idea on this either, but I’m pretty sure it won’t be more than 1, as it takes a hella lot to increase or decrease average life expectancy by a year. Let’s try 1 just to see what happens under this assumption that the effect is very large. Then we need the coef for age. Let’s assume it’s coded as age minus 55 so we don’t have to worry about the intercept from earlier. I’ll give it a coef of -0.9: the older you are, the less future life expectancy you will have. Or maybe -0.99 would be even better. We could always try it both ways, just to see if it makes a difference. Finally, we need a residual standard error. I’ll say 10, i.e. approximately two-thirds of people live to within +/-10 years of their predicted life expectancy.

      OK, now that we have the model, let’s simulate data. Suppose we have 500 elections. The only thing we need now is to simulate the predictors in the model. For simplicity, I’ll assume ages of candidates are normally distributed with mean 55 and standard deviation 10 (so that 95% are between the ages of 30 and 75), and I’ll assume that we’re only considering close elections, so the vote margin is uniformly distributed between +/- 0.1 (that is, the candidate gets between 45% and 55% of the two-party vote) and that it’s independent of candidate age.

      Here goes:

      library("rstanarm")
      n <- 500
      age <- rnorm(n, 55, 10)
      margin <- runif(n, -0.1, 0.1)
      win <- ifelse(margin > 0, 1, 0)
      y <- 20 + 0*margin + 0*win - 0.9*age + rnorm(n, 0, 10)
      fake <- data.frame(age, margin, win, y)
      fit <- stan_glm(y ~ win + margin + age, data=fake, refresh=0, algorithm="optimizing")
      print(fit)
      

      Let's check that it gives reasonable output:

      stan_glm
       family:       gaussian [identity]
       formula:      y ~ win + margin + age
       observations: 500
       predictors:   4
      ------
                  Median MAD_SD
      (Intercept) 13.8    2.6  
      win          1.2    1.8  
      margin      -4.3   15.5  
      age         -0.8    0.0  
      
      Auxiliary parameter(s):
            Median MAD_SD
      sigma 9.7    0.3    
      

      OK, now I'll loop it 100 times:

      n_loop <- 100
      b_hat <- rep(NA, n_loop)
      b_se <- rep(NA, n_loop)
      for (loop in 1:n_loop){
        n <- 500
        age <- rnorm(n, 55, 10)
        margin <- runif(n, -0.1, 0.1)
        win <- ifelse(margin > 0, 1, 0)
        y <- 20 + 0*margin + 1*win - 0.9*age + rnorm(n, 0, 10)
        fake <- data.frame(age, margin, win, y)
        fit <- stan_glm(y ~ win + margin + age, data=fake, refresh=0, algorithm="optimizing")
        b_hat[loop] <- coef(fit)["win"]
        b_se[loop] <- se(fit)["win"]
      }
      print(c(mean(b_hat), sd(b_hat)))
      print(mean(b_se))
      

      And here's what we get:

      > print(c(mean(b_hat), sd(b_hat)))
      [1] 1.11 1.81
      > print(mean(b_se))
      [1] 1.79
      

      So, under the above assumptions, we will be able to estimate this discontinuity effect to within a standard error of about 1.8. This tells us that the study is too small to reliably estimate an effect of 1 year of age.

      Does that help?

      P.S. Yes, the above R code is kinda ugly. It's how I do things so maybe there's a virtue here in that it demonstrates how even someone like me who's a crude coder can still do this sort of simulation.

      P.P.S. If you show this to the authors of the original study, they might reply that the true effect is actually 5 to 10 years, not just one year, in which case my above analysis is all wrong. My reply is that it's ludicrous to think that losing an election could cost an average of 5 to 10 years of life. Even if every losing candidate immediately took up the pastimes of smoking and sky diving and stuck with it for the rest of their (shortened) lives, and even if every winning candidate gave up cigarettes, alcohol, and steaks (unlikely for a politician, huh?) and performed regular yoga and meditation . . . even with all of that, I wouldn't expect to see such a large average effect as 5 years of life expectancy. Such a claim just contradicts everything else we know about life expectancy (except for other noisy statistical analyses selecting on statistical significance).

      But, in any case, the P.P.S. here is not really relevant to the main point of this comment, which is that, yes, it should not be difficult to set up this simulation before collecting the data and doing the analysis. Indeed, I think that being able to set up such simulations is an important part of learning applied regression, which is why we have a lot of these in Regression and Other Stories. Maybe not enough, though.

      • I also code like this…so I guess I’m an ugly coder too.

        The P.P.S. – I understand the main point of the comment was the sim, but as far as evaluation of the study goes, no sim was really needed then (other than to show how a result of 5-10 years is possible from a much smaller true effect), because it appears to all just boil down to the argument that 5-10 years is a ludicrous number. This simply reminds me of the common sense check.

        I was trying to teach some of the students in our lab some regression modeling this summer. I started by teaching them data simulation of simple scenarios like that in the above code. So instead of showing them first how to run a regression on data, I started off by showing them how to simulate data by coding a generative model.

        For me, simulation has been extremely helpful in trying to learn how to model data (and thinking about experiments, troubleshooting, etc). I wish I had learned everything in the reverse order that I learned it. Data simulation first, then analysis after. I think it might have been easier to learn modeling if I had first learned how to simulate the data that I was going to model.

        • Jd:

          The point of the simulation is that, even before seeing the data, the researchers could’ve realized that they did not have enough data to reliably study what they were trying to study.

          And, yes, 5-10 years is a ludicrous number—but I can only say that because I have some statistical numeracy. The idea of a particular event shortening one person’s life expectancy by 5-10 years, that’s plausible—just barely plausible, maybe, but plausible. It tips over into ludicrousness when it’s supposed to be the average effect in a population, cos then it would require these just-barely-plausible things happening for all or most people.

          By analogy, the idea of a perpetual motion machine is ludicrous—but only if you know enough physics, or trust enough physicists. If you just reason from intuition, you might say, “Yeah, sure, I could imagine a perpetual motion machine.” People could imagine unicorns too, even though they never seem to have actually existed.

        • “The point of the simulation is that, even before seeing the data, the researchers could’ve realized that they did not have enough data to reliably study what they were trying to study.”

          Sure, but isn’t that realization possible only by assuming that they know beforehand that 5-10 years is ludicrous? Only then could they have realized this through a simulation as you did. No simulation is going to help if one’s assumptions are so wildly off that ludicrous effect sizes are genuinely plausible and thus get programmed into the simulation (as you point out could be their counterargument to your simulation). With wildly large effect sizes, it wouldn’t be difficult to fool oneself in a simulation. Some basic knowledge of life expectancy, physics (perpetual motion machines), or equines (unicorns, ?) would seem a prerequisite to embarking upon a study anyway. As far as I can see, the common sense rule still applies and would actually be necessary to perform the simulation that gave the answer that you argue.

        • Jd:

          Sure, common sense helps. The point is that to do the simulation you need to put in some assumptions. If, before seeing the data, they wanted to put in the assumption that losing the election would cost on average 5 years of life, then, sure, they could go for it. And then if they were to send that analysis around, maybe someone would’ve pointed out the problem in that particular assumption. Or not! I’m not saying that doing this sort of simulation will protect from all errors.

        • “I’m not saying that doing this sort of simulation will protect from all errors.”

          Agreed. And I agree with your comment and post. My point (and maybe common sense is the wrong phrase) is that for some errors, no amount of simulation or thought about many experiments over the long run is going to help. It’s a fatal error or fundamental error. The same error that leads to failure to recognize ludicrous results might also lead to ludicrous simulation beforehand if it was attempted.

          Thus in the example, I think the error is actually a credulousness regarding large effect size that is present before any study is even attempted.

          BTW I know little about life expectancy, so I’m credulously buying that 1 year is a lot;-)

        • “I think it might have been easier to learn modeling if I had first learned how to simulate the data that I was going to model.”

          I agree. The paper “Visualization in Bayesian workflow” by Andrew and colleagues is very inspiring in this sense. I personally love how the students react when they ‘discover’ that modeling makes really sense only after the assesment of some simulations from the prior-predictive distribution (or model marginal likelihood). By looking at this, I guess the Bayesian and frequentist way of thinking seem closer than what one might think.

  4. In the frequentist way of thinking, you consider your entire procedure (all the steps above) as a single unit, and you consider what would happen if you apply this procedure to a long series of similar problems.

    I consider all of Statistics as a single unit and one entire procedure.

      • So all of statistics is really a kind of global Quantum Mechanics problem! I like where you’re going with this.

        Finally the foundations of statistics is making progress.

        • Unfortunately frequentist statistics works just the opposite to quantum mechanics. 95% confidence intervals hold the true value 95% of the time, but once you collapse it to a particular interval for a particular data set, you just don’t know any more.

      • Wait, you wouldn’t believe the bonferonni corrections I’m getting with this approach. Everyone needs to use an alpha =10^(-100^1000)

        Which reminds me: why do we call large numbers “astronomical”?

        Numbers in astronomy are always something like 10^50. That’s large but not incomprehensible. Maybe there should be a more extreme category called “statistical” for truly incomprehensible numbers.

      • This puts you in good but generally disregarded company, both John Stewart Bell and David Bohm felt that the only meaningful QM description of the world was that it was all one united wave function (and no such thing as collapses) and no separation between “classical” and “QM” phenomenon exists. This has generally been disregarded or even worse JS Bell’s work has been interpreted to mean kind of the exact opposite.

        • Only a wavefunction and no such thing as collapses? Bell wrote many things – including the following in one if his latest papers (“Are there quantum jumps?”):

          “Either the wavefunction, as given by the Schrödinger equation, is not everything, or it is not right.

          “Of these two possibilities, that the wavefunction is not everything, or not right, the first is developed especially in the de Broglie—Bohm ‘pilot wave’ picture. […]

          “If, with Schrödinger, we reject extra variables, then we must allow that his equation is not always right. I do not know that he contemplated this conclusion, but it seems to me inescapable.”

          And he goes on to describe the GRW model of spontaneous collapse which seemed to him “particularly simple and effective” and “a very nice illustration of how quantum mechanics, to become rational, requires only a change which is very small“.

        • Carlos,

          My understanding is that for the most part Bell took the approach that Bohm was right, there are hidden variables, namely the positions of all the particles. What that implies given his inequalities, is that the world is nonlocal. That was for the most part his preferred picture. For example in chapter 14 of “Speakable and Unspeakable in Quantum Mechanics”

          “I will try to interest you in the de Broglie – Bohm version of non-relativistic quantum mechanics. It is, in my opinion, very instructive. It is experimentally equivalent to the usual version insofar as the latter is unambiguous. But it does not require, in its very formulation, a vague division of the world into “system” and “apparatus,” nor of history into “measurement” and “nonmeasurement.” So it applies to the world at large, and not just to idealized laboratory procedures. Indeed the de Broglie-Bohm theory is sharp where the usual one is fuzzy, and general where the usual one is special.”

          This kind of talk is very typical of his work.

        • there are hidden variables, namely the positions of all the particles.

          The hidden assumption seems to be that particles cannot be influenced by their own past motion, eg in the same way a ship can be influenced by its own wake. This is assumed to be impossible in the most popular interpretations of QM.

          In other words, particle motion is only approximated by a markov chain, and when you account for a decaying influence of past motion you can get the “strange” correlations without any weird explanations. Check out the walking oil droplet experiments.

        • World? Given the Garden of Forked paths, I despare of ever being able to calculate a true p-value until there’s a wavefunction for the multiverse!

        • Ok, don’t let Bell work on spontaneous collapse theories distract us from the typical Bell presentation.

          Even then I find misleading your statement “the only meaningful QM description of the world was that it was all one united wave function”.

          It suggests that the world is completely described using just a wave function when in fact the description is given by the actual position of every particle plus an auxiliary wave function. (The latter piece would be the truly hidden one, by the way: “That X rather than Psi is historically called a ‘hidden’ variable is a piece of historical silliness.”)

        • You’re right, what I was thinking of was the alternative of one wave function per “experiment” , a distinction between observation and non observation, a wave function that collapses, and no wave function applied to the “classical apparatus”.

          Bohms presentation basically says there’s just one big wave function which doesn’t collapse but pilots literally everything.

          Bell, being an excellent scientist of course wasn’t dogmatically stuck to this view but it appears he preferred it to the other theories that seemed to address his concerns.

  5. I think this paper is fatally flawed anyway:

    First, healthy politicians might be more likely to win elections, e.g. if voters are more likely to reward attractive politicians

    […]

    We estimate the causal effect of winning the gubernatorial election using a sharp regression discontinuity design based on close elections. In doing this, we compare the longevity of candidates who narrowly win to candidates who narrowly lose the election. The underlying identification assumption is that candidates within this narrow margin are similar across all other characteristics that might affect longevity. Because election outcomes within this narrow bandwidth can be considered essentially random, the setup allows us to use candidates who narrowly lose the election as a counterfactual for the longevity of candidates who narrowly win—had they instead lost the election.

    It is much more likely that voters pick up on signs of health than that winning an election can increase longevity by 5+ years. To deal with this huge problem, they just assume that isn’t the case.

    What I’d like to know from the data, is the life expectancy curve over time, and how that compares to the general population.

    • The hypothesis that voters pick up signs of health predicts a correlation of longevity with margin of victory, not a discontinuity near 50% which is what the paper analyzed.

      • I think you’re falling into a common trap, which is trying to explain a pattern that can easily be explained by pure noise.

        “Pure noise” is another alternative explanation. I don’t see what is stopping them from assuming that didn’t happen either as part of their analysis. You can conclude anything if people are willing to accept enough questionable premises.

        The hypothesis that voters pick up signs of health predicts a correlation of longevity with margin of victory, not a discontinuity near 50% which is what the paper analyzed.

        I don’t know who came up with that specific hypothesis, but It isn’t hard to hypothesize that the health issue plays a much bigger role in tight races.

  6. I really enjoyed the linked article. This parenthetical comment near the end of section 3 was great:

    > the fact that something is calculated automatically using some theory and a computer program doesn’t mean it’s correct in any particular example

    I need to convince my coworkers of this every day.

  7. > In the frequentist way of thinking, you consider your entire procedure (all the steps above) as a single unit.

    Non-statistician but fan of the blog here. This is how I came to understand (maybe) what a confidence interval is… and why “range of values with 95% probability, the range will contain the true unknown value of the parameter” (to quote a well-known introductory textbook at my desk) is really quite misleading. It’s really that the whole unit (sampling -> interval calculation) that will produce intervals that will contain the true parameter x% of the time it’s repeated (as opposed saying something about a single interval).

    It was doing some simulations that helped me finally get this, though, not parsing someone’s explanation of this single interval/procedure distinction I tried to describe above.

    • > It’s really that the whole unit (sampling -> interval calculation) that will produce intervals that will contain the true parameter x% of the time it’s repeated (as opposed saying something about a single interval).

      Yes, but only when the generative model is literally true. The “frequency guarantee” is not a guarantee about the real world, it’s a guarantee about what would happen in a world where instead of studying whatever you’re studying, you’re studying a high quality random number generator.

  8. I loved this in the linked article:
    “Or, to put it another way, there’s an attitude that causal identification + statistical significance = discovery, or that causal identification + robust statistical significance = discovery. But that attitude is mistaken. Even if you’re an honest and well-meaning researcher who has followed principles of open science”
    Think about all the articles that haven’t followed principles of open science. And it seems like you even had trouble setting up the reproduction due to data to code discrepancies.

  9. Alexander Pope would like your objection. You describe people who don’t understand basic implications of their model but who have the tools of analysis and thus of publication available. Like the hacks who Pope complained had access to the printing press. His objection was more snobbish than professional.

Leave a Reply

Your email address will not be published. Required fields are marked *