“Frequentism-as-model”

Christian Hennig writes:

Most statisticians are aware that probability models interpreted in a frequentist manner are not really true in objective reality, but only idealisations. I [Hennig] argue that this is often ignored when actually applying frequentist methods and interpreting the results, and that keeping up the awareness for the essential difference between reality and models can lead to a more appropriate use and interpretation of frequentist models and methods, called frequentism-as-model. This is elaborated showing connections to existing work, appreciating the special role of i.i.d. models and subject matter knowledge, giving an account of how and under what conditions models that are not true can be useful, giving detailed interpreta- tions of tests and confidence intervals, confronting their implicit compatibility logic with the inverse probability logic of Bayesian inference, reinterpreting the role of model assumptions, appreciating robustness, and the role of “interpretative equivalence” of models. Epistemic (often referred to as Bayesian) probability shares the issue that its models are only idealisations and not really true for modelling reasoning about uncertainty, meaning that it does not have an essential advantage over frequentism, as is often claimed. Bayesian statistics can be combined with frequentism-as-model, leading to what Gelman and Hennig (2017) call “falsificationist Bayes”.

I’n interested in this topic (no surprise given the reference to our joint paper, “Beyond subjective and objective in statistics.”

I’ve long argued that Bayesian statistics is frequentist, in the sense that the prior distribution represents the distribution of parameter values among all problems for which you might apply a particular statistical model. Or, as I put it here, in the context of statistics being “the science of defaults”:

We can understand the true prior by thinking of the set of all problems to which your model might be fit. This is a frequentist interpretation and is based on the idea that statistics is the science of defaults. The true prior is the distribution of underlying parameter values, considering all possible problems for which your particular model (including this prior) will be fit.

Here we are thinking of the statistician as a sort of Turing machine that has assumptions built in, takes data, and performs inference. The only decision this statistician makes is which model to fit to which data (or, for any particular model, which data to fit it to).

We’ll never know what the true prior is in this world, but the point is that it exists, and we can think of any prior that we do use as an approximation to this true distribution of parameter values for the class of problems to which this model will be fit.

I like what Christian has to say in his article. I’m not quite sure what to do with it right now, but I think it will be useful going forward when I next want to write about the philosophy of statistics.

Frequentist thinking is important in statistics, for at least four reasons:

1. Many classical frequentist methods continue to be used by practitioners.

2. Much of existing and new statistical theory is frequentist; this is important because new methods are often developed and understood in a frequentist context.

3. Bayesian methods are frequentist too; see above discussion.

4. Frequentist ideas of compatibility remain relevant in many examples. It can be useful to know that a certain simple model is compatible with the data.

So I’m sure we’ll be talking more about all this.

29 thoughts on ““Frequentism-as-model”

  1. I guess there are instances where you might have many people collecting data sets and then running a single model over and over… Analytical chemistry or tumor biology or environmental surveys of stuff like that come to mind… But it seems to me that the terminology should be Bayes **can be** interpreted in terms of frequency, and that we can choose a prior to represent the range of datasets… On the other hand, we can tune out prior to every single dataset without regard to what other uses we may or may not out the model to in the future. And I think this is much more common than acknowledged by your formulation.

        • The Kindle Fire is an Android tablet, so the keyboard is on-screen. It’s true that you kind of need autocorrect because it’s too easy to hit the wrong key. But I find with the Kindle Fire the autocorrect does things like above, where

          “we can tune **our** prior …may or may not **put** the model” I promise you that I typed “our” and “put” and it decided in both cases that I meant “out” … it does this kind of thing ALL THE TIME. It’s particularly frustrating when interacting on technical websites and trying to type code symbols like “incr” or “dgamma” or something, but it will do it even with perfectly good actual dictionary words totally changing the meaning of what I am writing.

    • I wasn’t expecting Andrew blogging on this so soon, so I missed this on its day. I’ll go through comments now and respond where there is something to respond.

      Re Bayes “is” vs “can be” frequentist, I’m with you (Daniel) there, as you can see reading my paper.

  2. As someone who uses a lot of statistical tools in business contexts, but doesn’t have as much background in academia – are there any good entry-level reads about the philosophy of statistics and the typology of statistical approaches?

  3. I agree with Hennig that many just play lip service to “All models are false” and that is a cause of a lot of misunderstanding in statistics.

    In order to get away from this, I have found that replacing assumptions with building fake worlds in which you can repeatedly observe fake data to see what repeatedly happens in those under known fake truths, seems to help. Fake is fake, but maybe idealization is not as fake obvious as fake (though that would depend on the audience).

    I also interpret the parameter prior as a model of a parameter generating process in line your Bayesian reference set of parameters view.

    But my sense is many or even most on this blog don’t seem to?

    • Perhaps its my background with many different kinds of programming languages, but I think of the prior as a specification of a search space. The likelihood is a relative measure of compatibility (with the peak likelihood value being the most compatible with the data). The posterior is then the region of the search space that fits relatively well with our assumptions and the data.

      You can think of a sampler as kind of the continuous version of backtracking search in prolog.

      When we ask a sampler to sample, we are asking it to “search in the space described by the prior, spending time in any small region proportional to the prior density as modified by the comparison of the data to the model predictions”

      The reason why we might specify any given prior is ultimately that we think it’s relevant to search in that space. It might be relevant because we plan to reuse the model over and over again on similar problems in say analytical chemistry, or it might be relevant because for this one problem that we’re trying to solve, we think the best place to search is a certain region… insisting on a frequentist interpretation on repeated re-use is a mistake because repeated re-use is just a special case of a reason to search.

    • One other direction to get away from paying just lip service to “all models are false” is to face how many models are actually compatible with the data, which I think what I call “compatibility logic” encourages us more to do than Bayesian inverse probability logic – although of course the vast majority of people running tests and confidence intervals don’t have that on their radar either.

  4. Also why I’ve stopped describing any differences as “Bayes vs frequentist” but have now defaulted to always saying “Bayes vs classical.” There might be better ways to describe it, “Bayes vs non-Bayes, etc.”.

    • Zad

      Perhaps that’s why McElreath in his “Statistical Rethinking 2nd Edition” book, calls the frequentist approach as Non-Bayes throughout.

    • My message in this respect is that Bayes vs. frequentist is the wrong dichotomy. Frequentist statistics (together with what the philosophers call “propensities”) is an interpretation of probability, to be contrasted with epistemic probability (which is often but not always Bayesian). Bayesian probability however is about a certain way of computing things and asking questions, what I call “inverse probability logic”, to be contrasted with “compatibility logic” (such as tests and confidence intervals). Inverse probability logic is compatible with frequentist, propensity and epistemic probability, and I think there could be a way to make compatibility logic compatible with epistemic probability, too, except that those who adhere to epistemic probabilities don’t seem to be interested in that.

  5. Andrew said: “I’ve long argued that Bayesian statistics is frequentist, in the sense that the prior distribution represents the distribution of parameter values among all problems for which you might apply a particular statistical model.”

    I think you would create less confusion if you would add the provision that this sense is not what Fisher or Neyman or, as far as I know, anyone from that era who self-identified as frequentist, meant by frequentist. And it is not as if Fisher, Neyman, etc. would say “Oh, if you put it that way, Bayesian statistics really is frequentist so go ahead and use Bayesian statistics.” because they explicitly disavowed the re-interpretation you are suggesting.

    Nevertheless, if we think about the expectation of a variable across all possible variables (that have an expectation) that ensemble could be described by some unknown probability distribution. But researchers typically don’t pull a variable out of an urn, collect data on it, and estimate its expectation. They have a very non-randomized way of choosing the variables they are interested in estimating the expectation of, so I don’t see the benefit in considering the frequency properties of that choice. And while someone could say to themselves “Hmm, what prior distribution should I use for the expectation of Y? I know! I will use the distribution of expectations across all variables with an expectation.”, that is a really poor choice on their part. They should be able to come up with a non-default prior that takes into consideration the particular outcome they are planning to measure and what units it is measured in. The fact that Bayes rule would still hold if researchers are lazy and use default priors that are marginal over research designs and that researchers are coherently penalized for their laziness does not mean they should be lazy.

    • Ben:

      I don’t see Fisher or Neyman as the guardians of frequentism any more than I see De Finetti or Jaynes or whoever as guardians of Bayesianism. My take on frequentism is that it is about the properties of a statistical procedure over some sort of distribution of possible applications of the procedure. I’m interested in this perspective, because I work on statistical methods that are used in many problems.

      Regarding your last paragraph: I agree that in any particular problem it would not make sense to choose a prior by considering all possible problems where the given method would be applied. But in understanding a general procedure, I do think it can make sense to think of this sort of prior.

      So, I’m a frequentist when trying to understand the statistical properties of a procedure, which is relevant when writing textbooks or methods articles or blogs on MRP or whatever. But I’m not a frequentist when doing data analysis; then I’m a Bayesian.

      • So you’re arguing that there is a story you can tell in which the prior represents repetition of related experiments, but you don’t actually use that story for anything because you don’t think it’s actually useful in practice?

        • “I’m not a frequentist when doing data analysis; then I’m a Bayesian.” – By being Bayesian you don’t necessarily become “not a frequentist”.
          I’m reiterating things here but that’s because I’d like to make people aware (at the risk of annoying some by being repetitive) that the location of falsificationist Bayes within “frequentism-as-model” isn’t the only or even the core message of the arxiv paper on which this blog post is.
          I think that “being Bayesian” should rather be contrasted with using what I call “compatibility logic”, i.e., not assigning probabilities to models/parameters but rather only stating that they are, at some level and relative to a specific statistic, compatible with the data or not. Tests and confidence intervals follow compatibility logic; they are pretty much always used together with a frequentist interpretation of probability, but in principle it doesn’t have to be like that either. “Frequentist” is a misnomer for tests and confidence intervals, because the word frequentism originally doesn’t refer to specific ways of doing inference.

          Re Daniel: Interpreting probabilities according to “frequentism-as-model” can be very… idealist, when it comes to specific situations. Granted, if I think about a specific version and I want to set up a prior with frequentist interpretation, it isn’t in practice very clear over what situations this prior should apply. All situations in which I decide to apply a certain method? All problems that are reasonably similar from a subject-matter perspective, whatever that means? Frequentism is more of a mindset than a reference to objective reality. If we use it as a mindset for designing priors, in the first place it means that “we imagine the experiment as repeatable in, if you want, repeatable worlds, and we imagine that it could have a distribution of parameters then, in accordance with our prior knowledge of the world. Given this mindset, probably the way of deciding what the prior should be is not much different from interpreting the prior in an epistemic way – it should as well as possible formalise the knowledge that you choose to influence the analysis. The difference is that the model is interpreted as a data generator rather than something that refers top the researcher’s prior thoughts, and therefore the data actually generated cannot only, in a Bayesian way, modify the distribution, but also provide evidence that it the model overall (be it the sampling model, or the parameter prior, or how they work together) is so wrong that it should be replaced by something else.

        • >By being Bayesian you don’t necessarily become “not a frequentist”.
          Agree but I am not sure what to make of Andrew’s comment…

          > “we imagine the experiment as repeatable in, if you want, repeatable worlds
          Agree, it is in our imagination which is where representations (models) actually live. And so don’t bloc a way to think about what happens in this world (as long as you realize you are doing that in a fake world).

  6. From the perspective of parameter estimation (as opposed to model building), falsificationist Bayes is frequentist. Both approaches use prior information. Bayesians prefer to aggregate that information before each study as a prior distribution, and then update it after each study as a posterior distribution. Frequentists prefer to conduct a bunch of so-called independent (naive) studies and then aggregate retrospectively through meta-analysis. The only difference is the timing of the aggregation and the aggregation methods. Granted, these are substantial differences in practice, but the frequentist interpretation fits both approaches.

    I’m not knowledgeable enough about Bayesian model building to say the same, but certainly best practice for frequentist model building is to be guided by strong theory, and a theory is organized prior information.

  7. > the distribution of underlying parameter values, considering all possible problems for which your particular model (including this prior) will be fit.

    One way to understand this would be to say that for a location-scale model the prior that is appropriate for the whole class of problems where a location-scale model will be fit is the non-informative maximum entropy prior motivated by invariance considerations. Jeffrey’s prior: uniform for location and proportional to 1/sigma for scale. I don’t think that calling it the “true” prior is a good idea, though.

    • Not necessarily. Suppose for example you’re setting up an analytical chemistry assay for detecting lead in drinking water. You might try to think about the say 10,000 different drinking water suppliers you might be likely to test across the country, and try to create some prior that covers the range of lead they tend to experience. Let’s say that’s between 0 and 25 ppb with a mean of 1ppb, you might create some kind of truncated t distribution or something.

      You might do this, even if at the moment, you’re testing a particular location which has a long history of say higher lead levels due to ancient plumbing, or even if you’re testing a location which was built in the last decade and exactly none of the plumbing has lead pipes or solder…

      But you wouldn’t necessarily use uniform(0,1) for the lead fraction because we know for sure that molten lead doesn’t come down the pipe, so anything even within 3 or 4 orders of magnitude of 1000000000 ppb is totally irrelevant to the problem at hand.

      • That’s another way to understand it. Or you could think about determining the concentration of lead in water also during industrial processes or in mining operations, not exclusively in drinking water. Or you could care only about one single water supplier at different points in the life of the installation. Or about the concentration of substances other than lead. There are a myriad of “classes” apart from the extremes “the most general use of this mathematical model, with a non-informative prior making as few assumptions as possible” and “the problem at hand, with a prior fully representing the knowledge about it”.

        One may want to use not-so-informative priors, but for that one doesn’t need to claim that it exists a true prior (and a true class the problem at hand belongs to, I imagine) which is not the one specific to the problem and our knowledge about its particulars. It’s not a well-defined thing and as far as I can see it doesn’t help practically or conceptually.

    • Note that it is very essential for frequentism-as-model, as I advertise it in the arxiv paper on which this posting is, that it should be very clear that a frequentist model is a model, and as such different from the “truth”. We imagine a “true” distribution for the same of doing inference (and I explore in the paper what we can get out of that), but we do *not* need to claim that this corresponds to any truth “out there” in (non-formal) reality.

      This, however, is not a specific issue with frequentism but with mathematical modelling in general, and particularly also with epistemic probability as model for rational thinking.

  8. Andrew: Obviously I’m happy that you comment positively on my paper, however here’s a disagreement:
    “We’ll never know what the true prior is in this world, but the point is that it exists, and we can think of any prior that we do use as an approximation to this true distribution of parameter values for the class of problems to which this model will be fit.”
    The point of frequentism-as-model is to *imagine* a prior and sampling model and to see how this plays out with what reality throws at us, but I am very consciously *not* claiming existence of any truth that corresponds to it. That would be very, very hard to make precise and check! I’d rather keep up the idea that these models are artificial, not true, and that we are only temporarily thinking about them as if they were true. This can be justified (a) from the success of what we may get out of them and (b) by admitting that even if the reality-connection of this is not as strong as we would like, we hardly have any other way to do any better in that respect.

Leave a Reply to Zad Cancel reply

Your email address will not be published. Required fields are marked *