Bayesians are frequentists

Bayesians are frequentists. What I mean is, the Bayesian prior distribution corresponds to the frequentist sample space: it’s the set of problems for which a particular statistical model or procedure will be applied.

I was thinking about this in the context of this question from Vlad Malik:

I noticed this comment on Twitter in reference to you. Here’s the comment and context:

“It’s only via significance tests that model assumptions are checked. Hence wise Bayesian go back to them e.g., Box, Gelman.” https://twitter.com/learnfromerror/status/916835081775435776 and https://t.co/eUZpH48LDZ

While I’m not qualified to comment on this, it doesn’t sound to me like something you’d say. With all the “let’s report Bayesian and Frequentist stats together” talk flying around, I’m curious where statistical significance does fit it for you.

Necessary evil or, following most of the comments on your Abondon Stat Sig post, something not so necessary? My layman impression was that you’d fall into the “do away with it” camp.

Is stat sig necessary to “evaluate a model”? Perhaps I misunderstand the terminoloy, but my thinking is that experience/reality is the only thing that evaluates a model ie. effect size, reproducibility, usefulness… I see uses for stat sig as one “clue” or indicator, but I don’t see how stat sig helps check any assumptions, given it’s based on fairy big assumptions.

Actually that quote is not a bad characterization of the views of myself and my collaborators. As we discuss in chapters 6 and 7 of BDA3, there are various ways to evaluate a model, and one of these is model checking, comparing fitted model to data. This is the same as significance testing or hypothesis testing, but with two differences:
(1) I prefer graphical checks rather than numerical summaries and p-values;
(2) I do model checking to test the model that I am fitting, usually not to test a straw-man null hypothesis. I already know my model is false, so I don’t pat myself on the back for finding problems with the fit (thus “rejecting” the model); rather, when I find problems with fit, this motivates improvement to the model.

By the way, I followed the above links and they were full of ridiculous statements such as “scientists will never accept invented prior distrib’s”—which is kind of a shocker to me as I’ve been publishing scientific papers with “invented prior distrib’s” for nearly 30 years now! But I guess people will say all sorts of foolish things on twitter.

1. Ram says:

I’m trying to understand your first paragraph. Here is my best guess: frequentists prize decision rules which uniformly minimize the sample space-averaged loss over the parameter space, while Bayesians prize decision rules which uniformly minimize the parameter space-averaged loss over the sample space.

• Andrew says:

Ram:

No. As I wrote above, Bayesians are frequentists. Bayesian and frequentist inference are both about averaging over possible problems to which a method might be applied. Just as there are many different Bayesian approaches (corresponding to different sorts of models, that is, different sorts of assumptions about the set of problems over which to average), there are many different frequentist approaches (again, corresponding to different sorts of assumptions about the set of problems over which to average).

• Ram says:

Thanks for your response.

Presumably you agree, though, that a Bayesian and a frequentist can employ the same assumptions about the distribution of the data up to an unknown parameter, and the same loss function, for a given statistical inference problem. The distinguishing feature of their approaches, in such a case, would come down to what they average over (the parameter space versus the sample space), and what they uniformly optimize over (the sample space versus the parameter space). Bayesians would further be distinguished from one another by the prior they employ.

2. I hope Deborah Mayo, Sander Greenland, and Daniel Lakens, EJ Wagenmakers, and anyone else participate on your blog. Fascinating subject.

3. Anoneuoid says:

I do model checking to test the model that I am fitting, usually not to test a straw-man null hypothesis. I already know my model is false, so I don’t pat myself on the back for finding problems with the fit (thus “rejecting” the model); rather, when I find problems with fit, this motivates improvement to the model.

Exactly. Bayesian vs frequentist is a red herring, allowing strawman logic to pass as scientific is the main issue.

4. Ben Goodrich says:

“the Bayesian prior distribution corresponds to the frequentist sample space: it’s the set of problems for which a particular statistical model or procedure will be applied”

I fail to see how that implies Bayesians are frequentists. Frequentists define the sample space as the set of all possible outcomes of an experiment (presuming the experiment could be independently repeated many times to yield a sampling distribution of an estimator). The prior distribution isn’t a set; it is a function over a set (real numbers, positive real numbers, etc.) that is the parameter space.

Perhaps you could say that the prior distribution defines a subset of the parameter space where there is more than an epsilon’s worth of probability that the parameter could lie. Let’s say we are talking specifically about a standard normal prior. Essentially, values less than -37 and values greater than 8.3 have zero probability as far as a finite-precision computer is concerned. So, you have a subset, (-37, 8.3), of parameter values to which a particular statistical model or procedure can be applied. What happens if the true parameter value is outside this subset and you applied a particular statistical model or procedure anyway? The Bayesian machinery works the same way and has the same justification as it would if the parameter fell in the (-37, 8.3) subset, so I don’t see why the point you are making implies Bayesians are frequentists.

Unlike the sample space, a prior distribution also provides a topology for the subset of the parameter space where there is more than an epsilon’s worth of probability that the parameter could lie. So, you can say things before seeing the data like, “This parameter is seven times more likely to be 0 than to be 2” (and “as far as my finite-precision computer is concerned, an infinite number of times more likely to be 0 than to be -37”). And that is what set Fisher off. If you said that your prior was a uniform distribution over (-37, 8.3) or almost equivalently, uniform over (-oo, oo), then the outrage would be considerably muted. But that is also why you hate the uniform prior: it preposterously implies that -37 is as likely a value for the parameter as 0 is. So, the topology provided by the prior over the (-37, 8.3) interval is important beyond simply defining the (-37, 8.3) subset as holding essentially all of the prior probability.

But all that is ancillary to what I think most people use to distinguish Bayesians from frequentists by their conception of probability: Degree of belief that a proposition is true vs. relative frequency of an outcome in repeated trials. How could the existence of a prior distribution make that distinction moot?

• Andrew says:

Ben:

I see a direct mapping between the frequentist reference set and the Bayesian prior distribution. Another way to put it is that a frequentist probability, defined as relative frequency in repeated trials, requires some definition of what is a repeated trial. We discuss this, from various angles, in chapter 1 of BDA>

• Ben Goodrich says:

I know you know I know what chapter 1 of BDA says, but that chapter isn’t arguing anything especially rigorously. Sure, in order to be frequentist you have to both define the sample space and say how the process of repeated trials would work. The inability to do that is why it is tortuous to give a frequentist interpretation to Bayesian statements like “Last week, I believed Brazil had a 0.25 probability of winning the 2018 World Cup”. It does not mean Bayesians are frequentists.

I get that the prior distribution implies a distinction between regions of the parameter space that have a non-negligble probability from regions that have a negligible probability. But the prior goes farther than that by saying within the non-negligble region, some values for the parameter are more likely than others, which is not a characteristic of defining the sample space or laying out the design of repeated trials. The fact that the prior says what values of the parameter are more likely than others is what frequentists fundamentally object to. And that is why Fisher oddly insisted that Bayes was frequentist because Bayes used a uniform prior over the parameter space in his one and only example.

To say “Bayesians are frequentists” is essentially to say that “Fisher drew a false distinction between his approach to statistics and one that Laplace, Jeffreys, Keynes, etc. described because Fisher failed to recognize / understand X”. What is X?

• Andrew says:

Ben:

I agree that different groups of statisticians, who are labeled “Bayesians” and “frequentists,” can approach problems in different ways. I just don’t think the differences are so fundamental, because I think that any Bayesian interpretation of probability has to have some frequentist underpinning to be useful. And, conversely, any frequentist sample space corresponds to some set of problems to be averaged over.

For example, consider the statement, “Brazil had a 0.25 probability of winning the 2018 World Cup.” To the extent that this statement is the result of a Bayesian analysis (and not simply a wild guess), it would be embedded in web of data and assumptions regarding other teams, other World Cups, other soccer games, etc. And, to me, this network of reference sets is closely related to the choice in frequentist statistics of what outcomes to average over.

• I’m fine if you say that there are a whole bunch of possible states of the world which are indistinguishable from each other given the level of detail you are observing, and it’s these possible states that define a sample space. I think this is Laplaces interpretation when he talks about the number of cars favorable vs the total number.

But frequentism is about repeating trials, and bayesianism is about this kind of relative weight of possibilities within a single event: the 2018 world cup.

They aren’t miles apart, I agree, but the distinctions are important.

• Andrew says:

P.S. I should also say that self-declared frequentists do have a way of handling one-off events: they call them “predictions.” In the classical frequentist world, you’re not allowed to make probabilistic inferences about parameters, but you are allowed to make probabilistic inferences about predictions. Indeed, in that classical framework, the difference between parameters and predictions or missing data is precisely that parameters are unmodeled, whereas predictions and missing data are modeled.

• a reader says:

“Indeed, in that classical framework, the difference between parameters and predictions or missing data is precisely that parameters are unmodeled, whereas predictions and missing data are modeled.”

That’s a very interesting point. In particular, there’s no reason missing data cannot be modeled as a parameter in the classic MLE framework, and yet the classic Frequentist algorithm (EM algorithm) explicitly models the missing data rather than optimizing in regards to the data values. After pondering this for awhile in grad school, my final conclusion was that while in some cases it might be reasonable to think that we have enough information in our data to pin down “parameters” to relatively small intervals with high probability (which is required for many asymptotic results), this is generally not the case for individual data points.

• Ben Goodrich says:

“I think that any Bayesian interpretation of probability has to have some frequentist underpinning to be useful”

That is a very controversial position and one that is not explicated in the original blog post. But I can imagine a few ways it might be explicated.

Let’s take a beta prior for the success probability in binomial trials. I guess you would say this prior has a frequentist underpinning because alpha and beta can be interpreted as the number of past successes and failures observed. Then take another prior whose hyperparameters do not have that interpretation, such as a gamma distribution truncated to the (0,1) interval. Would you say there is a utility function such that the utility from the posterior distribution induced by the beta prior is not less than the utility from the posterior distribution induced by the truncated gamma prior for all values of the success probability on the (0,1) interval and the utility in the former case is strictly greater than the utility in the later case for some value of the success probability? I have not been able to think of such a utility function.

Keith O’Rourke and Corey Yanofsky said below that they think you meant “priors aim to represent frequencies in the ensemble of problems one expects to face”. I am not sure whether that is what you meant or if so, how that helps to establish your point. Is it, for example, something like: Income is thought to affect many outcome variables in the social sciences, so I will contemplate the distribution of the coefficient on income across outcomes and then use that distribution as a prior when modeling a particular outcome, such as voter turnout? That is not a bad way to choose a prior, but frequentists don’t talk about distribution of a statistic over outcomes “sampled” from a population of possible outcomes that could be measured.

Another way to argue that any Bayesian interpretation of probability has to have some frequentist underpinning would be to say something like

(1) Most people who deny that the Bayesian interpretation of probability has to have some frequentist underpinning accept the conclusion of the Cox theorem

(2) The conclusion of the Cox theorem, as originally stated, does not follow from its axioms

(3) Several people have clarified, added, or modified the axioms of the Cox theorem so that the conclusion of the Cox theorem actually follows

(4) The rationale for the changes to the axioms in (3) is either frequentist or lacking

I don’t think (4) is the case, although the rationale for the changes to the axioms in (3) is not exactly self-evident.

Finally, if any Bayesian interpretation of probability has to have some frequentist underpinning to be useful, what is the justification for applying Bayes rule to obtain the conditional distribution of an unobservable parameter given observed data? A key tenet of the frequentist approach is that parameters are not random variables and do not have probability distributions whether conditional, marginal, or joint. How do you underpin a Bayesian interpretation of probability with the frequentist one while disregarding that tenet?

• Christian says:

Intuitively, we may consider the situation of having made some obseravtions, x, and trying to make statements about the probability of future observations, y, using p(y|x).

This is neither Bayesian nor frequentist.

We obtain the Bayesian concept of probability, if we assume that our future experiment is very, very large, such that the future observations, y, define the system, i.e, we call them parameters.

We obtain the frequentist concept of probability, if we imagine that the observations that we have made, x, are from a very, very large experiment.

In both cases, we have probabilities of observing something in the future given what we have observed so far. In one case, the past observations are from a very large experiment, in the other case the future observations.

• Ben Goodrich says:

I can see how you mean that people (often) want to make statements using p(y | x). But if you rewrite it as

p(y | x) = \int p(y | \theta) * p(\theta | x) d\theta = \int(p(y | \theta) * p(\theta) * p(x | \theta) / p(x) d\theta

then you are integrating over a posterior PDF of theta | x in order to get that posterior predictive distribution, which frequentists object to. But it does not require imagining that future observations y come from a very large experiment. You can get the posterior predictive distribution for a single future y.

• Christian Bartels says:

(I) From a pragmatic point of view, often p(x|y) is the only thing we can look at in reality. Thus making statements about p(x|y) may be required independent of the starting standpoint.

(II) For a frequentist to arrive at statements about p(x|y): One option is to use some kind of procedure, e.g., make predictions about the future based on the maximum likelihood principle. This is commonly done but seems difficult to position properly within any framework. Alternatively, use a non-parametric model p(x|{}) with a likelihood function that does not have any parameters (empty set {} as parameters). With a non-parametric model p(y|{}) and p(x,y|{}) are defined, giving p(x|y)=p(x,y|{})/p(y|{}). Non-parametric models include splines, Kaplan-Meier estimates or simply an integrated likelihood model that declares all parameters as nuisance.Here the latter is kind of the frequentist version of a Bayesian model.

Not sure where this leaves us exactly, but it provides kind of a bridge between the two approaches.

5. I read this as the “statistics is a science based on defaults” view where weakly informative priors aim to represent frequencies in the ensemble of problems one expects to face. This is not a universal view among Bayesians — I for one want my priors to reflect problem-specific information at my disposal, and it might be the case that the problem is a one-off for which no ensemble can be picked out in a sensible fashion.

• Keith O'Rourke says:

> the problem is a one-off for which no ensemble can be picked out in a sensible fashion.
In such cases, Peirce argued that inference just does not apply – the problem has to be embedded in an inexhaustible multitude for inductive inference to make sense. One way he did that was to think of a indefinite community of inquirers facing the problem rather than just you.

> priors aim to represent frequencies in the ensemble of problems one expects to face.
I took that to be what Andrew meant.

• Agree. Excellent Keith. ‘One way he did that was to think of a indefinite community of inquirers facing the problem rather than just you.’

• Carlos Ungil says:

Doesn’t Peirce consider Bayesian inference always invalid, because it’s based on an epistemic interpretation of probability that he rejects?

• Keith O'Rourke says:

Don’t think there is much doubt that he rejected all the Bayesian approaches he encountered but those weren’t post Box but rather flat prior literal interpretation omnipotent Bayes.

What he would make of post Box Bayes would have be assessed through his espoused methodology. An much more important than what he would make of it – is what we should make of it.

6. Jouni Kerman says:

Frequentists can rejoice that there is indeed a unified way to do statistical inference, and that is the Bayesian way…

7. Ayse Tezcan says:

what is “invented prior distributions”?

• Andrew says:

Ayse:

I assume that an “invented prior distribution” is a prior distribution that come from a model that does not directly apply to a population of parameters. This would be a special case of an “invented model.” For example, fitting logistic regression to binary data, just cos it’s a curve that’s bounded between 0 and 1, is an invented model. Or, a normal(0,2.5) prior distribution for scaled regression coefficients, implemented in order to do some regularization, is an invented prior distribution. In contrast, a non-invented prior distribution could arise in some problem where the priors directly are fitting some population distribution.

Researchers have successfully used invented models for a long time to solve real problems, so I think the online critic of “invented models” was naive and was using rhetoric in place of thinking.

8. Christian says:

Two more things that make frequentist and Bayesian similar:

1) Aiming for decision sets of minimal average size, the same decision rules can be used for frequentist or Bayesian inference. The only difference is the dimension (parameters or observations) along which we try to control for errors.

2) The concept of loss function (frequentist) and prior distribution (Bayesian) are very much related. Defining one implies the other.
(In fact, for frequentist statistics, it may be more convenient to define the prior rather than the loss function.)

9. Mikhail Shubin says:

I think that frequenter is ultimately a Modernist project (derivation of pure knowledge, free from stains of Subjectivity) and Bayesianism is a post-modernist project (there are no absolute truth, only the degrees of believe; subjectivity cannot be avoided and must be embraced). So the difference is in core values rather then practical methods.