## Bayes, statistics, and reproducibility: “Many serious problems with statistics in practice arise from Bayesian inference that is not Bayesian enough, or frequentist evaluation that is not frequentist enough, in both cases using replication distributions that do not make scientific sense or do not reflect the actual procedures being performed on the data.”

This is an abstract I wrote for a talk I didn’t end up giving. (The conference conflicted with something else I had to do that week.) But I thought it might interest some of you, so here it is:

Bayes, statistics, and reproducibility

The two central ideas in the foundations of statistics—Bayesian inference and frequentist evaluation—both are defined in terms of replications. For a Bayesian, the replication comes in the prior distribution, which represents possible parameter values under the set of problems to which a given model might be applied; for a frequentist, the replication comes in the reference set or sampling distribution of possible data that could be seen if the data collection process were repeated. Many serious problems with statistics in practice arise from Bayesian inference that is not Bayesian enough, or frequentist evaluation that is not frequentist enough, in both cases using replication distributions that do not make scientific sense or do not reflect the actual procedures being performed on the data. We consider the implications for the replication crisis in science and discuss how scientists can do better, both in data collection and in learning from the data they have.

P.S. I wrote the above abstract in January for a conference that ended up being scheduled for October. It is now June, and this post is scheduled for December. There’s no real rush, I guess; this topic is perennially of interest.

P.P.S. In writing Bayesian “inference” and frequentist “evaluation,” I’m following Rubin’s dictum that Bayes is one way among many to do inference and make predictions from data, and frequentism refers to any method of evaluating statistical procedures using their modeled long-run frequency properties. Thus, Bayes and freq are not competing, despite what you often hear. Rather, Bayes can be a useful way of coming up with statistical procedures, which you can then evaluate under various assumptions.

Both Bayes and freq are based on models. The model in Bayes is obvious: It’s the data model and the prior or population model for the parameters. The model in freq is what you use to get those long-run frequency properties. Frequentist statistics is not based on empirical frequencies: that’s called external validation. All the frequentist stuff—bias, variance, coverage, mean squared error, etc.—that all requires some model or reference set.

And that last paragraph is what I’m talkin bout, how Bayes and freq are two ways of looking at the same problem. After all, Bayesian inference has ideal frequency properties—if you do these evaluations, averaging over the prior and data distributions you used in your model fitting. The frequency properties of Bayesian (or other) inference when the model is wrong—or, mathematically speaking, when you want to average over a joint distribution that’s not the same as the one in your inferential model—that’s another question entirely. That’s one thing makes frequency evaluation interesting and challenging. If we knew all our models were correct, statistics would simply be a branch of probability theory, hence a branch of mathematics, and nothing more.

OK, that was kinda long for a P.P.S. It felt good to write it all down, though.

1. Keith O’Rourke says:

Tried to write up some similar ideas for people who have only had a couple stats course.

Its a bit long for a comment so I append it to this comment.

Frequentist was metaphor-ed as No parameter value left behind for all possible battles they could have engaged in and simulation assessed Bayes as Only very unlikely parameters are left behind with others kept frequency performance evaluated under the battles they engaged in more than rarely.

• Keith O’Rourke says:

A necessary frequentest property, for instance in terms of confidence intervals is that they have at least the claimed coverage rate (say 95%) for each and every parameter value. This coverage rate is over all “possible to generate data sets” for each set parameter value. No parameter value left behind for all possible battles they could have engaged in.

Now, most Bayesians usually claim such frequency properties are not relevant or don’t even make sense, while the “sage” Bayesians (Don Rubin Harvard and a growing? number of others) argue that it is but seem reluctant to publish such clarifications.

However, we do need to keep in mind that some Bayesian models do achieve this frequency property or very close. This is most easily established using simulation. For each parameter value, simulate fake data and perform the Bayesian analysis as if the data was real and record if the credible interval included the parameter value. Do this say 1000 times to estimate the coverage for that parameter value. Now do this for all possible parameter values or realistically for a sufficient range of them. If its convincing that No parameter values were left behind for sufficient simulation of battles they could have engaged in it is frequentist (some say the Bayesian label can now be thought of as just ornamental). That is, no one should object to these analyses (position of Jim Berger my former boss at Duke) at least if posterior probabilities are largely ignored and the Bayesian interval is mostly interpreted as just a confidence interval (position of Brad Efron Stanford).

Unfortunately, many practicing frequentest statisticians are not even fully aware of the necessary frequentest property and most are unaware that Bayesian approaches can sometimes meet it. At the same time, while many practicing Bayesian statisticians are overly confident that they should never care about it, most of them also aren’t aware of it.

Now, it is true frequency properties are not relevant if and only if the prior and data generating model specification are beyond reasonable criticism – that is they can be taken as essentially correct (this would be the case say with diagnostic tests that have well known properties in populations with well known prevalence’s). That occasionally will be the case, and in such cases Brad Efron has published that, in those (rare) cases, all statisticians should just accept the Bayesian analysis and interpret posterior probabilities as credible and meaningful. But for most of research (almost all?), the prior and data generating model specifications are not much more than vague guesses or even worse just default choices. Here some sort of evaluation of performance in repeated use is arguably the only sane? position as there are no relevant probabilities but just thoughtless default ones.

However, in addition to the usual frequentest properties there are also some Bayesian variants that may make sense given the uncertainty of prior and data generating model specifications. Rather than consider coverage rate over “all possible to generate data sets” for each set parameter value (No parameter value left behind for all possible battles they could have engaged in), a Bayesian variant considers frequentest properties over the “set of all possible parameter values” that a chosen Bayesian method of analysis likely would encounter in the same analysis setting in the future. The evaluation of this requires that parameter values are randomly sampled from the prior, fake data then generated for each, then the Bayesian credible interval is calculated and the percent of time it includes that parameter value is recorded. In this evaluation, some parameters values are left behind a given percent of the time (the less likely ones more often left behind according to their prior probabilities) for the battles they most likely would have engaged in (not all of them but just more likely ones).

Now, in this Bayesian variant, if one draws parameters from the same prior as used in the analysis, things will appear too good (Sander Greenland UCLA coined this an assessment of an omnipotent model). To reflect the real uncertainty in the prior, a different prior should be used to randomly draw the possible parameter values from, while the assumed prior is still used to calculate the intervals.
The resulting coverage is an average over all possible parameter values (weighted by the prior) and this makes sense to many of us. Especially if a wide enough prior is used so that the actual coverage rate for values in the tails (considering each one separately) is not too bad. That is, the only parameter values that have poor frequency coverage with the Bayesian method are very unlikely to ever be true. Only very unlikely parameters are left behind with others kept frequency performance evaluated under the battles they engaged in more than rarely. Only very unlikely parameters are left behind with others kept frequency performance evaluated under the battles they engaged in more than rarely.

An open question for both approaches is properly defining all possible battles they likely would have engaged in. That is defining the set of relevant battles, restricting the “possible to generate data sets” sensibly (e.g. regression which fixes the x values and Fisher Exact test which fixes the totals). For many problems that has yet to be sorted out adequately. (For instance the need to do this for the bootstrap in real applications makes it too complicated for most statisticians – or at least me).

With the Only very unlikely parameters are left behind with others kept frequency performance evaluated under the battles they engaged in more than rarely the restriction is built into the battles they engaged in more than rarely – but it might be too much or not enough of a restriction. Here the open question is how to make it somewhat less restricted but still relevant (a Goldilocks’ hope.)

• I just deny both that we should ever use a thoughtless default prior, and that frequency properties have anything to do with Bayesian probabilities. Of course some times we study frequencies of things, in which case we may have a portion of our model where parameters represent frequencies. In model building I explicitly call these frequencies and don’t use the word probability, it reduces confusion. Then to evaluate the model we typically look at high posterior probability values of the frequency parameters and determine if the frequencies they imply are compatible with what actually happens. If not, adjust the model in some way, often the prior but the likelihood usually could use some tuning as well.

• Put another way, the *type* of an object is important, not just the numerical value. In a Bayesian analysis, when we refer to “probability” the *type* of this number is a plausibility assignment which essentially states “this is the region of possible values that the correct parameter value should be in”.

Another *type* of quantity is a frequency with which something occurs in the world, like say what fraction of people you might survey in California answer “yes” to a certain question, or what fraction of images collected by a satellite contain an active forest fire, or are of a desert region…

It doesn’t hurt anything and certainly helps if we call these “frequencies”.

Now, some numerical values for these frequencies are more plausibly correct than other numerical values of these frequencies, and so we may have a plausibility assessment (called a Bayesian probability) that a certain frequency of occurrence is the correct numerical value relative to the specific experiment in question. Now we have probability of a frequency.

If we keep the role that our quantities plays in mind through this type associated with the quantity, everything becomes dramatically clearer.

• Isn’t the Bayesian approach guaranteed to be calibrated in Bayesian terms if we can actually computationally fit the posterior? That is, if I generate parameters from the prior and data from the parameters, the posterior coverage is guaranteed in expectation.

I wrote a blog post about this in 2017, Bayesian Posteriors are Calibrated by Definition, based on my newfound understanding of the Cook-Gelman-Rubin diagnostics (which Andrew et al. renamed “simulation-based calibration”, I believe due to Andrew’s dislike of techniques named after people).

Edit: I forgot the link to Talts et al. arXiv paper, Validating Bayesian inference algorithms with simulation-based calibration.

• If “Calibrated in Bayesian Terms” means “if you generate parameters according to the posterior then data according to the data model… then the frequency with which the data comes from an RNG whose parameter is x is the posterior distribution probability of that parameter being x” then yes…. but this is irrelevant for science because the real science isn’t about RNGs whereas the Bayesian statement *is* about computer RNGs.

The data model is a thing that stands in for our knowledge, not a thing that describes what actually happens.

It is very occasionally the case that we have *huge quantities of data* so that our knowledge of the shape of the data distribution coincides with the actual shape in the actual world. But this is often far from the norm, even in areas where you have potentially a large sample size, particularly as you start to ask questions about subsets (like, among White males who live in rural portions of Arkansas and make less than 1/2 GDP/capita what is the distribution of weekly consumption of alcohol? Even a high quality random sample of 10M people in the US will not necessarily be enough to well-characterize the shape of this distribution)

The data model is about *our knowledge of what might happen* given a set of parameter values, it’s not about *how often things will actually happen*. The frequency distribution of what actually happens can easily and almost always *does* have a different shape from the Bayesian plausibility distribution we use. In many cases it’s not even a well defined thing (in the sense of having a stable distribution through time).

• Keith O’Rourke says:

First if the posterior was available in closed form, my comment and Andrew’s post would still mostly apply. From Daniel’s comment I would stress we are changing the focus from probabilities to frequencies given we have lost “confidence” in the prior (or realised one should never have had “confidence” in a default prior). The Cook et al paper is primarily about probabilities but uses the same two stage sampling math.

I should have been explicit that “if one draws parameters from the same prior as used in the analysis, things will appear too good” the too good was the frequencies averaged over the prior exactly equals the probabilities – 50 per cent credible has 50 per cent confidence. Back in 2017 I gave this reference for a proof of credible = confidence – Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician 1984 https://projecteuclid.org/euclid.aos/1176346785 There is also one in Mike Evans Relative Believe book.

So my comment and I believe Andrew’s post is mostly about when to switch from probability to frequency, why, when and how.

• Christian Hennig says:

Keith: I like this a lot. Seems like an extension of Tukey’s “More honest foundations for data analysis” to Bayesian methods.
https://www.sciencedirect.com/science/article/pii/S0378375896000328

• Keith O’Rourke says:

Thanks for the reference (I’ll need to be at library to access the full paper).

But it did remind me of Tukey’s Configural Polysampling which I haven’t look at in 20 years but seems related.
(I did get to discuss it very briefly with Tukey and I was very interested in that approach in 1998 but it fell out of mind.)

• Keith O’Rourke says:

Christian:

Read the paper, same general topic as Configural Polysampling book, but that book went more into computational complications (it was the cubiture methods I discussed with Tukey).

And your are right, my comment was an extension to switching from probability to frequency, why, when and how.

Thanks again.

2. Harry Crane says:

Looks like the same Title/Abstract you gave last year at Rutgers

3. Yayyyy Andrew,

I get what you mean. Whew!

4. Justin Smith says:

“…Frequentist statistics is not based on empirical frequencies:…”

Hmm, I’m not sure, or maybe I don’t understand fully. Some thing that are ’empirical frequency’-ish to me that are fundamental to frequentism as I see it, are:

-FN David’s book (God, Games, Chance, something), discussing probability and statistics origin is from playing/analyzing games of chance
-Strong Law of Large Numbers, it is almost certain that between the mth and nth observations in a group of length n, the relative frequency of Heads will remain near the fixed value p, whatever p may be (ie. doesn’t have to be 1/2), and be within the interval [p-e, p+e], for any small e > 0, provided that m and n are sufficiently large numbers.
-Central Limit Theorem (histogram = empirical frequencies, approaching theoretical normal distribution)
-frequentism reliance on idea of sample space
-performance interpretation of confidence intervals, and general evaluation like you said. This can be looked at as empirical frequency thing too, as # passed / # evaluated.
-nonparametric thingies (histograms, ecdf, etc.)
-likelihood (where data enters) swamping priors (at times for any choice of prior)
-and simply the quantity “n”, which is a frequency?, is used in many statistical equations

“After all, Bayesian inference has ideal frequency properties—if you do these evaluations, averaging over the prior and data distributions you used in your model fitting. “

What is “the” prior? It seems there can be many choices of priors, so I don’t see how something can be claimed to have ideal frequency properties if the prior can be anything. Is the claim that it has ideal frequency properties after a prior is selected/fixed?

Thanks,

Justin Smith
http://www.statisticool.com

• Andrew says:

Justin:

1. Frequentist statistics is based on models, not empirical frequencies. The models make predictions about empirical frequencies, so there’s an empirical connection, but the methods and their frequency properties—for example, statements such as asymptotic coverage of 95% intervals—come from assumed models. Empirical frequencies are important in Bayesian statistics as well as frequentist statistics (see for example chapter 1 of BDA) but in both cases the probability statements come only after assuming some mathematical models of distributions, independence, stationarity, etc.

2. Statistical properties are based on assumptions. There is no “the prior” and there is no “the data model” or “the likelihood.” Your point, “I don’t see how something can be claimed to have ideal frequency properties if the prior can be anything,” could also be made, “I don’t see how something can be claimed to have ideal frequency properties if the data model can be anything.” My statement, “averaging over the prior and data distributions you used in your model fitting,” is conditional on the assumed model being true. It’s a theoretical statement, of the same form as theoretical statements in textbooks such as the asymptotic efficiency of least squares estimation if the data truly come from a normal distribution.

• Justin Smith says:

Thanks Andrew. I guess it depends on what “based on” means, so we’ll have to disagree there. I’d say things like Strong Law of Large Numbers and CLT for example are not just models but also empirical facts, observable in nature. For example, I don’t need a likelihood or prior to flip a coin many times and see the cumulative relative frequency of heads approach a flat line.

For 2), I believe (without evidence) that the set of likelihood models are smaller than the set of possible priors. With likelihoods + priors there are ‘more levers to pull’ to influence our result (for good or bad) than with just likelihoods alone, but I agree with you.

One question I like to think about is, if likelihoods are functions of data and can swamp priors, does this indicate that frequencies are somehow more fundamental to statistics than beliefs about parameters? Or another way, if n is small, is there any way to make sense of the world in this small n case besides making strong modelling assumptions?

Thanks,

Justin Smith
http://www.statisticool.com

• Corey says:

For example, I don’t need a likelihood or prior to flip a coin many times and see the cumulative relative frequency of heads approach a flat line.

• Justin Smith says:

Interesting! Man people are creative!

I guess my argument doesn’t totally rely on coins. I could still observe any event with an Outcome I’m interested in and do:

relfreq(Outcome)_t+1 = ((t-1)*relfreq(Outcome)_t + I_t)/t, where
I_t = 1 if Outcome is observed on the t_th trial, 0 otherwise

where I use “relfreq” for cumulative relative frequency (I think),

• Corey says:

The only situation where you will not be able to break that down to lack of control over initial conditions is quantum mechanics, and no one really knows what’s up with that.

• Keith O’Rourke says:

< not just models but also empirical facts, observable in nature.
But you can't observe anything without representing it some how (i.e. have a model for it). As Peirce put it, you don't see a rose, you hypothesise that you are seeing a rose and don't come to doubt it.

< set of likelihood models are smaller than the set of possible priors.
But both are uncountable infinities (between any two likelihood is another likelihood – e.g. a mixture of the two).

< frequencies are somehow more fundamental to statistics than beliefs about parameters?
I think so. For instance, Peirce argued that in induction, in the end all that matters is how often it carries you to the truth. As Don Rubin put the same? thought to me once, smart people don't like being repeatedly wrong.

Maybe much of the argument is about whether priors facilitate or block good frequency properties. Here I believe the high road is Bayesian analyses done thoughtfully do facilitate good frequency properties. And sometimes better than direct frequency approaches.

I believe Andrew's point "based on models, not empirical frequencies" is very important. As Fisher put it these frequencies are just hypothetical. Today they might be better stressed as being subjunctive – not what will happen if studies are endlessly repeated but what _would happen_. That is an abstract mathematical concept that is assessed by math or today simulation. In particular, this I why I don't think any objection based on there only being one study or studies can't be repeatedly endlessly is relevant.

It is all bad meta-statistics all the way up from reality we have no direct access to so called principles that needlessly block inquiry.

• Corey says:

In particular, this I why I don’t think any objection based on there only being one study or studies can’t be repeatedly endlessly is relevant.

I don’ think that states the objection in the strongest way. (Some do make that objection in the weak form you quote.) The objection is that there’s a conceptual gap between knowing the properties of a method on hypothetical repetitions on the one hand and coming to a well-warranted inference or conclusion in the single case at hand. People are so used to applying what they’re taught that the gap isn’t really highlighted or obvious to practitioners. Mayo’s severity construal of frequentist statistical procedures aims to bridge exactly that gap.

The same gap exists in Bayesian analysis — not everyone who uses probabilities to quantify belief thinks those beliefs should be updated by Bayes’ theorem. Saying, “yes, use Bayes’ theorem to update your probabilities in the face of data” is how the gap is bridged for Bayesians.

• Martha (Smith) says:

“Today they might be better stressed as being subjunctive – not what will happen if studies are endlessly repeated but what _would happen_. “

a good way to put it.

• Andrew says:

Justin:

(1) You write, “I don’t need a likelihood or prior to flip a coin many times and see the cumulative relative frequency of heads approach a flat line.” But your above statement is entirely model-based! You didn’t really flip that coin thousands of times (I assume), you just ran the thought experiment, which is based on . . . a model. The key assumps of your model are that the coin flips are independent and that the probability doesn’t change over time. If these assumps are wrong, your conclusions can be way off. For coin flipping, these assumps are reasonable, and that’s fine, but recognize that they are assumps.

(2) I don’t think you can make any sense of the world without strong modeling assumptions, typically of the form that the future will be similar to the past.

• Thanatos Savehn says:

I printed this out and put it over my desk: ” I don’t think you can make any sense of the world without strong modeling assumptions, typically of the form that the future will be similar to the past.”

• Justin Smith says:

Hi Andrew. I have done stuff like flipped coins, tacks (http://www.statisticool.com/flippingtacks.htm), etc. thousands of times before. True about the assumptions you list, which I’d probably need to use if I wanted to make an inference. However, those assumptions are still much more general than also adding priors into the mix,

• Chris Wilson says:

Justin, if you think of Bayes as formulating a joint generative model of everything (data + parameters, or better yet, as McElreath suggests, observed + unobserved quantities), it is not really “likelihood + prior”. The use of probability theory in full generality requires distributions over both observed and unobserved quantities. From this point of view, it is not really an “additional” set of assumptions…

5. bananpeel says:

FYI there is a new entry on “Reproducibility of Scientific Results” in SEP

https://plato.stanford.edu/entries/scientific-reproducibility/

wonder what do you folks think about it