Bayesian inference completely solves the multiple comparisons problem

I’m rerunning this one from 2016 because it came up at work recently, and I think the general topic is as important as it’s ever been.

flat priors consistently give bad inferences. Or, to put it another way, the routine use of flat priors results in poor frequency properties in realistic settings where studies are noisy and effect sizes are small. (More here.)

Saying it that way, it’s obvious: Bayesian methods are calibrated if you average over the prior. If the distribution of effect sizes that you average over, is not the same as the prior distribution that you’re using in the analysis, your Bayesian inferences in general will have problems.

But, simple as this statement is, the practical implications are huge, because it’s standard to use flat priors in Bayesian analysis (just see most of the examples in our books!) and it’s even more standard to take classical maximum likelihood or least squares inferences and interpret them Bayesianly, for example interpreting a 95% interval that excludes zero as strong evidence for the sign of the underlying parameter.

In our 2000 paper, “Type S error rates for classical and Bayesian single and multiple comparison procedures,” Francis Tuerlinckx and I framed this in terms of researchers making “claims with confidence.” In classical statistics, you make a claim with confidence on the sign of an effect if the 95% confidence interval excludes zero. In Bayesian statistics, one can make a comparable claim with confidence if the 95% posterior interval excludes zero. With a flat prior, these two are the same. But with a Bayesian prior, they are different. In particular, with normal data and a normal prior centered at 0, the Bayesian interval is always more likely to include zero, compared to the classical interval; hence we can say that Bayesian inference is more conservative, in being less likely to result in claims with confidence.

Here’s the relevant graph from that 2000 paper:

claims_with_confidence

This plot shows the probability of making a claim with confidence, as a function of the variance ratio, based on the simple model:

True effect theta is simulated from normal(0, tau).
Data y are simulated from normal(theta, sigma).
Classical 95% interval is y +/- 2*sigma
Bayesian 95% interval is theta.hat.bayes +/- 2*theta.se.bayes,
where theta.hat.bayes = y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)
and theta.se.bayes = sqrt(1 / (1/sigma^2 + 1/tau^2))

What’s really cool here is what happens when tau/sigma is near 0, which we might call the “Psychological Science” or “PPNAS” domain. In that limit, the classical interval has a 5% chance of excluding 0. Of course, that’s what the 95% interval is all about: if there’s no effect, you have a 5% chance of seeing something.

But . . . look at the Bayesian procedure. There, the probability of a claim with confidence is essentially 0 when tau/sigma is low. This is right: in this setting, the data only very rarely supply enough information to determine the sign of any effect. But this can be counterintuitive if you have classical statistical training: we’re so used to hearing about 5% error rate that it can be surprising to realize that, if you’re doing things right, your rate of making claims with confidence can be much lower.

We are assuming here that the prior distribution and the data model are correct—that is, we compute probabilities by averaging over the data-generating process in our model.

Multiple comparisons

OK, so what does this have to do with multiple comparisons? The usual worry is that if we are making a lot of claims with confidence, we can be way off if we don’t do some correction. And, indeed, with the classical approach, if tau/sigma is small, you’ll still be making claims with confidence 5% of the time, and a large proportion of these claims will be in the wrong direction (a “type S,” or sign, error) or much too large (a “type M,” or magnitude, error), compared to the underlying truth.

With Bayesian inference (and the correct prior), though, this problem disappears. Amazingly enough, you don’t have to correct Bayesian inferences for multiple comparisons.

I did a demonstration in R to show this, simulating a million comparisons and seeing what the Bayesian method does.

Here’s the R code:

setwd("~/AndrewFiles/research/multiplecomparisons")
library("arm")

spidey <- function(sigma, tau, N) {
  cat("sigma = ", sigma, ", tau = ", tau, ", N = ", N, "\n", sep="")
  theta <- rnorm(N, 0, tau)
  y <- theta + rnorm(N, 0, sigma)
  signif_classical <- abs(y) > 2*sigma
  cat(sum(signif_classical), " (", fround(100*mean(signif_classical), 1), "%) of the 95% classical intervals exclude 0\n", sep="")
  cat("Mean absolute value of these classical estimates is", fround(mean(abs(y)[signif_classical]), 2), "\n")
  cat("Mean absolute value of the corresponding true parameters is", fround(mean(abs(theta)[signif_classical]), 2), "\n")
  cat(fround(100*mean((sign(theta)!=sign(y))[signif_classical]), 1), "% of these are the wrong sign (Type S error)\n", sep="")
  theta_hat_bayes <- y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)
  theta_se_bayes <- sqrt(1 / (1/sigma^2 + 1/tau^2))
  signif_bayes <- abs(theta_hat_bayes) > 2*theta_se_bayes
  cat(sum(signif_bayes), " (", fround(100*mean(signif_bayes), 1), "%) of the 95% posterior intervals exclude 0\n", sep="")
  cat("Mean absolute value of these Bayes estimates is", fround(mean(abs(theta_hat_bayes)[signif_bayes]), 2), "\n")
  cat("Mean absolute value of the corresponding true parameters is", fround(mean(abs(theta)[signif_bayes]), 2), "\n")
  cat(fround(100*mean((sign(theta)!=sign(theta_hat_bayes))[signif_bayes]), 1), "% of these are the wrong sign (Type S error)\n", sep="")
}

sigma <- 1
tau <- .5
N <- 1e6
spidey(sigma, tau, N)

Here's the first half of the results:

sigma = 1, tau = 0.5, N = 1e+06
73774 (7.4%) of the 95% classical intervals exclude 0
Mean absolute value of these classical estimates is 2.45 
Mean absolute value of the corresponding true parameters is 0.56 
13.9% of these are the wrong sign (Type S error)

So, when tau is half of sigma, the classical procedure yields claims with confidence 7% of the time. The estimates are huge (after all, they have to be at least two standard errors from 0), much higher than the underlying parameters. And 14% of these claims with confidence are in the wrong direction.

The next half of the output shows the results from the Bayesian intervals:

62 (0.0%) of the 95% posterior intervals exclude 0
Mean absolute value of these Bayes estimates is 0.95 
Mean absolute value of the corresponding true parameters is 0.97 
3.2% of these are the wrong sign (Type S error)

When tau is half of sigma, Bayesian claims with confidence are extremely rare. When there is a Bayesian claim with confidence, it will be large---that makes sense; the posterior standard error is sqrt(1/(1/1 + 1/.5^2)) = 0.45, and so any posterior mean corresponding to a Bayesian claim with confidence here will have to be at least 0.9. The average for these million comparisons turns out to be 0.94.

So, hey, watch out for selection effects! But no, not at all. If we look at the underlying true effects corresponding to these claims with confidence, these have a mean of 0.97 (in this simulation; in other simulations of a million comparisons, we get means such as 0.89 or 1.06). And very few of these are in the wrong direction; indeed, with enough simulations you'll find a type S error rate of a bit less 2.5% which is what you'd expect, given that these 95% posterior intervals exclude 0, so something less than 2.5% of the interval will be of the wrong sign.

So, the Bayesian procedure only very rarely makes a claim with confidence. But, when it does, it's typically picking up something real, large, and in the right direction.

We then re-ran with tau = 1, a world in which the standard deviation of true effects is equal to the standard error of the estimates:

sigma <- 1
tau <- 1
N <- 1e6
spidey(sigma, tau, N)

And here's what we get:

sigma = 1, tau = 1, N = 1e+06
157950 (15.8%) of the 95% classical intervals exclude 0
Mean absolute value of these classical estimates is 2.64 
Mean absolute value of the corresponding true parameters is 1.34 
3.9% of these are the wrong sign (Type S error)
45634 (4.6%) of the 95% posterior intervals exclude 0
Mean absolute value of these Bayes estimates is 1.68 
Mean absolute value of the corresponding true parameters is 1.69 
1.0% of these are the wrong sign (Type S error)

The classical estimates remain too high, on average about twice as large as the true effect sizes; the Bayesian procedure is more conservative, making fewer claims with confidence and not overestimating effect sizes.

Bayes does better because it uses more information

We should not be surprised by these results. The Bayesian procedure uses more information and so it can better estimate effect sizes.

But this can seem like a problem: what if this prior information on theta isn't available? I have two answers. First, in many cases, some prior information is available. Second, if you have a lot of comparisons, you can fit a multilevel model and estimate tau. Thus, what can seem like the worst multiple comparisons problems are not so bad.

One should also be able to obtain comparable results non-Bayesianly by setting a threshold so as to control the type S error rate. The key is to go beyond the false-positive, false-negative framework, to set the goals of estimating the sign and magnitudes of the thetas rather than to frame things in terms of the unrealistic and uninteresting theta=0 hypothesis.

P.S. Sorry for the ugly code. Let this be a motivation for all of you to learn how to code better.

P.P.S. More on this topic from Erik van Zwet and Eric Cator: The Significance Filter, the Winner's Curse and the Need to Shrink.

68 thoughts on “Bayesian inference completely solves the multiple comparisons problem

  1. In their defense, sometimes people may need to run non-Bayesian methods in multiple comparisons for (a) computation cost. With extremely high-d data, compared with full Bayesian inference with shrinkage prior, it could be more affordable to run a least square regression first, pick all significant covariates, and then run post-inference frequentist adjustment. (b)Bayesian model needs a joint generative model, while some complex procedure could have a non-tractable likelihood. For example, start by a full-Bayes regression model and add interaction terms based on posterior predictive check. This sequential procedure has a too complex likelihood that generally requires some fake data simulation.

    • Yuling:

      Agreed. But I think a lot of this is researchers simply being controlled by the dead hand of the past, doing multiple comparisons adjustments because that’s what was told to them by some authority figure. As noted by van Zwet and Cator (and, before them, by Rubin, and I’m sure by others before him), the reason why multiple comparisons is such a concern in classical inference is because in classical inference there’s an implicit rule to not touch the point estimates, so when you get more data you get increasingly high chances of seeing wacky estimates. Once you allow yourself to do shrinkage, this resolves this key problem.

  2. It’s an instructive simulation, but even Ronald Fisher himself agreed that Bayesianism is the way to go if you have a prior distribution that is perfectly calibrated to the true distribution of effect sizes. It would perhaps be more interesting to consider what happens when the prior over tau is miscalibrated to various degrees and in different ways. A proper prior that is sufficiently miscalibrated will do more damage than a flat prior. Where are the “turning points” where that happens?

    • Olav:

      Yes, this is a great question. Also, Ronald Fisher used maximum likelihood, which assumes the data model is known, even though the logistic model is just something he made up one day and not some true property of nature. We develop methods all the time under assumptions and then consider how they work when the assumptions fail. I think it would be a great research project for someone to assess the methods of the above post, averaging over various alternative prior distributions to see where things go wrong. Indeed, one reason I post these things is to inspire researchers to study these topics in further depth.

      • Here is an interesting paper by Yoav Benjamini. In section 5.5 he comments on your blog post from 2016 and demonstrates that things go wrong if the true (data generating) prior is mixed ever so slightly with a much wider distribution.

    • > A proper prior that is sufficiently miscalibrated will do more damage than a flat prior.

      Hmm, if your prior always errs on the side of being “flatter” than the true prior, maybe then you can still almost guarantee that you do better than the flat?

      Andrew:

      How complicated, in practice, is “you can fit a multilevel model and estimate tau”? Could you do this in a fairly simple and non-Bayesian way and get away with it, when the data is too big and cumbersome to put everything into one big combined model?

  3. Andrew, I think you are over-selling this. In practive, we hardly ever know the correct priors. Yes, we have good guesses, but that’s not the same thing. At the very least, you should qualify the title of the post.

    I think that this post/position has the potential to mislead in the same way that the Gelman and Hill 2007 book did, when you wrote that the normality assumption of the residual is the least important thing in a linear model—as I mentioned to you before, that statement gave a lot of psycholinguists the license to analyze reading time data on the raw scale, and they repeatedly (even today) quote you as proof that they are doing the right thing (No, they haven’t seen your comments on your blog).

    Someone needs to re-do the von der Malsburg and Angele paper with Bayesian methods, using informative priors. I predict that we will be surprised to find that Bayesian methods don’t necessarily solve the problem; but I am happy to be proven wrong.

    von der Malsburg, T., & Angele, B. (2017). False positives and other statistical errors in standard analyses of eye movements in reading. Journal of memory and language, 94, 119-133.

    The approach taken in the above paper has more realistic data simulations than the toy example you are using.

    • Shravan:

      You say, “we hardly ever know the correct priors.” I’ll go one step further and say that we never know the correct priors. Or one step further than that and say that there is no correct prior. Actually one step further than that and say that there’s no correct data model either. It’s all a mess. Also we don’t know what is going on in people’s minds when they answer survey questions etc., so our measurement error models are wrong too, at least in all the human-sciences problems that I’ve worked on. So, yes, the mathematical and computational statements above are conditional on the model.

      I agree that it would be be great if someone were to reanalyze old data in applications of interest using hierarchical models, to see what works and what doesn’t.

      • Many reviewers *love* to critique priors. In their mind there are objectively correct and incorrect priors. I see this all the time. Any attempt to correct them leads to cries of “Statistical Shenanigans!!!”

      • It’s not about getting to “know the correct prior”.

        The “correct prior” is by definition “what we know”.

        (Unfortunately that truism may not help a lot when it comes to writing it down.)

        • Not quite. If you interpret probability epistemically, then yes – but then also the sampling model should be interpreted epistemically in order to have a consistent meaning of probabilities. However, according to the frequentist Bayesian interpretation quoted by Eric above, the prior can refer to a distribution over a population of problems, which may be unknown to you.

        • The prior being about “the set of problems for which a particular statistical model or procedure will be applied” doesn’t make much sense to me.

          If I’m determining fish size using sonar measurements the prior is about the size of the fish, not about the (ill-defined) set of problems where I could use a similar statistical procedure.

          One could say that the “correct prior” depends on the species. For a sturgeon the prior will have some form, for a catfish another form. It could be refined with other information like the season. There may be also priors for nuisance parameters which are not really about the size of the fish but reflect what we know or ignore about other components of the model like the specifications of the equipment or the measurement process.

          But that is completely different from the “set of problems” view. It’s a problem specific question and it’s directly related to what we know. If we don’t know the species the “correct prior” will be a mixture of species-specific priors that reflects that (lack of) knowledge.

        • That’s all fine by me if you are a subjectivist. What doesn’t make sense to me is to have a probability model in which some probabilities refer to subjective knowledge and some others are meant to refer to frequencies occurring in a real data generating process. There is no foundation for bringing these together in the same calculus.

        • Why?

          If the calculus is simply meant to provide an assessment of what would repeatedly happen in these hypothesized settings or populations, if one used the analyses set out by this set of varying joint prior and data generating model assumptions?

          Prior represents fake populations and data generating models produce fake data in those fake populations.

          Now that calculus only involves posterior probabilities as devices to get estimated quantities – but maybe that is your point?

        • Keith: A fake population is still a population. It means you think of things as being generated by some process involving repetition. I understand Carlos (and the subjectivists in general) as not wanting to do this.

          I accept that one can conceptualise uncertainty in this way, by means of a hypothetical population. The point that I’m making is that if you then connect this to the real situation, this is a model for the generation of your data and not for your thinking (although of course it *shows* your thinking), and as such the data can give you evidence against the model, including the prior. A subjectivist can say, as long as the prior correctly reflects my beliefs, it is correct whatever the data say (at least the data that are observed after setting up the prior). This is how I understood Carlos.

        • I don’t understand what’s the fundamental problem. Maybe because I don’t really know what do you mean by “real data generating process”. Or, for that matter, whether “subjective knowledge” is intended to mean something different than “knowledge”.

          For example, let’s consider the typical urn-sampling-with-replacement setting. There are two urns, one with two red balls and one white and the other with two white balls and one red. You extract balls from one of them.

          I have no problem in mixing the probability that I assign to not knowing which urn you selected and the probability related to the frequency of balls coming from each urn.

          I can calculate the probability of the next ball being white or the probablility that you selected one particular urn based on the prior for the choice of urn, the frequencies associated with each urn, and the color of the balls you showed me so far.

        • Carlos: In situations in which your knowledge is like this, it doesn’t make much of a difference really. You can think of the prior as modelling a hypothetical urn drawing process, or as modelling your uncertainty, or as modelling some kind of non-subjective “knowledge” assuming that what you describe there is the knowledge there is, not depending on the person. Results will be the same in these situations. By “subjective” I mean that in general the state of knowledge depends on the person holding it, and more generally considerations may go into the subjective prior choice that may not be appropriately called “knowledge”, but fair enough, there are situations in which the knowledge that ultimately goes into the prior is shared and more or less uncontroversial. I’d still call it “subjective” not because I doubt it in the particular case but rather in order to indicate that what is modelled is the information/knowledge/uncertainty of a subject, even where shared and agreed.

          The ultimate issue is the following: Let’s say that you want to analyse data you haven’t collected or seen yet. You choose a prior based on your knowledge. Are you saying the prior is “correct” if it correctly formalises your uncertainty and knowledge? In this case, the data that you are going to observe can’t contradict the prior, because the prior as a model just doesn’t concern these data. However, if you think of the prior as a model for a real process generating parameters, even if artificial and “fake”, the data carry information about the prior. Let’s say your sampling model is N(a,1) and your prior for a is N(10,1). If you then observe 2000, the model (including your prior) is pretty much refuted. If the model (including your prior) is meant to model your belief, and be it called “information/knowledge”, the prior (and in the proper subjectivist’s view even the sampling model) can still stand as “correct”.

        • I would say that either the prior is “still correct” – whatever that means – after you observe 2000 or it was never “correct” to start with.

          If you set a prior based on your knowledge and that knowledge includes the predicition that if you observe 2000 your will consider your prior refuted I think you may be doing something wrong.

        • In practice, I guess every model is necessarily incomplete and tentative. There are always additonal degrees of freedom with some unspecified “probability of the model being wrong”, “probability of the data being false”, “probability of this being a glitch in the matrix”, etc.

          But I still don’t understand what’s the fundamental problem in bringing together (subjective) knowledge and (knowdledge about) frequencies.

          “The data that you are going to observe can’t contradict the prior” seems to be the very definition of prior (leaving aside know the impossibility of living up to that ideal).

          When you say that you “think of the prior as a model for a real process generating parameter” it seems as if you had a hierarchical model and you were giving the name “prior” to some hyperparameter and you wanted to update that “prior” as you got more data. I find it confusing.

        • In Chapter 7 of Ten Great Ideas About Chance, Diaconis and Skyrms give a pretty good overview of how De Finetti resolved the unification problem between frequencies and ‘degrees of belief’, which is basically where our ideas about exchangeability come from. So I don’t think it’s correct to say that “there is no foundation for bringing these teogether in the same calculus”. The Bayesian calculus works regardless, it seems to me, and if you buy into the Representation Theorem it has exactly the interpretation that you suggest is not possible. Now, I agree that reasonable people can still disagree about the utility or wisdom of doing this in various cases.
          Just my $0.02.

        • @Chris Wilson: de Finetti is a proper subjectivist. His priors are not about data generating processes. I have no objections against that, at least not from a foundational perspective, but this doesn’t contradict what I was writing all along. Note that in de Finetti’s view the whole probability setup made before observing the data is the “prior”, and this includes what is often referred to as sampling model. Meaning that de Finetti doesn’t mix up different concepts of probability. And that’s may point. You can have epistemic probability, or you can have aleatory/frequentist probability, but you should decide which one you want. I’m with de Finetti on that one – except that I’m fine with a frequentist approach as well, or with a Bayesian one in which all probabilities have aleatory/frequentist interpretation.

        • Maybe there’s really not that much of a disagreement, because maybe Carlos you’re more of a de Finetti-style subjectivist than you have conceded up to now. It somehow seems to me that you don’t like this characterisation, but why not? What you write seems all in line with it. By the way also what Chris wrote, and what’s in Diaconis & Skyrms’ Chapter 7 (as far as I checked it quickly; I have reviewed that book but remember better how de Finetti himself put it) is properly in line with that. I’m *not* saying that Bayesians shouldn’t use or encode information about frequencies. Of course they can, as far as such information exists. What I’m saying is that if you interpret the prior in an epistemic manner, you should also interpret the sampling model in an epistemic manner. What Diaconis and Skyrms write about de Finetti basically means that you shouldn’t think that this is a big problem that stops you from doing all kinds of useful things.

        • I don’t understand what’s wrong with just saying Bayes models degrees of credence, and there are multiple ways we can arrive at a degree of credence. One is to “think holistically about what we know and assign some credence” which will generally be where priors come from, and one will be “think about what data our model predicts we would get if we knew the parameter values” in which case we can create a likelihood.

          “what data our model predicts we would get” encompasses all possible kinds of “data generating processes” including cryptographically strong random number generators, solutions to PDEs, algebraic equations, etc etc

        • > you should also interpret the sampling model in an epistemic manner

          But this is trivial:

          “my sampling model is I’m drawing from a strongly unpredictable source of random numbers with density p(x)” implies that you know (epistemically) that the probability that x is between a,b is integrate(p(x),a,b).

          It also turns out that it’s the frequency as well, because “strongly unpredictable source of random numbers” is code words for “the probability is the same as the known frequency”.

          On the other hand if your model is “i’m sampling from some kind of weird seasonal growing process and the best I can guess is that growth is not too far from my predicted value if I know the parameters” then you use some other p(x) like a normal distribution, and you don’t worry that after observing it for 3000 times the frequency isn’t quite a normal distribution, because you’re not trying to learn the sampling distribution you’re trying to learn the parameter.

          (unless you are trying to learn the shape of the sampling distribution, in which case you need to make *that* your parameter, like through a GMM or something)

        • Daniel: “I don’t understand what’s wrong with just saying Bayes models degrees of credence, and there are multiple ways we can arrive at a degree of credence.”
          I’m not saying there’s anything wrong with that. Maybe I don’t make myself clear today…

        • Christian, thanks for follow-up. You write “What I’m saying is that if you interpret the prior in an epistemic manner, you should also interpret the sampling model in an epistemic manner.”

          I think that’s all fine and good. As I understand it, the point of the unification in De Finetti is that we can define/interpret the unknown parameter as a limiting relative frequency. So the Representation Theorem is a clean way in which – conditional on buying into exchangeability – the broader conception of probability as modeling uncertainty/degrees-of-belief contains and subsumes the frequency definition. You recover the frequencies in the limit, and can of course use the Bayesian machinery in the usual inductive way from observed frequencies in data to inference on chances.

          Anyhow, pragmatically, I think what Andrew is constantly saying (and you are agreeing with here I think) makes sense. The likelihood is just as much a modeled assumption as the prior – it’s not like our log-links and logistic regressions are ‘out there’ in the natural world or strictly dictated by our experimental designs even. Is that an ‘epistemic’ interpretation? How would a frequentist interpretation of the likelihood get around any of this? What would it add? In my view, the heart of frequentism – at least what makes sense to me and seems useful – is a concern for the average properties of a procedure over repeated use.

        • Chris Wilson said,
          “The likelihood is just as much a modeled assumption as the prior – it’s not like our log-links and logistic regressions are ‘out there’ in the natural world or strictly dictated by our experimental designs even. ”

          +1

        • Christian Hennig, I don’t see why the likelihood and prior must necessarily be given the same interpretation just for it to be legitimate (and meaningful) to combine them in a calculation. We formally combine quantities that have distinct meanings all the time. For example, multiplying mass and acceleration gives force. In a physics context, we have the added restriction that the dimensions on either side of an equation need to match up, of course. But in a Bayesian context, we don’t even have that restriction given that the probabilities are—and must be—unitless (since p(H|E) = p(H&E)/p(E), i.e. any purported units would be canceled out). Thus, on a hybrid interpretation, the likelihood (which I interpret as representing a model’s estimate of aleatory probabilities) combines with my prior epistemic degrees of belief to form my epistemic posterior degree of belief. I don’t think there’s anything wrong with this picture. Indeed, it seems more psychologically realistic than a purely epistemic interpretation. Models of the world (rather than models of my beliefs about the world) and what actually happens in the world (rather than what I believe happens in the world) have a direct influence on my beliefs all the time.

        • @Olav (and anyone interested): The discussion makes me realise that the issue is very subtle indeed and I have to probably review a bit how much (or little) I think this could affect practice… anyway, the first reason that comes to my mind why we shouldn’t mix up an epistemic and an aleatory/frequentist interpretation of probability in the same model is that all the justification in the foundations given for the axioms of probability is based on *either* an epistemic *or* an aleatory probability interpretation, but there’s nothing for mixing them, and to my knowledge nobody who has worked at this foundational level has ever considered mixing. One may however hold that this isn’t a reason against doing it in practice – it may just be a philosophical point. However when I do statistics and work with probabilities, I’m always concerned about meaning. I’m never happy with saying just that the probability of this-or-that event is x. I am aware that conflicting interpretations of probability are around, and that therefore it isn’t clear, without further explanation, what somebody means when they talk about probabilities. I’m skeptical about mixed interpretations in the first place because I haven’t seen them elaborated to the level of explanation that I’d expect. I’m not saying it isn’t possible, but chances are it hasn’t happened.

          Also there are issues with it. According to de Finetti and followers, including Diaconis and Skyrms, a major motivation for being a Bayesian (epistemic interpretation) is that there are no good reasons to say that “real probabilities” and data generating processes as modelled in the frequentist sense really exist. A “mixed” interpretation would work with them as does a frequentist one.

          Now my personal take on this is that in order to use frequentist models, one does *not* need to believe that reality really works like this. As I wrote in another posting, there’s a model world and a real world, they are different, which we can acknowledge, but we still can use the model world in order to think and learn about the real world. Using frequentist models does not mean that we imply reality is like this. We rather set up a model reality, analyse it, and can then ask, how informative is this and how different is reality from it? (A major way to learn from models is to understand what’s wrong about a wrong model; which means we shouldn’t stop ourselves from setting up models just because we worry they might be wrong.) So I agree with de Finetti, Diaconis and Skyrms that probability as we model it in a frequentist way does not exist. I do *not* agree that this is a reason not to use frequentist models, particularly because epistemic modelling comes with the same issues. It is an idealisation as well; it creates a model world as the frequentist one, with clear differences to the real one (here meaning what we really think/know about reality). A major example for this is that nobody in their right mind should really believe that anything is exchangeable, particularly because this view, taken consistently as the basis of a Bayesian model for a particular situation, implies that you can *never* learn that this belief may not be true. But exchangeability is such a helpful tool for modelling that we use it all the time anyway. I’m not objecting to this. What I’m objecting to is the idea that this solves the general modelling problem that frequentists have. In fact it is riddled with the very same issue, no worse, no better.

          Now you can ask, but if my view does *not* involve the postulate that any frequentist mechanism is real, and I’m still using frequentist models, why should these not be mixed with epistemic modelling? So much of my thinking is epistemic anyway, isn’t it? My answer is that I would like to keep things clean in the model world. Reality is ugly enough, but the model world enables clear mathematical thinking. So, despite accepting that this is not real, when I use a probability model in my model world, I want to have a clear idea how this connects to my view of reality. Meaning that I can have a model world that is frequentist, as it *models* data generation in the real world (not saying that the real world is really like that), or a model that is epistemic, *modelling* knowledge, epistemic uncertainty and its update through data (not saying that my dealing with real uncertainty is consistently like that, e.g., because I am aware that reality can do some nasty things that may destroy my naive exchangeability setup and I’m therefore prepared to leave Bayesian calculus temporarily when that happens).

          However if my model world (regarding one specific model) is half epistemic and half frequentist, I don’t know whether the resulting probabilities are one or the other; and I also don’t have a clear idea what role the data play in possibly changing my model – this is the point I made before. If your model is frequentist, it models the data generating process, so the data should be compatible with it for the model to look “realistic”. If your model is epistemic, you model the knowledge and the data come later, they can update your model but not refute it. I’m not saying people need to either hold a frequentist or an epistemic interpretation of probability. One can be frequentist for one problem and epistemic for another. But in the same model? I don’t think so. (Or rather, I’d think that it’s possible only as long as we aren’t very concerned about meaning, but I think we should.) By the way, for this reason, I object against any interpretation of Bayesian posteriors as “the (epistemic) probability that the true parameter is in this or that set” – this requires a true parameter to exist, and therefore frequentist thinking. And then I’d like to see a fully frequentist interpretation.

        • Christian, thank you for the thoughtful reply. I agree that meaning is important. In my view, neither the standard (pure) epistemic interpretation nor the frequency interpretation of Bayesian probability is best. My own view is that the interpretation of probability that fits best with practice (with a few exceptions) is partly pragmatic and partly epistemic, so I actually favor a hybrid interpretation of both the prior and likelihood (and posterior). That is, the (prior or posterior) probability assigned to a parameter in a model reflects both background knowledge as well as the goals of the analysis. The goals of the analysis will be influenced by pragmatic factors and can include considerations such as wanting a posterior probability distribution that is (in some sense) calibrated. The preceding ideas can be spelled out a bit more precisely in a few different ways (in http://philsci-archive.pitt.edu/15215/1/New%20semantics%20for%20Bayesianism%20.pdf I consider a couple of ways); but personally I’m not wedded to any particular interpretation, so much as the basic idea that Bayesian probabilities are (usually) in part pragmatic and in part epistemic. And I think this feature of probability can be practically significant (though I don’t think the example I give in the preceding paper is good; the kind of prior that Erik is interested in—i.e. one that is intended to be well calibrated across a certain range of problems—is a much better example of a prior that, in my view, is clearly based on both epistemic and pragmatic considerations).

          All of that is just to say that I agree that interpretation is important. However, I don’t see the virtue on insisting on a “pure” interpretation of probability.

        • Christian:

          Maybe I’m a de Finetti-style subjectivist. I guess I’ve been called worse things. I have nothing against de Finetti but I find him hard to read [*] and I don’t know well his work. I very much prefer Jaynes, who you consider an objectivist. I also like Lindley, who may be somewhere in between: “Objectivity is merely subjectivity when nearly everyone agrees.”

          In any case, all three recognize frequencies as something distinct from (epistemic) probabilities but related to them – with connections which arise naturally when appropriate and allow for their use within the framework of probability calculus.

          [*] However, I found yesterday this paper which is quite clear (and funnily enough elicited a response titled “Bayesianism; a threat to the statistical profession?” warning that the adoption of Bayesianism could seal the doom of applied statistics as a respectable and respected profession):

          “Bayesianism: Its Unifying Role for Both the Foundations and Applications of Statistics”

          http://www.socsci.uci.edu/~bskyrms/bio/readings/definnetti_bayesian_standpoint.pdf

        • Olav: Thanks for your response and particularly the link. It looks quite interesting and I’ll read it.
          Two quick remarks:
          (1) I find the term “pragmatic interpretation” somewhat puzzling. An attitude can be pragmatic, a use of models can be pragmatic, but I don’t quite understand what the term “pragmatic” means when it comes to giving probability statements a meaning.
          (2) I agree with this: “That is, the (prior or posterior) probability assigned to a parameter in a model reflects both background knowledge as well as the goals of the analysis.” However, for me there is a difference between what a model “reflects” – what kind of considerations go into setting it up, and what it actually models. If I use a frequentist model, it will reflect knowledge that I have as well as the goals of the analysis, as far as possible, by the way I set it up. But what it actually models is a data generating process, not my knowledge.

        • It is sometimes possible to embed a particular problem in a larger “reference” population of similar problems. Here and here my co-authors and I try to do that. If it all goes well, the posterior will be calibrated in the sense that posterior probabilities correspond to frequencies over the reference population.

        • Thanks for the references. I only had a quick look but as one of the abstracts says “is not intended to be a replacement for a prior based on full information about a particular problem; rather, it represents a familywise choice that should yield better long-term properties than the current default uniform prior, which has led to systematic overestimates of effect sizes and a replication crisis when these inflated estimates have not shown up in later studies.”

          It’s not about the “correct prior” for the problem, it’s more a “wild guess prior without looking at the problem at all”. Simplifying a bit, if in half of the clinical trials in the reference database the drug didn’t work there is a sense in which the prior for any future trial is 50%. But we could refine it more by therapeutic area or drug class. And we can also combine those statistics from previous trials with all the relevant details about the clinical trial at hand to get a “more correct” prior.

          It’s interesting but more than about “the set of problems for which a particular statistical model or procedure will be applied” it is about “one particular set of problems for which a particular statistical model or procedure has been applied”. And the resulting prior could be “calibrated” on average without being ever “correct”.

        • Carlos:

          By using all the RCTs in the Cochrane database, we’re quantifying the available information about estimates of treatment effects in RCTs. Such estimates are actually quite special because they tend to be roughly the same magnitude as their standard errors. That’s how RCTs are designed!

          Of course, for any given RCT we know much more than that it’s an RCT. And as you say, we can use this additional information to refine the prior. That will make the prior more specific, but not more correct.

          In fact, by refining the prior it will become *less* correct because we will have fewer and fewer examples to learn from.

        • > we can use this additional information to refine the prior. That will make the prior more specific, but not more correct.
          > In fact, by refining the prior it will become *less* correct because we will have fewer and fewer examples to learn from.

          Erik, I assume that you meant “will make the prior more specific, but not [necessarily] more correct.” and “it [may] become less correct”. I could agree with that. My point was that the “reference database” prior is not “the correct prior” and it can be improved by making a better use of the information in the database or adding information from other sources.

          Otherwise, I wonder why would you use in the other paper data from 178 phase 3 clinical trials if the result is going to be a “less correct” prior than the one derived from 23,747 RCTs in the Cochrane database.

        • Carlos: I agree with your more careful wording of what I wrote earlier.

          I would say that our prior based on all 20,000 RCTs is “correct” because it is very accurately estimated. However, it is not very specific. Posterior probabilities computed from this prior will be almost perfectly calibrated with respect to the very broad population of all RCTs.

          If we focus only on those 178 phase III trials, the prior will be “less correct” because it is not estimated as accurately. However, it is much more specific. Posterior probabilities computed from this prior will be *approximately* calibrated with respect to the *much narrower* population of phase III RCTs.

          In principle, the following two statements can be simultaneously correct, i.e. well calibrated:

          1. Viewed as a typical RCT, the posterior probability of a postive effect in this trial is xx

          2. Viewed as a typical phase III RCT, the posterior probability of a postive effect in this trial is yy

          The second statement is more informative, but more difficult to get right.

        • It seems that we have different notions of “more correct” and “less correct”.

          I think that “the posterior probability of a positive effect in this trial is yy (because of such and such)” can be more correct than “the posterior probability of a positive effect in this trial is xx (because the fraction of all the trials ever conducted that had a positive effect is xx)”.

          The “viewed as a typical RCT” and “viewed as a typical phase III RCT” qualifiers are irrelevant if one sees correctness as something related to the probability of a positive effect in this particular trial (viewed as a member of a hypothetical group of trials completely similar to this one, if you will).

          Even if you only want to think of correctness in the context of a group of actual trials, I think a reasonable definition of “more/less correct” will prefer a slightly-worse-calibrated probability for a more relevant comparable group than a slightly-better-calibrated probability for a less relevant comparable group.

          In the same way that an imprecise estimate of the height of Donald Trump based on some photographs of him getting down from Air Force One can be “more correct” than a precise estimate of his height based only on the average height of US citizens.

        • Carlos, Keith:

          From a frequentist-Bayesian point of view, I want to make sure that the posterior is well-calibrated with respect to some well-defined (potentially infinite) reference population. If that’s the case, I’d call posterior statements “correct”.

          One cause for miscalibration is having too few examples from the reference population to learn the prior from. This happens when the reference population is very specific. For example, all RCTs studying one particular drug.

          On the other hand, the posterior is more informative when the reference population is more specific. So, posterior statements that are well-calibrated to phase III RCTs are more informative than posterior statements that are well-calibrated to all RCTs.

          So it’s a trade-off. You want to be maximally informative but still well-calibrated.

        • Keith: There seem to be some words missing and some garbled. You probably can’t correct that – maybe you write it down again or in an email to me?

        • There is now a lot to digest on this post and for instance some points by Erik and Olav seem close to what I was trying to get at.

          So briefly for now:
          The joint probability (prior and data generating model) live in the model world – possibles.
          The data in hand live in the real world – actuals.
          The posterior _when used to decide what to make of the data_ straddles both possibles and actuals so can’t just be either but some have argued is an interpretant (or inference).

          The posterior _when used as a prior or just something to simulate from_ lives again in the model world.

        • Keith: Not sure whether this has practical implications, but I’d say things slightly differently. Obviously for the model world to be of some use, it somehow needs to be connected to the real world. There are two connections, (a) interpretation and (b) measurement. I think of measurement as dealing with the real world in such a way that it produces entities – data – that can be processed in the model world. Which means that I wouldn’t put data straight into the real world, I think it’s half way at least on the model side, although the connection to the real world is clear. If we for example talk about a student who scores 85/100 in an exam, the idea behind this is that 85/100 should be a mathematical (model) representation of the student’s ability. To what extent this is the case can be controversial; this would be a discussion of the measurement procedure, i.e., the procedure to translate something real but informal we are interested in into something formal that can be handled in the model world. The posterior as such then is again in the model world, but of course is connected to the real world by interpretation, as having accommodated the information in the data. This seems to be a different distinction from yours between possible and actual. I think of the model world as more artificial than “possible”; it is essentially different from the real world, and in order to become “possible”, as far as I understand what you mean by this, it already requires interpretation.

  4. This post and comments thread make me realize I, as a frequentist, don’t really understand what it means to assume correct specification of the prior distribution in a Bayesian simulation. In a frequentist simulation, the model is correctly specified if (in the simplest case) the correct family of distributions is assumed (e.g., Normal). But in a Bayesian distribution, if you’re assuming the prior is correctly specified, doesn’t that mean you’re assuming the prior distribution is both the correct type of distribution and has the correct moments? If so, that’s a much, much stronger assumption than we make in frequentist simulations. And if not, then how do you ensure you’re using an informative prior in the simulation without basing its moments on the population moments?

    It seems to me that a proper Bayesian simulation, meaning one that is both consistent with what researchers do in reality and that makes only plausible assumptions, would have to use a prior with at least the first moment sampled from a distribution with expected value equal to the population moment. But maybe that’s what it means when people say that Bayesians assume that the value of a parameter is a random variable?

    • Ah, I see from the linked 2012 paper that tau is a hyperparameter, which makes more sense in terms of modeling how a researcher specifies an imperfect but informative prior. But it still seems to imply that a Bayesian simulation makes more/bigger assumptions than a frequentist simulation. If that’s the case, I’d have to side with Shravan above: you can always “completely solve” a problem by adding or strengthening assumptions. Case in point: allow my frequentist simulation to make an assumption constraining the possible values of the parameter (which is what assuming tau is correctly specified does) and frequentist inference completely solves the multiple comparisons problem, too!

      • Michael:

        I completely agree that what’s important are the assumptions and the method, not the philosophy. I’d have no problem with a frequentist procedure that solves the multiple comparisons problem by partially pooling estimates based on a model. One problem I’ve found with many frequentist approaches in practice is a reluctance to move beyond maximum likelihood or least squares estimates. The key to solving the multiple comparisons problem is to be willing to move the point estimates.

      • My take on this would be that (a) you’re right, knowing the prior is a big and strong assumption, however (b) all assumptions are wrong anyway, because the “model world” is always different from the real world. Now this looks like a philosophical point, but it does have implications in practice. If you make an assumption, what in fact you’re doing is *not* that you’re saying “I know the world is like this”, but rather “Here’s a model world that seems reasonable to me, and what we get are its implications”.The key question now is, how wrong do we get things if the real world is actually different? That requires sensitivity analysis – exploring how what we get is actually affected by varying the prior over different reasonable ideas about how the real world may behave. This is actually also an issue for frequentists, and leads to robustness and the like. But the Bayesian has the prior as an additional issue to worry about… on the negative side. On the positive side, she has a tool to bring in some information that the frequentist tester tends to ignore. Which is good in case the information is worthwhile, of course, otherwise not.

        • Agree – the key point being varying the prior over different reasonable ideas about how the real world may behave.

          That’s becoming more practically feasible and no longer should it be undervalued.

          And this is nicely worded – “Here’s a model world that seems reasonable to me, and what we get are its implications”

  5. Many researchers particularly in my field (neuroimaging) have been
    taking this position, I was just writing a post about why I disagree.
    In neuroimaging we have typically hundreds to millions of locations at
    which we want to perform inference, in this situation I don’t think a
    shrinkage prior would be strong enough if we wanted to control our
    false discovery or familywise error rates.

    One thing the presentation above doesn’t make clear is that sigma is
    not the residual variance, it’s the estimator variance. This hides the
    dependence on sample size. You can replace sigma with sigma’ /
    sqrt(N) in the derivation in the 2000 paper and you get the ratio
    between the two test as sqrt(1 + sigma’^2 / (N * tau^2)). I’ve been
    thinking about the shrinkage strength in terms of the critical value,
    1.96 for the classic test and 1.96 * sqrt(1 + sigma’^2 / (N * tau^2))
    for the bayesian test. With sigma’ = 2 * tau and N = 10, you get a
    critical statistic of 2.31 which would correspond to an alpha = 0.02
    test instead of alpha = 0.05. This wouldn’t be nearly strong enough
    for hundreds of tests. In one application I fit a multi-level model
    to 336 structures with N = 20 and found that exact sigma’/tau ratio.

    Another point is that we don’t always want to test null hypotheses, if
    we want to make a large number of claims based on probability
    thresholds from our model we may need multiple comparisons correction
    if we want to limit the number of false claims we make. If I wanted to
    test a large collection of hypotheses that parameters were below a
    certain cutoff a shrinkage prior would do the opposite of what we
    wanted it to do.

    • Chris:

      I don’t see why the hierarchical modeling approach would have any problem when you have hundreds of millions of locations. The point is that you don’t need to think of doing hundreds of tests or millions of tests or whatever; you’re estimating hundreds or millions of parameters, and you have to accept some level of uncertainty in your inferences. To that end, I don’t think it’s a good idea to test a large collection of hypotheses that parameters were below a certain cutoff; to me, that sounds like an attempt to obtain certainty or near-certainty where this is inappropriate.

      • Thanks Andrew,

        Sorry I didn’t mean to imply there was something necessarily wrong with using a multi-level model, just that I don’t think it solves multiple comparisons correction, I think the two can be used together. For visualization and display, or presenting in an abstract, we’re more or less forced to focus on a subset of locations, and if I wish to choose these using a probabilistic criterion and limit the number of mistakes I make then multiplicity correction is needed. Especially if the probabilistic criterion is something unrelated to the way in which the parameters are shrunk (my hypothesizing parameters less than a threshold example). I don’t think that surviving this correction should imply certainty.

        In another comment you mention that decision theory is the way to go for this type of selection problem, and I agree, but in my field we don’t have a concrete cost function for the decision to highlight a location, and it’s easy to imagine a cost function that depends primarily on the number of mistakes made, which would lead to something like FDR.

        Another case where classic multiple comparisons correction makes sense is when you have reason to believe that the groups aren’t exchangeable. A plausible example is that there are two latent data generating processes. These can potentially be resolved by post-hoc mixture modelling which is closely related to FDR. I’ve experimented with this a bit on my MLM posteriors but ran into difficulty with the probabilities being truncated near zero.

  6. Maybe I’m missing something here but the simulations above don’t seem very compelling in supporting the Bayesian approach. In fact, I would say they are a good example to show that the choice between classic and Bayes depends a lot on what you are willing to assume and what you want to communicate.

    In the classic approach, I don’t see why you shouldn’t apply a multiple testing correction to the p-values associated with each observation in y. I tried it (code below – hope I got it right!) and I get 9 positives with FDR < 0.05, which seems reasonable. Without correction the analysis feels just wrong to me (unless the cost of a false positive is very low and the one of a false negative very high).

    The Bayesian solution, it seems to me, depends heavily on the correct assumption on tau. With tau = 0.5 I get 58 positives but with tau = 0.6 I get 528, the difference seems quite alarming to me.

    As an aside, I'm convinced that the way forward is Bayesian although in practice I use whatever works which, more often than, not means classical methods. By "work" I mean: what people smarter than me do, it gives sensible results, it needs in a decent amount of CPU and RAM. (Bioinformatician here – one of those people…!)

    set.seed(1234)
    sigma <- 1
    tau <- 0.5
    N <- 1e6

    theta <- rnorm(N, 0, tau)
    y <- theta + rnorm(N, 0, sigma)

    p <- (1 – pnorm(abs(y), 0, sigma)) * 2
    fdr <- p.adjust(p, method= 'BH')
    sum(fdr 9

    # Correct prior

    theta_hat_bayes <- y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)
    theta_se_bayes <- sqrt(1 / (1/sigma^2 + 1/tau^2))
    signif_bayes 2*theta_se_bayes
    sum(signif_bayes) # -> 58

    # With Slightly incorrect prior

    tau <- 0.6
    theta_hat_bayes <- y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)
    theta_se_bayes <- sqrt(1 / (1/sigma^2 + 1/tau^2))
    signif_bayes 2*theta_se_bayes
    sum(signif_bayes) # -> 528

    • Dario:

      There are no false positives and false negatives. In the problems I work on, all effects are real. There are type S and type M errors. If you are concerned about costs of different sorts of errors, I recommend framing as a decision problem. Setting alpha does not correspond to any cost structure. The point of the Bayesian approach is to get inference about the underlying effects; once you have this you can pipe this through any cost structure and make decisions. And, yes, Bayesian inference, like all other statistical methods, depends on assumptions, and in real problems we’re in a continuing process of assessing and modifying these assumptions.

  7. For what their (not) worth, my latest takes are here:
    https://link.springer.com/article/10.1007/s10654-019-00552-z
    https://onlinelibrary.wiley.com/doi/full/10.1111/ppe.12711
    I sent these to Andrew with the comment that both of these point readers toward our shared preference for hierarchical modeling – but the PPE one especially tries to give the fairest hearing to Neymanian-frequentist hypothesis testing (since some ask why the one ended up in PPE, the answer is they invited it; the other was invited too);
    and that the Neymanian hypothesis-testing (behaviorial decision) concepts and tools serve the highly specialized goals they were designed for. Yes those methods look clumsy to us and certainly don’t fit our applications, are vastly over-promoted and over-applied, and like most stat methods are probably misused more than used properly (including by some professors of statistics). But still, I see settings where they work well enough for the intended purpose if used knowledgeably (e.g., as even Fisher conceded, in clean industrial-process control settings that required simple mechanical rules for users once the statistician left the room, which is why they caught on like wildfire).

  8. Say I am running a three armed trial; treatment A: control, treatment B: standard, treatment C: novel.

    A person would speculate treatments B and C should outperform A. So if someone was conducting all pairwise comparisons the alpha wouldn’t need to be corrected – given in each of the comparisons the presumed correct prior was used? In a model, a prior can be stated for each term, but how does one incorporate the pairwise comparison priors into the model or is the model individually refitted for each comparison (e.g., A v B, A v C, B v C), so you would fit three models? Or is there a better way to do this? I know this inquiry is a little off the topic, but over the years I have wondered what is the preferred way to conduct Bayesian multiple comparisons in such a scenario.

    • Hayden:

      In this setting I would have three parameters—the probability of survival under A, B, or C under some standard conditions—and then model the experimental data conditional on these parameters.

      The model might need to include some other parameters corresponding to the underlying conditions. Suppose, for example, that the data come from three experiments: expt 1 compares A to B, expt 2 compares B to C, and expt 2 compares A to C. Then you might have model where Pr(survival of treatment j in experiment k) = alpha_j + beta_k, where j = A, B, or C, and k = 1, 2, or 3. This model would have 6 parameters. Maybe there’s a logit in there too, and there could be other parameters to adjust for sex, age, etc., of participants in the experiments.

      The point is, with Bayesian analysis you’d put a prior on all these parameters and then you’d get a posterior. You can use the posterior to answer whatever questions you’d like, for example expected survival under each of the treatments under various conditions.

  9. Andrew:

    I suppose my mind gets stuck on thinking about a GLM (logit) with an intercept and two indicator variables. So I would treat Beta_0 as the control (A), Beta_1 as B, and Beta_2 as C. I would then use the generated posteriors for the three parameters to get any estimate I desired (e.g., subtract Beta_1 and Beta_2 to get the outstanding B vs C estimate). Thus given the above posted information, additional corrections would not be necessary given the priors were correct. Thank you for your time and please let me know if I am misrepresenting or making an error in logic.

    • Hayden, indicator variables are the default in many stat packages but index variables are equally valid. One of the beauties of Bayes is that with a posterior you can compute whatever derived quantities you want, as Andrew suggests. No need to pull your hair out pre-specifying contrasts or whatever…

Leave a Reply to Zhou Fang Cancel reply

Your email address will not be published. Required fields are marked *