Skip to content

“Bayesian evidence synthesis”

Donny Williams writes:

My colleagues and I have a paper recently accepted in the journal Psychological Science in which we “bang” on Bayes factors. We explicitly show how the Bayes factor varies according to tau (I thought you might find this interesting for yourself and your blog’s readers). There is also a very nice figure.

Here is a brief excerpt:

Whereas BES [a new method introduced by EJ Wagenmakers] assumes zero between-study variability, a multilevel model does not make this assumption and allows for examining the influence of heterogeneity on Bayes factors. Indeed, allowing for some variability substantially reduced the evidence in favor of an effect….In conclusion, we strongly caution against BES and suggest that researchers wanting to use Bayesian methods adopt a multilevel approach.

My reply: I just have one suggestion if it’s not too late for you to make the change. In your title you refer to “Bayesian Evidence Synthesis” which is a kind of brand name for a particular method. One could also speak of “Bayesian evidence synthesis” as referring to methods of synthesizing evidence using Bayesian models, for example this would include the multilevel approach that you prefer. The trouble is that many (most) readers of your paper will not have heard of the brand name “Bayesian Evidence Synthesis”—I had not heard of it myself!—and so they will erroneously take your paper as a slam on Bayesian evidence synthesis.

This is similar to how you can be a democrat without being a Democrat, or a republican without being a Republican.

P.S. Williams replied in comments. He and his collaborators revised the paper, including changing the title; the new version is here.


  1. Richard D. Morey says:

    “Indeed, with a flat prior on τ the intervals include zero which indicates non-significance.”

    Interesting mix-up of statistical logic here…

    “In contrast, we used the Incredibility index (Schimmack, 2012) and the Test of Insufficient Variance (Schimmack, 2015) to estimate bias. Both tests found no evidence of publication bias. Thus it does not appear that publication bias inflated the evidence for an effect of social-norms messages on towel reuse.”

    Disagreements about modeling assumptions (“Bayesian evidence synthesis” vs a multilevel approach, which will almost always assume — sometimes substantial! — variability of the effect) aside, this sort of unfortunately common meta-analytic logic strikes me as awfully naive. It’s the worst kind of misuse of significance testing, even if one *believes* the assumptions underlying whatever method “test” for publication bias one is using.

    You can’t see what’s not there, that was removed by a process you don’t understand. Just because a test failed to “detect” it doesn’t mean much at all.

    • Donny Williams says:

      That is fine. But keep in mind this paper was written by 4 people, two that use frequentist methods and two that use Bayesian methods. As such, there was a give and take to get the paper agreed upon. I actually do not even recall who placed the “non-significance” in the paper (it was over a year ago that it was written), and there was discussion of whether the bias part even worked in this comment. Ultimately, we decided on what is there (obviously), but might do it differently if given the opportunity.

    • Donny Williams says:

      Also, while the “non-significance” is in that section (would take that back if I could), we also say this in the discussion:

      “We showed that Bayesian evidence synthesis is vulnerable to Simpson’s paradox and that a multilevel model produced weaker evidence than Bayesian evidence synthesis. Whereas Bayesian evidence synthesis assumes zero between-studies variability, a multilevel model does not operate under this assumption, which allows researchers to examine the influence of heterogeneity on BFs. Indeed, in the case of Scheibehenne et al.’s data, allowing for some variability substantially reduced the evidence in favor of an effect.”

      This is more than fair. While we did lean toward a binary “significance” in one part of the paper, it is important to consider what we concluded from the interval including 0. We do not suggest no effect, non-significance, but that using a MLM provided a sensitivity analysis for the tau. As we said, “..allowing for some variability substantially reduced the evidence in favor of an effect.”

      Finally, from being familiar with your work, I would be very surprised if you new the issues surrounding estimate tau in meta-analyses with typical sample sizes. The ML, REML etc. can get caught on the boundary of the sample space which reduces the model to a fixed effect model. In contrast, Bayes can overestimate tau with typical studies. Assessing sensitivity to estimating tau is a recommended approach, and is what we did.

    • Keith O'Rourke says:

      > this sort of unfortunately common meta-analytic logic strikes me as awfully naive
      I think you are right but its hard to discern when, where and how to step in.

      A hopefully more informed meta-analytic logic motivated this post of mine that the authors I was criticizing wanted some time to consider before responding. One of them also mentioned problems of repeatedly running into misconceptions of fixed-effects meta-analysis.

      Now in the latest response?, Scheibehenne et al quoted the very same paper I was somewhat critical of “this recommendation [allow treatment effect to vary] has recently been challenged on theoretical grounds (Rice, Higgins, & Lumley, 2017).”

      Briefly here, I do think everyone should read Don Rubin’s work where he conceptualized meta-analysis as building and extrapolating response surfaces in an attempt to estimate “true effects.”

      Additionally, if one is not clear on the mechanical likelihood implementation of meta-analysis from summarized and or individual level raw data – that is clarified in pretty much what is a tutorial here –

      Also with Bayesian meta-analysis that treats a baseline parameter (say control rate) as varying, Simpson’s paradox can happen. It requires extremely unbalanced data to happen, but has lead some (Cochrane Collaboration) to insist that baseline parameters are stratified on – to maintain as randomized comparisons (not that I would always agree).

  2. Donny Williams says:

    Hi Andrew:
    Here is the published paper. We took your advice and changed the title (maybe you could update the post with this link):

  3. Stephen Martin says:

    The BF crew has published some other things now too, all based (ofc) on the BF. Things like, using Bayes factors in MLMs or meta-analyses to “test” for 1) Does random variation exist 2) Are random effects all positive 3) Are random effects a mix of both positive and negative effects and 4) Is the fixed effect zero with random variation across it.

    Of course, my gut reaction is “Why do we need to ‘test’ for this; just fit the more complicated model [random variation permitted, fixed effect other than zero permitted] and make posterior inferences that way.” But I digress.

    I like the linked paper (I read it a while back), and fully agree with its critiques of the BF method of synthesis.

  4. Was the title the only change in the article Donny?

  5. ojm says:

    Seems like the article makes reasonable points to me.

    I read this and then one of the cited papers: Bayes factors: Prior sensitivity and model generalizability by CC Liu and M Aitkin. Took me a while to understand Murray Aitkin’s (and, earlier Dempster’s) points but I think they are/were onto something. If you care about Bayesian testing and having it consistent with Bayesian estimation, that is.

    • Donny Williams says:

      Thanks! I think what we were really attempting to show was that things were not so clear, and in this case, the Bayes factor really reduced uncertainty too much. It is quite the paradox, actually. With Bayesian estimation (modeling), we fully capture uncertainty with probability distributions. So much so, that I am not even sure what to make of my models sometimes (as far as if I needed to make some binary decision). In contrast, I more often see strong claims from studies (or blogs) reporting Bayes factors. Not to say all the claims are incorrect, but the Bayes factor approach really does (IMO) stand in contrast to estimation in this regard.

      • ojm says:

        Sure. My understanding of Aitkin’s approach is that it is _not_ based on Bayes factors but instead on a posterior distribution of likelihood ratios. This approach gives the same conclusions as estimation based on posteriors, in contrast to Bayes factors. No Jeffreys-Lindley paradox and all that.

        • ojm says:

          Note: Andrew and Xi’an have criticised this approach quite strongly but after talking it through with Murray Aitkin last year (this year? What year is it again?) I personally think it has some merit.

        • Richard D. Morey says:

          Estimation and Bayes factors are entirely consistent, when properly understood. If you do an analysis with a model that has nonzero prior probability on a null value (a mixture model), you’ll get precisely the same results whether you use see it as “estimation” or “hypothesis testing”. You can try it yourself with sided hypotheses, without a point hypothesis, too; the posterior odds in favor of the positive effect sizes will be exactly the prior odds multiplied by the Bayes factor. This *must* be true.

          The difference is understanding that the BF is a measure of change (that is, Bayesian evidence), and the posterior is a tentative conclusion (that is, what you get when you incorporate a prior with the evidence). If they weren’t consistent with one another, Bayesian statistics would collapse into logical contradiction. And if you doubt that looking at change from prior to posterior is important, just think: don’t you want to know the effect of the data on the inference?

          What people often critique is the point hypothesis itself, but this is not an argument about Bayes factors vs estimation; it is an argument about the priors. Many distrust point hypotheses. Fine. Priors need to be discussed and skepticism is necessary. But let’s not claim that BFs and estimation don’t give “the same results”. They’re just different parts of the Bayesian inference pipeline.

          • Donny Williams says:

            Interesting and insightful comments. I am not sure how they could necessarily be consistent. For Bayes factors, specifically for so-called hypothesis testing, a measure of prior predictive success might allow for inference if the prior represented a well-defined hypothesis. Unfortunately, what we have in psychology is usually priors (claimed to be actual alternative hypotheses) that predict what most would agree is not reasonable. For example, I have centered a prior at – 0.4 on d scale with a default Cauchy width: Cauchy(-0.4, 0.707). Here, the prior is so diffuse (would not even call it weakly informative) that the “true” value can be 0.4 and still provide evidence for our supposed hypothesis. That is, we can predict a negative effect but the scale of the prior is so vague that even positive values can provide evidence in favor of our alternative.

            Additionally, this says nothing about that the fact that the only reason posteriors are generally consistent with classical methods in the Bayes factor package and JASP is because the priors do not reflect actual hypotheses. They are best considered as basically non-informative, which obviously is a stretch to suggest they reflect hypotheses. An actual hypothesis, in which we are making a genuine pre-data prediction, might be something like N(0.4, 0.1) and this will have a heavily influence on the posterior.

            Finally, to be clear, there is not really all that Bayesian about Bayes factors. To get the posterior, one has to complete Bayes theorem. In contrast, Bayes factor require no such thing as they are a ratio of only the denominators (marginal likelihood).

            • Richard D. Morey says:

              “I am not sure how they could necessarily be consistent.”

              They must be; they depend on the same marginalisation. In computing a marginal posterior probability, you just marginalize over part of the space you’ll marginalize over for a marginal likelihood.

              “They are best considered as basically non-informative”

              It’s funny, I get people complaining that they’re too informative, as well. But it must be said: the family of priors in BayesFactor includes censored Cauchy distributions with any positive scale. You can scale the Cauchy and chop it up however you like. This gives quite a flexible family of priors, though obviously if this won’t give you what you want, you shouldn’t use it (as with all prior families, including the ones defined other software). When I teach Bayesian multilevel modelling, I don’t spend much time on these priors. They’re meant for very simple tasks.

              The general point is: you can’t critique a prior out of context like this. Priors should be judged on whether they more-or-less adequately use information we have about a parameter (unless you’re a frequentist, in which case you can critique them on error statistical grounds). It’s totally fine, as Gelman does, to say “these Bayes factors against point nulls are not interesting for questions I try to answer” because that contextualise them. But a contextless “these are basically non-informative” is inappropriate; non-informative relative to *what* (it’s also an abuse of the term “non-informative”).

              “Finally, to be clear, there is not really all that Bayesian about Bayes factors. To get the posterior, one has to complete Bayes theorem.”

              This is a very strange comment. It wouldn’t be a Bayes factor if it weren’t conceptually sandwiched between a prior and a posterior odds; it would just be a number. The meaning comes entirely from Bayes theorem. Under this logic there would be no such thing as a “Bayesian prior” because you haven’t “completed” Bayes theorem.

              • You marginalize over *part* of the space, yes. It’s still a different marginalization, and it seems fairly well known that the inference made from a BF can be very different from an inference made from a posterior of a quantity.

                In one case, you marginalize the likelihood across the prior parameter space. In the other, you marginalize over the posterior distributions of the other parameters. One is a prior-predictive likelihood, the other is a posterior marginalization. They can certainly lead to different inferences.

                As for the non-informative statement, I’m not entirely sure when the default Cauchy prior typically used /is/ informative; again, it’s a really broad, weird prior predictive distribution. “Fine” for estimation purposes, but from a prior-predictive perspective, it’s a weird one. Have you ever generated prior-implied datasets from a Cauchy? The prior predictive looks insane to me; I can’t really think of a scenario where I would ever predict such data.

                As for the BF not being ‘bayesian’, I tend to think something similar, but only because it, by itself, is not a Bayesian quantity. It’s just the result of probability sums/integration, and invokes zero Bayes theorem in its computation. So *by itself*, it is not a Bayesian metric, in my opinion. And you’re right, a Bayesian “prior” /by itself/ is not a Bayesian concept – It’s just a pmf or pdf. But then people don’t make inferences from just the prior either. People /do/ make inferences from the BF without any respect to a model prior or posterior – In which case, that inference isn’t particularly Bayesian, it’s just a marginalized likelihood ratio with rules of thumb. No updating is occurring, no bayes theorem is used, no priors are employed, no posterior is gained.

              • Richard D. Morey says:

                (Response to S. Martin; not sure where it will put this reply…)

                “and it seems fairly well known that the inference made from a BF can be very different from an inference made from a posterior of a quantity.”

                These are apples and oranges. They’re not “inconsistent”, but they are different *kinds* of quantities. So yes, different, but consistent (analogous to the way that incorporating utility can make one act in ways that are “inconsistent” with one’s beliefs. An act and a belief are different things. They are not necessarily inconsistent, but might appear so if you didn’t understand them).

                “In one case, you marginalize the likelihood across the prior parameter space. In the other, you marginalize over the posterior distributions of the other parameters.”

                Apologies for being tedious. Suppose two models are subsets of a parameter space Theta, Theta_1 and Theta_2. The respective priors over these subsets are p_1 = p_0(\theta)/Pr(\Theta_1) and p_2 = p_0(\theta)/Pr(\Theta_2) where p_0 is the “full” prior. The marginal likelihoods are

                M_1 = \int_{\Theta_1}p(y | \theta)p_1(\theta) d\theta


                M_2 = \int_{\Theta_2}p(y | \theta)p_2(\theta) d\theta

                The posterior odds of \Theta_1 to \Theta_2 — a quantity any Bayesian could compute from her posterior estimates, and that anyone would think is kosher — is

                (\int_{\Theta_1}p(y | \theta)p_0(\theta) d\theta) / (\int_{\Theta_2}p(y | \theta)p_0(\theta) d\theta)

                Because p_0(\theta) only differs from p_1(\theta) over \Theta_1 by the scaling factor Pr(\Theta_1) (and likewise for \Theta_2) we can rewrite this as

                (\int_{\Theta_1}p(y | \theta)p_1(\theta)*Pr(\Theta_1) d\theta) / (\int_{\Theta_2}p(y| \theta)p_2(\theta)*Pr(\Theta_2) d\theta)

                which is

                M_1 / M2 * Pr(\Theta_1) / Pr(\Theta_2)

                …or precisely the Bayes factor times the prior odds. Whether you consider \Theta_1 to be a “model” or a subset of a parameter space is just a matter of perspective; any “full” parameter space can be re-conceptualized as a subset of some larger parameter space, and then what was a “full” marginalisation becomes a matter of computing a posterior and prior probability.

                So you’ll get the same answer whether you treat it as a posterior estimation problem, or compute the BF and then multiply by the prior. If it weren’t true there would be something wrong! You can indeed get into (big) trouble if you treat a BF as a posterior odds, or you model average in naive ways and look at the BFs, but that’s misinterpreting the BF.

                “In which case, that inference isn’t particularly Bayesian, it’s just a marginalized likelihood ratio with rules of thumb.”

                They justification for marginalisation itself is Bayesian probability. If it weren’t Bayesian, we couldn’t meaningfully marginalise. And as above, the marginalisation is conceptually the same.

                “No updating is occurring, no bayes theorem is used, no priors are employed, no posterior is gained.”

                Who is updating? Whose beliefs? Yours? Of course not. There is no updating in reality. Bayesian updating is a *device* for understanding statistical evidence. No one is *actually* updating any beliefs. A Bayes factor, or a likelihood, is just one part of trying to understand the epistemic implications of observed data. There are no actual Bayesian updaters, only hypothetical ones. Parameters and models aren’t real; how can you have beliefs about them?

                “Have you ever generated prior-implied datasets from a Cauchy?”

                I’ve even plotted them in papers, and for courses.

              • ojm says:

                > The justification for marginalisation itself is Bayesian probability

                Funnily enough this is one of my main issues with unrestrained Bayes – there are many cases where the Bayesian assumptions imply marginalisation makes sense when I think it is basically misleading. I don’t think averaging can be the basis of a general logic of science (I also don’t think uncertainty measures should be additive in general).

                In the Bayes factor case I think of the following simplified scenario. I have an indexed family of models p(y;theta), where each theta gives a distribution over y. Now to me it makes sense to compare

                p(y;theta1) to p(y;theta2)

                for two fixed theta values within the space indexing the family. It would also make sense to me to take eg theta2 to be some average parameter value E[theta] obtained from some distribution. As long as your parameter space is closed under convex combinations or whatever, then this is just another point value in the parameter space.

                What doesn’t make much sense to me is to compare a model p(y;theta1) to an average model E[p(y;theta)]. In particular it is much less common for your model class to be closed under convex combinations. So you often end up comparing a member of the original family to a model that lies outside the original family.

                Now, you _can_ always take averages if you want. But I don’t think it is always _meaningful_ or useful to do so. I think an issue with Bayes is that it encourages the idea that as long as we obey the basic rules of probability calculus we will get a meaningful answer. But, I would argue these rules shouldn’t be applied indiscriminately.

                So, despite finding Bayesian statistics very often a useful thing, I think these misgivings make me a non-Bayesian. Interestingly, there are prominent Bayesians who also seem to be nervous about similar things. Does this mean they aren’t really true Bayesians, or that I could still be a Bayesian despite these misgivings? I don’t know, and I don’t know how to assign a probability distribution to these possibilities either.

              • ojm says:

                E[p(y;theta)] is not p(y;E[theta]) in general and these sorts of thing make me nervous.

              • Keith O'Rourke says:

                > I think an issue with Bayes is that it encourages the idea that as long as we obey the basic rules of probability calculus we will get a meaningful answer.
                I think so – obeying those rules just ensures the joint model specification is not made more wrong (i.e. they are truth preserving given the model was true). One upshot of this being that when you notice a marginal prior or posterior is not sensible (too wrong) then you know the joint model specification is too wrong and needs to be revised.

                So to get a meaningful answer one may often (usually?) have to actually break those basic rules (revise the joint model in ways other than Bayes theorem). I think this drives some people to rightly become frequentest via the use and re-use of the proper rules when they don’t appear to need to be broken.

              • Chris Wilson says:

                ojm, I think that if you are willing to apply the Bayesian machinery and find it a useful route to inference in some set of problems you are functionally a “Bayesian”. Any other definition to me smacks of some kind of ‘religious test’ for office of statistician…

          • ojm says:

            Possibly. But a Bayes factor _does_ include a prior in general. I’m fine with likelihood ratios of point hypotheses, I just don’t think it makes much sense (or is very useful) to compare eg a point hypothesis to a composite hypothesis marginalised over a prior.

            Firstly, why rely on only the average likelihood ratio? This could be very misleading. And second, why marginalise over the prior? Why not incorporate the posterior in some way (yes yes ‘double use of data and all that).

            Disclaimer: I don’t really care that much about this particular topic…just an interesting thing to think about

            • Hey Ojm;

              Not that I particularly like BFs, but I interpret the prior in BFs as a probabilistic expression of a statistical hypothesis that maps onto a discrete substantive hypothesis (e.g., H1). So p(theta|H1) is an expression of a statistical hypothesis. The BF is then a ratio of “prior”-predictive success. In a way, the “prior” in a BF is a misnomer (in my opinion), because there is no posterior to the parameter, and thus nothing to be “prior” to; really it’s a statistical-hypothesis implied predictive success ratio, or statistical hypothesis implied likelihood of the observations in the prior-predictive distribution.

              Anyway, I say that, because the goal of a BF is very different. Given substantive H1, what is the statistical hypothesis mapping in the form of a “prior” – $p(\theta | H1)$, and what is the predictive success of such a mapping: $\int p(y|\theta,H)p(\theta|H)d\theta$ ? This represents the density of the observations given the prior predictive distribution, and the goal isn’t to update the state of certainty about $\theta$, but about the state of certainty of a discrete hypothesis that $\theta$ maps onto.

              That said, I totally agree with you. Imo, the only useful BF is an “informed” one, wherein we derive posteriors for quantities for differing theoretical models, then assess on a _new_ sample which informed model best predicts the new sample. With that said, this is approximated by the LOO error anyway (can test this; the elpd is roughly equivalent to getting posteriors, then using those as priors and computing the marginal likelihood; the elpd and the “informed” marginal likelihood will be fairly similar, within elpd standard error).

  6. Marcel van Assen says:

    Both the comment “banging” on BFs and the response to the comment by the original authors increased my insight, so, great & thanks.

    Estimating heterogeneity in meta-analysis is quite awful, as uncertainty is rather huge (with impossible values of heterogeneity in its CI). Bayesian methods incorporating a prior on tau2 make a lot of sense, given this huge uncertainty of heterogeneity.

    • Donny Williams says:

      I said “banging” in reference to Gelman’s paper about BIC.

      For more on the topic on estimating tau, there is lots in more biomedical fields. This has been an active area of research for a long time. People in psychology might be surprised to know error rates in typical meta analyses are not optimal, and this is directly related to estimation of tau. To small, mu interval is too narrow. To big, mu interval is too wide. Too in the sense that the intervals are not calibrated. Most of this is probably trivial if not using binary decisions!

Leave a Reply