Default informative priors for effect sizes: Where do they come from?

To coincide with the publication of our article, A Proposal for Informative Default Priors Scaled by the Standard Error of Estimates, Erik van Zwet sends along an explainer. Here’s Erik:

1 Set-up

This note is meant as a quick explainer of a set of three pre-prints at The Shrinkage Trilogy. All three have the same simple set-up: We abstract a “study” as a triple (beta,b,s) where

– beta is the parameter of interest
– b is an unbiased, normally distributed estimate of beta
– s is the standard error of b.

In other words, we are assuming that our estimate b has the normal distribution with mean beta and standard deviation s. We do not observe beta, but we do observe the pair (b,s).

We define the z-value z=b/s and the signal-to-noise ratio SNR=beta/s. Note that the z-value is the sum of the SNR and independent standard normal “noise”. This means that the distribution of the z-value is the convolution of the distribution of the SNR with the standard normal density.

It is not difficult to estimate the distribution of z-values if we have a sample of study results from a particular field of study. Subsequently, we can obtain the distribution of the SNRs in that field by deconvolution. Moreover, we also know the conditional distribution of the z-value given the SNR; it’s just normal with mean SNR and standard deviation 1. So, we can actually get the joint distribution of the z-value and the SNR.

So, we’re going to estimate the distribution of z=b/s, deconvolve to get the distribution of SNR=beta/s and scale that distribution by s to get a prior for beta given s. We can then use conjugate theory the posterior of beta, given b and s. The posterior mean of beta is a useful shrinkage estimator. Shrinkage is very important, because the signal-to-noise ratio is often very low and therefore |b| tends to overestimate (exaggerate) |beta|. This is especially bad when we condition on statistical significance (|z|>1.96).

2 z-value and SNR

To estimate the distribution of the z-value in some particular field of research, we need an “honest” sample that is free from publication bias, file drawer effect, fishing, forking paths etc. Recently, Barnett and Wren (2019) collected more than a million confidence intervals from Medline publications (data are here). We converted those to z-values and display the histogram below. The striking shortage of z-values between -2 and 2 suggests strong publication bias. This biased sample of z-values is not suitable for our purpose.

Simon Schwab (2020) collected more than 20,000 z-values from RCTs from the Cochrane database (data are here). The histogram shows much less publication bias. This may be due to the fact that many studies in the Cochrane database are pre-registered, and to the efforts of the Cochrane collaboration to find unpublished results.

We fitted a mixture of 3 normal distributions to the z-values from the Cochrane database. We show the fit in the histogram above, and note that it is quite satisfactory.

fit=flexmix::flexmix(z ~ 1, k = 3)  # estimate mixture distribution of z
p=summary(fit)@comptab$prior        # mixture proportions
mu=parameters(fit)[1,]              # mixture means
sigma=parameters(fit)[2,]           # mixture standard deviations
round(data.frame(p,mu,sigma),2)
##           p    mu sigma
## Comp.1 0.46 -0.25  1.25
## Comp.2 0.08 -0.76  5.39
## Comp.3 0.46 -0.13  2.18

We can now get the distribution of the SNR by deconvolution of the distribution of the z-value with the standard normal distribution. Deconvolution is not easy in general, but in our case it is trivial. Since we estimated the distribution of the z-value as a normal mixture, we can simply subtract 1 from the variances of the mixture components. We plot the densities of z and SNR together, and see that the density of the z-value is a “smeared-out” version of the density of the SNR.

tau=sqrt(sigma^2-1); round(tau,2)  # deconvolution; standard deviations of the SNR  
## Comp.1 Comp.2 Comp.3 
##   0.75   5.30   1.94

3 Power

The power of the two-sided test of H_0 : beta=0 at level 5% is

P(|z|>1.96 | beta,s) = pnorm(SNR-1.96) + 1 – pnorm(SNR+1.96).

Since the power is just a function of the SNR, we can transform a sample from the distribution of the SNR into a sample from the distribution of the power (see also the histogram below).

rmix = function(n,p,mean,sd){   # sample from a normal mixture
  d=rmultinom(n,1,p)
  rnorm(n,mean%*%d,sd%*%d)
}
snr=rmix(10^6,p,mu,tau) 
power=pnorm(snr - 1.96) + 1 - pnorm(snr + 1.96)
S=summary(power); round(S,2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.05    0.07    0.14    0.28    0.39    1.00

We see that the median power across the Cochrane database is about 14%, while the average power is about 28%. The average power can be interpreted as the probability that a randomly selected result from the Cochrane collection is significant. And indeed, 29% of our z-values exceeds 1.96 in absolute value. The fact that the (achieved) power is often very low should not surprise us, see The “80% power” lie. However, it also does not imply that the usual sample size calculations aiming for 80% or 90% power are necessarily wrong. The goal of such calculations is to have high power against a particular alternative that is considered to be of clinical interest – the effect “you would not want to miss”. That high power is often not achieved just goes to show that medical research is hard, and treatments often do not provide the benefit that was hoped for.

It is also possible to condition the power on a specific z-value. This will allow us assess the probability that the replication of a specific result will be significant.

4 Exaggeration

Low power is a serious problem as it leads to overestimation of effects (type M error) and over-confidence about their sign (type S error), see This is what “power = .06” looks like. Get used to it.

The “exaggeration ratio” is |b/beta| = |z/SNR|. Since it is just a function of the z-value and the SNR, we can easily get a sample from its distribution (see also the histogram below).

z=rnorm(10^6,snr,1)
exaggeration=abs(z/snr)
Q=quantile(exaggeration,c(0.1,0.25,0.5,0.75,0.9))
round(Q,2)
##  10%  25%  50%  75%  90% 
## 0.36 0.76 1.23 2.30 5.44

We see that the median exaggeration is 1.23. That means that half the studies overestimate the effect by at least 23%.

It is also possible to condition the exaggeration on a specific z-value. This will allow us assess the exaggeration of a specific estimate. We can then correct for this by shrinking the estimate.

5 Shrinkage

We can use shrinkage (regularization) to correct the exaggeration. We have estimated the distribution of the SNR as a normal mixture parameterized by (p,mu,tau). Recalling that SNR=beta/s, we scale this distribution by s to get a distribution for beta. So, the distribution of beta is a normal mixture parameterized by (p,s*mu,s*tau).

We can now compute the conditional (or posterior) distribution of beta given the pair (b,s). It is again a normal mixture distribution.

posterior = function(b,s,p,mu,sd){  # compute conditional distr of beta given (b,s)
                                    # mixture distr of beta given by (p,mu,sd)
  s2=s^2
  sd2=sd^2
  q=p*dnorm(b,mu,sqrt(sd2+s2))      # conditional mixing probs
  q=q/sum(q)
  pm=(mu*s2 + b*sd2)/(sd2+s2)       # conditional means
  pv=sd2*s2/(sd2+s2)                # conditional variances
  ps=sqrt(pv)                       # conditional std devs
  data.frame(q,pm,ps)
}

As an example, we compute the conditional (posterior) distribution of the beta given b=2 and s=1. It is a normal mixture with the following parameters:

b=2; s=1
post=posterior(b,s,p,s*mu,s*tau)
round(post,2)
##           q   pm   ps
## Comp.1 0.34 0.56 0.60
## Comp.2 0.06 1.90 0.98
## Comp.3 0.60 1.55 0.89

In particular, we can use the conditional (posterior) mean as an estimator.

post.mean=sum(post$q * post$pm) 
round(post.mean,2)              # posterior mean of beta
## [1] 1.24
round(post.mean/b,2)            # shrinkage factor
## [1] 0.62

6 Conclusion

Low power is very common. It leads to overestimation of effects (a.k.a. exaggeration, inflation or type M error) which must be corrected by shrinkage. For more details, we refer to The Shrinkage Trilogy. Here we end with three remarks.

Undoubtedly, the Cochrane database suffers from at least some publication bias, file drawer effects, fishing, forking paths etc. Unfortunately, this means that the power is likely to be even lower and the exaggeration even greater.

Strictly speaking, our analysis concerns the results in the Cochrane database. However, we believe that problems of low power and exaggeration are similar – if not worse – in many other areas of research.

Our estimate of the distribution of the SNR is quite close to the standard Cauchy distribution, which we recommend as a default prior without reference to the Cochrane database. Of course, nothing beats real, substantive prior information that is specific to the study of interest.

15 thoughts on “Default informative priors for effect sizes: Where do they come from?

  1. This is a very alarming recommendation and I would stenuously oppose it.

    Two comments:

    – achieved power is just post-hoc power. I don’t think this is a sensible way to talk about what they are computing. I would be more careful, and say instead that *if* the published estimates are treated as true values, *then* for a future study, power would be … So, I would talk only in terms of prospective power, otherwise we are going to land in the same confused space again where people compute power after collecting data to justify that they had enough data.

    Reference: Hoenig, J. M., and Heisey, D. M. (2001). The abuse of power: the pervasive fallacy of power calculations for data analysis. The American Statistician, 55(1), 19-24.

    – We can’t seriously be talking about using Cauchy priors today, at least not in the context of finding out whether an effect is “present” or “absent”, and especially not in medicine (I’m OK with doing this in useless areas where nothing matters anyway). There is already enough abuse of Bayesian methods happening as it is (as someone predicted years ago). E.g., Montero-Melis et al. (2020) compute a Bayes factor in a Cortex pre-registration with a vague prior (Normal(0,2^2)) on the parameter of interest (see: https://osf.io/krvt2/). They criticize work by Pulvermueller for having a low BF of 1.7, but they biased the BF in favor of the null by setting such a vague prior. Yeah, sure, I can show that all published results have a low BF with a vague prior, no problem. See the extensive discussion about the sensitivity of priors on BFs in Kruschke, Wagenmakers and Rouder’s work.

    • Shravan:

      1. Thanks for commenting, but the achieved power is certainly not the same as post-hoc power. By achieved power we really mean the probability of rejecting H0 under the true value of beta. Of course, the true value of beta is never observed, so we never observe the achieved power either. However, as we describe above, we can estimate the distribution of the achieved power. This estimated distribution shown in the histogram.

      2. We’re not talking about Bayes factors, or finding out whether an effect is present or absent. We are only saying that the distribution of the signal-to-noise ratio (SNR) among the RCTs in the Cochrane database is fairly close to standard Cauchy. It’s the red distribution in the second figure above.

      • My experience has been that unless one explicitly says that one is not talking about hypothesis testing using BFs, the reader (a newcomer to Bayes) will infer that a default prior is the way to go for any kind of Bayesian modeling, including BFs. So I would suggest explicitly pointing out that BFs are off the table here. People may easily be misled by the bottom line of this work: “Our estimate of the distribution of the SNR is quite close to the standard Cauchy distribution, which we recommend as a default prior without reference to the Cochrane database.” Personally, I feel one should focus more on showing examples of how to elicit priors for their own particular problems (echoing gec’s points below). BTW, the last time I used Cauchy priors, they caused all manner of problems in Stan.

        Is achieved power a term in the field? I have never heard it. I know about “post-hoc power” and power as in prospective power for a future study. What is the exact definition of “achieved power”? From the name it sounds like something you compute given some data and it seems to refer to the power properties of the *already observed* data. Is that incorrect?

        • The power is P(|z| &gt 1.96 | beta,s), which is a function of beta (and of the standard error s, but we’ll ignore that). For sample size calculations, we would take beta to be “the effect we wouldn’t want to miss”. If we take beta to be equal to the estimated effect b, then we get the abominable post-hoc power. If beta is the true effect, then we get the “achieved power”. We generated a sample of size 1 million from the estimated distribution of this achieved power over the RCTs in the Cochrane, and that sample is shown in the histogram above. I think got the term “achieved power” from Stephen Senn.

    • I echo many of Shravan’s concerns, but I think they have to do with the fact that this paper is doing a lot of cool things and the framing makes it somewhat difficult to tell what the message is.

      Using a database that represents as unbiased a sample as you could get means that you actually have a good handle on estimating effect sizes and therefore power. This is really cool, and I think it’s cool that we can, post-hoc, infer the degree to which the single-study effect sizes are likely to over-estimate the true effect sizes. This is like the Ioannidis paper but, instead of making educated guesses, you have legit estimates.

      This is all a really cool exploration of the relationship between data and inference within a particular area and is useful qualitative guidance. But then there is the recommendation for a particular Cauchy prior as default, and this is where Shravan and I get stuck. I think we’d all agree that a genuine informed prior is best, and that if you had to pick between a uniform and a Cauchy, you should probably pick the Cauchy. But I have two worries:

      1) The distribution of effect sizes in the paper was inferred based on a large corpus of studies, and wouldn’t it make sense for a researcher to do the same for their particular domain, rather than import a default from somewhere else that doesn’t necessarily apply? Indeed, I think a great value of the paper is showing a way to estimate effect sizes from a corpus so researchers could follow this step.

      2) Although you are careful to say in the paper that the default is meant as a starting point, in practice most users of statistical software don’t adjust from the starting point. Sure, we can say that, given that people will just use the default settings, we should try to make the defaults as good as we can. But I think it would be better to not have any defaults at all. Yes, this would result in fewer analyses being performed (or maybe fewer people doing them) because they would require making more decisions, but maybe that’s not a bad thing.

      • gec: I’m glad you think our paper has many cool things. Thanks!

        1. Yes, I agree. It would be best if a researcher uses a corpus that’s suited to their application. The next best thing would be a large “general purpose” corpus. For example, “all phase III RCTs” or even “all RCTs” as discussed here. The Cauchy is a last resort – we call it a “default default” prior in the paper.

        2. As Andrew puts it here: “There’s always a default. Choose your default, or your default will choose you.” The fact is that the uniform prior is currently the default in 99% of the cases, and that’s pretty much the worst possible choice.

        • Indeed, as the renowned statistician Neil Peart once wrote, “If you choose not to decide // you still have made a choice”!

        • Great, now that will be stuck in my head for the rest of the day…

          I wonder if there is another word besides “default” that we could use that would highlight the need to not accept them thoughtlessly. Placeholder prior? Stopgap prior? Canned prior?

  2. I regard the t(1)-prior as a default for a logistic coefficient beta to be no better in any meaningful way than the improper uniform prior, as can be seen by how little difference it makes (apart from the extreme case of complete separation, where any method is doing dubious interpolation): It places way too much posterior probability on absurdly large values, e.g., it implies an upper 2.5th prior percentile for the odds ratio exp(beta) of about 330,000, which if the outcome were a death indicator, corresponds to a treatment that would kill over 99.99% of the treated if the background (control) risk of death were only 1 per thousand (0.1%).
    Hence the t(1) is just too vague to bother with over the uniform prior – both are absurd in any reasonable application of logistic regression.

    The next less vague is the Jeffrey’s prior, which is data-dependent but in a good way, insofar as it removes 2nd-order frequentist bias in the estimated logistic coefficients – its use corresponds to the Firth ML bias adjustment. But it’s still absurdly vague as a treatment prior.

    My favorite practical default prior for logistic regression in terms of not being informative enough that anyone would seriously object to for a treatment coefficient beta is the logistic prior density
    f(beta) = exp(beta)/(1+exp(beta))^2
    which is a natural extension of the Laplace prior to logistic coefficients: It contributes two coin-tosses (two bits) worth of information, one heads, one tails, to the estimation of beta. It’s symmetric about zero, places 95% of its mass on the odds the ratio exp(beta) between 1/39 and 39. Even our best vaccines are well within this range (e.g., 95% effectiveness corresponds to an odds ratio of no more than 19 for the unvaccinated vs. the vaccinated).

    The logistic density is a special case (with a=b=1) of the conjugate prior for the binomial-logistic model, the log-F(2a,2b) prior with easily graphed density
    f(beta) = exp(a*beta)/(1+exp(beta))^b
    in which a+b is the number of bits of information added by the prior, the same information as that for the binomial parameter
    p = f(beta) = exp(beta)/(1+exp(beta)) upon getting a heads and b tails in a+b independent tosses.

    Setting a = b and m = 2a = 2b creates a log-F(m,m) prior symmetric about zero; the tuning parameter m controls the dispersion about zero and equals the information content of the prior measured in bits.
    As m increases this prior approaches a normal(0,4/m) density, but has heavier tails when m is small, albeit not crazy heavy like the t(1).

    I find no other proposed default comes close in elegance, ease of comprehension in straightforward information terms, and ease of use as the log-F(2a,2b) density. They can be used in a posterior sampler, but unlike with other priors, the resulting posterior statistics can be computed from ordinary ML logistic software just by adding one pseudo-record per coefficient with a successes and b failures. The resulting “MLEs” are the maximum-penalized-likelihood estimates, which are the exact posterior modes and approximate posterior means obtained by using the log-prior density as a likelihood penalty function (and to a better approximation than standard ML), while the CIs are approximate posterior intervals. Using the profile-likelihood interval option in SAS and Stata, in no real dataset I’ve had were these statistics different to any practical degree from the corresponding MCMC results.

    For further details about general log-F priors (allowing location and scale parameters) and how to code prior pseudo-records for them see Greenland S. Prior data for non-normal priors. Statistics in Medicine 2007;26:3578-90.
    For more on the log-F(m,m) as a default family and its comparison to a Jeffreys prior see Greenland S, Mansournia M. Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions. Statistics in Medicine, 2015;34:3133–43.

    • Sander:

      Regarding the t(1) prior, I don’t use it myself anymore, but it did perform well on a corpus. I’ve used it in other applications too, and I much prefer it to the uniform prior. I’ve not tried the log-f density, but I guess we could try it in Stan and see how it goes. We use MCMC for convenience, but when speed is an issue we can use a mode-based approximation for non-hierarchical models.

  3. I think any of the proposed default priors will cure separation and outperform ML in real problems where using logistic regression is a sensible tool, provided the prior is dispersed enough;
    and in that case the defaults will not display serious practical differences in our sort of “soft science” logistic regression applications which are marked by the absence of known overwhelming effects.

    To read your Annals of Applied Statistics 2008 paper you seem to agree, albeit in using the t family you seem to have a bit more dispersed view of “dispersed enough”. My working definition of “dispersed enough” would be for a targeted treatment would start with the logistic = log-F(2,2) distribution, which corresponds to the Laplace reference point of adding 1 success and one failure, and for other covariates or situations perhaps wider but not more so than the Haldane = log-F(1,1) distribution, which you take as a reference point at the bottom of p. 1364 of your 2009 paper as adding a half success and a half failure.
    I also want no prior on the intercept (which you practically seem to as well by using an enormously dispersed prior for that).

    It seems then we are very close about what we are after and the practical differences in the answers from our defaults are small.
    What if anything then is left to choose among the defaults?
    All I can see at the moment are two very correlated properties:
    1) ease of alternative explanations of the meanings of the priors (reflecting my concern about fostering better use by nonmethodologists), and
    2) ease of correct implementation (by nonmethodologists who are stuck with SAS or similar, not about to use R or the like).
    For those purposes I’ve argued that the log-F family seems better suited than other proposals because
    a) it always has an exact correspondence to adding successes and failures, and so
    b) because of that exact correspondence, its implementation with ordinary ML logistic software is as simple as one can get: As described in my 2007 SIM article one just adds in the successes and failures in a record with zero regressor entries everywhere except 1 for the covariate those prior data refer to (the intercept is counted, becoming the covariate of a coefficient which indicates that a record is of real data). If no prior is desired for a coefficient, simply add no prior record for it.

    Items (a) and (b) are of course the conceptual and computational analogs of the conjugacy of the log-F for logistic coefficients, which provides an even more intuitive correspondence than that of the beta for the proportions:
    The pesky -1 decrements of the added counts in the beta density disappear upon transformation to the log-F(a,b), as can be seen from the fact that
    if p is beta with density p^(a-1)*(1-p)^(b-1) then y = logit(p) is log-F(a,b) with density exp(ay)/(1+exp(y))^(a+b). So now a and b can be more clearly seen both as added counts, a bit (actually a+b bits) more concrete than the more general treatment of a prior as added information (in the likelihood sense).

    It’s also not hard to add in prior correlations using the log-F, with no need to increase the number or simplicity of the added data records. I find the easiest way to do that with any method is to specify a hierarchical prior to create conditional independencies (I think you do as well).

    Having tried to state the reasons for my preference precisely, I wonder in what precise way you could express holding a preference for the t-family (as in your AAS article) or whatever you prefer now.
    I’ll hazard a guess it has something to do with wanting more tail weight, which I could provide by lowering the degrees of freedom in the log-F distribution and, if I want, rescaling just as you do for the t distribution.
    It would then come down to whether we take tails around inverse-square order seriously as opposed to being content with inverse exponential tails (which for me are heavy enough compared to the inverse exponential-square tails of the normal). If so maybe you were thinking more of physics than I would (again just a guess).

  4. Dear Erik,

    I’ve been reading your “Trilogy of Shrinkage” and would like to congratulate you on the amazing work. Truly insightful. I have a quick “mathematical” question:

    In this post, you estimated the mean of posterior distribution as “sum(post$q * post$pm)”, as you did in your “A Proposal for Informative Default Priors Scaled by the Standard Error of Estimates” article with Dr. Gelman.

    Similarly, in your “The statistical properties of RCTs and a proposal for shrinkage” article (Reference 2), you estimated the mean of beta_hat (shrinkage estimator), as “s * sum(res$q * res$m)”, and then the 95% interval using KScorrect::qmixnorm().

    Instead of the 95% interval, could one estimate the variance of beta_hat as “s * sum(res$q * res$sigma)”, following the code from Reference 2?

Leave a Reply

Your email address will not be published. Required fields are marked *