## What’s a good default prior for regression coefficients? A default Edlin factor of 1/2?

The punch line

“Your readers are my target audience. I really want to convince them that it makes sense to divide regression coefficients by 2 and their standard errors by sqrt(2). Of course, additional prior information should be used whenever available.”

The background

It started with an email from Erik van Zwet, who wrote:

In 2013, you wrote about the hidden dangers of non-informative priors:

Finally, the simplest example yet, and my new favorite: we assign a non-informative prior to a continuous parameter theta. We now observe data, y ~ N(theta, 1), and the observation is y=1. This is of course completely consistent with being pure noise, but the posterior probability is 0.84 that theta>0. I don’t believe that 0.84. I think (in general) that it is too high.

I agree – at least if theta is a regression coefficient (other than the intercept) in the context of the life sciences.

In this paper [which has since been published in a journal], I propose that a suitable default prior is the normal distribution with mean zero and standard deviation equal to the standard error SE of the unbiased estimator. The posterior is the normal distribution with mean y/2 and standard deviation SE/sqrt(2). So that’s a default Edlin factor of 1/2. I base my proposal on two very different arguments:

1. The uniform (flat) prior is considered by many to be non-informative because of certain invariance properties. However, I argue that those properties break down when we reparameterize in terms of the sign and the magnitude of theta. Now, in my experience, the primary goal of most regression analyses is to study the direction of some association. That is, we are interested primarily in the sign of theta. Under the prior I’m proposing, P(theta > 0 | y) has the standard uniform distribution (Theorem 1 in the paper). In that sense, the prior could be considered to be non-informative for inference about the sign of theta.

2. The fact that we are considering a regression coefficient (other than the intercept) in the context of the life sciences is actually prior information. Now, almost all research in the life sciences is listed in the MEDLINE (PubMed) database. In the absence of any additional prior information, we can consider papers in MEDLINE that have regression coefficients to be exchangeable. I used a sample of 50 MEDLINE papers to estimate the prior and found the normal distribution with mean zero and standard deviation 1.28*SE. The data and my analysis are available here.

The two arguments are very different, so it’s nice that they yield fairly similar results. Since published effects tend to be inflated, I think the 1.28 is somewhat overestimated. So, I end up recommending the N(0,SE^2) as default prior.

I think it makes sense to divide regression coefficients by 2 and their standard errors by sqrt(2). Of course, additional prior information should be used whenever available.

Hmmm . . . one way to think about this idea is to consider where it doesn’t make sense. You write, “a suitable default prior is the normal distribution with mean zero and standard deviation equal to the standard error SE of the unbiased estimator.” Let’s consider two cases where this default won’t work:

– The task is to estimate someone’s weight with one measurement on a scale where the measurements have standard deviation 1 pound, and you observe 150 pounds. You’re not going to want to partially pool that all the way to 75 pounds. The point here, I suppose, is that the goal of the measurement is not to estimate the sign of the effect. But we could do the same reasoning where the goal was to estimate the sign. For example, I weigh you, then I weigh you again a year later. I’m interested in seeing if you gained or lost weight. The measurement was 150 pounds last year and 140 pounds this year. The classical estimate of the difference of the two measurements is 10 +/- 1.4. Would I want to partially pool that all the way to 5? Maybe, in that these are just single measurements and your weight can fluctuate. But that can’t be the motivation here, because we could just as well take 100 measurements at one time and 100 measurements a year later, so now maybe your average is, say, 153 pounds last year and 143 pounds this year: an estimated change of 10 +/- 0.14. We certainly wouldn’t want to use a super-precise prior with mean 0 an sd 0.14 here!

– The famous beauty-and-sex-ratio study where the difference in probability of girl birth, comparing children of beautiful and non-beautiful parents, was estimated from some data to be 8 percentage points +/- 3 percentage points. In this case, an Edlin factor of 0.5 is not enough. Pooling down to 4 percentage points is not enough pooling. A better estimate would of the difference be 0 percentage points, or 0.01 percentage points, or something like that.

I guess what I’m getting at is that the balance between prior and data changes as we get more information, so I don’t see how a fixed amount of partial pooling can work.

That said, maybe I’m missing something here. After all, a default can never cover all cases, and the current default of no partial pooling or flat prior has all sorts of problems. So we can think more about this.

P.S. In the months since I wrote the above post, Zwet sent along further thoughts:

Since I emailed you in the fall, I’ve continued thinking about default priors. I have a clearer idea now about what I’m trying to do:

In principle, one can obtain prior information for almost any research question in the life sciences via a meta-analysis. In practice, however, there are (at least) three obstacles. First, a meta-analysis is extra work and that is never popular. Second, the literature is not always reliable because of publication bias and such. Third, it is generally unclear what the scope of the meta-analysis should be.

Now, researchers often want to be “objective” or “non-informative”. I believe this can be accomplished by performing a meta-analysis with a very wide scope. One might think that this would lead to very diffuse priors, but that turns out not to be the case! Using a very wide scope to obtain prior information also means that the same meta-analysis can be recycled in many situations.

The problem of publication bias in the literature remains, but there may be ways to handle that. In the paper I sent earlier, I used p-values from univariable regressions that were used to “screen” variables for a multivariable model. I figure that those p-values should be largely unaffected by selection on significance, simply because that selection is still to be done!

More recently, I’ve used a set of “honest” p-values that were generated by the Open Science Collaboration in their big replication project in psychology (Science, 2015). I’ve estimated a prior and then computed type S and M errors. I attach the results together with the (publicly available) data. The results are also here.

Zwet’s new paper is called Default prior for psychological research, and it comes with two data files, here and here.

It’s an appealing idea, in practice should be better than the current default Edlin factor of 1 (that is, no partial pooling toward zero at all). And I’ve talked a lot about constructing default priors based on empirical information, so it’s great to see someone actually doing it. Still, I have some reservations about the specific recommendations, for the reasons expressed in my response to Zwet above. Like him, I’m curious about your thoughts on this.

I’ll also wrote something on this in our Prior Choice Recommendations wiki:

Default prior for treatment effects scaled based on the standard error of the estimate

Erik van Zwet suggests an Edlin factor of 1/2. Assuming that the existing or published estimate is unbiased with known standard error, this corresponds to a default prior that is normal with mean 0 and sd equal to the standard error of the data estimate. This can’t be right–for any given experiment, as you add data, the standard error should decline, so this would suggest that the prior depends on sample size. (On the other hand, the prior can often only be understood in the context of the likelihood; http://www.stat.columbia.edu/~gelman/research/published/entropy-19-00555-v2.pdf, so we can’t rule out an improper or data-dependent prior out of hand.)

Anyway, the discussion with Zwet got me thinking. If I see an estimate that’s 1 se from 0, I tend not to take it seriously; I partially pool it toward 0. So if the data estimate is 1 se from 0, then, sure, the normal(0, se) prior seems reasonable as it pools the estimate halfway to 0. But if the data estimate is, say, 4 se’s from zero, I wouldn’t want to pool it halfway: at this point, zero is not so relevant. This suggests something like a t prior. Again, though, the big idea here is to scale the prior based on the standard error of the estimate.

Another way of looking at this prior is as a formalization of what we do when we see estimates of treatment effects. If the estimate is only 1 standard error away from zero, we don’t take it too seriously: sure, we take it as some evidence of a positive effect, but far from conclusive evidence–we partially pool it toward zero. If the estimate is 2 standard errors away from zero, we still think the estimate has a bit of luck to it–just think of the way in which researchers, when their estimate is 2 se’s from zero, (a) get excited and (b) want to stop the experiment right there so as not to lose the magic–hence some partial pooling toward zero is still in order. And if the estimate is 4 se’s from zero, we just tend to take it as is.

I sent some of the above to Zwet, who replied:

I [Zwet] proposed that default Edlin factor of 1/2 only when the estimate is less than 3 se’s away from zero (or rather, p<0.001). I used a mixture of two zero-mean normals; one with sd=0.68 and the other with sd=3.94. I’m quite happy with the fit. The shrinkage is a little more than 1/2 when the estimate is close to zero, and disappears gradually for larger estimates. It’s in the data! You can see it when you do a “wide scope” meta-analysis.

1. Thomas Passin says:

When you get right down to it, using a flat prior amounts to assuming that the system has a distribution that you know is wrong, often wildly wrong – because not many real measurements have a flat distribution of the outcome. That doesn’t seem to be a good idea.

When you combine two sets of real measurements, you weight them by their (inverse) variances. But if they disagree by too much, you ought to be suspicious that there is something wrong about one or both of the data sets, rather than just pooling them and being done with it.

In combining a prior with data from a measurement, you should be no less stringent. Using assumed data (the prior) has to bring in more uncertainty than using actual data. So the weight on the prior ought to be smaller than it would have been for real data. Basically, this amounts to taking into account the reliability of the prior.

So for practical purposes, you should increase the variance of the prior over what you might otherwise have thought it should be. How much? that would depend on how reliable you know the prior to be. If you don’t know much, than it can’t be very reliable. If you know it’s very reliable (perhaps because of previous measurements), you hardly have to do another measurement at all!

There is also a difference between measuring some quantity that has a point value, like the weight of an object, and measuring something which amounts to a distribution, i.e., is not represented by just the mean. For the former, more measurements reduce the standard error. For the latter, more measurements help reduce the uncertainty of both the mean and standard deviation.

I haven’t yet thought through whether the role of uncertainty in the prior affects these two cases the same way.

2. To me all this reflects misunderstanding. The idea that there is a “default prior” is really odd when you think about it. Like, suppose you’re estimating the frequency of people attending church, frequency of people eating, frequency of people getting married…. are these all supposed to use the same “default frequency of human action prior”? I mean, you could make one pretty easily. This frequency is somewhere between eye blinking around 1/second and everything you’ll ever do once: 1 / lifetime = 1/100yrs ~ 1/(pi * 10^9 s), but if you’re estimating frequency of blinking and you’re using that prior range… you’re doing it wrong… and if you are estimating getting married, and using a prior range that includes 1/second you’re doing it wrong…

And then, setting your prior based on the observed data, we had a big conversation about that at one point, but I think the general idea is this: don’t do it. It’s possible to do correctly, that is, in a way that obeys the rules of logic, but it’s not easy, and basically results in mixing in your “real” prior with the likelihood in a way that just hides things under the rug where you can’t figure out what’s going on.

For example, you could set your prior for something to be centered on the observed average and the observed standard deviation… but then you’ll need to do something to account for this fact, because if you use the information “i know the observed average exactly” to set the prior, then after seeing N-1 data points, you can predict the Nth *precisely* using the previous N-1 and the knowledge of the sample average… I’m sure there’s something similar in the standard deviation calculation.

Prior choice should use *real information* every time in my opinion. If you want something where you don’t use that much information, just start with your real information and widen the prior somewhat from there and claim “conservativeness” like when Civil Engineers say “when crowded, you can fit around 25 people on this footbridge, so we’ll design it for up to 100 to be conservative”

• Andrew says:

Daniel:

I agree with you that prior choice, actually model choice, “should use real information every time.” But there’s still the question of the default, the thing you try before putting in that real information, beyond all that information you used in design, data collection, etc.

• Seriously, we’re going to go to the trouble of spending thousands, or hundreds of thousands of dollars of research money, maybe half a year of several people’s time collecting data, at least a couple days massaging that data into a form you can read it into your software reliably, and then we’re not going to bother to go get a cup of coffee and stand next to the white board, and talk with our coworkers for 15 minutes about what kind of information we should put in the prior?

• Basically that IS my default prior: whatever I can figure out in about 5 to 15 minutes of thinking about what I know and making sure all those possibilities are somehow included, and that I’ve excluded anything I am very sure is nonsense, probably using just a few typical distributions like normal, gamma, lognormal, t, dirichlet, etc.

We’ll call it an informative prior when after doing that I do a bunch of simulation tests and have tuned my prior to exclude anything that results in fake-data simulations that lead to strange data etc.

• Erik says:

Author here; thanks for commenting! I agree that the prior should reflect *real information*. One good way (perhaps the best way) to get real prior information is by doing a meta-analysis. But how do you decide the scope? Do you only include studies that are identical to your own, or is it OK to go more general?

I’m arguing that the wider you choose the scope, the less informative the prior becomes. But even if you choose the scope *very* wide, you *still* don’t get a diffuse prior. You get a general purpose prior that reflects basic information about how regression coefficients behave. The second paper (it’s still under-construction) Andrew refers to (Default prior for psychological research) is a nice example.

• Chris Wilson says:

I think this is an interesting idea. Almost empirical Bayes like. Basically, you are defining a large reference set of “studies kinda like this one”, and using that to set priors.

• Erik van Zwet says:

Exactly! And I take “kinda like this one” in a very broad sense so that the estimated prior is very broadly applicable. Now, if you have additional prior information that everybody can agree on, then you should use that instead of the default. For example, in that beauty-and-sex-ratio study we know that large effects are biologically impossible and we should use a much more narrow prior.

• Erik:

Just some history, this 1987 paper was a result of having to learn about meta-analysis in ordere to get *real information* for priors to use in the cost-benefit analysis of funding clinical trials – https://annals.org/aim/article-abstract/702054/meta-analysis-clinical-research
– still behind a paywall :-(

Sure that was not the first but it gives a sense of how slowly *good* ideas take to be widely picked up.

p.s. the sense was and I think still is to flatten the combined likelihood function to account for unaccounted for uncertainties.

• Erik van Zwet says:

Thanks, Keith! I didn’t know that paper, but at a first glance it seems pretty relevant to what I’m doing!

3. Garnett says:

“In principle, one can obtain prior information for almost any research question in the life sciences via a meta-analysis. In practice, however, there are (at least) three obstacles. First, a meta-analysis is extra work and that is never popular.”

…but it’s a great for students and post-docs to get publications that can be useful for more than advancing their careers!

An important point regarding “meta-analytic priors” from Spieglehalter et al. book on Bayesian approaches to clinical trials (p150):

“It is important to note that the appropriate prior distribution for [the parameter] is the predictive distribution of the effect in a new study, and not the posterior distribution of the ‘average’ effect…”

• Thomas Passin says:

@Garnett: “In principle, one can obtain prior information for almost any research question in the life sciences via a meta-analysis. In practice, however, there are (at least) three obstacles. First, a meta-analysis is extra work and that is never popular.”

I remember reading – somewhere – that some researcher studied the accuracy of meta-analyses. In my memory, the meta-analyses didn’t hold up too well for cases where later on they were able to find better studies. Wish I could remember more details!