## Bayes for estimating a small effect in the context of large variation

Shira Mitchell and Mariel Finucane, two statisticians at Mathematica Policy Research (that’s the policy-analysis organization, not the Wolfram software company) write:

We here at Mathematica have questions about priors for a health policy evaluation. Here’s the setting:

In our dataset, healthcare (per person per month) expenditures are highly variable (sd = \$2500), but from prior studies, we don’t expect the policy being studied to have average impact of much greater than \$20. The sample sizes are large enough to have hope to estimate small effects (medical claims records for millions of beneficiaries), but prior knowledge is still useful.

As described here, we consider these levels of priors for the average impact:
1. (essentially) flat prior
2. generic weakly-informative prior: normal(0,1) scaled by sd of the data, so normal with mean 0 and sd \$2500 [that’s normal(0, \$2500) using the mean, sd parameterization of the distribution]
3. specific informative prior: “One principle: write down what you think the prior should be, then spread it out.”
3a. If we think about a large collection of prior studies, we’d believe normal(0, \$20), so we could use something like Normal(0, \$50) as a weakly informative prior
3b. Or a specific prior study that was related: we could center the prior at the point estimate from that evaluation, with a standard deviation twice the SE of that estimate?

Questions:
Thoughts on #2 vs #3 (a vs b)? For #2, which sd to use? Raw sd of expenditures, or a residual sd? Expenditures are highly non-Normal, does this affect the usefulness of the sd as a scale?

First, I’m concerned about the challenges of estimating an average treatment effect of 20 or so, in the context of variation that is 100 times as large. I’m not saying it’s impossible—you just need a sample size in the zillions, of the sort that you can get from administrative records—just that your inference will be highly sensitive to small biases.

[Shira and Mariel elaborate: Say sd = 1 and we want to estimate a true effect = 1 with n = 100, so SE = 0.1. If bias is 0.01, this is only 1% of the true effect, not a big deal. Now say we want to estimate a true effect = 0.01 with n = 1000000, so SE = 0.001. If bias is again 0.01, this is 100% of the true effect, which is a big deal. Larger sample sizes won’t reduce bias, and we expect bias proportional to the sd of the raw data, and not the true effect.]

A randomized trial gets rid of one sort of bias, but there’s still dropout, nonresponse, and plain old variation in the treatment effect. So it would be best if you could somehow decompose the outcome and pull out the part that’s responsive to the treatment and then focus on that part, on the hope that the sd will be a lot less than 2500. [Shira and Mariel say: for example, if we imagine the program decreases hospital expenditures specifically (instead of total expenditures)? or decreases expenditures primarily for high-risk patients?]

Second, to get to the problem at hand, I have no problem with something like a conservative normal(0,20^2) prior or a weakly informative normal(0,200^2) prior on the treatment effect. To me, the “conservative” choice is not the flat prior but rather the prior that pulls strongly toward 0; for more on this point, see this paper with Aleks Jakulin.

Shira and Mariel follow up:

We share your notion of conservative. How should this be balanced with prior information we have from a previous (very related) study? Do we center at 0 or the previous study’s estimate? Somewhere in-between? If centered at a non-zero value, then a strong prior is no longer your version of conservative, correct?

My response:

In this context, yes, I feel that centering the prior at a positive value rather than 0 would not be conservative. But I suppose it depends on the context. For example, suppose you label the treatment effect in a previous study as theta_1 and the treatment effect in the current study as theta_2 (these represent the underlying parameters, not the point estimates), and you think of theta_1, theta_2, etc., as draws from some distribution of treatment effects with center mu and scale sigma. Then you could have a conservative prior on mu with mean 0. This will be different from a conservative prior on theta_2 with mean 0. I guess the lesson is that a conservative prior is defined with respect to the problem being studied and the data being modeled.

Shira and Mariel point out that in the sort of hierarchical model I’m suggesting, you’ll want an informative prior on the variation in the treatment effect. That’s right. In statistical practice we’re not used to setting up these priors, but I think full Bayes is the way to go here, and that it’s not so horrible to have to justify your particular choice of prior distribution as part of the analysis, in the same way that you’re already expected to justify your choice of predictor variables, transformations, and so forth.

1. Daniel Lakeland says:

A kind of prior I have used when I’ve got a reasonable range to consider is a “soft uniform”

u ~ uniform(a,b)
n ~ normal(0,scale)

par = u + n;

and use par as my parameter of interest. The point being to generate a broad plateau of reasonable values between a,b while still giving support to the whole real line outside that range, and a nice parabolic tail like in a normal distribution.

So, for example if you expect something near zero, and not much more than absolute value around \$20 you could do

u ~ uniform(-20,20)
n ~ normal(0,10)

par = u+n

and now the main prior probability mass of par is between -20 and 20 with a normal tail on either side extending out another few multiples of 10, 95% mass is something like -40 to 40

You could get something similar from normal(0,30) or the like, but the soft uniform gives you a little stronger tail and a little more uniform plateau in the high probability region.

• Corey Yanofsky says:

u ~ uniform(-20,20)
n ~ normal(0,10)
par = u+n

This is
par ∝ Φ[(par + 20)/10] – Φ[(par – 20)/10]

in which Φ[⋅] is the standard normal CDF.

• Corey Yanofsky says:

dang it, that should be

prior density for par ∝ Φ[(par + 20)/10] – Φ[(par – 20)/10]

• Daniel Lakeland says:

I usually think of it in terms of a convolution. You start with the “table” uniform(a,b) and then smooth it out by convolving with normal(0,scale)

So density at a given point x is integral(pn(s,x,scale) * 1/(b-a),s,a,b)

with pn(s,x,scale) is the normal density centered at x with given scale, evaluated at point s.

which is probably how you get your phi based representation.

I think the useful fundamental way to think about this class of priors is in terms of the convolution, since you can construct any kind of “smoothed Foo” for all kinds of reasonable base distributions “Foo” and the smoothing kernel can be normal, or something less smooth, or have finite tails, or whatever you want.

• Eric says:

Would this run into sampling problems at the boundary of the “uniform” plateau due to discontinuities in the density because of the poorly estimated derivative around these boundaries?

• Daniel Lakeland says:

no, I wouldn’t expect that at all. The resulting prior is infinitely smooth as it’s the convolution of the “table” with an infinitely smooth gaussian. In Stan you’ll have to make sure your support is declared properly though:

parameters{

real u;
real n;
}

transformed parameters{

real par;

par = u+n;
}

u ~ uniform(a,b)
n ~ normal(0,scale);

• Daniel Lakeland says:

of course, the blog ate my stan boundaries. basically you need u to have lower=a,upper=b boundaries.

2. MJT says:

“`
Shira and Mariel follow up:

We share your notion of conservative. How should this be balanced with prior information we have from a previous (very related) study? Do we center at 0 or the previous study’s estimate? Somewhere in-between? If centered at a non-zero value, then a strong prior is no longer your version of conservative, correct?

My response:

In this context, yes, I feel that centering the prior at a positive value rather than 0 would not be conservative. But I suppose it depends on the context.

“`

the way i like to think about umbrella ‘shrinkage’ is: shrinkage to some value c.
* if c is 0, then you get lasso type of shrinkage
* if c is basically the mle, then you have ‘no shrinkage’

with bayes + multilevel models, you the flexibility to structure c.

in turn, what is ‘conservative’ or ‘liberal’ will better match your context

3. Eric says:

Would you recommend trying to fit the larger multilevel model to the old data (theta_1) and the new data (theta_2) all at once or is there a more computationally efficient way to estimate both mu and theta_2 given the prior data on theta_1 and its estimated error?