Informative priors for treatment effects

Biostatistician Garnett McMillan writes:

A PI recently completed a randomized trial where the experimental treatment showed a large, but not quite statistically significant (p=0.08) improvement over placebo. The investigators wanted to know how many additional subjects would be needed to achieve significance. This is a common question, which is very hard to answer for non-statistical audiences. Basically, I said we would need to conduct a new study.

I took the opportunity to demonstrate how a Bayesian analysis of these data using skeptical and enthusiastic priors on the treatment effect. I also showed how the posterior is conditional on accumulated data, and naturally lends itself to sequential analysis with additional recruitment. The investigators, not surprisingly, loved the Bayesian analysis because it gave them ‘hope’ that the experimental treatment might really help their patients.

Here is the problem: The investigators want to report BOTH the standard frequentist analysis AND the Bayesian analysis. In their mind the two analyses are simply two sides of the same coin. I have never seen this (outside of statistics journals), and have a hard time explaining how one reconciles statistical results where the definition of probability is so different. Do you have any help for me in explaining this problem to non-statisticians? Any useful metaphors or analogies?

My reply: I think it’s fine to consider the classical analysis as a special case of the Bayesian analysis under a uniform prior distribution. So in that way all the analyses can be presented on the same scale.

But I think what’s really important here is to think seriously about plausible effect sizes. It is not in general a good idea to take a noisy point estimate and use it as a prior. For example, suppose the study so far gives an estimated odds ratio of 2.0, with a (classical) 95% interval of (0.9, 4.4). I would not recommend a prior centered around 2. Indeed, the same sort of problem—or even worse—comes from taking previous published results as a prior. Published results are typically statistically significant and thus can grossly overestimate effect sizes.

Indeed, my usual problem with the classical estimate, or its Bayesian interpretation, is with the uniform prior, which includes all sorts of unrealistically large treatment effects. Real treatment effects are usually small. So I’m guessing that with a realistic prior, estimates will be pulled toward zero.

On the other hand, I don’t see the need for requiring 95% confidence. We have to make our decisions in the meantime.

Regarding the question of whether the treatment helps the patients: I can’t say more without context, but in many settings we can suppose that the treatment is helping some people and hurting others, so I think it makes sense to consider these tradeoffs.

31 thoughts on “Informative priors for treatment effects

  1. Thanks for your insights. Interestingly, a statistical reviewer asked for an uninformative prior e.g. N(0, sd=100) with the concern that informative priors increase the risks of Type I error. I don’t know if I agree with that assessment, but I see that fairly regularly: concerns about Type I or Type II error in Bayesian analysis of trials.

    • Garnett:

      Wow, your reviewer is clueless. You can point them here and here and here for discussions of how classical estimates (or Bayesian inferences using noninformative priors) are prone to jumping to conclusions, while informative Bayesian inferences should be more conservative.

      • I think that reviewers feel that informative priors lead the investigators to over-promote efficacy (hence Type I error concerns). The diffuse prior is seen as “allowing the data to speak for themselves,” and is a symptom of uneasy suspicion about statistical inference.

        FWIW a colleague recently attended a one day Bayesian analysis course, where diffuse priors were promoted. I think this is common so as not to ‘scare’ novices into thinking that Bayesian analysis just leads to forgone conclusions.

        • Garnett:

          Yes, I’m thinking of priors that are centered on zero or something like that, not priors that are centered on optimistic estimates based on researchers’ hopes, dreams, and biased reviews of the literature.

          One reason to think about informative priors is that noninformative priors have been leading to such disasters. Himmicanes, power pose, ovulation and clothing, etc etc etc.

          If classical or noninformative statistics were working just fine, I’d say, hey, go for it, only unleash the Bayes if you really need the efficiency. But given the disasters that noninformative analysis have given us, I for one am highly motivated to move on.

        • I’m not the best person to reflect on this, but I imagine that diffuse priors were/are promoted because they give the impression of ‘objectivity’ and, I think more importantly, they allow non-statisticians to do Bayesian analysis on automatic pilot.

          I find that establishing reasonable priors is easily as much work as the data analysis itself.

        • “I find that establishing reasonable priors is easily as much work as the data analysis itself.”

          Nah, I don’t buy this at all. In your example, suppose that on the scale you’re measuring things, N(0,100) is already seen as “uninformative” by basically everyone. This tells you that everyone believes that effect sizes are O(10) (ie. absolute size is a small multiple of 10 or less). How much variability is there in the placebo/control group? Say the sake of illustration that it’s sd = 0.5, this tells you that 0.5 is “not big”.

          So, just based on that information, I’d start out with a N(0,5) prior, and then start to ask people whether they might think that there should be some asymmetry, and/or maybe increased or decreased diffusiveness and based on what arguments. After about 10 minutes you should be able to put down a standard deviation that is a small multiple of 5 that everyone can agree on, maybe 5 itself, maybe 20 for example. If there’s some asymmetry you can use exponentially-modified normal or something like that. If people feel that it should be centered on 0 and 5 isn’t a bad guess but that it could be several multiples of 5, perhaps a laplace prior centered on 0 and with scale 5 would be better.

          The key to remember is that there *IS NO* true prior, there is only approximate representations of your state of information about what is and isn’t reasonable.

          The hard part isn’t coming up with priors for single parameters like this, it’s more coming up with priors for complex objects, like matrices or functions (gaussian process covariance functions for example) or splines, non-orthogonal basis expansion coefficients, etc. You don’t want to accidentally put some kind of unwarranted strong structure on things, and that is more common in high dimensions.

        • Agreed. I was thinking more of the latter example that you gave. Randomized trials are usually not too hard to define priors. Usually.

    • I want to echo Andrew here. I don’t tend to think in terms of Type I error. I prefer to say that flat priors overfit the sample. Overfitting is bad. So skeptical priors centered on zero are nice model features.

        • Yes. That said, I’ve sometimes been surprised in my own work how the data pool across levels, such that sometimes “a lot” depends subtly upon the model structure.

      • My understanding is that FDA requires that device licensing studies evaluate Type I error rates in Bayesian trials. I get their point of view: they don’t want manufacturers to promote ineffectual or even harmful therapies.

        Still, the focus on Type I errors seems to be a classical way to skirt around the fact that Bayesian analysis requires potentially non-unique prior input.

        (Love your book, BTW)

        • Interesting. I am not actually against the Type I/Type II framework. It’s just that my PhD training emphasized fitting non-null models to data, not testing for existence of effects. So it’s not my default mode of thinking. I appreciate the cultural difference to epi/pharma.

        • Richard:

          The FDA has a regulatory perspective which drives this and in this role Don Berry and other Bayesians have argued that Type I/Type II framework is most appropriate here.

          Now I am not sure how often the Bayesian methods type I/type II errors are properly evaluated. It seems to usually involve a simulation from a point in the null hyp and a few points in the alt hyp.

        • Keith:

          I can believe that, with effort, the type 1 and 2 error framework can be adapted to this problem. But I can’t believe it’s an appropriate framework. We should be able to do much better, I think. Do you have references on what Berry and others have written, so we can suggest improvements from that starting point? This is a big deal, no?

        • I took the phrase “Type I/Type II framework” from Richard’s comment – perhaps I should have said just Type I/Type II assessment – regulatory agencies need to credibly assess if something does not work (has a trivial effect) what is the chance that they will approve it and vice versa.

          Not that regulatory agencies should to make that (assessed) chance always less than 5% or base important decisions solely on the basis of Type I/Type II assessments. It is a minimum standard not a gold standard.

          I’ll locate the Berry reference early next week.

        • Also lots of Don and Scott Berry’s work is about adaptive trials, where you have to take decisions as the trial proceeds, so probability of wrong decisions is quite important. But generally there is a tendency to interpret trials dichotomously – treatment works or it doesn’t – which is obviously rarely realistic. Driven no doubt by significance testing and obsessing about Type I error rates.

          I have a similar situation at the moment to the one described in the post; I’m suggesting a Bayesian analysis for trial that is probably not going to find a traditionally “significant” result because the event rate is much lower than expected. Resistance comes (perhaps surprisingly, but then again, perhaps not) largely from the statisticians involved.
          I wouldn’t see a problem with doing more than one analysis, as different analyses give you different information, or more accurately, Bayesian analysis adds a lot to a traditional analysis. In other contexts it is common to do and present more than one analysis, for example adjusted and unadjusted models. David Spiegelhalter for one has advocated doing Bayesian analyses in addition to traditional, as an aid to interpretation of results (though to my mind that doesn’t go nearly far enough!).

        • Not able to find the article I recalled with a discussion of why regulators required knowing type 1 and 2 error rates but perhaps these will do for now.

          P-values Are Not What They’re Cracked Up to Be: comment on Statement on Statistical Significance and
          P-values. “When is it appropriate to use p-values for inference? An archetype is drug regulation. Drug sponsors must develop a protocol and a statistical analysis plan in advance of an experiment. These explicitly and unambiguously state the primary endpoint and how it will be analyzed. After the experiment a robot could calculate the p-value.”

          Bayesian clinical trials http://www.nature.com/nrd/journal/v5/n1/full/nrd1927.html “Institutional review boards and others involved in clinical research, including regulators when the trial is for drug or medical device registration, require knowing the trial design’s operating characteristics. These include false-positive rate and power (the probability of concluding a benefit when there is actually a benefit), average total sample size, average proportion of patients assigned to the various treatment arms, probability of identifying the most effective dose and so on. Moreover, these bodies can request modifications in the design so as to ensure that the operating characteristics meet conventional benchmarks, such as having no greater than a 5% false-positive rate.”

    • > a statistical reviewer asked for an uninformative prior e.g. N(0, sd=100) with the concern that informative priors increase the risks of Type I error.

      As in he was concerned about informative priors like N(mean=0, sd=NotAHugeNumber)? Or was he concerned about an informative prior like N(mean=SomethingNonzero, sd=Whatever)?

      If he was concerned about N(0, NotAHugeNumber), then I am very confused. I would think that N(0,100) would lead to more Type I errors than N(0,5). Hopefully I’m correctly understanding that a Type I error is “the incorrect rejection of a true null hypothesis” (wikipedia), and in this case the null hypothesis is an effect size of 0.

      • Honestly, I don’t know. I think the response was a gut reaction against non-diffuse priors potentially being abused to over-promote a therapy.

        The funny thing about reviewers complaining about priors is that it’s really irrelevant, no? If we publish the data, couldn’t anyone just reanalyze with their own priors?

        • Garnett:

          Nothing special about priors here. You could more generally say, if we publish the data, anyone could reanalyze with their own models. On the other hand, data analysis is a valuable skill. So I think it makes sense to (a) publish the data, and also (b) analyze these data as well as we can.

        • Thanks for your reply.

          >I think the response was a gut reaction against non-diffuse priors potentially being abused to over-promote a therapy.

          Wouldn’t a diffuse prior like N(0,100) be more prone to over-promote a therapy (over-estimate effect size) than a non-diffuse prior like N(0,5)? I fear I have some error in my thinking.

          >If we publish the data, couldn’t anyone just reanalyze with their own priors?

          Theoretically yes, but even trivial obstacles can make large differences in human behavior. Like how having a tempting food being within hand’s reach versus being stowed out of sight in a pantry can make a huge difference in how much of that food you eat.

        • No error, I think.

          I _suspect_ that people working in industry are less inclined to think of a prior as a state of knowledge and more as the necessary (evil) cost to getting the benefits from Bayesian analysis (e.g. interpretability, adaptability, prediction). Thus the focus on Type I error.

          I guess it makes sense if you work in an environment with one-shot data analyses done to promote a therapy.

        • Garnett:

          I work with people in industry, and they’re interested in prior distributions for (at least) two reasons:

          1. Regularization: When we fit models without including prior information, we can get unstable computations and noisy estimates that don’t make sense.

          2. Information: We can use prior information to get better estimates, better predictions, and, we hope, better decisions.

          There is selection, of course: if someone working in industry don’t want to use Bayesian methods, he or she won’t come to me in the first place. But the key point is that they want useful results.

        • > No error, I think.

          That is a relief. Thank you.

          >I guess it makes sense if you work in an environment with one-shot data analyses done to promote a therapy.

          Now I am less relieved. A N(0,100) prior will lead to a bigger estimation of effect size than a N(0,5) prior given the same observations. My tentative assertion is that the reviewer’s suggestion of a N(0,100) prior was not just misguided, but in fact counterproductive to his goal of avoiding overestimated effect sizes. So no, the suggestion does not make sense, at least to me.

        • It seems to me like what’s needed here is an actual decision analysis from the point of view of utility to the patient. The N(0,100) prior will likely lead to a higher posterior expected value and a wider posterior distribution of effect, but the posterior expected value is not necessarily the right point estimate from the perspective of the patient. In fact, if the actual effect is less than the point estimate, the patient’s “losses” can grow very large, whereas if the actual effects are larger than estimated, the patient “gains”. So a loss function that maybe looks like -(x-xest)^3 might make sense, in which case to minimize expected loss you’d shift your point estimate towards zero.

          That shifting the point estimate is a separate issue from shifting the inference

        • hmm… there’s something wrong with the math there, losses need to be not just a function of error, but also a function of absolute effect size. otherwise with unbounded “gains” for positive errors, negative infinity is the estimate you’ll choose :-)

          anyway, the basic principle is that large uncertainty in the prior leads to larger uncertainty in the posterior distribution. People potentially “lose” for a variety of reasons here, one of which is when they take a drug that harms them (effect negative) because they believe the effect is positive, one reason is because they under-prescribe a drug because although it has a large positive effect, they don’t really know this, and one reason is because they over-prescribe a drug thinking it has a large positive effect and it actually has only a small positive effect.

          So, with all that in mind, a little thinking about constructing a loss function in terms of both absolute effect size and error in estimation of effect size would lead to a good point estimate for decision purposes even if a broad prior was used.

          I know Andrew likes working with real dollar values etc, but in the absence of information to construct a “real” utility function, at least an abstract utility function that reproduces some of the important features would be useful.

  2. I’d like to discuss the original investigators’ question of how many extra subjects would be needed to achieve ‘significance’. Most discussions, like this one, ignore entirely the fact that the addition of extra observations to a dataset that looks promising can be entirely sensible.

    The conventional take on the question is that the increase in false positive error rates that would come with such a procedure means that it is entirely inappropriate. However, that is a badly incomplete accounting of inferences because the consequent decrease in false negative error rate (increase in power) may more than offset the increase in false positive error rate for many reasonable valuations of outcomes and costs of the experiment.

    It is quite likely that an optimal response to the investigators’ questions is that they should increase the sample size and re-analyse the results with an eye on the evidential meaning of the results rather than a dichotomous significant/not significant outcome. (The latter part of that advice would be nearly universally applicable, and fits with the Bayesian preference of most of the comments above.) However, it should be noted that advice to add extra observations and perform a dichotomous re-analysis without ‘correction’ might be a good option if the costs of observations and false negative errors are real.

Leave a Reply

Your email address will not be published. Required fields are marked *