Weakly informative priors

Bayesians traditionally consider prior distributions that (a) represent the actual state of subject-matter knowledge, or (b) are completely or essentially noninformative. We consider an alternative strategy: choosing priors that convey some generally useful information but clearly less than we actually have for the particular problem under study. We give some examples, including the Cauchy (0, 2.5) prior distribution for logistic regression coefficients, and then briefly discuss the major unsolved problem in Bayesian inference: the construction of models that are structured enough to learn from data but weak enough to learn from data.

I’m speaking Monday on this at Jun Liu’s workshop on Monte Carlo methods at Harvard (my talk is 9:45-10:30am Monday at 104 Harvard Hall).

Here’s the presentation. I think this is potentially a huge advance in how we think about Bayesian models.

11 thoughts on “Weakly informative priors

  1. I highly appreciate some work being done in this area. I really think this is the way to go for historically-controlled trials, because while historical controls give information, they are generally unreliable because of changing conditions (especially in biostatistics). Being able to essentially say with the analysis "This is what we think it might be based on historical information but let the data speak differently if need be" I think is a huge thing. Interesting point about "noninformative priors" being "weakly informative."

  2. Weakly informative is great term and you have nice examples in the presentation.

    It took me some time to spot the difference in table on page 46 and in figures on page 56. For the web audience it might be useful to use, e.g., additional coloring to attract the attention.

  3. If you actually have valid prior information, why on earth wouldn't you use it?

    The prior begins to effect the posterior when its uncertainty (width or standard deviation) is similar to the uncertainty on the data. Otherwise, the likelihood dominates.

    Also

    An uninformative prior does not have to be flat. A Gaussian prior with a large standard deviation is uninformative as well and best represents a large sample distribution.

    You should read Jaynes book.

  4. Regarding "uninformative", it is not clear to me what this means.

    I'm involved in an argument in climate science, with a bunch of people who claim that a uniform prior such as U[0,20] is "uninformative" about a parameter S (in this case, the sensitivity of equilibrium temperature change to a doubling of CO2).

    My claim is that such a prior, far from saying nothing about S, is actually asserting P(S>6)=70% (inter alia) which seems a remarkably exaggerated and alarmist claim. For context, real opinions of climate scientists reported over the past 100 years suggest S is probably close to 3. When the uniform prior is updated with recent observations, P(S>6) is greatly reduced – perhaps to 5% – but this is still high enough to be alarming. I don't think it is a plausible belief though.

    On a somewhat related topic, a climate scientist who advocates maximum entropy methods (and publishes on the topic) recently claimed via email that a proper uniform prior U[0,n] (which he insists is the natural ME prior when we only have the bounds 0 and n) has no mean whatsoever, ie int(x/n)dx does not exist, or cannot be calculated, or has no meaning. It seems rather bizarre to me, even if one overlooks the need to choose an invariant measure (in order to get a prior that is uniform in x rather than x^2 or 1/x or something else).

    I'd be grateful if anyone could shed any light on these matters.

  5. A related issue is prior distributions vs. parameter identification. That is, starting with an unidentified model, how informative must the corresponding prior be before the model parameters are identified? I've always wondered about this but haven't come across much information on the issue.

  6. Aki,

    Thanks for the suggestion. I don't know how to put a red circle around things (or, more generally, "draw") on a postscript file. There's probably some utility out there that does this?

    Timothy,

    It's a Latex document class that Jouni taught it to me. I'm using the default settings.

    Bill,

    I agree that it makes sense to use prior information; at the same time, there is often a demand (within the field of statistics, at least) for inferences that are not sensitive to the prior distribution. My point is that the "weakly informative prior" is a more general, and useful, idea than the "noninformative prior."

    James,

    Yes, when inferences are sensitive to the prior distribution, ultimately you have to go to the scientific debate–in your case, questions about the plausibility of high values of S. (This is an issue for classical inference also: learning that a particular value of S is consistent with your data does not mean that it is true, if you have prior information saying that it is unlikely.)

    Ed,

    Yes, I agree, this is the big question–the construction of models that are structured enough to learn from data but weak enough to learn from data. When you have more data, the appropriate weakly informative prior distribution changes.

  7. first of all I am very sorry I missed the workshop…(I am taking too many classes this term)

    I have just read the presentation (and to Aki, it seems to me to be a Beamer product.. isn't it? you can find a lot of references on the web and if you are really into latek I am sure you can customize one of the "n" Beamer themes..)

    about the "waekly informative priors" I do understand the general aim of this attempt.. but I have some questions that can maybe open up a more theorerical debate.

    1) is there any information-theoretic argument for choosing weakly informative priors? I mean, Reference Priors, Jeffrey's Priors, g-priors, Max-Ent rules, can all be seen under the same lens of "distance" and "entropy"… is it possible to derive the "weakly informative ones" under the same framework?

    2) I think I understood the implications of weakly informative priors for "multilevel models".. is it possible to think about any possible extension to other kind of models such as "time series" or "spatial statistics" where "regularization of estimates" is usually to be associated with "reduction of dimensionality" ?

  8. @ Texer, 2)

    "Reduction of dimensionality" is achieved when a parameter is set to exactly zero. Priors with a peak in zero can achieve this usually only if inference is not based on the posterior means (which will never be exactly zero) but on the posterior mode (which might be exactly zero for some elements of the parameter vector). There are some suggestions for priors with a marked peak in zero and simultaneously fairly heavy tails, e.g. Griffin/Brown (2005).

Comments are closed.