My best thoughts on priors

My best thoughts on priors (also the thoughts of some other contributors) are at the Prior Choice Recommendations wiki.

And this more theoretical paper should be helpful too.

I sent these links in response to a question from Zach Branson about priors for Gaussian processes. Jim Savage also pointed to our paper on simulation-based calibration of Bayesian models.

I don’t have so much experience with GP’s, but I agreed that you have to be careful about the nonidentifiable region, as I became aware when doing this mini-project which I never wrote up except on the blog. The discussion thread at that link is interesting.

16 thoughts on “My best thoughts on priors

  1. At the wiki, under the heading “Prior for the regression coefficients in logistic regression (non-sparse case)” it says, “Normal distribution is not recommended as a weakly informative prior, because it is not robust (see, O’Hagan (1979) On outlier rejection phenomena in Bayes inference.). Normal distribution would be fine as an informative prior.”

    The article is behind a paywall. Could you explain why something like normal(0,5) or even normal(0,2.5) wouldn’t be weakly informative for logistic regression coefficient parameters? Since the coefficients are on the log scale this would seem like rather weak priors, no? A coefficient parameter of 5 seems rather large (or -5 rather small).

    • Jd:

      I’m not sure, because now we do recommend normal(0, 1) or normal(0, 2.5) as default weakly informative priors. Perhaps the issue here is that we want our models to be on unit scale, so we just would not expect to have coefficients like 5 or 10 or 100 etc. If your problem might not be on unit scale then it would make sense to have another prior on the scale. That prior would be weak on the log scale. Taking a scaled normal(0, 1) prior and then averaging over uncertainty in the scale will give you something like a t prior unscaled. I guess that’s the right way of thinking about it. I’ll add this to the wiki.

      • “…so we just would not expect to have coefficients like 5 or 10 or 100 etc. ”

        The data would tell you if the coefficients cannot be like 5 or 10 or 100, no? I’d worry about situations where we think it cannot be 5 for whatever reason, and it turns out it can be 5 but the prior we used disallowed that.

        Justin

        • Justin:

          The problem is that people don’t analyze all their data at once. When you’re only analyzing a small amount of data, your data can be consistent with all sorts of parameter values that make no sense. And if you just let your data go and overfit, you can end up with ridiculous inferences and wrong scientific conclusions.

        • “but the prior we used disallowed that.” Justin, please, for the love of god stop talking about things you clearly do not have a fundamental understanding of. You clearly lack the basic knowledge about the role of priors and likelihood in a Bayesian model; it isn’t this blog’s job to teach you things that you can read in the literature for yourself. I know others have already stopped taking your comments seriously, but I still have had a faint hope you’d stop being ignorant and at least try to understand another perspective on inference. If you want to be a good stats instructor, I’d think you’d want to engage in good faith arguments about statistical methods. Guess not.

    • This seems kind of confusing to me. I eyeballed the O’Hagan’s article and it seems to be about how outlying _observations_ are taken into account. It’s obvious that if errors are modeled as being normally distributed and there are outlying observations, the normal distribution will try to cover the outliers by increasing its variance and maybe by shifting its mean towards the outlier. A common way of dealing with that is to change the model of errors to some fatter tailed distribution, e.g. t-distribuion.

      However, here we are talking about prior distributions for _parameters_? That’s a different issue, no?

      In logistic regression the model for observations is the binomial distribution. Problems arise if there are e.g. negative responses when a the probability for them is essentially zero (for some set of parameters). Accomodating these outliers is not done by giving _the parameters_ more uninformative priors; these are accomodated by e.g. mixing the binomial distribution with uniform distribution. This will limit the lowest and highest probabilities inside in such a way that a single outlying response won’t collapse the probability for the set of parameters to log(0).

      • J-Man: There is no difference. In case of data prior conflict we can either reject data or prior, and we can select which one is rejected by choosing whether observation model or prior has thicker tails. See, e.g. https://twitter.com/avehtari/status/1218896617346162688

        jd: quick web search finds legal free download of O’Hagan (1979). btw Google scholar button makes it very easy to find downloadble files.

        jd: “Since the coefficients are on the log scale” I guess there is typo here, but that is valid point that O’Hagan (1979) is not directly applicable here. Andrew’s response is what I would have said, too.

        We’re doing more research on priors and you can expect that there will be changes in the recommendations later this year.

        • I don’t quite get it. Doesn’t that diagram in your Twitter _corroborate_ what I said? The posterior changes depending if the prior or the likelihood is fat tailed.

          It’s easy to see the difference in made-up examples, but I’m not sure if they are so made-up that they don’t inform us about real, practical situations… but here we go in any case:

          Consider a sort of “noiseless detection” scenario. Always, when the signal is below a certain value, the detector will report 0; when signal, S, is above this threshold the detector reports 1. At least that’s our model!

          Now, if we have a set of observations like this…

          S R
          0.1 0
          1.2 0
          3.4 0
          5.7 1
          8.9 1

          …the likelihood will crop the posterior distributio somewhere between3.4 and 5.7, depending on the prior. However, it is easy to see that if we introduce a single outlying value like this…

          S R
          0.1 0
          1.2 0
          3.4 0
          5.7 1
          8.9 1
          9.2 0 !!!OUTLIER ALERT!!!!

          …the whole thing breaks down since the likelihood is not able to incorporate the outlier. It does not matter what we do with the prior. Only by modifying the likelihood, by saying e.g. that there’s 0.02 chance of 1 when the signal is below threshold and 0.98 chance of 1 when above can the situation be fixed.

          A more realistic example would be one of logistic regression, as was case in OP, but I’m not good enough to work out the maths in this sort of simple form. But in my experience similar problems can arise: the expected probabilities go to 0 or 1 — due to imprecision in doubles, I reckon — but there’s an outlying observation that ruins it for the whole set of theta. But that’s a different story.

  2. Third paragraph: “Flat and super-vague priors are not usually recommended and some thought should included to have at least weakly informative priors. For example, it is common to expect realistic effect sizes to be of order of magnitude 0.1 on a standardized scale (for example, an educational innovation that might improve test scores by 0.1 standard deviations). In that case, a prior of N(0,1) could be considered very strong, in that it puts most of its mass on parameter values that are unrealistically large in absolute value.”

    Ignoring the ungrammatical first sentence, the second and third sentences don’t make sense to me. Isn’t the N(0,1) prior here too weak, not too strong? What am I missing?

    • AndyM,

      Yes, the terminology is ambiguous. N(0,1) is too weak in that example—but in another sense it’s too strong a prior, in that it leads to strong claims about large values of theta.

      • Yes, and in this sense, the strongest prior of them all is the totally unregularized procedure- eg what Justin Smith recommends above and elsewhere on this blog.

  3. OK, so strong means weak and weak means strong!

    Perhaps that sentence would be better phrased “In that case, a prior of N(0,1) could be considered inappropriate, in that it puts most of its mass on parameter values that are unrealistically large in absolute value.”

Leave a Reply to J-Man Cancel reply

Your email address will not be published. Required fields are marked *