You won’t believe these stunning transformations: How to parameterize hyperpriors in hierarchical models?

Isaac Armstrong writes:

I was working through your textbook “Data Analysis Using Regression and Multilevel/Hierarchical Models” but wanted to learn more and started working through your “Bayesian Data Analysis” text. I’ve got a few questions about your rat tumor example that I’d like to ask.

I’ve been trying to understand one of the hierarchical models revolving around rat tumors (Chapter 5). This is where there is a binomial model with p assigned a beta distribution. The Beta distribution has parameters $\alpha$ and $\beta$ which need a distribution for the full hierarchical model.

In order to create a noninformative distribution the book parametrizes the model in terms of $\frac{\alpha}{\alpha+\beta}$ and an approximation of the standard deviation $(\alpha+\beta)^{-1/2}$. (described here too http://statmodeling.stat.columbia.edu/2009/10/21/some_practical/) I know you mentioned not favoring this approach anymore, but I’d still like to understand the modeling thinking/process that supports this if possible.

I have a few questions about this:

– Why use an approximation here for the parametrization rather than the actual standard deviation of the Beta distribution? /When to use approximations for reparametrization? Computational reasons?

– How did you arrive at this particular approximation?

– What connection, if any, does this have to a Pareto distribution? I tried parametrizing this model with a Pareto(1.5,1) distribution for $\alpha+\beta$ and a uniform distribution on $\alpha/(\alpha+\beta)$ and ended up with $p(\alpha,\beta)\propto (\alpha+\beta)^{-3/2}$ but the book’s approach seems to yield $p(\alpha,\beta)\propto (\alpha+\beta)^{-5/2}$ which disagrees with the gentleman writing into the blog in the link above.

My reply: As I’ve said, I’ve changed my views since writing that book in the early 1990s, but not all my newer perspective has been worked into the later editions of the book. In particular, I’m not so happy with noninformative priors, for two reasons:

1. We often have prior information, so let’s use it. Traditionally we pragmatic Bayesians have been hung up on the difficulty of precisely specifying our prior information—but I it seems clear to me now that specifying weak prior information is better than specifying nothing at all.

2. With flat priors and a small number of groups, we can get a broad posterior distribution for the group-level variation, which in turn can lead to under-smoothing of estimates. In some contexts this is ok (for example, when the unpooled, separate estimates are taken as a starting point or default), but in other settings it’s asking for trouble, and the use of flat priors is basically a way to gratuitously add noise to the inference.

Anyway, back to the example. It seemed to make sense to put a prior on the center of the beta distribution and the amount of information in the beta distribution. These can be specified using mean and variance, but in this case the “effective sample size” seemed reasonable too. To put it another way: you ask, Why not parameterize in terms of the mean and variance? But in general that won’t work either, for example what would you do if you had a Cauchy prior, which has no mean and no variance?

A rule such as “parameterize using the mean and variance” is nothing but a guideline. So, when introducing this example into the book, I didn’t want to try to overly formalize this point. In retrospect, I actually think this was pretty mature of me! But maybe I should’ve explained a bit more. There’s a tradeoff here too: Not enough explanation and things are mysterious; too much explanation and the practical material gets lost in the verbiage (a point of which readers of this blog are well aware, I’m afraid).

4 thoughts on “You won’t believe these stunning transformations: How to parameterize hyperpriors in hierarchical models?

  1. > we pragmatic Bayesians
    Not in the statistical lexicon yet…

    I would argue for purposeful rather than practical as a clarification of what being pragmatic should be about.

  2. It’s too bad that’s still the first hierarchical modeling example in BDA. I was utterly baffled by it when I was first trying to learn about Bayesian modeling because it breaks out moment matching in the non-hierarchical case then an improper prior based on an implicit Jacobian that I didn’t have the tools to understand. And to make matters worse, when I saw Andrew and asked him about it (this is probably 10 years ago), he said it didn’t reflect his current thinking. I think when I asked him about what he’d do now (that is, 10 years ago), I didn’t understand the response.

    A few years prior to starting to work with Andrew on Stan, I finally “got it” and wound up writing a series of blog posts on hierarchical modeling for batting ability, only to be told by Andrew that he hated batting ability because the denominator was a random variable (at-bats, rather than, say, plate appearances) and that he still didn’t like the prior.

    Daniel and I are circling back to this very problem now and we’re going the Gelman-and-Hill route and making it a multilevel logistic regression (where we can use our standard default priors), while also paying attention to Andrew’s first criticism and modern baseball analysis trends (happily in agreement here) and concentrate on on-base percentage rather than batting average.

Leave a Reply

Your email address will not be published. Required fields are marked *