# The scaled inverse Wishart prior distribution for a covariance matrix in a hierarchical model

Since we’re talking about the scaled inverse Wishart . . . here’s a recent message from Chris Chatham:

I have been reading your book on Bayesian Hierarchical/Multilevel Modeling but have been struggling a bit with deciding whether to model my multivariate normal distribution using the scaled inverse Wishart approach you advocate given the arguments at this blog post [entitled “Why an inverse-Wishart prior may not be such a good idea”].

My reply: We discuss this in our book. We know the inverse-Wishart has problems, that’s why we recommend the scaled inverse-Wishart, which is a more general class of models. Here‘s an old blog post on the topic. And also of course there’s the description in our book.

Chris pointed me to the following comment by Simon Barthelmé:

Using the scaled inverse Wishart doesn’t change anything, the standard deviations of the invidual coefficients and their covariance are still dependent. My answer would be to use a prior that models the standard deviations and the correlations separately, so that you can express things like “I don’t expect my coefficients to be too large but I expect them to be correlated.”

Barthelmé mentions the Barnard, McCulloch, and Meng paper (which I just love, and which I cite in at least one of my books) in which the scale parameters and the correlations are modeled independently, and writes, “I don’t see why this isn’t the default in most statistical software, honestly.”

The answer to this last question is that computation is really slow with that model.

Also, it’s not really necessary for scale parameters and correlations to be precisely independent. What you want is for these parameters to be uncoupled or to be de facto independent. To put it another way, what matters in a prior is not what the prior looks like, what matters is what the posterior looks like. We’d like to be able to estimate, from hierarchical data, the scale parameters and also the correlations. The redundant parameterization in the scaled inverse Wishart prior (which, just to remind you, is due to O’Malley and Zaslavsky, not me; all I’ve done is to publicize it) allows scale parameters and correlations to both be estimated from data. It fixes the problem with the unscaled inverse-Wishart.

There’s nothing so wonderful about the Wishart or inverse-Wishart in any given example. These are all just models. What I like about these models are that they are computationally convenient, and the scaled version allows the flexibility we want for a hierarchical model. The Barnard et al. model is fine too (and, as I said, I love their article) but I don’t see any particular reason why these parameters should be independent in the prior. That’s just another choice too. What matters is how things get estimated in the posterior.

Unfortunately, even now I think the inverse-Wishart is considered the standard, and people don’t always know about the scaled inverse-Wishart. Another problem is that people often think of these models in terms of how they work with direct multivariate data, but I’m more interested in the hierarchical modeling context where a set of parameters (for example, regression coefficients) vary by group.

## 2 thoughts on “The scaled inverse Wishart prior distribution for a covariance matrix in a hierarchical model”

1. Thanks for the discussion! I’m not sure I agree with you, though. Sure, what matters most is how the prior behaves in the region that contains the bulk of the likelihood mass, and you can use that to fix the problem. It still feels like a hack, so I’d rather start with a clean solution if at all feasible (from the computational point of view). And I’d also argue the clean solution should be the default, because most end users don’t expect that kind of issue to crop up, I’d imagine.

• Simon: I’ll side with you on this, but first, all models are wrong so the question here is whether a less wrong model is enough more useful that benefits exceed the extra costs.

And as I understand Andrew, he does not consider posterior probabilities “as probabilities” to (usually) be useful in and of themselves (I agree) and so we need not even worry that priors and posteriors are really not probabilities (though one cannot depend on them having those properties).

Where I almost got burned using a published WinBugs script that used a convenient prior that the likelihood needed to “properly” correct was a problem involving three related proportions where the prior put a large amount of mass on proportions greater than one. In my sparse data set, the posterior ended up with a 30% probability on an implied (marginal) proportion of interest (something like p1/(p2 + p3) ) greater than one. A quick plot of the marginal prior did allow me to see the problem but I sent the WinBugs script off to a Bayesian colleague and asked why the posterior ended up being impossible.

They figured it out, but not right away and only after a lot of thought (that’s why I think people should regularly plot marginal priors, likelihoods and posteriors.) I then sent the query to the author of the script and they answered that the likelihood from their data got rid of the impossible prior probabilities, so it was not a problem. I followed up with “but you published it for common use, without that warning and strongly emphasized the value of having obtained posterior probabilities [of value in and of themselves] but the joint distribution is definitely miss-specified”.

But my main point here is that, if many people are unlikely to understand the limitations of a more wrong model, the benefits of the less wrong model will exceed the extra costs. On the other hand, if the limitations can somehow be made clear and automatically raised in the method for the more wrong model, the benefits of the less wrong model will not exceed the extra costs.