Someone sent me a question regarding the inverse-Wishart prior distribution for covariance matrix, as it is the default in some software he was using. Inverse-Wishart does not make sense for prior distribution; it has problems because the shape and scale are tangled. See this paper, “Visualizing Distributions of Covariance Matrices,” by Tomoki Tokuda, Ben Goodrich, Iven Van Mechelen, Francis Tuerlinckx and myself. Right now I’d use the LKJ family. In Stan there are lots of options. See also our wiki on prior distributions.

On a bit of a tangent, but still correlationy stuff… regarding your wiki on prior choice recommendations, and the case with a single correlation coefficient… I am sincerely wondering if there is something wrong with

atanh(rho) ~ normal

This was my first idea, but it is excluded from the wiki and there doesn’t seem to be much about it on the internet (or at least I couldn’t find much in a couple of tired minutes).

Here’re some R plots, they’re always fun, aren’t they:

prior_density = function(y, par){

dnorm(atanh(y), par[1], par[2]) * abs(1 / (1 – y^2))

}

pars = matrix(c(0.0, 1.0,

0.3, 0.8,

0.0, 0.3,

-0.8, 0.8), ncol = 2, byrow = T)

y = seq(-1, 1, length.out = 400)

par(mfrow = c(2, 2))

for(i in 1:nrow(pars)){

y_density = prior_density(y, pars[i,])

plot(y, y_density, type = “l”, ylab = “Density”, xlab = expression(rho),

ylim = c(0, 2), col = “red”, main = paste(“mu =”, pars[i,1], “sigma =”, pars[i,2]))

hist(tanh(rnorm(50000, pars[i,1], pars[i,2])), prob = T, add =T)

}

You can view the suggested prior this way:

With ggplot,

It gives you a U-shaped distribution, which is probably not what you want as a correlation prior.

A simple uniform(-1, 1) would be close, but uniform. We tend to prefer distributions concentrated around 0 to regularize correlation estimates. You can do that by cutting down on the scale of the normal, say normal(0, 0.5) instead of normal(0, 1).

With Stan, you could just truncate a normal(0, whatever) to [-1, 1] rather than transforming using tanh(). That would still allow the boundaries to have non-zero density.

Using uniform(-1, 1) is what you get in 2 dimensions setting the LKJ parameter to 1. As dimensionality increases, LKJ(1) is still uniform on correlation matrices, but each marginal correlation becomes more peaked around 0 because of the positive-definiteness constraint.

I’m afraid I got images in there by abusing my posting privileges and directly uploading the images to the site then linking in by hand in the comments. I don’t know how to include images if you don’t have posting privilege.

Bob:

> I’m afraid I got images

Anything wrong with doing that?

Nothing wrong, I just didn’t want people wasting time trying to figure out how to include images. I included them as

Anyone can link to externally hosted images this way, too.

Thanks for your answer. Indeed, I tried depicting different prior shapes in my R code (the U-shaped when mu = 0 and sigma = 1 was one of them) but it got all mangled up. Poor mangled code.

I reckon, when using uniform or truncated normal, Stan will still transform the parameter under the hood. Not with tanh though, if I’m reading the manual correctly. In hand-rolled algorithms, without using any transformations at any point, I think using uniform or truncated distribution would save one from adjusting the prior density, which might or might not be practical.

On a somewhat unrelated point, what are your recommendations for hierarchically specified variances? If I expect the spread of an outcome to be similar within a grouping factor, but different across grouping factors, how would you recommend the variance distribution be formulated?

For some applications I’ve used a normal distribution for the log variance, and from there modeled the hierarchy much as one would for the mean structure.

Interesting question- I’m also curious what thoughts are about this! Why normal on log(variance)? My guess is that it would be easier/better to work directly on the SD scale.

The main problem with log normal distributions for variance (or scale) is that they’re zero-avoiding in the sense that the density approaches zero as the variance goes to zero. So if the true hierarchical variance is zero, you won’t get there with this kind of prior. Otherwise, it should be OK.

I’m curious what Andrew recommends here. I don’t recall seeing hierarchical variance structure in any of the books.

I’ve tried to do this a couple of times but I’ve never been entirely satisfied with my solutions. So count me as one more who is curious about this.