## Conjugate prior to Dirichlets?

Mark Johnson writes:

Suppose we want to build a hierarchical model where the lowest level are multinomials and the next level up are Dirichlets (the conjugate prior to multinomials). What distributions should we use for the level above that? I’m not a statistician, but aren’t Dirichlet distributions exponential models? If so, Dirichlets should have conjugate priors, which might be useful for the next level up in hierarchical models. I’ve never heard anyone talk about conjugate priors for Dirichlets, but perhaps I’m not listening to the right people. Do you have any other suggestions for priors for Dirichlets?

My reply: I’m not sure, but I agree that there should be something reasonable here. I’ve personally never had much success with Dirichlets. When modeling positive parameters that are constrained to sum to 1, I prefer to use a redundantly-parameterized normal distribution. For example, if theta_1 + theta_2 + theta_3 + theta_4 = 1, with all thetas constrained to be positive, I’ll use the model,

theta_j = exp(phi_j)/(exp(phi_1)+exp(phi_2)+exp(phi_3)+exp(phi_4), for j=1,2,3,4.

If you give each of the phi_j’s a normal distribution, this is a more flexible model than the Dirichlet: it has 8 parameters (four means and four variances). Well, actually 7, because the means are only identified up to an arbitrary additive constant.

Frederic Bois and I used this distribution for a problem in toxicology, modeling blood flows within different compartments of the body–these were constrained to sum to total blood flow.

In this article, I used a fun stochastic approximation trick to compute reasonable values for the mean and variance parameters for the phi’s.

P.S. Dana Kelly points to this article on the topic by Aitchison and Shen from 1980.

1. Xi'an says:

Yes, indeed, Dirichlet distributions are a type of exponential family (see Example 3.3.4 in The Bayesian Choice, 2004) and therefore "enjoy" conjugate priors. It is however a non-standard type of distribution over R^p with (p+1) parameters, as illustrated in the special case of the Beta distribution just before Example 3.3.15 in The Bayesian Choice (2004, p.122). So, despite having to move from (p+1) to (2*p-1) parameters, I think Andrew's solution is more intuitive.

2. Ted Dunning says:

Dirichlet processes are commonly used for this sort of problem and there are a number of very successful results in hierarchical problems using DP's. This has the virtue of giving you a non-parametric solution in the bargain.

If you don't need a conjugate prior and don't want to go all the way to non-parametric methods then choosing normalized Dirichlet parameters from a Dirichlet distribution and choosing a multiplier from some distribution like Gamma or log-normal works well. This formulation allows you to use a Dirichlet distribution to parametrize the distribution of Dirichlet distributions.

One of the most common problems using Dirichlet distributions in MCMC approaches to practical problems is that near the boundaries, you get nasty behaviors from Metropolis samplers. The asymptotic distribution of samples is just fine, but the probability of repeating a single sample near the boundary can be quite high. This leads to "stiff" sampling problems where mixing is very good in some regions and very bad in others.

Sampling the log of the parameters is one of the easiest fixes to the boundary problem. When sampling from the Dirichlet, this is the soft-max basis.

Depending on what you need to do, it is commonly the case that the prior does not really need to be conjugate. Certainly for numerical work, this really isn't necessary.

See http://tdunning.blogspot.com/2009/04/sampling-dir… for details.

3. bill r says:

Since a Dirichlet is/can be viewed as independent draws from gammas with a fixed scale and normalized, can't the conjugate for a gamma be used here?

4. George Papandreou says:

Indeed, the Dirichlet has a conjugate prior, which however is not a "standard" distribution and does not seem to have been explored in the statistics literature (the only related reference I have found is the example in Robert's Bayesian Choice cited above.)

Stamatis Lefkimmiatis, Petros Maragos, and myself looked into this issue in relation to a problem in photon-limited imaging, where wavelet-like coefficients arising from a multiscale image decomposition are naturally modeled with Dirichlet mixtures. In order to regularize the estimation of the Dirichlet parameters, we imposed on them their conjugate prior. The conjugate prior family (with p+1 parameters), as derived in our work is non-standard and its normalization constant is not given in closed form. However, by constraining the natural conjugate prior "eta" parameter to zero, we arrive at a particularly simple and interesting sub-family, where the p Dirichlet alpha_i parameters are given independent exponential priors with possibly distinct scale parameters. We used this special case of the conjugate family as prior in our penalized maximum likelihood estimator and it proved particularly effective.

To probe further, see our paper "Bayesian Inference on Multiscale Models for Poisson Intensity Estimation: Applications to Photon-Limited Image Denoising", just accepted for publication in the IEEE Transactions on Image Processing. After the paper is published, a reprint will be available from our group's web site. Until then, here is a preprint.

5. George Papandreou says:

Indeed, the Dirichlet has a conjugate prior, which however is not a "standard"
distribution and does not seem to have been explored in the statistics
literature (the only related reference I have found is the example in Robert's
Bayesian Choice cited by Xi'an above.)

Stamatis Lefkimmiatis, Petros Maragos, and myself looked into this issue in relation to a problem in photon-limited imaging, where wavelet-like
coefficients arising from a multiscale image decomposition are naturally modeled with Dirichlet mixtures. In order to regularize the estimation of the
Dirichlet parameters, we imposed on them their conjugate prior. The conjugate prior family (with p+1 parameters), as derived in our work is non-standard and
its normalization constant is not given in closed form. However, by constraining the natural conjugate prior "eta" parameter to zero, we arrive at a particularly simple and interesting sub-family, where the p Dirichlet
alpha_i parameters are given independent exponential priors with possibly distinct scale parameters. We used this special case of the conjugate family as prior in our penalized maximum likelihood estimator and it proved
particularly effective.

To probe further, see our paper "Bayesian Inference on Multiscale Models for Poisson Intensity Estimation: Applications to Photon-Limited Image Denoising",
just accepted for publication in the IEEE Transactions on Image Processing. After the paper is published, a reprint will be available from our group's web site. Until then, here
is a preprint.

6. Alex says:

Maybe Bayesian Nonparametric–Hierarchical Dirichlet Process(HDP) can solve the problem. DP is a dirichlet on any finite partition and we can put a second layer of DP on the first layer of DPs, which may serve as the prior on dirichlet.