Bayesian hierarchical models: Neal Beck responds

We had an interesting discussion on the blog entry last week about Bayesian statistics, where we wrote,

Boris presented the TSCS paper at midwest and was being accused by Neal Beck for not being a real Bayesian. Beck was making the claim that “we’re not Bayesians” because we’re using uninformative priors. He’s seems to be under the assumption that bayesians only use informative priors.

Neal reports that he had more to say and graciously emailed me a longer verision of his comment about Bayesian methods. Neal has some interesting things to say. I’ll present his comments, then my reactions.

Neal Beck writes:

Since my comments (from the audience) on the Shor et al. Midwest Political Science Association meetings (April, 2005) [referred to as SBKP for its co-authors] were mentioned on the blog, I thought I would chime in. Since I wasn’t quoted, I can hardly have been misquoted. But as either a non-Bayesian or at most a fellow traveller (are you or have you ever been a Bayesian?), I am hardly in a position to accuse people of not being Bayesian. However, I was curious about what was Bayesian in the SBKP paper. I will restrict comments to that specific example, but I think the issue generalizes.

The SBKP paper does Monte Carlo on a variety of methods for estimating time-series–section models; one method is an MCMC analysis of a multilevel model (where only intercepts vary randomly across units). My claim is that SBKP is simply using Bayesian methods to explore the likelihood (and the diffuse prior is required to make sure that the posterior density looks like the likelihood). There is no question that the Bayesian methods are better at doing many (some, most?) complicated setups than are the traditional non-Bayesian ml methods. And at this time of the year, that should call for a dayenu. But ….

Let me leave aside that for models like the multilevel the Bayesian story is much simpler and less contorted than is the classical story (though again, this is not a trivial win for Bayes), and also that the interpretation of the likelihood in terms of hpd’s and credible intervals is much more sane than the contorted frequentist’s confidence intervals (two more dayenus). (I leave it as an open question whether it is kosher to use standard ml methods and then interpret the results as a Bayesian would, but if the credible interval and the confidence interval have the same endpoints, does it matter how I compute those endpoints before I interpret them?)

So the claim is that the Bayesian work in SBKP (and I would claim without citation that this holds for much (a lot, most?) of the recent spate of MCMC applications) is simply a better way of exploring the likelihood surface (including finding its mode) than are the traditional ml methods. If perfectly uninformative priors were feasible, SBKP would use them, because that would better enable exploration of the likelihood surface. If SBKP found that the posterior was more than non-trivially affected by the prior, SBKP would go back and make the prior more diffuse.

Consider the following thought experiment:
Scott McNealey gives us all the latest Sunfires, and through the wonders of the UltraSparc we can now estimate all our favorite ml models using standard ml methods. How much interest in MCMC would remain?

Thus the claim is that applications like SBKP’s (and the claim is that such applications are common), the prior is a nuisance that is made necessary by the wondrous computational advantages of MCMC. In a world of infinitely fast chips these applications would happily revert to being estimated via classical methods (though keeping the beauty of Bayesian interpretation). Thus the claim is that “computational Bayesians” (what Jeff Gill tells me he calls “convenient Bayesians”) are, at a minimum, not exactly Bayesians.

Now let us move away from computations. Some of the applications that SBKP and I are interested in are comparative political economy. This is a vibrant area of research, and I think we have actually learned something about this field in the last two decades. So my closing query is: will we ever see a time when Bayesians (computational or otherwise) will actually update their priors? If we do not see this, then I might be content to call such scholars non-Bayesian (but excellent classicist). Of course they still at least would be able to interpret the intervals they get in a sane manner.

My reply:

1. You have a good point about noninformative models. If more information is available, it’s a good idea to use it. Noninformative prior distributions can be convenient but can often be improved upon.

2. You talk about “exploring the likelihood,” but it’s not so simple. What exactly is the likelihood you’re exploring? In a hierarchical model with dozens of parameters, the joint likelihood of all these parameters is a mess and you certainly can’t do anything with its joint maximum. Perhaps you want the maximum of the marginal likelihood, averaging over some of the parameters? That’s OK, but that’s also Bayesian–it’s averaging over uncertainty in the distribution.

3. You say that “computational Bayesians” are not exactly Bayesians. We can each use the term as we want, but my preferred definition, to borrow from my earlier blog entry, is that Bayesian inference is characterized by the use of the posterior distribution–the distribution of unknowns, conditional on knowns. Bayesian inference can be done with different sorts of models. In general, more complex models are better, but a simpler model is less effort to set up and can be used as a starting point in a wide range of examples.

4. In many problems, real gains can come from using informative prior distributions (see, this paper, for example), but in other cases, hierarchical models–even with noninformative prior distributions–can take you places that classical models don’t usually go. See Chapter 5 of Bayesian Data Analysis, or here and here for recent examples in political science.

5. If you can get the same results as I do, but using some version of classical statistics and a fast processor, that’s fine with me! I like the hierarchical Bayesian framework because it allows me to throw lots of information into an analysis, and it also provides me with a systematic way to check the fit and improve a model (see Chapter 6 of our book). If you have another way to do this, that’s fine too.

My suspicion is that, at some point, when you want to control for lots of factors, and you want to allow treatment effects to vary geographically, and you want to consider data from many years and many countries, and your models are getting complicated, that you’ll have to either restrict your models or else go Bayesian. But I don’t mind if you wait until then.

3 thoughts on “Bayesian hierarchical models: Neal Beck responds

  1. Our paper argues for a Bayesian multilevel model (BML) estimator for time-series cross-sectional (TSCS) data. We compare the BML estimator to the more commonly employed estimators with Monte Carlo-type experiments with simulated data.

    TSCS data has become increasingly popular in political science in recent years–due in large part to the series of papers on better ways of modeling such data written by Neal Beck and Jonathan Katz beginning a decade ago.

    Neal has been very helpful in commenting on our paper, both at the Midwest conference and over email. His comments, both philosophical and practical, are helping us to improve it.

    Neal expressed skepticism at the conference at the Bayesian content of our paper due to our choice of a diffuse prior. Since we do so, he argued, the Bayesian content of the paper was reduced to simple computation.

    Our response:

    Using diffuse priors does not disqualify us from being Bayesian. To us, being Bayesian

    means the explicit employment of priors combined with sample data to generate

    posterior distributions, which we then use to characterize parameters of interest.

    The level of informativeness of the priors is chosen for a particular utility. We wanted the BML

    estimator to be essentially “blind” to the true beta we used in generating the various data sets in our Monte Carlo analysis. So it would have been odd to employ an informative prior, giving the BML estimator an unfair advantage over the other estimators.

    We agree with Andrew that the Bayesian multilevel approach makes fitting extremely complex models easier than with classical techniques. TSCS models are definitely in the class of complicated models; consequently, we think our approach is appropriate.

    Moreover, the Bayesian multilevel approach is general enough to allow us to add additional structure and predictors at multiple levels of analysis. That flexibility is a big plus to us, inasmuch as we think that multilevel/hierarchical structure is an important aspect of typically used data (especially TSCS data). Not modeling that structure is common, and we think mistaken.

  2. By the way, our paper can be found here.

    Another thought on this topic. One advantage of Bayesian inference is that you can choose the appropriate informativeness of your priors. This is a theory-driven choice. So, while uninformative priors are appropriate for us here, more informative priors may be appropriate for in other studies (and, this choice can be made differently for various predictors).

    To Neal's last point: sure, if there's settled opinion in comparative political economy on the effect of a certain covariate on a given outcome, you can increase the informativeness of the prior of that covariate. This will of course be more consequential the smaller the data set, but that is something the researcher can choose to do (or not).

    Again, this underlines how flexible and general Bayesian inference is. We feel that this is a major advantage for studying TSCS data sets.

    Andrew has a clear and concise 3 page paper on priors here, where he illustrates the choices data analysts can make in choosing appropriate priors.

    And his paper on priors for variance parameters, a related topic, can be found here.

  3. Having been the discussant on this paper at MPSA, let me re-iterate my earlier comments. TSCS models are truly pooled time series. You cannot use a flat prior / non-informative prior for these models (esp. for the kind of comparative political macro-economic data used in most of these analyses). Chris Sims at Princeton has written extensively on this point over the last 15+ years: basically a flat / non-informative prior does not make sense if you expect that there are dynanics because you need to deal with the fact that the ML model addresses the conditional likelihood, but the posterior includes dynamics. Ergo, the likelihoods and the posterior can be far apart in the parameter space. This is a major issue, since if you are trying to do conditional prediction or counterfactual inference, the difference between the likelihood and the posterior can be large. (see Brandt and Freeman from MPSA on this point). I guess I need to post my slides from the panel at MPSA….

Comments are closed.