Expanding concepts of Bayesian inference

Continuing the discussion of Neal Beck’s comment on David Park’s models: the concept of Bayesian inference has been steadily generalized over the decades. Let me steal some words from my 2003 article in the International Statistical Review:

It is an important tradition in Bayesian statistics to formalize potentially vague ideas, starting with the axiomatic treatment of prior information and decision making from the 1920s through the 1950s. For a more recent example, consider hierarchical modeling.

In the 1960s and 1970s, it was recognized that Bayesian inference for a sequence of parameters could have better statistical properties if data-dependent prior distributions were allowed. This developed into the “empirical Bayes” approach. But then, through the work of Hill (1965), Tiao and Tan (1965), Lindley and Smith (1972), Rubin (1981), and others, the hierarchical Bayes approach was developed, obtaining the benefits of data-based prior distributions in a fully Bayesian mathematical framework. Other places where vague statistical ideas have been formalized are in modeling missing data (Rubin, 1976) and model averaging (Draper, 1995, Raftery, 1995).

All of these ideas have the form of mathematical generalizations. Start with the likelihood, p(y|theta). Bayesian inference generalizes to include a prior distribution, p(theta). Over the years, many statisticians have objected to the claim that they need a prior distribution—but the success of Bayesian methods suggests that the gains (in ability to flexibly restrict inferences and to perform exact decision analyses) outweigh the costs that arise from having to defend “subjective” inferences. In fact, classical inferences can often be interpreted as Bayesian under particular prior specifications and loss functions, and so the Bayesian approach can be a tool to understand other statistical methods.

Similarly, hierarchical modeling elicited a lot of resistance in its time (see, for example, the discussion of Lindley and Smith, 1972), with a key point of contention being the legitimacy of combining information from different sources in a single model, as in a meta-analysis. There was also some free-floating skepticism about the additional assumptions inherent in an empirical Bayes analysis or hyperprior distribution.

(I recall seeing a graduate student presentation a few years ago of a hierarchical regression model that had random effects for the 50 U.S. states. A statistician objected that the 50 states are fixed, and so it does not make sense for them to be random effects, in the sense of their being a larger population from which they are a sample. This is an interesting point but not relevant to the hierarchical model per se. One could similarly object to a non-hierarchical regression model of data from 50 states, since once again there is an error distribution. In either case, the model must be interpreted with care–but that there are only 50 states is not a good reason to set the state-level variance parameter to zero or infinity, as would be implied by classical nonhierarchical models.)

Eventually, however, the intermediate formalism of “empirical Bayes,” with its awkward data-dependent prior distributions, was replaced by the richer full-Bayes hierarchical structure. It became clear that the hierarchical analysis is a generalization that includes simpler models as special cases, and this allows us to answer various objections at a mathematical level. For example, if a hierarchical model combines highly dissimilar data sources–and these dissimilarities are not corrected for in the model–then the hierarchical variance parameter will be estimated to be a very large value, and the inferences will display essentially no shrinkage.

The next generalization, modeling missing data or, more generally, the process of data collection, generalizes the likelihood from p(y|theta) to p(y,I|theta,phi), where I represents the information of which data points are actually observed, and phi are parameters describing the design of the data-collection and recording process (Rubin, 1976). Including the data-structure I in the model allows us to easily model rounded, censored, and truncated data and, as with the previous generalizations, gives insights into the previously-standard methods. In the more general framework, a model is “ignorable” if p(theta|y)=p(theta|I,y); that is, if the data structure can be ignored. Understanding ignorability helps us in setting up non-ignorable models (as with dropouts in clinical trials) and in adding covariates to a model so that ignorability can be a reasonable assumption. Also as with the previous generalizations, these concepts predated the mathematical formalism, but the formalism made it easier to apply them in new and more complicated settings.

The expansion of the Bayesian formalism into p(y,I|theta,phi) to include the data-generation process using p(y,I) also resolves some theoretical and practical connections to classical methods (see Gelman et al., 1995, chapter 7). For example, randomized data collection is hard to justify under the usual Bayesian framework, but, in the context of defining a data-collection scheme, randomization is in fact the only way to select a sample without reference to covariates. Similarly, the idea of ignorability corresponds to the classical principle of including in the analysis all information used in the design, which in turn suggests particular Bayesian models. And the traditional Bayesian claim about the irrelevance of data-based stopping rules (see, for example, Berger, 1985) is modified by an understanding that a time variable must be included to have an ignorable model in this scenario.

A very active area of current statistical research is model averaging, generalizing the space of parameters one step further to allow for different choices of models or (in our preferred version) a continuous space spanned by models which had previously been fitted individually. Much progress seems to have been sparked by various formalizations of model combination, which take us beyond the previous vague ideas that no model is perfect and that it should be desirable to combine inferences from several models. Mathematical gaps typically correspond to areas of potential statistical improvement, and one area for improvement here can be seen from the difficulties of computing Bayes factors for models of different dimensionality (see, for example, Raftery, 1995, Spiegelhalter et al., 2002, and Denison et al., 2002). The problem here is not with the model combination but rather with the use of flat, or nearly-flat, prior distributions on the component models. We suspect that model averaging would be much more effective if the models being averaged were hierarchical.

The paper continues by explaining how model checking, p-values, and exploratory data analysis fit into the Bayesian framework.

Just to be clear: I do not think it is an empty statement to say that a method is Bayesian. For a method to be “Bayesian,” it must use inference from posterior distributions. Taking an existing concept and making it Bayesian–that is, formulating it as a posterior distribution under a specified model–can require real work that can pay off in the form of more generally-applicable statistical procedures.