Measuring the information in an empirical prior

Here’s a post describing an informative idea that Erik van Zwet (with the collaboration of me and Andrew G.) came up with in response to my post Is an improper uniform prior informative? it isn’t by any accepted measure of information I know of:

One feature (or annoyance) of Bayesian methodology over conventional frequentism comes from its ability (or requirement) to incorporate prior information, beyond the prior information that goes into the data model. A Bayesian procedure that does not include informative priors can be thought of as a frequentist procedure, the outputs of which become misleading when (as seems so common in practice) they are interpreted as posterior probability statements. Such interpretation is licensed only by uniform (“noninformative”) priors, which at best leave the resulting posterior as an utterly hypothetical object that should be believed only when data information overwhelms all prior information. That situation may arise in some large experiments in physical sciences, but is far from reality in many fields such as medical research.

Credible posterior probabilities (that is, ones we can take seriously as bets about reality) need to incorporate accepted, established facts about the parameters. For example, for ethical and practical reasons, human clinical trials are only conducted when previous observations have failed to demonstrate effects beyond a reasonable doubt. For medical treatments, that vague requirement (imposed by IRBs and funding agencies) comes down to a signal-to-noise ratio (the true effect divided by the standard error of its estimator) that rarely exceeds 3 and is often much smaller, as discussed here. Adding in more specific information may change this ground state, but even modestly well-informed priors often yield posterior intervals that are appreciably shifted from frequentist confidence intervals (which are better named compatibility or uncertainty intervals) with the posterior mean being closer to the null relative to the maximum likelihood estimate. In that sense, using the uniform prior without actually believing it leads to overestimation in clinical trials, although a more accurate description is that the overestimation arises from the fact that a uniform prior neglects important prior information about these experiments.

“Information” is a complex, multifaceted topic about which much has been written. In standard information theories (e.g. of Shannon, Fisher, Kullback-Leibler), it is formalized as a property of a sample given a probability distribution on a fixed sample space S, or as an expectation of such a property over the distribution (information entropy). As useful as these measures can be in classical applications (in which the information in data is the sole focus and the space of possible samples is fixed), from an informative-Bayes perspective we find there are more dimensions to the concept of information that need to be captured. Here, we want to discuss a different way to think about information that seems to align better with the idea of empirical prior information in Bayesian analyses.

Suppose we want to choose a prior for a treatment effect β in a particular trial. Consider the finite multi-set (allowing that the same value might occur multiple times) S1 of such treatment effects in all clinical trials that meet basic, general validity (or quality) considerations, together with the frequency distribution p1 of effects in S1. We consider subsets Sk of S1 that meet certain further conditions, and their frequency distributions pk. The distributions pk can be obtained by conditioning p1 on Sk. Examples of such reference sets are:

S1.           The effects in all RCTs

S2.           The effects in all RCTs in intensive care

S3.           The effects in all RCTs in intensive care with a parallel design

S4.           The effects in all RCTs in intensive care in elderly patients

S5.           The effects in all RCTs in intensive care in elderly patients with a parallel design

Prior p1 (with reference set S1) represents the information that we are considering the treatment effect in an RCT that meets the general considerations used to define S1. Prior p2 (with reference set S2) represents the additional information that the trial concerns intensive care. Since the pair (p2,S2) represents more information than (p1,S1), we could say it is more informative. More generally, consider two potential priors pk and pj that are the frequency distributions of reference sets Sk and Sj, respectively. If Sk is a strict subset of Sj, then we call the pair (pk,Sk) more informative than the pair (pj,Sj).

To give another example, we would call (p3,S3) more informative than (p2,S2). We believe that this definition agrees well with the common usage of the term “information” because (p3,S3) incorporates additional information about the design of the trial. But p3 is not necessarily more informative than p2 in the sense of Shannon or Fisher or Kullback-Leibler. To say it even more simply, there is no requirement that the variances in S1, S2, S3 form a non-decreasing sequence. Carlos Ungil gave a clear example here. We have defined only a partial ordering of “informativeness” on pairs (pk,Sk); for example, the pairs (p3,S3) and (p4,S4) would not be comparable because S3 and S4 are not subsets of each other.

Our usage of the word “information” in relation to reference sets Sk is very similar to how a filtration in stochastic process theory is called “information”. This is very different from information theory where information is (like the mean and variance) a property of p alone, or relative to another distribution on the same sample space S. Both p and S are relevant when we want to think about the information in the prior.

In certain applications it can make sense start with the set S0 of all logically possible but otherwise unspecified effects on top of the hierarchy, where p0 is a uniform distribution over S0 or satisfies some criterion for minimal informativeness (such as maximum entropy) within a specified model family or set of constraints. For example, this can be appropriate when the parameter is the angle of rotation of photon polarity (thanks to Daniel Lakeland). However, in most applications in the life sciences (p0,S0) is not a sensible starting point because the context will almost always turn out to supply quite a bit more information than either S0 or p0 does. For example, clinical trials reporting hazard ratios for treatment effects of say HR < 1/20 or HR > 20 are incredibly rare and typically fraudulent or afflicted by severe protocol violations. And then an HR of 100 could represent a treatment for which practically all the treated and none of the untreated respond, and thus is far beyond anything that would be uncertain enough to justify an RCT – we do not do randomized trials comparing jumping with and without a parachute from 1000m up. Yet typical “weakly informative” priors assign considerable prior probability to hazard ratios far below 1/20 or far above 20. More sensible yet still “weak” reference priors are available; for log hazard-ratios (and log odds-ratios) the simplest choices are in the conjugate family, which includes the logistic distribution and its log-F generalizations.

Is an improper uniform prior informative? It isn’t by any accepted measure of information I know of

Yesterday I was among those copied when a correspondent wrote something I’d seen before and which had always stood out as totally wrong to me in the statistical context I usually work within (mainly regression coefficients for generalized linear models with canonical link functions like the normal-linear, binomial-logistic, and log-linear proportional-hazards models as applied to human health and medical data).

Here’s the offending quote:

“The non-informative prior is the current default in almost all statistical application. It’s used either implicitly when people interpret the usual confidence interval as if it is a credible interval, or explicitly when people do a Bayesian analysis with a ‘non-informative’ prior. This is a really big mistake. The uniform prior is far from non-informative! In fact, it represents the prior belief that effects are likely to be very large, and also that the (actual, achieved) power is likely to be very large. We can see in the Cochrane data that this is not true for RCTs [randomized clinical trials]. Consequently, the uniform prior leads to considerable overestimation in RCTs.”

– As Andrew (who has made the same claim) has said in other contexts, No, no, no!: I think the claim that an improper uniform prior is highly informative is the really big mistake; it is its lack of information that justifies turning to priors derived from other RCTs.

There is no unique measure of information, but among those I’ve seen in common use in both statistics and communications engineering, the improper uniform prior contains zero information. For example, its Fisher information is zero (or to be more technical, zero is the limiting information of any regular proper prior distribution that converges to uniform as its scale is allowed to expand without bound); likewise, another measure of the information in the prior, the Kullback-Leibler information divergence (KLID) from the posterior to the normalized likelihood function, is zero under an improper uniform prior.

Now this pair of facts is almost the same result as the Fisher information is the coefficient of the first nonvanishing term in an expansion of the KLID, but the same fact comes up with all the variations of information measures for distributions I’ve seen in the literature on the topic: The improper uniform prior is indeed non-informative.

One way to describe the situation in informal betting (“operational”) Bayesian terms is that the claim about an improper uniform prior overlooks how the information content of the prior is a function of its concentration of belief (as measured by expected gain or loss). With an improper uniform prior in binomial logistic or normal linear regression coefficients, the posterior bets depend only on the likelihood function (as is implicit in treating the maximum-likelihood statistics as posterior summaries), which seems as good a definition as any of complete lack of useful a priori guiding information (i.e., a state of total ignorance before seeing the current data).

Generalizing, all the weakly informative and reference-Bayes priors proposed to replace the uniform have very little information compared to the likelihood function in all but the tiniest real studies. That’s because those priors typically have the information content of 1 or 2 observations according to some familiar information measure. What makes the improper uniform prior distasteful to me is that we never, ever have zero uncontested prior information: The very fact that anyone would do a formal study shows there is plenty of information that the effect in question cannot be so huge as to be obvious without such a study (which is why there are no RCTs of having a parachute vs. nothing when jumping off a 1000 meter drop).

Intuitively, I think we could agree that someone must be completely ignorant of the research world as well as of the specific topic if they claim that any huge effect you can name, such as a causal rate ratio of 0.0001 or 10000, is as a priori probable as say 0.1 or 10, which is what is implied by the improper uniform prior for the log rate ratio (i.e., the proportional-hazards coefficient). But replacement of an improper uniform prior with a vague proper prior capturing uncontested prior information only matters when the the likelihood (or estimating) function would not swamp that information.

The bottom line is that to blame a uniform or vague prior for  overestimation is to evade our responsibility to use the real-world information we have: Namely, that if a treatment needs an RCT to settle debate about whether its effect is large enough to care about, that fact alone should narrow our prior dramatically in comparison to typical “weakly informative” priors, and forms a valid empirical basis for considering recent proposals for shrinkage based on RCT databases.