Here’s a post describing an informative idea that Erik van Zwet (with the collaboration of me and Andrew G.) came up with in response to my post Is an improper uniform prior informative? it isn’t by any accepted measure of information I know of:
One feature (or annoyance) of Bayesian methodology over conventional frequentism comes from its ability (or requirement) to incorporate prior information, beyond the prior information that goes into the data model. A Bayesian procedure that does not include informative priors can be thought of as a frequentist procedure, the outputs of which become misleading when (as seems so common in practice) they are interpreted as posterior probability statements. Such interpretation is licensed only by uniform (“noninformative”) priors, which at best leave the resulting posterior as an utterly hypothetical object that should be believed only when data information overwhelms all prior information. That situation may arise in some large experiments in physical sciences, but is far from reality in many fields such as medical research.
Credible posterior probabilities (that is, ones we can take seriously as bets about reality) need to incorporate accepted, established facts about the parameters. For example, for ethical and practical reasons, human clinical trials are only conducted when previous observations have failed to demonstrate effects beyond a reasonable doubt. For medical treatments, that vague requirement (imposed by IRBs and funding agencies) comes down to a signal-to-noise ratio (the true effect divided by the standard error of its estimator) that rarely exceeds 3 and is often much smaller, as discussed here. Adding in more specific information may change this ground state, but even modestly well-informed priors often yield posterior intervals that are appreciably shifted from frequentist confidence intervals (which are better named compatibility or uncertainty intervals) with the posterior mean being closer to the null relative to the maximum likelihood estimate. In that sense, using the uniform prior without actually believing it leads to overestimation in clinical trials, although a more accurate description is that the overestimation arises from the fact that a uniform prior neglects important prior information about these experiments.
“Information” is a complex, multifaceted topic about which much has been written. In standard information theories (e.g. of Shannon, Fisher, Kullback-Leibler), it is formalized as a property of a sample given a probability distribution on a fixed sample space S, or as an expectation of such a property over the distribution (information entropy). As useful as these measures can be in classical applications (in which the information in data is the sole focus and the space of possible samples is fixed), from an informative-Bayes perspective we find there are more dimensions to the concept of information that need to be captured. Here, we want to discuss a different way to think about information that seems to align better with the idea of empirical prior information in Bayesian analyses.
Suppose we want to choose a prior for a treatment effect β in a particular trial. Consider the finite multi-set (allowing that the same value might occur multiple times) S1 of such treatment effects in all clinical trials that meet basic, general validity (or quality) considerations, together with the frequency distribution p1 of effects in S1. We consider subsets Sk of S1 that meet certain further conditions, and their frequency distributions pk. The distributions pk can be obtained by conditioning p1 on Sk. Examples of such reference sets are:
S1. The effects in all RCTs
S2. The effects in all RCTs in intensive care
S3. The effects in all RCTs in intensive care with a parallel design
S4. The effects in all RCTs in intensive care in elderly patients
S5. The effects in all RCTs in intensive care in elderly patients with a parallel design
Prior p1 (with reference set S1) represents the information that we are considering the treatment effect in an RCT that meets the general considerations used to define S1. Prior p2 (with reference set S2) represents the additional information that the trial concerns intensive care. Since the pair (p2,S2) represents more information than (p1,S1), we could say it is more informative. More generally, consider two potential priors pk and pj that are the frequency distributions of reference sets Sk and Sj, respectively. If Sk is a strict subset of Sj, then we call the pair (pk,Sk) more informative than the pair (pj,Sj).
To give another example, we would call (p3,S3) more informative than (p2,S2). We believe that this definition agrees well with the common usage of the term “information” because (p3,S3) incorporates additional information about the design of the trial. But p3 is not necessarily more informative than p2 in the sense of Shannon or Fisher or Kullback-Leibler. To say it even more simply, there is no requirement that the variances in S1, S2, S3 form a non-decreasing sequence. Carlos Ungil gave a clear example here. We have defined only a partial ordering of “informativeness” on pairs (pk,Sk); for example, the pairs (p3,S3) and (p4,S4) would not be comparable because S3 and S4 are not subsets of each other.
Our usage of the word “information” in relation to reference sets Sk is very similar to how a filtration in stochastic process theory is called “information”. This is very different from information theory where information is (like the mean and variance) a property of p alone, or relative to another distribution on the same sample space S. Both p and S are relevant when we want to think about the information in the prior.
In certain applications it can make sense start with the set S0 of all logically possible but otherwise unspecified effects on top of the hierarchy, where p0 is a uniform distribution over S0 or satisfies some criterion for minimal informativeness (such as maximum entropy) within a specified model family or set of constraints. For example, this can be appropriate when the parameter is the angle of rotation of photon polarity (thanks to Daniel Lakeland). However, in most applications in the life sciences (p0,S0) is not a sensible starting point because the context will almost always turn out to supply quite a bit more information than either S0 or p0 does. For example, clinical trials reporting hazard ratios for treatment effects of say HR < 1/20 or HR > 20 are incredibly rare and typically fraudulent or afflicted by severe protocol violations. And then an HR of 100 could represent a treatment for which practically all the treated and none of the untreated respond, and thus is far beyond anything that would be uncertain enough to justify an RCT – we do not do randomized trials comparing jumping with and without a parachute from 1000m up. Yet typical “weakly informative” priors assign considerable prior probability to hazard ratios far below 1/20 or far above 20. More sensible yet still “weak” reference priors are available; for log hazard-ratios (and log odds-ratios) the simplest choices are in the conjugate family, which includes the logistic distribution and its log-F generalizations.