Measuring the information in an empirical prior

Here’s a post describing an informative idea that Erik van Zwet (with the collaboration of me and Andrew G.) came up with in response to my post Is an improper uniform prior informative? it isn’t by any accepted measure of information I know of:

One feature (or annoyance) of Bayesian methodology over conventional frequentism comes from its ability (or requirement) to incorporate prior information, beyond the prior information that goes into the data model. A Bayesian procedure that does not include informative priors can be thought of as a frequentist procedure, the outputs of which become misleading when (as seems so common in practice) they are interpreted as posterior probability statements. Such interpretation is licensed only by uniform (“noninformative”) priors, which at best leave the resulting posterior as an utterly hypothetical object that should be believed only when data information overwhelms all prior information. That situation may arise in some large experiments in physical sciences, but is far from reality in many fields such as medical research.

Credible posterior probabilities (that is, ones we can take seriously as bets about reality) need to incorporate accepted, established facts about the parameters. For example, for ethical and practical reasons, human clinical trials are only conducted when previous observations have failed to demonstrate effects beyond a reasonable doubt. For medical treatments, that vague requirement (imposed by IRBs and funding agencies) comes down to a signal-to-noise ratio (the true effect divided by the standard error of its estimator) that rarely exceeds 3 and is often much smaller, as discussed here. Adding in more specific information may change this ground state, but even modestly well-informed priors often yield posterior intervals that are appreciably shifted from frequentist confidence intervals (which are better named compatibility or uncertainty intervals) with the posterior mean being closer to the null relative to the maximum likelihood estimate. In that sense, using the uniform prior without actually believing it leads to overestimation in clinical trials, although a more accurate description is that the overestimation arises from the fact that a uniform prior neglects important prior information about these experiments.

“Information” is a complex, multifaceted topic about which much has been written. In standard information theories (e.g. of Shannon, Fisher, Kullback-Leibler), it is formalized as a property of a sample given a probability distribution on a fixed sample space S, or as an expectation of such a property over the distribution (information entropy). As useful as these measures can be in classical applications (in which the information in data is the sole focus and the space of possible samples is fixed), from an informative-Bayes perspective we find there are more dimensions to the concept of information that need to be captured. Here, we want to discuss a different way to think about information that seems to align better with the idea of empirical prior information in Bayesian analyses.

Suppose we want to choose a prior for a treatment effect β in a particular trial. Consider the finite multi-set (allowing that the same value might occur multiple times) S1 of such treatment effects in all clinical trials that meet basic, general validity (or quality) considerations, together with the frequency distribution p1 of effects in S1. We consider subsets Sk of S1 that meet certain further conditions, and their frequency distributions pk. The distributions pk can be obtained by conditioning p1 on Sk. Examples of such reference sets are:

S1.           The effects in all RCTs

S2.           The effects in all RCTs in intensive care

S3.           The effects in all RCTs in intensive care with a parallel design

S4.           The effects in all RCTs in intensive care in elderly patients

S5.           The effects in all RCTs in intensive care in elderly patients with a parallel design

Prior p1 (with reference set S1) represents the information that we are considering the treatment effect in an RCT that meets the general considerations used to define S1. Prior p2 (with reference set S2) represents the additional information that the trial concerns intensive care. Since the pair (p2,S2) represents more information than (p1,S1), we could say it is more informative. More generally, consider two potential priors pk and pj that are the frequency distributions of reference sets Sk and Sj, respectively. If Sk is a strict subset of Sj, then we call the pair (pk,Sk) more informative than the pair (pj,Sj).

To give another example, we would call (p3,S3) more informative than (p2,S2). We believe that this definition agrees well with the common usage of the term “information” because (p3,S3) incorporates additional information about the design of the trial. But p3 is not necessarily more informative than p2 in the sense of Shannon or Fisher or Kullback-Leibler. To say it even more simply, there is no requirement that the variances in S1, S2, S3 form a non-decreasing sequence. Carlos Ungil gave a clear example here. We have defined only a partial ordering of “informativeness” on pairs (pk,Sk); for example, the pairs (p3,S3) and (p4,S4) would not be comparable because S3 and S4 are not subsets of each other.

Our usage of the word “information” in relation to reference sets Sk is very similar to how a filtration in stochastic process theory is called “information”. This is very different from information theory where information is (like the mean and variance) a property of p alone, or relative to another distribution on the same sample space S. Both p and S are relevant when we want to think about the information in the prior.

In certain applications it can make sense start with the set S0 of all logically possible but otherwise unspecified effects on top of the hierarchy, where p0 is a uniform distribution over S0 or satisfies some criterion for minimal informativeness (such as maximum entropy) within a specified model family or set of constraints. For example, this can be appropriate when the parameter is the angle of rotation of photon polarity (thanks to Daniel Lakeland). However, in most applications in the life sciences (p0,S0) is not a sensible starting point because the context will almost always turn out to supply quite a bit more information than either S0 or p0 does. For example, clinical trials reporting hazard ratios for treatment effects of say HR < 1/20 or HR > 20 are incredibly rare and typically fraudulent or afflicted by severe protocol violations. And then an HR of 100 could represent a treatment for which practically all the treated and none of the untreated respond, and thus is far beyond anything that would be uncertain enough to justify an RCT – we do not do randomized trials comparing jumping with and without a parachute from 1000m up. Yet typical “weakly informative” priors assign considerable prior probability to hazard ratios far below 1/20 or far above 20. More sensible yet still “weak” reference priors are available; for log hazard-ratios (and log odds-ratios) the simplest choices are in the conjugate family, which includes the logistic distribution and its log-F generalizations.

31 thoughts on “Measuring the information in an empirical prior

        • I do think this is a no go statement from Judea – “Or, you mean a text that will prepare stat students for modernity? Out of rung-1 of the Ladder?”

          By rung-1 he might have meant having suffered through the usual introductory course or not yet. Not sure which would be preferable, but for either there is a long way people need to be taken. How to do that and whether it can be done in a single course is largely unknown.

          Andrew suggested this book Regression and Other Stories which likely is a good bet for those with some familiarity with regression and enough energy and commitment for a full term course.

          I have been working on material for others that might enable them and inspire them to undertake books like Andrew suggested.

          Likely needs more work, but a start here – https://ww2.amstat.org/meetings/jsm/2021/onlineprogram/AbstractDetails.cfm?abstractid=319100 (for those who attended JSM there is a 15 video available.)

        • I think Pearl is referring to his ladder of causality, with the first rung “association”, the second “intervention”, and the third “counterfactuals”. According to him, statistics is stuck on rung 1.

        • You are likely correct but an introductory stats course does need to cover core components of critical thinking about representing reality more realistically such as intervention and counterfactuals.

        • Yes Eric, that is what Judea means. He covers the rungs in his Book of Why, a tour de force. Since I’m not a statistician, I have to read it twice to make sure I understand his perspective. Plus Judea has implied or stated on Twitter that statistics is stuck.

        • Steven Goodman, in one of his articles, pointed out that each field/domain has to work out its own statistical competencies. On the face of it, this sounds logical. It would be useful to hear and read more by Steven Goodman for he is an excellent writer. I appreciated a primer that Steven and John Ioannidis published, some years ago, about reproducibility nomenclature.

          However, I am aware that each expert is wedded often to assumptions that are not so transparent as they should. I think that it’s important also to have a very good critical thinking textbook b/c most that I have skimmed are just ok. Copi of course is very good. But whether it can lead to better statistical competencies is a question that I can’t assess.

  1. WRT this and the related post. By making the claim that an improper uniform prior is not informative, you are merely showing that your mathematical definition of “informative” does not correspond to the normal usage of the word. If you tell someone that you have no information about x, but you know a lot about 1/x, they will likely shake their heads sadly and walk away. Or that you believe that x is exactly 25.36 times more likely to lie in the interval [0.25.36] than in the interval [99,100].

    Some mischievous people have sometimes tried to confuse people by equating the/a mathematical definition of “informative” with the common English usage. I’m not accusing you of doing that of course, but you should be aware that it happens.

    • I was trying to find where we made the claim an improper uniform prior is not informative. The closest I can find above is “uniform (“noninformative”) priors” where “noninformative” is in scare quotes to warn that it is not our definition, but rather a common description of a prior that makes the normalized likelihood function into the posterior.

      I know of no precise technical definition named after an everyday concept that captures all uses or meanings of the ordinary-language word; for example when people talk of resistance to receiving a vaccine they don’t measure it in ohms. Sure, some ordinary labels are worse than others in misleading users about technical concepts; “significance” and “confidence” as used in statistics are the poster children for that problem, capturing almost nothing of common everyday meaning of the words – they’re more like a weasel words used to spin knowledge out of ignorance (apologies to Sir Karl).

      In contrast, I think it reasonable to map the idea “all information in the posterior comes from the likelihood” to saying “the prior is uniform” (improper or not). Now I’d avoid using “noninformative” for that, but I’ve not seen it as especially harmful as long as one is using the parameterization or scale in which prior information would be expressed in the context. Above we assumed that scale was hazard-rate ratio HR (which if the outcome is not too common isn’t very far from the risk ratios and odds ratios that some external sources might have supplied). Using HR, pairs like 1/20 and 20 would be considered equivalent in size; following that consideration leads to taking logs as it makes such pair members equal in absolute value. It also means switching to 1/HR translates to sign reversal. All that might be taken as a basis for claiming the improper uniform prior on log HR is a sensible representation of “indifference” or “zero information”, and indeed it makes the posterior proportional to the likelihood for log HR. But that is not a compelling argument for such usage.

      Most importantly, that improper prior becomes irrelevant once we put the math in real context because we always have some information to help both the calibration and credibility of our statistics. Consider that, if instead we used a uniform prior on f = HR/(1+HR), that would force a logistic prior on log HR. In the simplest balanced (1:1 allocation) trials that corresponds to the Laplace “indifference” prior forced by adding one treated and one untreated case to the case series. This is arguably a better prior in the settings discussed precisely because it forces 95% probability on the HR being between 1/39 and 39, which may be little information (2 bits worth by one measure) about HR but not zero. Even better I think would be a log-F prior with a 95% probability of HR being between 1/20 and 20 (still less than 5 bits of information by the same measure).

      So for me the practical question is not whether a uniform prior represents “zero information” – sometimes it does in some sense in some parameterizations, sometimes it doesn’t. Instead that question is whether the prior percentiles it imposes on the contextually designated target parameter are reasonable or not. We (Erik, Andrew, Sander) can agree that the improper uniform prior does not do that at all in the topics we encounter (nor does the Jeffreys prior, even though under certain models it removes 2nd-order bias in MLEs); hence we focused on characterizing priors derived from external data, not on defining “noninformative”.

    • In the absence of other information, wouldn’t you expect a location parameter to be 25.36 times as likely to lie in a region which is 25.36 as large? If not, would it be more or less likely?

      Are you talking about improper priors exclusively or do you have an issue with non-informative priors in general?

    • > If you tell someone that you have no information about x, but you know a lot about 1/x, they will likely shake their heads sadly and walk away.

      If by having no information about x you mean that x=0 is not different from x=42 or x=-1000, then you know about 1/x the same that you know about 1/(x-42) or about 1/(x+1000). You know that they are likely to be small because x is unlikely to be close to 0, 42 or -1000.

      If you want a prior which is the same for x and for 1/x, because zero is a special number and there is a reason to have invariance under that transformation, then do not use a uniform prior on x. A uniform prior on log(x) will make more sense, I think.

  2. I’m curious, do you think there’s any potential value (in any application) of trying to quantify this sense of informative, so that cases like (S3, p3) and (S4, p4) could be compared? This idea makes me recall how back when I was doing research on eliciting priors from people, we used to struggle with the possibility that a weak-seeming prior we elicited could actually be based on more “experience” than a more informative prior elicited from someone else by the typical definition of informative, but we didn’t have ways to capture those differences.

    • I’m not sure, seems worth thinking about. I kind of doubt it would help to compress the pair into one number, but maybe having a number for each pair member would help. Here’s a starting proposal for our example, not to be taken too seriously as it’s just a quick guess: The measure m(Sk) could be the negative log of the Sk proportion of S1, so m(S1) = 0 and it goes up from there; then the measure for pk could be its Shannon entropy.

      I’m no fan of “expert elicitation” though: I’ve found expert opinions too often are very biased by wishful thinking, highly selective reading, and a myriad of fallacies and prejudices that are presented as fact yet have no basis in actual data (including claiming “X has no effect” because one prestigious study reported that on the basis of getting p>0.05). And opinions often conflict in ways that are beyond sensible merging. As an example try contrasting expert opinions about covid-19 infection-fatality rates.

      I argue that anyone wanting to do sensible statistics must build their models (both for parameters and for data) from the hard work of getting immersed in research reports in the topic area, not by relying on expert opinions (although experts can be queried and then contrasted to the literature to see what prejudices are afoot). To the extent that work must be limited, I think one better rely on more flexible models – and that means less informative models, avoiding models with constraints that have no contextual justification unless the constraints can be shown as inconsequential for the targets of inference (i.e., inferences need to be robust to violation of uncertain constraints).

      • My question wasn’t very well stated, but what you describe (capturing proportion within a defined hierarchy) is along the lines of what I was thinking. The post also got me thinking about something like finding minimum spanning trees given some set of nodes representing non-overlapping features of a target subset, but I’m not sure that evaluating priors this way makes sense given that there isn’t necessarily a sequential process by which people decide on the evidence on which to base a prior.

        I tend to agree on elicitation. I was eliciting priors from non-experts to see how well doing so explained individual differences in reactions to data that weren’t captured by the more standard measures of how well people perceive/comprehend information. We also used them to generate ‘uncertainty analogies’ that used a person’s elicited prior for some parameter as a reference against which to describe the amount of information in an observed dataset (to see if it helped people take very large datasets more seriously). But we were never really confident that even with a well defined body of evidence the average person would be able to articulate any reasonable prior.

        • Jessica: The main purpose of those reference sets was to make it clear that there’s more to the “amount of prior information” than can be measured from the prior distribution alone. Prior elicitation provides another nice example. You could say that asking 10 experts provides more information than asking 1, but that does not mean the elicited prior will also be more informative in any information-theoretical sense.

        • Yes, my question was confusing. There’s something like relevance of the set defining the prior, though that’s still not a great word for it, that this post got me thinking about trying to quantify, but I’m not sure there’s any other information (outside of traditional informativeness) that can be gained beyond what’s described in the post. How many people’s opinions are captured in a prior is a nice example from elicitation; I’ve seen that come up in some econometric lit.

  3. The “third way” is to use a likelihood prior – 1:1 odds to represent equipoise. This means a likelihood ratio (calculated from the data) equals the posterior odds.

    • In our setting that posterior-proportional-to-likelihood approach yields the same posterior odds as the improper uniform prior, which we find contextually unacceptable because it ignores important prior information and thus does not produce credible (contextually well-informed) posterior odds. It also does not produce the most well-calibrated frequentist statistics. See our context description and my response to James Annan above.

  4. If the “standard information theories” aren’t working for you, consider Kolmogorov’s complexity theory. Kolmogorov was motivated to construct the theory specifically to get away from the notion that we need to reference probability distributions (or unobservable parameters of distributions) to define information. That probably sounds completely incompatible with Bayesian priors, but bear with me. Suppose that, for each member of S3, we generate a sequence (S3.1.Beta1, S3.2.Beta2, S3.3.Beta3…) listing every possible effect estimate, in order of probability according to p3. (Assume that effects sizes are always discrete variables, or can be discretized without destroying significant information, though the sample space can still be infinite.) And the same for the members of (S4.1.Beta1, S4.2.Beta2, S4.3.Beta3…).

    Now, the algorithmic information in a particular sequence, K(Si.j), is the length of the shortest possible algorithm that can output that sequence–this is an uncomputable quantity, so don’t worry about it. What you care about is the conditional algorithmic information in one such sequence given another, K(Si.j | Si.k). This is defined as the smallest number of string operations of a specified type or types (e.g., character substitutions, transpositions, insertions, etc.) a hypothetical algorithm must perform on one sequence to change it into the other. This quantity is entirely computable–the particular operation(s) your algorithm is permitted to use depends on the particular metric, and different metrics are useful for different purposes.

    Most metrics don’t require an equal number of elements in each sequence. This is important because what you REALLY want to know is the conditional algorithmic information of the set of sequences for all members of S4 given the set of sequences for all members of S3, where the sequences of each S have some sort of sensible ordering. Essentially, you’re computing 2D conditional complexity between matrices instead of 1D conditional complexity between strings.

    Finally, you may be asking any or all of the following very good questions: How do I do this for continuous effect sizes? What constitutes a meaningful ordering of sequences? What complexity metric should I use? How do I interpret the amount of conditional complexity so that it is meaningful in a Bayesian context? Answers: Dunno. :) But it sounds like it could be worth your looking into!

  5. It seems to me that what you’re looking for is what programming language people usually refer to as syntactic sugar: a useful function that streamlines and simplifies a more tedious task. Here’s what I mean:

    If you augment the S space with a binary flag indicating whether or not the participant is admissible to the trial, then all the old information theory stuff works out just fine. All the patients that get screened out in S2 and beyond are simply delta spikes at the flag=0 value in your p2 and beyond. At this point what you’re left with is just some annoying measure theory bookkeeping.

    The entropy will now decrease as you decrease the set of admissible patients. But it will be the entropy over this ugly (flag, effect) space. I totally agree it would be nice to have a convenient way to discuss this but I don’t think we need to throw out Shannon.

    • Hmm, you might be right about that! And I guess the entropy over the pair (flag,effect) would induce a total ordering so that, for example, (S3,p3) and (S4,p4) are compatible too.

    • I didn’t see where we advised throwing out Shannon information or its extensions, but rather we saw a need to supplement them with explication of the distribution range space. Many thanks though for pointing out how our idea could be subsumed within the existing measures (I had mentioned to Erik and Andrew that I thought someone better informed on the theory might do just that!).

      I wonder how your proposal relates to measuring m(Sk) as the negative log of the Sk proportion of the starting (i.e., base or reference) space S0 and measuring pk by its Shannon entropy H(pk) (as I suggested above to Jessica Hulman using S1=S0), then adding these two to get a total-entropy measure H(Sk,pk). I think but haven’t checked that this H(Sk,pk) would reduce to the expectation over pk of the information -log(p0(A&Sk)) in events in S0 of the form “A & Sk”, or equivalently the entropy restricted to such events and divided by p0(Sk). This measure will also decrease as restrictions increase.

      • “This measure will also decrease as restrictions increase.” Actually now I’m not at all sure of that. The idea is clearly not ready for prime time!

  6. > One feature (or annoyance) of Bayesian methodology over conventional
    > frequentism comes from its ability (or requirement) to incorporate prior
    > information, beyond the prior information that goes into the data model.

    But, it is clear (or should be clear) that you need a prior to do
    statistics, as the lady tasting tea experiment demonstrates.

Leave a Reply

Your email address will not be published. Required fields are marked *