## “Pointwise mutual information as test statistics”

Christian Bartels writes:

Most of us will probably agree that making good decisions under uncertainty based on limited data is highly important but remains challenging.

We have decision theory that provides a framework to reduce risks of decisions under uncertainty with typical frequentist test statistics being examples for controlling errors in absence of prior knowledge. This strong theoretical framework is mainly applicable to comparatively simple problems. For non-trivial models and/or if there is only limited data, it is often not clear how to use the decision theory framework.

In practice, careful iterative model building and checking seems to be the best what can be done – be it using Bayesian methods or applying “frequentist” approaches (here, in this particular context, “frequentist” seems often to be used as implying “based on minimization”).

As a hobby, I tried to expand the armory for decision making under uncertainty with complex models, focusing on trying to expand the reach of decision theoretic, frequentist methods. Perhaps at one point in the future, it will be become possible to bridge the existing, good pragmatic approaches into the decision theoretical framework.

So far:

– I evaluated an efficient integration method for repeated evaluation of statistical integrals (e.g., p-values) for a set of of hypotheses. Key to the method was the use of importance sampling. See here.

– I proposed pointwise mutual information as an efficient test statistics that is optimal under certain considerations. The commonly used alternative is the likelihood ratio test, which, in the limit where asymptotics are not valid, is annoyingly inefficient since it requires repeated minimizations of randomly generated data.
Bartels, Christian (2015): Generic and consistent confidence and credible regions.

More work is required, in particular:

– Dealing with nuisance parameters

– Including prior information.

Working on these aspects, I would appreciate feedback on what exists so far, in general, and on the proposal of using the pointwise mutual information as test statistics, in particular.

I have nothing to add here. The topic is important so I thought this was worth sharing.

1. Olly Johnson says:

I’m not sure exactly what is going on here, but I wonder if there is any link between this “pointwise mutual information” idea and the “information density” as used for example in Polyanskiy-Poor-Verdu – see Eq (3), (4) http://www.princeton.edu/~verdu/reprints/PolPooVerMay2010.pdf?q=tilde/verdu/reprints/PolPooVerMay2010.pdf

They make an analogy between (essentially optimal) channel coding and hypothesis testing, so I’m curious to know if there’s any relation.

2. Nick Menzies says:

I find this strange, and a very different conception of decision theory to the one I am familiar with.

Specifically…
“…typical frequentist test statistics being examples…”
Decision theory generally takes VNM expected utility theory as its starting point, and thus the optimal strategy is the one that maximises a given utility function on expectation (p(event)*u(event) summed over all possible events). The analyses needed for this are inherently Bayesian — frequentist statistics are heuristics that will only give the right answer through luck, and I would not think of them as an example of decision theory (unless we are being descriptive).

“This strong theoretical framework is mainly applicable to comparatively simple problems.”
I would disagree, but maybe it depends how much work you want to do. Any decision analysis will commonly involve (i) estimating the possible distribution of outcomes based on current information, and (ii) valuing those outcomes. The second part (valuing outcomes) is usually analytically simple, though involves tough judgements (i.e. the problem is deciding on the utility function, but when you have one it is generally easy to apply). The first part can be arbitrarily complicated.

I do not know what mutual information is, but any test statistic that does not attempt to value the outcomes of different decisions would seemed to be doomed from the outset. Having said all that, I would love to learn more — if and when there is a write up of this approach I would be grateful if you could post on it.

• Christian Bartels says:

Thanks for the comments. It seems to me that we have similar conception of decision theory. May be I’m using a somewhat narrower definition than you (see also references below)?

Specifically. The write up referenced in the blog above aims at set selection by minimizing risks defined as the size of the decision set. This is the same as in Schafer, Chad M, and Philip B Stark. “Constructing confidence regions of optimal expected size.” Journal of the American Statistical Association 104.487 (2009): 1080-1089.

Frequentist statistics such as interval estimation or hypothesis testing can be positioned within decision theory (e.g, Mathematische Statistik by Rüschendorf, L. or in “Lecture Notes Mathematical Statistics – MIT OpenCourseWare.” 2012. 8 Sep. 2015
).

As to your other point, I try to rephrase “This strong theoretical framework is mainly applicable to comparatively simple problems”: Frequentist test statistics aim at controlling error rates (i.e., risks or benefits), and they achieve this for a subset of problems. However, if one has non-trivial models (e.g., generalized linear mixed effects models) with limited data, it is neither clear in how far the test statistics achieve the goal of controlling errors, nor how to use these test statistics to iteratively improve the model. This said, it remains important to define potential losses and potential outcomes and evaluate risks of decisions for such problems. It is just not clear how to use existing frequentist tests for this, or how to derive optimal (minimax) decision rules for such problems. One option is to be pragmatic and to give up aiming at optimal, minimax rules.

• Daniel Lakeland says:

The other option is to just realize THIS and go ahead and do actual decision analysis using Bayesian methods and real-world approximations to utilities.

It seems like “minimizing the size of the decision set” is equivalent to a loss function which is equal to the inverse probability of the outcome, so that p(x)Loss(x) = 1 and the expected value of the loss is equal to the size of the decision set. This is perverse because you can’t say how much you like a given outcome independently from what you currently know about how likely it is to happen, and the decision is only defined for bounded parameters. Furthermore, even if you start out with a bounded parameter, unless your likelihood can assign exactly zero probability to some of the possible outcomes (and you accept that 0 * 1/0 = 0 by convention), no amount of information changes your expected loss.

I think this is actually a great way to think about how wrong Frequentist theory is. People treat values outside the confidence interval “as if” they had zero probability, and treat values within the confidence interval “as if” it didn’t matter to them which values were the real ones.

• Daniel Lakeland says:

On the other hand, you can see how this reliance on a truncation of the probability to zero actually works correctly for something like discriminating between random and non-random sequences a-la Per Martin-Lof

Suppose you have an RNG that can generate a sequence of up to 100 terabits (10^14 bits) before cycling… so Is it a good one? Well, there are 2^(10^14) possible sequences, and if in some sense “most” of them are “as if generated by coin flips” and we have a test that rejects the “as if coin flip” only say 10% of the time and rejects the “non coin flips” 99.9999995% of the time. We still are going to have something like perhaps 2^(10^12) sequences that are good enough for our purposes and we explicitly DON’T CARE which of those sequences we get, just if the RNG gives us one of them… which is exactly the case for the “in vs out of the rejection set” concept.

• Christian Bartels says:

Not sure. The proposal aims at minimizing loss (size) for a fixed coverage. I.e., in your simplified notation, minimize Loss(x) for p(x)=const.

Also, not clear how the above proposal can be instrumentalized in a discussion of frequentist vs. Bayesian. The proposal uses the same criterion for frequentist confidence set selection as for Bayesian credible set selection. And the two sets that get selected are consistent, i.e., very similar up to differences that are due to differences in the question being asked.

• Keith O'Rourke says:

> same criterion for frequentist confidence set selection as for Bayesian credible set selection
Interesting.

You might be interested in contrasting that choice with the same criterion for Bayesian credible set selection as for likelihood based frequentist confidence set selection.

Optimal properties of some Bayesian inferences M. Evans and M. Shakhatreh http://projecteuclid.org/euclid.ejs/1229975382

Abstract

Relative surprise regions [same shape as likelihood based regions] are shown to minimize, among Bayesian credible regions, the prior probability of covering a false value from the prior. Such regions are also shown to be unbiased in the sense that the prior probability of covering a false value is bounded above by the prior probability of covering the true value. Relative surprise regions are shown to maximize both the Bayes factor in favor of the region containing the true value and the relative belief ratio, among all credible regions with the same posterior content. Relative surprise regions emerge naturally when we consider equivalence classes of credible regions generated via reparameterizations.

• Christian Bartels says:

Indeed very much related and relevant. Thanks!

After a first reading. Evans et al. (2008) propose observed relative surprise (ORS) as an optimal criterion for Bayesian confidence intervals selection, and show that the Bayesian confidence interval has good frequentist properties. Observed relative surprise is defined as the size of the confidence interval selected using pointwise mutual information as criterion (or its marginalized version)!

In summary, together, we have Evans et al. (2008) who show that pointwise mutual information is an optimal criterion for Bayesian credible interval selection, and that the resulting credible intervals have good frequentist properties. The proposal introduced above overlaps in that it suggests to use pointwise mutual information as an optimal test for Bayesian credible set selection. Frequentist properties of the credible intervals are not discussed. Instead, it focuses on showing that pointwise mutual information is an optimal test for frequentist confidence set selection. In addition an implementation of the set selection procedure is provided.

Seems to move in a similar, promising direction.

P.S. implementation is available at
Bartels, Christian (2015): Code – Generic and consistent confidence and credible regions. figshare.
https://dx.doi.org/10.6084/m9.figshare.1528187.v1

• Keith O'Rourke says:

Thanks, apparently you do get the same shaped intervals and regions so I will be very interested in your software.

There was a difference in motivations as you seemed to be just be optimizing on a particular criterion which to me (and Fisher by the way) is always questionable (why I made this previous comment http://statmodeling.stat.columbia.edu/2014/06/18/judicious-bayesian-analysis-get-frequentist-confidence-intervals/#comment-173386 )

On the other hand Evans has a theory of evidence that I find convincing that dictates those shaped intervals and regions should be used. He then works out various optimality properties to assess support for the theory (sort of a severe testing of it).

Reminds me of the difference between Legendre and Gauss in their different motivations for developing least squares.

• Nick Menzies says:

Thanks for the follow-up. This clarifies a bit, I think — from the linked paper, it seems that the objective function being invoked is related to the statistical properties of the test in some way (e.g. minimize the width of an interval, covering the truth a certain amount of the time, minimizing the distance of one extent of the interval to the truth). If true, these seem like good ways to think about desirable properties for a confidence interval (without having thought hard about it), if we think of the interval as a way of communicating the precision of an estimate. However, if we are interested in decision rules (i.e. fall one side of a threshold do A, else do B), I don’t see how there is major progress to be made without an objective function that is sensitive to what you are actually going to do with the information.

Scenario 1: if your interval doesn’t cover the truth everyone dies.
Scenario 2: if your interval doesn’t cover the truth everyone gets mild sunburn.

If we don’t admit information about which scenario we are in, how can a statistical test function as a decision rule?

• Daniel Lakeland says:

Exactly. And even in scenario 1 vs 2 it can’t just be “if the interval doesn’t cover the truth” because the interval -inf,inf always covers the truth.

In the usual case, the loss/gain/utility function shouldn’t depend on the probability of the outcome. I can maybe imagine that if you’re doing channel encoding you’d be interested in minimizing uncertainty or something, but outside of that, it makes sense for the loss/utility function to be independent of the estimated probability. Usually what you know about something and how good it would be for that thing to happen in a certain way are totally separate concepts.

• Christian Bartels says:

Thanks. Yes makes sense. For most actual decision problems, one probably should take into account prior information – be it prior probabilities and/or perhaps more importantly as your example suggests available knowledge on relative losses of different outcomes.

This is a clear limitation of the current proposal on using pointwise mutual information as a test statistics.