# An apparent paradox regarding hypothesis tests and rejection regions

Ron Bloom wrote in with a question:

The following pseudo-conundrum is “classical” and “frequentist” — no priors involved; only two PDFS (completely specified) and a “likelihood” inference. The conundrum however may be interesting to you in its simple scope; and perhaps you can see the resolution. I cannot; and it is causing me to experience something along the lines of what Kendall says somewhere (about something else entirely) about “… the problem has that aspect of certain optical illusions; giving different appearances depending upon how one looks at it…”

Suppose I have p(x|mu0) and p(x|mu1) both weighted Gaussian sums with stipulated standard deviations and stipulated weights; for definiteness say both are three term sums; moreover all three constituent Gaussians have the common mean named in the expression p(x|mu). so they look like “heavy tailed” Gaussians at least from a distance.

Suppose mu0 < mu1 are both stipulated too; in fact everything is stipulated; so this is *not* an estimation problem; nothing to do with "EM" or maximum likelihood. Just classical test between two simple alternatives. A single datum is acquired: x. The classical procedure for deciding between "H0" and "H1" is to choose the test "size". Put down the threshold cut T on the right tail of p(x|mu0) so the area above that cut is the test size; the power of that test against the stipulated alternative H1 is of course the area above T under p(x|mu1). When the PDFs are Gaussian or in an exponential family or when "a sufficient statistic is available" this procedure above is identical to what one does if he uses the Neyman-Pearson likelihood criterion: which amounts to putting a cut with the same "size" on the more complicated random variable L(x) = p(x|mu0)/p(x|mu1). When the PDFS are nicely behaved or more generally *monotonic* the probability statement about a rejection test on the variate L(x) translates into a a statement about a rejection test on the variate (x) simpliciter. But in the case of this "nice" Gaussian mixture I discover that for mu1 sufficiently close to mu0 (and certain combinations of weights and standard deviations) that the likelihood ratio L(X) is *not* monotonic and so I am suddenly faced with an unexpected perplexity: it seems (to the eye anyway) that there's only one way set up a right-tailed rejection test for such a pair of simple hypotheses: and yet the Neyman Pearson argument seems to say that making that cut using the PDF of L(x) and making that cut using p0(x|mu0) itself will not yield the same "x" --- for the same test size. Can you see the resolution of this (pseudo)-conundrum?

I replied: Yes, I can see how this would happen. Whether Neyman-Pearson or Bayes, if you believe the model, the relevant information is the likelihood ratio, which I can well believe in this example is not a monotonically increasing then decreasing function of x. That’s just the way it is! It doesn’t seem like a paradox to me, as there’s no theoretical result that would imply that the ratio of two unimodal functions is itself unimodal.

Bloom responded:

I finally was able to see what is obvious. That indeed there are many alternative “rejection regions of the same size” and if the PDF of the “alternative” is bumpy (as in this example) or more generally if the likelihood ratio is not monotone (and this is *not* “easy to see” for ratios of “simple” Gaussian mixtures all of whose kernels have common mean) then indeed the best (most powerful) test is not necessarily the upper tail rejection test. See my badly drawn diagram. This by the way can be filed under your topic of how the Gaussianity ansatz sufficiently well-learned can really impede insights that would otherwise be patently obvious (to the unlearned).

## 6 thoughts on “An apparent paradox regarding hypothesis tests and rejection regions”

1. Isn’t this why the gods invented the monotone likelihood ratio condition, to protect us from this sort of thing?

2. A related non-paradox is that inverting likelihood ratio tests in discrete distributions (e.g. two-sided test for a binomial proportion) can (and does) lead to confidence regions with holes. See for example Blaker, H. (2000). Confidence curves and improved exact confidence intervals for discrete
distributions. Canadian Journal of Statistics 28 (4), 783–798.

• Probabilities calculated from discrete distributions must be rational numbers. Simultaneously assuming they can be irrational leads to weird behavior and all sorts of ad hoc adjustments.

• I don’t think that’s right.

In theory you could have a binomial random variable where the probability of success was say sqrt(2)/2. Of course the estimate of the probability you get from a finite sample would be a rational number, but the actual infinite-sample frequency could be irrational.

At some practical level, irrationals are the most impractical thing ever. To send someone a message containing the true value of a single draw from Uniform(0,1) would take past the end of the universe, since the value would be irrational almost surely, and would have no symbolic representation (like pi/4) almost surely. So you’d just have to send all the binary digits, and there are an infinity of them.

• actual infinite-sample frequency

Isn’t this an oxymoron? Relatedly, I came across this nice set of quotes the other day:

Cardano 1564:

So there is one general rule, namely, that we should consider the whole circuit, and the number of those casts which represents in how many ways the favorable result can occur

Leibniz 1710:

If a situation can lead to different advantageous results ruling out each other, the estimation of the expectation will be the sum of the possible advantages for the set of all these results, divided into the total number of results.

Bernoulli 1713
blockquote>
…if the integral and absolute certainty, which we designate by letter a or by unity 1, will be thought to consist, for example, of five probabilities, as though of five parts, three of which favor the existence or realization of some events, with the other ones, however, being against it, we will say that this event has 3/5a, or 3/5, of certainty.

De Moivre 1711:

If p is the number of chances by which a certain event may happen, and q is the number of chances by which it may fail, the happenings as much as the failings have their degree of probability; but if all the chances by which the event may happen or fail were equally easy, the probability of happening will be to the probability of failing as p to q.

Laplace 1774:

The probability of an event is the ratio of the number of cases favorable to it, to the number of possible cases, when there is nothing to make us believe that one case should occur rather than any other, so that these cases are, for us, equally possible.