Ken Rice presents a unifying approach to statistical inference and hypothesis testing

Ken Rice writes:

In the recent discussion on stopping rules I saw a comment that I wanted to chip in on, but thought it might get a bit lost, in the already long thread. Apologies in advance if I misinterpreted what you wrote, or am trying to tell you things you already know.

The comment was: “In Bayesian decision making, there is a utility function and you choose the decision with highest expected utility. Making a decision based on statistical significance does not correspond to any utility function.”

… which immediately suggests this little 2010 paper; A Decision-Theoretic Formulation of Fisher’s Approach to Testing, The American Statistician, 64(4) 345-349. It contains utilities that lead to decisions that very closely mimic classical Wald tests, and provides a rationale for why this utility is not totally unconnected from how some scientists think. Some (old) slides discussing it are here.

A few notes, on things not in the paper:

* I know you don’t like squared-error loss on its own – and I think this is fair – but (based on prior work by others) it’s highly plausible the paper’s specific result extends to give something very similar for whole classes of bowl-shaped loss functions – that describe much the same utility in a less mathematically-tractable way. Also, I’m not claiming the utilities given are the *only* way to interpret such decisions.

* Even if one doesn’t like either squared-error loss or its close relatives, the framework at least provides a way of saying what classical tests and p-values might mean, in the Bayesian paradigm. That they mean something rather different to Bayes factors & posterior probabilities of the null is surprising to many people, particularly those keen to dismiss all use of p-values. I really wrote the paper because I was fed up with unrealistic point-mass priors being the only Bayesian way to get tests; like you, I work in areas where exactly null associations are really hard to defend. [Yup—ed.]

Here’s the abstract of Rice’s 2010 paper:

In Fisher’s interpretation of statistical testing, a test is seen as a ‘screening’ procedure; one either reports some scientific findings, or alternatively gives no firm conclusions. These choices differ fundamentally from hypothesis testing, in the style of Neyman and Pearson, which do not consider a non-committal response; tests are developed as choices between two complimentary hypotheses, typically labeled ‘null’ and ‘alternative’. The same choices are presented in typical Bayesian tests, where Bayes Factors are used to judge the relative support for a null or alternative model. In this paper, we use decision theory to show that Bayesian tests can also describe Fisher-style ‘screening’ procedures, and that such approaches lead directly to Bayesian analogs of the Wald test and two-sided p-value, and to Bayesian tests with frequentist properties that can be determined easily and accurately. In contrast to hypothesis testing, these ‘screening’ decisions do not exhibit the Lindley/Jeffreys paradox, that divides frequentists and Bayesians.

This could represent an important way to look at statistical decision making.

6 thoughts on “Ken Rice presents a unifying approach to statistical inference and hypothesis testing

  1. Very interesting! Will definitely look into using this test.

    While much is made of the problem of poorly defended point nulls, do the conclusions change if my null is far more defensible? From the argument within the paper, I don’t think it does. Unsure whether that should be disconcerting or not.

  2. West: thanks for your interest.

    Using the same loss function but with a prior that has a “spike” at the null, i.e. what one might use when a point null is reasonable, then the Bayes rule still relies on only the posterior mean and variance. So that’s not disconcerting, I think, unless for some reason one wants to insist that Bayesian tests only use the posterior probability of the null or the Bayes Factor.

    But the test’s general large-sample agreement with default frequentist methods does go away when the prior has a “spike”. This won’t be disconcerting if interest lies only in the test’s Bayesian properties, and it’s also not disconcerting if only the test’s frequentist properties are key – though it’ll take more work than usual to figure out what those properties are.

    However, getting disagreement between Bayesian and default frequentist methods is (by definition) disconcerting if one thinks these analyses should agree – and I think the fact that we all still call the Jeffreys-Lindley paradox a “paradox” suggests that lots of people do expect this agreement. So if you’re disconcerted by it, you’re not alone, but thinking carefully about what the various methods assume and the conclusions they draw should help unravel the issue.

    • After some thought, I believe my consternation comes from applying this Bayes rule to a Poisson counting problem I am working on and getting a nonsensical result.

      * I have two Poisson processes whose total rate is r=s+b, where b is known to be very small while s remains unknown and could possibly be zero.
      * Now I analyze a large amount of data (of length t) and obtain a count of n=1
      * The likelihood of getting n=1 if r=b (ala s=0) is very small, say with a pvalue=1e-6.
      * The Bayes rule is then E[r-b|n]^2/Var[r|n] = E[s|n]^2/Var[s+b|n]
      * Because b is so tiny, Var[s+b|n] ~> Var[s|n] (this is where I think I am floundering)
      * If p(s|n) is a gamma distribution, the Bayes rule = shape parameter of p(s|n). If I used an improper uniform prior, it is n+1=2. With a Jeffrey’s prior its n+1/2=3/2.

      If I stick with the improper uniform prior for p(s), the corresponding alpha from the Wald-test is 0.157. So despite the fact that getting 1 count from b alone is incredibly unlikely, this test recommends I “conclude nothing” rather than reject null (r=b). This seems bizarre to me. Now the most likely reason this result makes no intuitive sense is that I am not applying the test correctly.

      • West: if the null is that s=0, the Bayes rule is based on E[s|n]^2/Var[s|n], equivalently E[r-b|n]^2/Var[r-b|n]. If, a priori, we are absolutely certain that b is really tiny and our prior has no strong dependence between b and r, this is approximately E[r|n]^2/Var[r|n]. If we just set b=0, you want to test r=s=0, and have one observation with n=1. Then, under the uniform and Jeffrey’s prior for s, the posterior mean is equal to the posterior variance is equal to the Bayesian Wald statistic, with values 2 and 1.5 respectively. In other words, I think your algebra is right.

        Under both priors the bulk of the support in the posterior is well within a couple of posterior SDs of the null – so no, the signal:noise ratio is not strong enough to trigger a “reject” signal. (Values of n=3 would trigger a signal, however.) For n=1 the posterior supports values near s=1 (sensibly) but on this scale the noise overwhelms that.

        If you don’t like this answer because you deem the available precision to be irrelevant to your testing decision, the loss function we’re using doesn’t express your utility; you want to answer a question that’s different to the one being addressed. (Perhaps your loss only depends on whether s=0 vs s>0?)

        NB you might prefer looking at the problem on the Log[s] scale, with the null being some very small value of s.

  3. Ken: This is thoughtful and well written.

    I would have preferred an explicit qualification of “either reports some scientific findings [now, given current awareness/evaluation of all relevant studies] — or gives no firm conclusions [for now].

    Now your later statement “if the inferential goal is a comprehensive summary of what is known, then reporting ‘nothing’ is inappropriate” makes perfect sense to me but do you mean to separate the evaluation of evidence (comprehensive summary of what is known) completely from any “underlying decision [is] whether a report on some scientific quantity is merited, or not”?

    Also, I would have _predefined_ the loss function as most appropriate for statistical testing (as I think Tukey argued, we just want an indication of direction of effect) would only involve getting the sign correct or incorrect (you refer to that loss function in your slides).

    Interesting that a much more ambitious goal of getting an accurate estimate specified the loss function that leads to Bayesian analogs of the Wald test…

    • Hi Keith, thanks for your nice comments. Here are some clarifications – happy to follow up more by email.

      On my “later statement”; yes, I do want to make that distinction. If one wants a comprehensive summary, giving the full posterior is fine – much in line with lots of Andrew’s advice on this blog, I think. But sometimes that’s way too much detail to be practical, and a criteria is needed – a loss function – that describes how good or bad different cruder choices would be (e.g. the value of a point estimate, the yes/no of a test).

      Is this completely separating the comprehensive summary from the cruder one? No, if we’re permitting subjunctive use of loss functions; i.e. a rational person who held THIS utility would do THIS, but with THAT utility would do THAT, etc etc. It’s okay, I think, to view comprehensive and crude as answers to different questions. But let’s say what those questions are, explicitly.

      On using sign: it’s a bit opaque in the posted slides, but the other losses use signed decisions (only) with continuously-value measures of effect size, albeit measures that don’t accelerate as wildly as the quadratic ones in the paper. I do think effect size should matter to some extent; if the signal we missed is a modest improvement over a sugar pill, that’s bad, but not as bad as missing the next penicillin.

Leave a Reply to West Cancel reply

Your email address will not be published. Required fields are marked *