Statistical methods that only work if you don’t use them (more precisely, they only work well if you avoid using them in the cases where they will fail)

Here are a couple examples.

1. Bayesian inference

You conduct an experiment to estimate a parameter theta. Your experiment produces an unbiased estimate theta_hat with standard error 1.0 (on some scale). Assume the experiment is clean enough that you’re ok with the data model, theta_hat ~ normal(theta, 1.0). Now suppose that theta_hat happens to equal 1, and further suppose you are doing Bayesian inference with a uniform prior on theta (or, equivalently, a very weak prior such as normal(0, 10)). Then your posterior distribution is theta ~ normal(1, 1), and the posterior probability is 84% that theta is greater than zero. (See section 3 here.) You then should be willing to bet that theta is greater than zero with 5-1 odds (assuming you’re not planning to bet against anyone with private information about theta). OK, maybe not 5-1 because you’re concerned about the jack of spades, etc. So, 4-1.

Still, it’s a problem, if you’re willing to routinely offer 4-1 bets on data that are consistent with noise (which is one way to put it when you observe an estimate that’s 1 standard error away from zero). Go around offering those bets to people and you’ll soon lose all your money.

OK, yeah, the problem is with the flat prior. But that’s something that people do! And the flat-prior-analysis isn’t terrible; it can summarize the data in useful ways, even if you shouldn’t “believe” or lay bets on all its implications.

In practice, what do we do? We use the Bayesian inference selectively, carefully. We report the 95% posterior interval for theta, (-1, 3), but we don’t talk about the posterior probability that theta is positive, and we don’t publicize those 5-1 odds. Similarly, when an estimate is two standard errors away from zero, we consider it as representing some evidence for a positive effect, but we wouldn’t bet at 39-1 odds. In a sense, we’re acting as if we have a high prior probability that theta is close to zero—but not quite, as we’re still giving out that (-1, 3) interval. We’re being incoherent, which is fine—you know what they say about foolish consistency—but in any case we should be aware of this incoherence.

2. Null hypothesis significance testing

You have a hypothesis that beauty is related to the sex ratio of babies, so you find some data and compare the proportion of girl births from attractive and unattractive parents. You’ve heard about this thing called forking paths so you preregister your study. Fortunately, you still find something statistically significant! OK, it’s not quite what you preregistered, and it goes in the wrong direction, and it’s only significant at the 10% level, but that’s still enough to get the paper published in a top journal and have it receive two awards.

OK, yeah, the problem is the famous incoherence of classical statistical practice. What exactly is that p-value? Is it a measure of evidence (p less than 0.01 is strong evidence, p less than 0.05 is ok evidence, p less than 0.1 is weak evidence, anything else counts as no evidence at all), or is it a hard rule (p less than 0.05 gets converted into Yes, p more than 0.05 becomes No)? It’s not clear. The procedure described in the paragraph immediately above corresponds to treating the p-value as evidence, and this leads to obvious problems so maybe you should just use the hard rule, but that puts you in the uncomfortable position of making strong statements based on noisy data.

In practice, what do we do? It’s a mix. We use that hard 5% rule on the preregistered hypothesis—ok, not always, as indicated by the above link, but often—and then we also report p-values as evidence. Again, incoherence is not in itself a problem, but it can lead to a worst-of-both-worlds situation as we’ve seen in Psychological Science, PNAS, etc., of a literatures that drift based on some mixture of speculation and noise.

3. Where are we, then?

I’m not saying Bayes is wrong or even that null hypothesis significance testing is wrong. These methods have their place. What I’m saying is that they depend on assumptions, and we don’t always check these assumptions.

To put it another way, Bayesian methods and null hypothesis significance testing methods work—really, they work for solving engineering problems and increasing our scientific understanding, I’m not just saying they “work” to get papers published—but the way to get them to work is to use them judiciously, to walk around all the land mines. The good news is that you can use fake-data simulation to find out where those land mines are.

27 thoughts on “Statistical methods that only work if you don’t use them (more precisely, they only work well if you avoid using them in the cases where they will fail)

  1. Still, it’s a problem, if you’re willing to routinely offer 4-1 bets on data that are consistent with noise (which is one way to put it when you observe an estimate that’s 1 standard error away from zero). Go around offering those bets to people and you’ll soon lose all your money.

    OK, yeah, the problem is with the flat prior.

    I’d say the bigger problem is sources of systematic error not included in the likelihood.

    Like the all cause mortality in pfizer/moderna trials. There were 37 deaths in vaccinated vs 33 in placebo, which tells us any benefit has something like 80-90% chance to be less than 10%.

    But they excluded the elderly and frail, and the mortality rates were about half of that found in the life tables for their median age. This is a much bigger problem, and it is hardly the only source of systematic error.

  2. > In practice, what do we do? We use the Bayesian inference selectively, carefully. We report the 95% posterior interval for theta, (-1, 3), but we don’t talk about the posterior probability that theta is positive

    Maybe you should report a 95% confidence interval for theta instead… Then if someone things you’re talking about probabilities it’s really their fault.

    • “…it’s really their fault.”

      That depends on the someone. Even a very carefully written description of the findings could easily lead someone to believe they should be 95% confident in the result. That sounds like a probability.

      (I’m a fan of the CI and think it solves a lot of problems in science by forcing discussion of actual values. But it also has issues and is imperfect.)

  3. What would happen if you routinely took one data point from N(0,1), and made the kind of bet indicated? Seems like you’d brake even (given unlimited bankroll and all that).

    So the lesson learned here is that Bayes theorem won’t magically give you a betting edge unless your informational inputs actually contain a betting edge.

    Doesn’t seem like a failing though. If there were some statistical method that could turn clueless-ness about theta into a betting edge, we’d all be billionaires and doing something other than reading this blog.

      • If they know its a good or bad bet, then they know more than someone who knows nothing about theta.

        That people with more relevant information tend to win more bets is a curious criticism of Bayes theorem.

        • Jbayes:

          Yes, that’s the point! In real life, we do know more than “nothing” about theta. If theta is the effect of some new treatment, then on average we know that theta is likely to be near zero and not likely to be huge. I am not criticizing Bayes’ theorem; my criticism of Bayesian methods is that they are used with models that contradict our prior information.

  4. This reminds me of that point you made somewhere Andrew about pluralism: most statistical methods can produce usable results in some sense when wielded by those who fully understand their strengths and weaknesses, and how they relate to increasing scientific understanding or technological progress. (After googling I see it was in your book chapter “How do we choose our default methods?”)

    • The examples in the OP fail before we even get to the statistics step though:

      You have a hypothesis that beauty is related to the sex ratio of babies

      It is. The end. It may be negligible, but we know everything is correlated with everything else without any data at all.

      Rather than testing the strawman of no difference, they need to deduce what range of results are consistent with their actual hypothesis. Ideally there would be at least one other competing hypothesis that predicts something different. Most likely they would collect a different type of data altogether in that case (eg, predict the shape of a a timeseries, or dose-response curve).

      Anyway, all statistical methods are equally worthless when used to test a strawman.

      • Thinking more about “plurality”, I’d say it is more of a hierarchy.

        From left to right we trade accuracy/information for convenience/efficiency:

        A discrete posterior is approximated by the continuous posterior which is summarized by the credible interval which is approximated by the confidence interval which is inverted to the p-value which is dichotomized into significance.

      • I was thinking that even this is related to using, say, Neyman-Pearson decision-theoretic inference in a principled and informed way (your example beginning “Rather…” potentially being an example of how a scientist with this expertise might apply this method).

        • We aren’t making a decision though. We are concluding something like explanation A fits the data twice as well as explanation B
          Or the parameter is about 95% likely to be within about a factor of two of x.

          This info can later be used to make some decision, in which case there will be consequences for being mistaken that can be used along with these probabilities.

  5. “What I’m saying is that they depend on assumptions, and we don’t always check these assumptions.”

    Would it be possible to formulate a description of a given test result (in either the Bayesian or NHST scenarios) so that the *collective* set of assumptions would be highlighted (as consistent or inconsistent with the data and model), rather than what is typically a *single* parameter assumption?

    I am looking at a bookshelf full of undergraduate methods textbooks in psychology, and I cannot find one that does. Few of them are recent, however, so this set might not be an up-to-date sample.

      • Yes, I used that section (11.1) in class (an introductory methods course for undergraduate psychology), but it was a little disconnected from Sections 4.2 and 8.4 where intervals initially came up. I am thinking of something to be more directly stated when we report an interval.

        Also, it was tricky to line up that discussion with the approach of the psych methods text we used, because that text introduced inference quite early, whereas ROS has that section somewhat later, after regression and interactions are discussed. But that might have been a problem with my choice of methods text.

  6. Regarding the 84% result being too high: a likelihood approach (after A.W.F Edwards) gives a probability of about 65%.
    So I guess that confirms the problem is the prior(?)

  7. On p-Values and Bayes Factors – Leonhard Held and Manuela Ott (https://www.annualreviews.org/doi/pdf/10.1146/annurev-statistics-031017-100307)

    Edwards derived the likelihood ratio of two point hypotheses as exp((−t^2)/2) where t is the observed effect size measured in standard errors. This LR can be generalised to composite hypotheses (Zhang, Z. (2009) “A Law of Likelihood for Composite Hypotheses,” arXiv:0901.0463 [math.ST] and Bickel, D. (2012) “The Strength of Evidence for Composite Hypotheses: Inference to the Best Explanation,” Statistica Sinica, 22, 1147-1198).
    The LR in this case is 0.61, and with prior odds of 1, the post probability is 62%ish.

    • Thanks for the references! If I understand correctly the generalization is to consider the maximum likelihood for each composite hypothesis so the LR for positive vs negative is just the LR for 1 vs 0.

      • Hmm, maybe there are at least three answers to derive the evidence in this y=1 ~ normal (theta, 1) example,
        1) Bayesian inference on theta, we get Pr(theta>0) = 84%.
        2) LR test on theta>0 vs theta <0, based on the your discussion above it should be 62%.
        3) "stacking"? cuz 1) is just the BMA solution over two discrete models (theta<0 and theta>0), which can be further replaced by stacking. But then stacking is even worse: it will assign 100% weight to theta>0; even more over-confident.

        Both 2) and 3) do not directly on the theta prior. But such independence can also lead to worse results…

        • I think that the problem with 2) is not that it doesn’t depend on the prior – one does actually use the fact that prior(theta positive)=prior(theta negative) to get that “probability”.

          The problem is that – as far as I understand – it doesn’t depend on the likelihood function beyond two points. We could move likelihood mass from one side to the other and the “probability”wouldn’t change.

          If instead of theta_hat~normal(theta,1) we had theta_hat~uniform(theta-1.1,theta+1.1) observing theta_hat=1 we would conclude that the “probability” that theta is positive or negative is the same. There would be only three possible answers for a uniform distribution and a symmetric prior: 50/50, 100/0 or 0/100.

        • @Carlos,
          good example of the uniform lik.
          Think about an even more counter intuitive example: if the likelihood is p(y|theta) = 0.05 1(theta-1.1 < y < theta+1.1) + 0.95 1(theta+2 < y < theta+2.000001), then the likelihood ratio test will say the p value of (theta>0) given y=1 is 0.05, while Bayes will say Pr (theta>0 | y=1) =1: both will reject with high confidence but in different directions!

          *edit: change uniform to indicator function

        • I imagine that when you write unif(a,b) you don’t mean the usual normalized definition 1/(b-a) if x is in [a b] and 0 otherwise. (Using that definition the posterior probability that theta is positive conditional on y=1 would be less that 5%.)

Leave a Reply

Your email address will not be published. Required fields are marked *