Guttman points out another problem with null hypothesis significance testing: It falls apart when considering replications.

Michael Nelson writes:

Re-reading a classic from Louis Guttman, What is not what in statistics, I saw his “Problem 2” with new eyes given the modern replication debate:

Both estimation and the testing of hypotheses have usually been restricted as if to one-time experiments, both in theory and in practice. But the essence of science is replication: a scientist should always be concerned about what will happen when he or another scientist repeats his experiment. For example, suppose a confidence interval for the population mean is established on the basis of a single experiment: what is the probability that the sample mean of the next experiment will fall in this interval? The level of confidence of the first experiment does not tell this. … The same kind of issue, with a different twist, holds for the testing of hypotheses. Suppose a scientist rejects a null hypothesis in favour of a given alternative: what is the probability that the next scientist’s experiment will do the same? Merely knowing probabilities for type I and type II errors of the first experiment is not sufficient for answering this question. … Here are some of the most realistic problems of inference, awaiting an answer. The matter is not purely mathematical, for the actual behaviour of scientists must be taken into account. (p.84)

This statement, literally as old as me [Nelson], both having been “issued” in 1977, is more succinct and more authoritative than most summaries of the current controversy. Guttman is also remarkably prescient in his intro as to the community’s reaction to this and other problems he highlights with conventional approaches:

An initial reaction of some readers may be that this paper is intended to be contentious. That is not at all the purpose. Pointing out that the emperor is not wearing any clothes is in the nature of the case somewhat upsetting. … Practitioners…would like to continue to believe that “since everybody is doing it, it can’t be wrong”. Experience has shown that contentiousness may come more from the opposite direction, from firm believers in unfounded practices. Such devotees often serve as scientific referees and judges, and do not refrain from heaping irrelevant criticisms and negative decisions on new developments which are free of their favourite misconceptions. (p. 84)

Guttman also makes a point I hadn’t really considered, nor seen made (or refuted) in contemporary arguments:

Furthermore, the next scientist’s experiment will generally not be independent of the first’s since the repetition would not ordinarily have been undertaken had the first retained the null hypothesis. Logically, should not the original alternative hypothesis become the null hypothesis for the second experiment?

He also makes the following, almost parenthetical statement, cryptic to me perhaps because of my own unfamiliarity with the historical arguments against Bayes:

Facing such real problems of replication may lead to doubts about the so-called Bayesian approach to statistical inference.

No one is perfect!

My reaction: Before receiving this email, I’d never known anything about Guttman, I’d just heard of Guttman scaling, that’s all. The above-linked article is interesting, and I guess I should read more by him.

Regarding the Bayes stuff: yes, there’s a tradition of anti-Bayesianism (see my discussions with X here and here), and I don’t know where Guttman fits into that. The specific issue he raises may have to do with problems with the coherence of Bayesian inference in practice. If science works forward from prior_1 to posterior_1 which becomes prior_2, then is combined with data to yield posterior_2 which becomes prior_3, and so forth, then this could create problems for analysis of an individual study, as we’d have to be very careful about what we’re including in the prior. I think these problems can be directly resolved using hierarchical models for meta-analysis, but perhaps Guttman wasn’t aware of then-recent work in that area by Lindley, Novick, and others.

Regarding the problems with significance testing: I think Guttman got it right, but he didn’t go far enough, in my opinion. In particular, he wrote, “Logically, should not the original alternative hypothesis become the null hypothesis for the second experiment?”, but this wouldn’t really work, as null hypotheses tend to be specific and alternatives tend to be general. I think the whole hypothesis-testing framework is pointless, and the practical problems where it’s used can be addressed using other methods based on estimation and decision analysis.

18 thoughts on “Guttman points out another problem with null hypothesis significance testing: It falls apart when considering replications.

    • Hi Andrew,

      What are the other available methods based on decision analysis that you would suggest? I have always seen NHST as a flawed but necessary evil that is rooted in decision theory, and desperate to read (or invent, because why not?) better methods. Also in your opinion, why have these other methods not caught on unlike NHST?

  1. Hyperbole follows:

    Statisticians would do far, far more good (however defined) by getting researchers to abandon significance testing than resolving the Bayesian/Frequentist, or any other, controversy.

    • In high-energy physics we still do significance testing, but we publish our null results. For example, search arxiv.org for “Search for” and “Higgs” and you’ll get (tens of) thousands of hits. If we get a null result, then it usually gets published with a 95% CL upper limit. HEP results are designed to replicable, and a discovery isn’t considered confirmed until it reaches 5-sigma significance and has been replicated (which is why there are multiple independent detectors at the LHC), so null results are considered highly informative.

      Recently we’ve also started publishing the likelihood distributions of the parameters of interest and all nuisance parameters, sometimes along with other information that allows for the results to be more comprehensively reinterpreted to incorporate new results or test new theories. There’s an accessible overview of this in arXiv:2009.06864, “Reproducibility and Replication of Experimental Particle Physics Results” by Junk and Lyons.

      • Thanks! I will look at this paper.

        However, the “researchers” I referenced don’t work in the environment that you describe.
        The researchers I’m describing constitute many, if not most, biomedical researchers and many, if not most, social scientists…and probably a terrifyingly large number of other people spending public money and teaching new students.

  2. I think the quoted sections have a problem analogous to the issues with NHST itself. I can absolutely agree with the statements that the confidence interval (or any other measure) from the first experiment cannot be relied on for the outcome of a replication. However, rejecting that first confidence interval is much like rejecting a null hypothesis – there are an infinite number of possible alternatives. It is too easy to use the stated words to mean that the first experiment tells you nothing about any replication. There are statistical issues with any replication, just due to random variation. And then there are the myriad issues associated with researcher degrees of freedom. But just as it would be bad practice to use the first experiment’s outcomes as a basis for the replication, I think it is also bad practice to ignore the first experiment entirely. He does not say to ignore it, but the quoted sections don’t help with knowing what value there is in the first experiment for any attempted replication.

  3. Was not aware of this paper.

    When I started to work on meta-analysis in the mid 1980s it seemed most in the statistics discipline thought it was too dangerous to analyze more than one study at a time. There were some exceptions who coined derogatory terms for that attitude like “the cult of a single study.”

    Being a graduate student at the time, some in the department thought I should be strongly encouraged to withdraw from the program given my naïve decision to publish methods for meta-analysis. That is what motivated me to write this paper – https://statmodeling.stat.columbia.edu/2012/02/12/meta-analysis-game-theory-and-incentives-to-do-replicable-research/#comment-73427

    The world does change.

  4. I think a quote from earlier in Guttman’s remarks (p. 82) captures the wider problem of inference more broadly, and is perhaps even more apt today in the age of “data science”:

    > Perhaps worse, many—if not most—practitioners do not do the scientific thinking that must precede statistical inference. They do not make the choice of null versus alternative hypothesis that is properly tailor-made to their specific substantive problem. They behave as if under the delusion that the choice is not in their hands, that the null hypothesis is pre-determined either by the mathematicians who created modern statistical inference or by some immutable and contentless principles of parsimony.

    Sure, we could complain about invoking null and alternative hypotheses here, but it is clear that Guttman is referring to the fact that the major problems with hypothesis testing arise less from the “tests” than from the “hypotheses”. In practice, the “hypotheses” are meaningless defaults divorced from the actual domain of study.

  5. the repetition would not ordinarily have been undertaken had the first retained the null hypothesis. Logically, should not the original alternative hypothesis become the null hypothesis for the second experiment?

    NO NO NO NO NO NO NO NO NO NO NO!

    This is perhaps the purest form of an incredible fallacy. I wouldn’t have expected to see someone admitting that they thought it was a good idea.

    But I’ve been complaining for a while about papers that take this stylized form:

    1. There is a phenomenon of interest.

    2. We consider two explanations for the phenomenon, (A) and (B).

    3. Our evidence tends to reject (A) as an explanation for the phenomenon.

    4. Therefore, we conclude that explanation (B) is correct.

    And that’s exactly what’s being *proposed* here! If you reject the null hypothesis, that won’t justify an assumption that some other hypothesis is correct! You have to support the alternative hypothesis first!

    • I think what’s being proposed is something else:

      1) Phenomenon of interest
      2) Two explanations A, B
      3) Evidence from experiment 1 tends to reject experiment A
      4) Collect evidence from experiment 2 and see if it tends to reject explanation B

      This is not at all the same as what you suggested.

      • Is it different? I’d say that your null hypothesis in the last step consists in considering that explanation B is correct.

        The problem is that the exact same reasoning applies when you consider

        2’) Two explanations A, C

        2’’) Two explanations A, D

        Etc.

        So it’s not clear why step 4) should be relative to any particular explanation B, C, D, or Z.

        • I think it’s different because rather than just accepting that B is correct the proposal is to try to test B against some predictions it makes and see if it can be shown to be wrong.

          Of course I don’t hold with this whole “hypothesis testing” framework but it seems to me that 4 as I understood it is vastly better than Michaels

  6. Mr Guttman appears to be complaining about epistemic uncertainty.
    Of course we do not know the probability of replication, because this depends on whether the hypothesis is true or not, which is exactly what we don’t know, hence the need for an experiment. The problem is nothing to do with NHST, the problem is with trying to estimate the ‘probability’ that a hypothesis is true in the real world (as opposed to artificial situations like card or dice games) in the absence of a known prior probability. This problem can be avoided by concerning ourselves with strength of evidence (that is likelihood), rather than probability.

  7. > this wouldn’t really work, as null hypotheses tend to be specific and alternatives tend to be general.

    I think there is a reasonable variation on what he said, given his first point. In a replication study you could set the null hypothesis to
    $H_0:$ $\beta$ is in the original confidence interval. Then, if you rejected that null, you have a real conflict.

    That’s not what he actually said, and I’m not sure it’s really the best way to go about things, but I don’t think it’s unreasonable.

  8. Here’s a recent piece of NHST in PNAS that may amuse readers of this blog:
    Louys et al, 2021. No evidence for widespread island extinctions after Pleistocene hominin arrival. PNAS, https://doi.org/gjvpn2

    “Most species considered here have no direct dates associated with their remains, let alone their last appearance … Use of commonly applied data quality criteria or auditing of dating methods would require us to reject most islands from our examination and almost all species. However, the null hypothesis we test … is that there is evidence of hominin-driven extinctions following first Pleistocene arrival on an island. This can be achieved using available datasets … as for any given island, a lack of reliable dates does not support the null hypothesis.”

    Thus, the weaker the evidence, the stronger the argument. QED!

Leave a Reply

Your email address will not be published. Required fields are marked *