“We continuously increased the number of animals until statistical significance was reached to support our conclusions” . . . I think this is not so bad, actually!

For some reason, people have recently been asking me what I think of this journal article which I wrote about months ago . . . so I’ll just repeat my post here:

Jordan Anaya pointed me to this post, in which Casper Albers shared this snippet from a recently-published paper from an article in Nature Communications:

The subsequent twitter discussion is all about “false discovery rate” and statistical significance, which I think completely misses the point.

The problems

Before I get to why I think the quoted statement is not so bad, let me review various things that these researchers seem to be doing wrong:

1. “Until statistical significance was reached”: This is a mistake. Statistical significance does not make sense as an inferential or decision rule.

2. “To support our conclusions”: This is a mistake. The point of an experiment should be to learn, not to support a conclusion. Or, to put it another way, if they want support for their conclusion, that’s fine, but that has nothing to do with statistical significance.

3. “Based on [a preliminary data set] we predicted that about 20 unites are sufficient to statistically support our conclusions”: This is a mistake. The purpose of a pilot study is to demonstrate the feasibility of an experiment, not to estimate the treatment effect.

OK, so, yes, based on the evidence of the above snippet, I think this paper has serious problems.

Sequential data collection is ok

That all said, I don’t have a problem, in principle, with the general strategy of continuing data collection until the data look good.

I’ve thought a lot about this one. Let me try to explain here.

First, the Bayesian argument, discussed for example in chapter 8 of BDA3 (chapter 7 in earlier editions). As long as your model includes the factors that predict data inclusion, you should be ok. In this case, the relevant variable is time: If there’s any possibility of time trends in your underlying process, you want to allow for that in your model. A sequential design can yield a dataset that is less robust to model assumptions, and a sequential design changes how you’ll do model checking (see chapter 6 of BDA), but from a Bayesian standpoint, you can handle these issues. Gathering data until they look good is not, from a Bayesian perspective, a “questionable research practice.”

Next, the frequentist argument, which can be summarized as, “What sorts of things might happen (more formally, what is the probability distribution of your results) if you as a researcher follow a sequential data collection rule?

Here’s what will happen. If you collect data until you attain statistical significance, then you will attain statistical significance, unless you have to give up first because you run out of time or resources. But . . . so what? Statistical significance by itself doesn’t tell you anything at all. For one thing, your result might be statistically significant in the unexpected direction, so it won’t actually confirm your scientific hypothesis. For another thing, we already know the null hypothesis of zero effect and zero systematic error is false, so we know that with enough data you’ll find significance.

Now, suppose you run your experiment a really long time and you end up with an estimated effect size of 0.002 with a standard error of 0.001 (on some scale in which an effect of 0.1 is reasonably large). Then (a) you’d have to say whatever you’ve discovered is trivial, (b) it could easily be explained by some sort of measurement bias that’s crept into the experiment, and (c) in any case, if it’s 0.002 on this group of people, it could well be -0.001 or -0.003 on another group. So in that case you’ve learned nothing useful, except that the effect almost certainly isn’t large—and that thing you’ve learned has nothing to do with the statistical significance you’ve obtained.

Or, suppose you run an experiment a short time (which seems to be what happened here) and get an estimate of 0.4 with a standard error of 0.2. Big news, right! No. Enter the statistical significance filter and type M errors (see for example section 2.1 here). That’s a concern. But, again, it has nothing to do with sequential data collection. The problem would still be there with a fixed sample size (as we’ve seen in zillions of published papers).

Summary

Based on the snippet we’ve seen, there are lots of reasons to be skeptical of the paper under discussion. But I think the criticism based on sequential data collection misses the point. Yes, sequential data collection gives the researchers one more forking path. But I think the proposal to correct for this with some sort of type 1 or false discovery adjustment rule is essentially impossible and would be pointless even if it could be done, as such corrections are all about the uninteresting null hypothesis of zero effect and zero systematic error. Better to just report and analyze the data and go from there—and recognize that, in a world of noise, you need some combination of good theory and good measurement. Statistical significance isn’t gonna save your ass, no matter how it’s computed.

P.S. Clicking through, I found this amusing article by Casper Albers, “Valid Reasons not to participate in open science practices.” As they say on the internet: Read the whole thing.

The internet has no memory. Also, I’d be happy if the terms “false discovery,” “statistical significance,” “false positive,” and “false negative” were never to be heard again.

And here’s my post from a few years ago on stopping rules and Bayesian analysis.

29 thoughts on ““We continuously increased the number of animals until statistical significance was reached to support our conclusions” . . . I think this is not so bad, actually!

  1. To be fair, sometimes “false positive/negative” distinctions make sense, such as when you have a test for a disease or the presence of some chemical in something. I know that’s not what you meant, but in the case of medical testing in particular, I think its good to know and understand false positives/negatives… in that case, the phrases don’t pretend towards certainty or homogeneity that doesn’t exist, they are a recognition of frequencies of errors and thus reinforce the importance of humility in the scientific endeavor (or, in Gelman speak, they reinforce the reality of uncertainty).

    … now I’m wondering if I already made that comment a month ago… meh, still good.

    • Jrc:

      Yes, I agree. The terminology of false positives and negatives make sense in those situations you discuss, just not in the setting of null hypothesis significance testing of statistical models.

      • But why do they make sense in such situations? Is it because tens of thousands of women have mammograms annually so that the N-P false positive / false negative rates are “known”? Is tens of thousands out of 3.5 billions enough? I suspect not. And what of the fact that the definition of breast cancer keeps changing? If ductal carcinoma in situ was cancer the day before yesterday and now falls into a new category called “non-cancerous” what should we make of it? N-P widgets didn’t much change as they rolled off the assembly line; but women do. If my Mom calls and asks if she should be on tamoxifen because her doctor says so, is her doctor’s primitive Bayes approach to deciding who ought be on it and who ought not the sort of bet you’d make with your own Mom? I go back to Meehle and wonder if the river of published knowledge isn’t so polluted as to be unworthy of a dive; deep, or shallow.

        /Blowin’ off steam after a good day (i.e. in which I learned something).

        • Some people kinda imagine that you either have Tuberculosis or you don’t. This is probably down to the detection threshold of the Tuberculosis bacterium rather than anything else. I’d be surprised if you couldn’t find a single Tuberculosis bacterium on a randomly chosen person. In any case, imagining that the basic concept is binary allows people to imagine that there’s a reasonable way to define “false positive” or “false negative” as a real logical thing. For the most part there is, but don’t look too carefully at the details. My surveying professor published a CAD drawing of a traverse we did giving the location of a nail head to a number of significant figures implying precision of something like the wavelength of green light. If you sneezed on the other side of campus you’d make the nail head vibrate more than that.

        • Two aspects I’ll quickly comment on.

          First, the populations are fairly well understood as not being too heterogeneous for cancer risk, these prior risks have bean empirically estimated (or as Mayo would say severely tested) and these conceptualized probabilities can be interpreted being due to underlying variation in exposure/genetic risk. So none of usual objections to using Bayes Theorem apply.

          Second, the only true assurance scientific induction provides us with is that although it often will mislead us, if we persist in it adequately that will eventually be corrected. Persisting adequate likely requires being able to do randomized experiments.

        • Having pondered this for a good portion of a ponderous day I’m going to push back. I don’t think that p(E) is fairly well understood at all in the case of cancer. It certainly hasn’t been deduced from any theory but instead seems to create some unexplicated theory about the production of E which then hides undetected behind the curtains as scientists search for p(H1|E). The whole thing seems circular.

          Take normal human body temperature. It isn’t 98.6 F. So what does that mean for p(disease|body temp not equal to 98.6)? The insurance company writing key man insurance on me caused intense aggravation with serial medical testing until it was determined that my ECG was a “normal variant”. When I got over being annoyed (and radioactive, I think) I looked into this business of normal variants. There are normal variants of elbows and knees, of hearts and lungs, of ears and eyes. There are so many normal variants that I suspect most people are abnormal – whatever that might mean in a world in which people and barley never appear in the same analogy question on the SAT/ACT. Or take non-Hodgkins’ lymphoma and/or CLL – the hematologists have been making and un-making their minds up about diagnostic criteria and how to classify these diseases since before I was born. I’m beginning to wonder whether science should just stop trying to discover p(H|E) and instead take a big step back and try to define categories and better estimate p(E); in hopes of finally taking that giant leap we’ve been waiting for since 1969.

    • Even in Some of those cases it’s problematic. Like Lead in drinking water. It seems likely that there is at least one lead atom in every Liter if drinking water, so the relevant question becomes estimating the concentration rather than binary presence or absence.

      • In practical testing typically the one atom limit never arises because analytical procedures have a (statistically determined) Limit of Detection.

        So, that becomes the basis of a presence / absence criterion for things like lead.

        • Yes but what happens is that as technology improves that analytical limit gets pushed down becoming a moving Target. If you just go with presence absence it can look like everything was better in the 1960s before we started polluting… But if you go by concentration it’s clear that in the 1960s it was less than 1e-6 molar but now it’s less than 1e-9 or whatever (all numbers made up for illustration)

        • Of course improving technology is a good thing. Reporting the binary “detectable/undetectable” instead of “less than x” for a continuously changing x is unambiguously a bad thing. It’s always best in terms of measurement to report the maximum amount of information that you actually have. For example if after people made detection devices capable of detecting 1e-9 molar solutions, they continued to report 2.41e-7 as “undetectable by 1960’s technology” you’d have to agree that’s a bad thing right?

  2. The study came to attention again on twitter recently. Not sure why that happened, but there was a lot of discussion and outrage about the methods. Your name came up ….

  3. “As long as your model includes the factors that predict data inclusion are also included in the model, you should be ok.”

    Just what on earth does that mean? Some kind of typo?

  4. My colleagues seem to agree this is not so bad. As part of a conference dedicated to registered reports, I and my co-authors conducted a survey of publish authors in the field. We asked them whether various forms of author discretion made research better or worse. Using discretion to determine when to stop/continue data collection was widely seen as the most helpful and least harmful behavior. I think this is because more data is generally better, and as others have already mentioned, if the predictions are wrong more data will probably show that.

    Other forms of discretion, which fared worse, were discretion to remove entire subsamples (the worst) or to remove unusual observations, and discretion on which analyses to report. Interestingly, people didn’t worry much about changing hypotheses late in rewriting (HARKing), primarily because those were viewed as expositional devices, rather than actual statements of predictions.

    If you are interested, the paper is No System is Perfect. The title reflects our conclusion that while the usual editorial process has its problems, pre-registration has its problems as well–primarily that editors give up the leverage to demand additional work before publication.

  5. In my mind if a person continues a study until ‘significance’ is achieved, then they have strong prior beliefs that the probability of a non-significant result is 0.00. They could save a lot of time and money if they just admit that.

    • Garnett:

      That’s the point. All these null hypotheses are false, and you can reject any null hypothesis by just gathering enough data. So it makes no statistical sense to design a study with the goal of rejecting a hypothesis. Indeed, if the only goal is to demonstrate that the hypothesis is false, you could save a lot of time and money and not bother doing the experiment at all. A better goal is to estimate the direction, magnitude, and variation of effects, and, for that, the p-value isn’t particularly relevant.

      • Here is a passage from Berry (1985, top of p. 525) on this topic.
        (I’d appreciate any insights into the martingale argument….)

        “Classical statisticians have two main objections to a Bayesian analysis. One is the difficulty and arbitrariness in picking a prior distribution. The other is the possibility of sampling to a foregone conclusion. The first is a legitimate concern; the second is not. Sampling to a foregone conclusion is possible in the classical set-up when one considers typeI and type II errors. But posterior probabilities do not behave like error probabilities. For example, suppose one has observed X1,. . . , X,. The probability of delta > 0 given X1,. . . , Xn,Xn+1 is a random variable when conditioning on X1, . . . , X,. Its expected value is precisely the current probability of delta > 0; unlike P-values, for example, the probability of delta > 0 is a martingale. So if the current probability is 0.94, it can increase, perhaps to above 0.95, with the next observation or it may decrease. In the case of normal sampling described above, the expected number of observations required to convert a current probability of 0.94 into one greater than 0.95 is infinite.”

        • Garnett:

          I agree with Berry that you it does not make sense to talk about sampling to a foregone conclusion. You can sample until you get a statistically significant p-value, but that’s not a “conclusion” in any sense relating to science or decision making. As I wrote in the above post, if “sampling to a foregone conclusion” in that sense is a problem with Bayesian methods, it’s a problem with non-Bayesian methods too, because you can design a non-Bayesian study with N=1000000000, in which case you’re also guaranteed of a rejection in any real problem.

        • Wow, just wow. The argument seems to be

          ‘You can’t use Bayes to make true/false decisions dichotomizing on a particular posterior probability threshold without giving up all the “good properties” of p values!’

          That’s pretty funny.

        • Ah, actually I misread the quote! in particular I missed this sentence: “The first [about the prior] is a legitimate concern; the second [about sampling to a foregone conclusion] is not.” and so I thought they were arguing that the way Bayes behaves is no good because it doesn’t replicate the way p values behave! that would be pretty funny because the way p values are used is basically the cause of the vast majority of what’s wrong in applies stats in science today.

          I do however think that the martingale properties and etc are not that important. That’s an argument that says essentially “the expected amount of time you need to sample to a foregone conclusion using Bayes and a threshold for posterior probability is infinite”. But *using a threshold and dichotomizing into “true” and “false” are already a bad idea in the first place* so you shouldn’t ever do it, and the fact that it’s hard to do isn’t even an issue that comes up if you’re using Bayes effectively.

          When you want to make a decision about something like ‘should I take aspirin for my heart health?’ or some such thing, you use *Bayesian Decision Theory* and a real-world utility, not a threshold for the posterior probability that the effect is positive.

        • Thanks, Daniel

          Your original interpretation would be especially funny since Don Berry wrote the article!!

          As everyone knows, Don Berry has done a ton of work on Bayesian clinical trials, and I haven’t seen him use decision theory in his work (citations to the contrary would be appreciated). I think that much of his work is in the context of predominately Frequentist methods, so he is primarily reacting to that environment. Even so, your point is well taken.

          The quoted paragraph is the only time I’ve seen martingales used as an argument against the critique of “sampling to a forgone conclusion” in Bayesian sequential trials. The reason I brought it up is that Frequentists appear unconvinced by other arguments (many in this thread) that Bayesian sequential analysis is a legitimate effort.

        • In a regulatory context you use whatever decision criterion the regulatory body hands you. I’ve argued here before that we should use decision theory for drug trials, but I have exactly zero influence on the FDA ;-)

          I suspect Frequentists simply see Bayes as suspect, the sequential analysis falls out of the math if you accept a Bayesian analysis in the first place, once you accept Bayes you can’t really argue against whatever the math implies.

          The one case where Bayes results do depend on stopping is where there’s a so called informative stopping rule. An example is if you are simultaneously doing inference on a parameter about a physical process, and a parameter related to the stopping rule in use. In other words reverse engineering someone else’s stopping rule. In that case, the fact that the other person stopped is information for your model of the stopping rule.

        • This frames “sampling to a foreground conclusion” in the wrong way. It is so simple, people are testing models that they have deliberately (perhaps ignorantly) invalidated.

          If the model being tested is derived while making certain assumptions about the sampling scheme, you need to run the study in accordance with those assumptions.

          It is difficult to understate the malice/ignorance required to assume one thing about your own behavior, act some other way, and then draw a conclusion about a totally different aspect of your model.

      • you can reject any null hypothesis by just gathering enough data

        I wouldn’t say this using “null” in the (imo, correct) “hypothesis to be nullified” sense. It is possible to make a theoretical prediction more precise than its possible to collect data to test.

        If you use “null” in the usual “default, something other than what the researcher believes” sense, then yea there is probably never any point to rejecting it.

        The vast majority of the time you want to be comparing the skill of various well thought out plausible explanations for some observations though.

  6. In my experience in the bio world, a large percentage of the examined effects were “near nil” (at least 50 percent?), and the researchers tended to have strong priors that the effects were strong (sunken cost bias?). I think in that scenario, we do run into a lot of problems with this type of approach, especially since we are most likely to stop for “significance” due to underestimation of the error term (big issue with n = 3!). And since often times researcher present things as effect size / error term, this type of error can also inflate our understanding of what the magnitude of the effect was.

    On the other hand, if we start with large enough a sample that the error term is fairly stable, yeah, it’s not much of an issue.

  7. You say
    ” I’d be happy if the terms ““false discovery,” “statistical significance,” “false positive,” and “false negative” were never to be heard again.”

    I certainly agree about the term “statistical significance”. But I disagree about the others. I realise that you don’t like the point null, but that point of view seems to me to be unrealistic from the point of view of experimentalists. The use of a point null is not an assertion that the true effect is zero, but rather that the likelihood of seeing your data if it were zero is relatively large compared the likelihood of seeing your data if the effect size were some alternative non-zero value.
    As an experimenter I’m certainly interested to know that quantity.

    Far from wanting to ban the term “false positive”, it seems to me to be important to give (as well as p-value and CI) an estimate of the risk that your p value is a false positive. It’s a problem that there are many different ways of calculation the false positive risk, but luckily it turns out that several different methods give values of false positive risk between 20% to 30% if you observe p = 0.05 (that’s for a prior of 0.5 -much higher for smaller priors). See https://arxiv.org/abs/1802.04888

Leave a Reply to Garnett Cancel reply

Your email address will not be published. Required fields are marked *