Stopping rules and Bayesian analysis

I happened to receive two questions about stopping rules on the same day.

First, from Tom Cunningham:

I’ve been arguing with my colleagues about whether the stopping rule is relevant (a presenter disclosed that he went out to collect more data because the first experiment didn’t get significant results) — and I believe you have some qualifications to the Bayesian irrelevance argument but I don’t properly understand them.

Then, from Benjamin Kay:

I have a question that may be of interest for your blog. I was reading about the early history of AIDS and learned that the the trial of AZT was ended early because it was so effective:

The trial reported in the New England Journal of medicine, had produced a dramatic result. Before the planned 24 week duration of the study, after a mean period of participation of about 120 days, nineteen participants receiving placebo had died while there was only a single death among those receiving AZT. This appeared to be a momentous breakthrough and accordingly there was no restraint at all in reporting the result; prominent researchers triumphantly proclaimed the drug to be “a ray of hope” and “a light at the end of the tunnel”. Because of this dramatic effect, the placebo arm of the study was discontinued and all participants offered 1500mg of AZT daily.

It is my understanding that this is reasonably common when they do drug studies on humans. If the treatment is much, much better than the control it is considered unethical to continue the planned study and they end it early.

I certainly understand the sentiment behind that. However, I know that it isn’t kosher to keep adding time or sample to an experiment until you find a result, and isn’t this a bit like that? Shouldn’t we expect regression to the mean and all that?

When two people come to me with a question, I get the impression it’s worth answering. So here goes:

First, we discuss stopping rules in section 6.3 (the example on pages 147-148), section 8.5, and exercise 8.15 of BDA3. The short answer is that the stopping rule enters Bayesian data analysis in two places: inference and model checking:

1. For inference, the key is that the stopping rule is only ignorable if time is included in the model. To put it another way, treatment effects (or whatever it is that you’re measuring) can vary over time, and that possibility should be allowed for in your model, if you’re using a data-dependent stopping rule. To put it yet another way, if you use a data-dependent stopping rule and don’t allow for possible time trends in your outcome, then your analysis will not be robust to failures with that assumption.

2. For model checking, the key is that if you’re comparing observed data to hypothetical replications under the model (for example, using a p-value), these hypothetical replications depend on the design of your data collection. If you use a data-dependent stopping rule, this should be included in your data model, otherwise your p-value isn’t what it claims to be.

Next, my response to Benjamin Kay’s question about AZT:

For the Bayesian analysis, it is actually kosher “to keep adding time or sample to an experiment until you find a result.” As noted above, you do lose some robustness but, hey, there are tradeoffs in life, and robustness isn’t the only important thing out there. Beyond that, I do think there should be ways to monitor treatments that have already been approved, so that if problems show up, somebody becomes aware of them as soon as possible.

P.S. I know that some people are bothered by the idea that you can keep adding time or sample to an experiment until you find a result. But, really, it doesn’t bother me one bit. Let me illustrate with a simple example. Suppose you’re studying some treatment that has a tiny effect, say 0.01 on some scale in which an effect of 1.0 would be large. And suppose there’s a lot of variability, so if you do a preregistered study you’re unlikely to get anything approaching certainty. But if you do a very careful study (so as to minimize variation) or a very large study (to get that magic 1/sqrt(n)), you’ll get a small enough confidence interval to have high certainty about the sign of the effect. So, from going from high sigma and low n, to low sigma and high n, you’ve “adding time or sample to an experiment” and you “found a result.” See what I did there? OK, this particular plan (measure carefully and get a huge sample size) is chosen ahead of time, it doesn’t involve waiting until the confidence interval excludes zero. But so what? The point is that by manipulating my experimental conditions I can change the probability of getting a conclusive result. That doesn’t bother me. In any case, when it comes to decision making, I wouldn’t use “Does the 95% interval exclude zero?” as a decision rule. That’s not Bayesian at all.

It seems to me that problems with data-based stopping and Bayesian analysis (other than the two issues I noted above) arise only because people are mixing Bayesian inference with non-Bayesian decision making. Which is fair enough—people apply these sorts of mixed methods all the time—but in that case I prefer to see the problem as arising from the non-Bayesian decision rule, not from the stopping rule or the Bayesian inference.

51 thoughts on “Stopping rules and Bayesian analysis

  1. By my understanding, the main problem with outcome-based stopping for inference isn’t so much about the difference between Bayesian or Frequentist positions. Rather, the problem stems from the combination of two factors: first, by their most common, binary/qualitative interpretation, many Frequentist tests only allow one outcome – rejection of H0; second, alpha isn’t zero. Combine these two, and optional stopping all but guarantees results. However, a stopping rule not biased towards one outcome does not suffer from this problem; for example, width of a CI. Basically, any method that could lead to a stop based on rejecting H1 or on rejecting H0. The Bayesian equivalent to stopping when p < alpha would be equivalent to stopping if and only if e.g. the Bayes Factor favours the hypothesis favoured by the investigator (or your "Does the ..interval exclude 0?"), and that is hardly any better. Is this correct?

  2. Many thanks for the post.

    I don’t think I understand point 2: suppose we interpret p as “under H0 the probability of this event occurring within N observations is less than p”, then we wouldn’t we calculate the same p-value however N was chosen (whether predetermined or by a stopping rule).

    (& I came up with another qualification in a Bayesian world: we infer different things from “I expected an effect size of 4 and found an effect size of 4”, vs “I expected an effect size of 16 and found an effect size of 4”. It is genuinely informative to know the experimenter’s expectations. And their choice of sample size tells us something about their priors. If they use a stopping rule then it can be potentially misleading in that direction, and we learn something when we find out the experimenter had to recruit 4 times as many subjects as he or she originally intended.)

  3. Suppose you’re at a basketball game. Think of the game as an experiment to determine which team is better at playing basketball.

    If the game is called unexpectedly because the electricity went out to the gym, it seems fair to take the score at the time of the power failure as good data. But if the referee calls the game as soon as his favored team pulls into the lead, that hardly seems fair.

    How is the basketball story different from the scientist collecting data until he gets the result he wants?

    • You say that you’re trying to determine which team is better, but then you say “once the favored team pulls into the lead”. But if you only care about quality of the team, then you care about the difference in score given the time, & there’s not a discontinuity when someone is in the lead. And so if you’re deciding whether to let them play an extra minute, it’s just as likely that your favored team will get better (relative to their history) as it is they will get worse. (Or rather, in expectation there will be no systematic movement)

      • Suppose X_i and Y_i are sequences of iid Bernoulli random variables with probability of success p_X and p_Y respectively, with p_X > p_Y. Then the sum of the X’s will be less than the sum of the Y’s infinitely often with probability 1. So with the unfair ref stopping rule, you can almost certainly conduct an experiment inferring wrongly that p_Y > p_X. Admittedly this is a frequentist argument, but I don’t see how a Bayesian perspective could salvage this, unless you account for the informative stopping rule in the likelihood.

        • I think your claim is false. Let Z_i = sum(X_(1:i)) – sum(Y_(1:i)). Z is a Markov chain with transition probabilities: decrease by 1 with probability (1-p_X)(p_Y), increase by 1 with probability (1-p_Y)p_X, otherwise stay constant. With p_X>p_Y this is essentially an asymmetric random walk on 1D, which is known to not be recurrent. Hence, “the sum of the X’s will be less than the sum of Y’s i.o.” actually has probability 0.

    • John:

      There are some differences between the basketball story and the scientist story. In basketball you have a winner, in science you are doing inference. For example, if this were a science example and you had 2 drugs and you kept sampling until drug A wins . . . this isn’t such a realistic rule, because if A is much worse, it’s quite possible you’ll never (in finite time) get to a point where A wins. Especially if you have a rule that N has to be greater than some minimum value such as 40. Also, if A wins and the difference is clearly noise (e.g., 8/20 successes for A and 7/20 successes for B), that won’t be taken as strong evidence.

      So to apply your story to science, you’d need to have a minimum sample size, a maximum sample size, and some rule that you only stop if A is statistically significantly better than B. Even so, yes, you will sometimes see that happen, and a data-dependent stopping rule can increase the probability of stopping at that point—but, yes, this is a frequentist argument and indeed I don’t think it will hurt a full Bayesian analysis if there is no underlying time trend in the probabilities of success.

      As I said above, though, a data-dependent stopping rule could cause damage if someone is mixing Bayesian inference with non-Bayesian decision rules. And indeed people do this all the time, I’m sure (for example, performing a Bayesian analysis and then making a decision based on whether the 95% posterior interval excludes zero). So in that sense it could create a problem.

      To go back to the basketball example: in a Bayesian analysis, your posterior probability of which team is better is changing a bit with each score. But sports is about winning, not about inference: a team wins the game if they scored more points, not because there is an inference (Bayesian or otherwise) that they are the better team.

      Perhaps this last point will be clear if I return to the sample size analogy. Suppose two players are competing: now consider an individual sport, in this case taking shots 30 feet from the basket, and the ref gives 1 shot to player A and 1000 shots to player B. The prize goes to the player with a greater success record.

      It’s really hard to make the shot from 30 feet, so player A will almost certainly get 0 successes. But player B gets so many tries, he’ll probably have some success, maybe 10% or 5% or whatever. The point is, Player B will almost certainly win. So you get unfairness, but with no data-dependent stopping rule. The problem is with the decision rule. Having a decision rule that satisfies certain fairness properties is a hard problem. It’s true that by restricting the stopping rules in certain problems, you can get the fairness properties that seem so intuitive, but you lose something too (in the medical example, you might give a less preferred treatment to someone). Is it worth it, this tradeoff? It depends on how much you care about the fairness property. It’s hard for me to see the justification of it, really; I think it’s an Arrow’s-theorem-like situation where there are certain properties that intuitively seem desirable but, on second thought, aren’t worth the effort.

    • John:

      Just to add a bit more about this “fairness” thing (maybe we need to do a joint blog post on it…): It seems reasonable for a basketball game to have a symmetry principle, that any stopping or scoring rule has to be symmetric relative to the team labeling, for example if you stop after team A is up by 20 points, then you have to stop after team B is up by 20 points. For a medical trial, though, I don’t see this, as I’d think it would be rare that an analysis is symmetric in any case. (For example, the existing treatment and the new treatment are typically not treated symmetrically.)

    • Conversely, in some casual and children’s leagues they have a ‘mercy rule’ where if one team is ahead by a certain amount regardless of how much of the game has elapsed, the game is declared over.

  4. Sorry, lost my connection or something. I’ll try again.

    Discussion of stopping rules and inference usually begin with the frequentist position that the repeated sampling principle is more important than any other consideration, so I am very pleased to read this post.

    It is commonplace to view data-dependent stopping rules as problematical because of the increase in risk of type I errors, but at the same time as those false positive error increases the risk of false negative errors declines. I’ve played around with simulations and in almost all situations the false negative rate declines much faster with increasing sample size than the false positive rate increases. Thus even within the inferentially depleted world of frequentists who use dichotomous outcomes there are inferential advantages to data-dependent stopping rules. Does anyone know why such stopping rules are nearly universally assumed to be deleterious?

  5. Despite my reservations above, I’ve done a lot of work with early stopping rules in the context of clinical trials. There you have an ethical obligation to stop assigning people to a treatment once you have sufficient reason to believe it’s inferior.

    • John: OK, but the real loss of life/health clock starts ticking at the rate the treatment is adopted – early stopping can seriously delay that even if in a perfect world it shouldn’t.

  6. I’m one of those people who fit bayesian models and then make a decision based on whether the 95% HPD contains zero, or whether theta>0 or theta0)/(1-P(theta>0)) ). I have never made a different decision based on whether I used the Bayes factor and using the HPD interval. In the kinds of studies I do, and for the amount of data I have per experiment, my decision never differs regardless of whether I use linear mixed models in R (lme4), or Stan or JAGS. Of course, the bayesian approach allows me to flexibly fit models that I simply cannot fit in the frequentist setting (or don’t yet know how to), so that is a huge advantage. But the decision is the same.

    • Shravan:

      Indeed, there are a lot of people like you, which is one reason that Bayesians probably should be worrying more about the impact of stopping rules on inferences.

        • I don’t yet understand why the inference is non-bayesian. I just got done submitting a homework assignment for my statistics course where I did a bayesian analysis to come to a decision on whether to give treatment A or B based on willingness to pay on part of the government for a unit increase in net benefit. The decision I made was pretty much based on the probability of a net benefit (i.e., P(theta>0)). I understand Andrew’s general objection that there may be no theta to estimate out there in nature. But what’s non-bayesian about such an inference? A frequentist analysis is not going to give me a posterior distribution to estimate such probabilities from; I can only do this because I fit a bayesian model.

          And using an HPD yields pretty similar decisions. I wouldn’t even know what the alternative criterion for a decision would be in this very practical setting.

          [Of course, I don’t yet know whether I did the homework right!]

        • Shravan:

          In Bayesian decision making, there is a utility function and you choose the decision with highest expected utility. Making a decision based on statistical significance does not correspond to any utility function.

        • OK; I know how to use decision theory to do this. What I don’t know how to do is to define the loss function. What should the cost be for misses and false alarms in a classical 2×2 factorial design in a psychological experiment?

          Maybe this is in your new book, but I haven’t finished reading it yet.

        • Hey, I didn’t say it would be easy, I just said that if you do Bayesian inference with non-Bayesian decision making, you can end up with challenges that would not arise in a pure Bayesian setting.

          I do think that formalizing costs and benefits can be a good idea—there’s no general way that I know to do this, I think that at our current stage of understanding it just needs to be done anew for each problem. One advantage of formalizing costs and benefits is that it can make you think harder about what you’re really concerned about in your estimation problem. That said, I don’t usually do this sort of formal decision analysis in my own work.

        • If my utility were the step function that is -95 when th0) > 0.95, which is thus affected by the stopping rule. Even if it’s an unrealistic utility, isn’t that cause for concern?

        • (sorry – html garbled my original message)

          If my utility were the step function that is -95 when th is less than 0 and +5 otherwise, then my Bayesian decision rule would be to use the treatment iff P(th>0) > 0.95, which is thus affected by the stopping rule. Even if it’s an unrealistic utility, isn’t that cause for concern?

        • Cedric:

          No, that doesn’t work. You need a utility that is a joint function of your decision and the unknown theta. The utility function you gave just depends on theta and thus does not imply any decision recommendation at all.

        • So, what has changed since you posted this (“for now let me just reiterate my current understanding that there is no such thing as a utility function”) and this (“I’m down on the decision-theoretic concept of “utility” because it doesn’t really exist.”).

          Do you have a different view of utility functions in general now? Or is it that you don’t like working backward from, say, choice data to infer something about unobserved utility functions, whereas in full Bayesian decision making, you get to make your own utility function and then use it prescriptively?

          If it’s the latter, what’s the problem with inferring utility functions from data?

        • Noah:

          We have a whole chapter on decision analysis in BDA, so I certainly don’t mind the idea of utility analysis. I don’t think there is a true utility function but I think utility functions are useful for clarifying tradeoffs in decision problems. We discuss further in Section 9.5 of BDA. Also, I don’t mind inferring utility functions from data, I just think you have to be clear that preferences involve a lot more than utilities.

        • Andrew: right, I got it mixed up. But isn’t it easy to construct that utility function? Let C be the decision to take the medicine, and ¬C the decision not to.

          U(th, ¬C) = 0 for all th

          U(th, C) = -95 if th < 0

          U(th, C) = +5 if th > 0

          Then the expected utility EU[¬C] = 0, while EU[C] = P(th<0) * -95 + (1-P(th<0)) * 5, which will be greater than EU[¬C] IFF P(th>0) > 0.95

        • Cedric:

          Sure, but that utility makes no sense. Ultimately, the utility should make sense in terms of dollars and lives, or quality of life, or whatever. There is no logic to the utility function that you gave.

        • AG: I don’t think it would be that hard to come up with a situation in which Cedric’s utility function maps to concrete outcomes in terms of dollars or something. The trickiest part would be coming up with a “hard threshold” situation in which all theta ≤ 0 are equally bad and all theta > 0 are equally good. Add in some cost to deciding to effect the change that theta represents, and Bob’s your uncle.

  7. Apart from the case of medicine, where lives are at stake, in planned designs what’s the problem with fixing a sample size in advance and strictly stopping when the sample size is reached. If people are doing this kind of thing, they should just stop, period. No need to do research on the topic :)

    • Shravan:

      It’s all about cost. No point in spending lots of $ on extra data if they’re not needed. Conversely, if there is a lot of uncertainty it can make sense to gather more data. It would be pretty foolish to just sit there with your prechosen N, if you think you can learn something useful by increasing your sample.

  8. ” It would be pretty foolish to just sit there with your prechosen N, if you think you can learn something useful by increasing your sample.”

    I assume you are talking about the case where one is doing a bayesian analysis. In a frequentist setting, that *would* be foolish, no? I just want to have it out there so that I don’t start getting people telling me Andrew Gelman says it’s OK to keep running an experiment till you hit significance ;)

    What I started doing very recently in such a situation (where more data would help) is to re-run the experiment and use the previous data as a prior. I hope that’s not too crazy.

  9. @Shravan Vasishth. But what is actually the problem? Usual frequentist’s approach is “what is the probability that given H0 and sample size N you see the difference like observed or larger?” and now you have to ask question “what is the probability that given H0 the observed difference can be seen in N trials or faster?”

    • It’s rather more complicated than that, actually. What you have to do is define a test procedure T( , ), a function taking two arguments: a null hypothesis and a Type I error rate. A good test procedure needs be consistent with Egon Pearson’s “Step 2“:

      We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts.

      In the case of optional stopping, the set of possible results is any observed difference & N combo that stops the experiment, so the “system of ordered boundaries” needs to be set up for all possible N. Once you’ve defined such a test procedure T, you can observe the experimental result and then back out the Type I error rate that puts the result on the boundary of the rejection region: that’s your p-value.

      …Or you could always just chuck the observed difference, pretend you only observed N, and base your test and p-value on that. That’s pretty much what frequentist who wanted actual results had to do back in the Stone Age (that is, back when computations were chiselled on stone tablets, or written in notebooks in pen, or whatever it is people did back then).

      • Yes, sure. But it’s moderated in reality by the fact that you will probably stop after a really good streak, especially if no stopping rule was agreed ahead of time.

  10. I’d like to point to two papers on optional stopping and Bayesian inference that may be of interest:

    Sanborn, A. N. & Hills, T. T. (in press). The frequentist implications of optional stopping on Bayesian hypothesis tests. Psychonomic Bulletin & Review. http://www2.warwick.ac.uk/fac/sci/psych/people/asanborn/asanborn/frequentist_implications.pdf

    Rouder, J. N. (in press). Optional Stopping: No Problem For Bayesians. Psychonomic Bulletin & Review.
    http://pcl.missouri.edu/sites/default/files/r_0.pdf

    Cheers,
    E.J.

    • Table 1 in the Sanborn paper is seriously misleading because (i) it does not attempt to account for the reduction in false negative errors that accompanies the increase in number of tests, and (ii) because it does take into account the fact that many of the false positive results will be effect sizes so small as to lead any competent experimenter to say that they are trivial. Ignoring those factors leads to a distorted view of the problems of statistical inference.

  11. These simulations by John Kruschke are very relevant: http://doingbayesiandataanalysis.blogspot.nl/2013/11/optional-stopping-in-data-collection-p.html

    He compares three bayesian stopping rules in addition to an NHST based one. Stopping based on bayes factors and accepting/excluding the ROPE introduces bias in the parameter estimate. Stopping based on precision seems to be the way to go. A great quote is: “A stopping rule based on getting extreme values will automatically bias the sample toward extreme estimates”

    Cheers,
    Jonas

  12. The linked AZT story is interesting:
    http://aidsperspective.net/blog/?p=749

    Sounds like affirming the consequent fallacy rampant throughout medical research I’ve noticed (If the drug works people who get it will survive longer, people who got the drug survived longer therefore the drug works). I’ve seen comments by Kary Mullis’ (who invented pcr) regarding the early days of HIV testing saying the method at the time was not capable of detecting virus at the levels claimed. It’s not my area of expertise so I will stop there, but I would not doubt that fields could continue along the wrong path for decades under the current environment of mass confusion over how to interpret evidence along with publication bias.

    Also this paper contains a nice discussion of stopping rules:
    “It is an interesting sub-paradox that the seemingly hard-headed and objective p-value approach leads to something as subjective as the conclusion that the meaning of the data depends not only on the data but also on the number of times the investigator looked at them before he stopped, while the seemingly fuzzy subjective formulation leads to the hard headed conclusion “data are data”.”

    Cornfield, Jerome (1976). “Recent Methodological Contributions to Clinical Trials”. American Journal of Epidemiology 104 (4): 408–421.
    http://www.epidemiology.ch/history/PDF%20bg/Cornfield%20J%201976%20recent%20methodological%20contributions.pdf

  13. Pingback: Ken Rice presents a unifying approach to statistical inference and hypothesis testing « Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social Science

  14. Pingback: Bayesian AB Testing is Not Immune to Optional Stopping Issues | Blog for Web Analytics, Statistics and Data-Driven Internet Marketing | Analytics-Toolkit.com

  15. Pingback: Bayes and optional stopping | Stephen R. Martin

  16. Pingback: Stopping rules and Bayesian analysis - Statistical Modeling, Causal Inference, and Social Science

  17. “To put it yet another way, if you use a data-dependent stopping rule and don’t allow for possible time trends in your outcome, then your analysis will not be robust to failures with that assumption”

    The data and your statistical model can’t tell you whether there are time trends, or if there are why the are occurring. You need to go outside statistics, to thinking and hypothesizing.

    “But if you do a very careful study (so as to minimize variation) or a very large study (to get that magic 1/sqrt(n)), you’ll get a small enough confidence interval to have high certainty about the sign of the effect. So, from going from high sigma and low n, to low sigma and high n, you’ve “adding time or sample to an experiment” and you “found a result.””

    Doing a “very careful study (so as to minimize variation)” again, involves thinking about the problem in a qualitative way and introducing controls based on theory. The “very careful” part is theory, not data, driven. This is a VERY different and more effective way to reduce uncertainty than increasing sample size, which I don’t believe will generally reduce uncertainty to acceptable levels in dirty data. It is the epidemiological approach vs the experimental approach. Stopping rules tell you where to stop in the former case; that’s not good enough.

  18. It seems to me that problems with data-based stopping and Bayesian analysis (other than the two issues I noted above) arise only because people are mixing Bayesian inference with non-Bayesian decision making. Which is fair enough—people apply these sorts of mixed methods all the time—but in that case I prefer to see the problem as arising from the non-Bayesian decision rule, not from the stopping rule or the Bayesian inference.

    Just wanted to comment on this. I think this is true in practice but not in principle, in the sense that a Bayesian can end up doing this in conformance with Bayesian methods. The scenario: the Bayesian is handed the data without having had any input in the collection mechanism, and the setting is Bayesian statistical decision theory with the action space being intervals and the loss function being one that picks out equal-tailed credible intervals as the posterior expected loss minimizer (at least one version of this loss function does exist).

  19. There was some suggestion that some preset width of a CI would be a reasonable stopping rule.

    There are two things that are problems with a stopping rule with a frequentist test. The first is that alpha just isn’t alpha anymore. This is what is being discussed most here. The other is that the effect size will be dependent (ENTIRELY in exreme cases!!) on sample size and become unrelated to the actual effect size.

    Using a CI width for a stopping will eliminate both of those effects. Of course, properly using a CI there isn’t really an alpha. Alpha is only incidentally related to the alpha of testing and you really don’t get to use the interval and do a test with it because then the interval disappears. But if someone were to use the CI as a test, stopping at a particular width is unbiased. Further, the observed mean effect size is also unbiased with respect to N.

    However, all is not rosy. What does occur is that the variance of the effect now becomes associated with N and decoupled from the actual effect variance. So effect variances, while still correct in the long run, will be (negatively) correlated with N, which they should not.

Comments are closed.