“We continuously increased the number of animals until statistical significance was reached to support our conclusions” . . . I think this is not so bad, actually!

Jordan Anaya pointed me to this post, in which Casper Albers shared this snippet from a recently-published paper from an article in Nature Communications:

The subsequent twitter discussion is all about “false discovery rate” and statistical significance, which I think completely misses the point.

The problems

Before I get to why I think the quoted statement is not so bad, let me review various things that these researchers seem to be doing wrong:

1. “Until statistical significance was reached”: This is a mistake. Statistical significance does not make sense as an inferential or decision rule.

2. “To support our conclusions”: This is a mistake. The point of an experiment should be to learn, not to support a conclusion. Or, to put it another way, if they want support for their conclusion, that’s fine, but that has nothing to do with statistical significance.

3. “Based on [a preliminary data set] we predicted that about 20 unites are sufficient to statistically support our conclusions”: This is a mistake. The purpose of a pilot study is to demonstrate the feasibility of an experiment, not to estimate the treatment effect.

OK, so, yes, based on the evidence of the above snippet, I think this paper has serious problems.

Sequential data collection is ok

That all said, I don’t have a problem, in principle, with the general strategy of continuing data collection until the data look good.

I’ve thought a lot about this one. Let me try to explain here.

First, the Bayesian argument, discussed for example in chapter 8 of BDA3 (chapter 7 in earlier editions). As long as your model includes the factors that predict data inclusion are also included in the model, you should be ok. In this case, the relevant variable is time: If there’s any possibility of time trends in your underlying process, you want to allow for that in your model. A sequential design can yield a dataset that is less robust to model assumptions, and a sequential design changes how you’ll do model checking (see chapter 6 of BDA), but from a Bayesian standpoint, you can handle these issues. Gathering data until they look good is not, from a Bayesian perspective, a “questionable research practice.”

Next, the frequentist argument, which can be summarized as, “What sorts of things might happen (more formally, what is the probability distribution of your results) if you as a researcher follow a sequential data collection rule?

Here’s what will happen. If you collect data until you attain statistical significance, then you will attain statistical significance, unless you have to give up first because you run out of time or resources. But . . . so what? Statistical significance by itself doesn’t tell you anything at all. For one thing, your result might be statistically significant in the unexpected direction, so it won’t actually confirm your scientific hypothesis. For another thing, we already know the null hypothesis of zero effect and zero systematic error is false, so we know that with enough data you’ll find significance.

Now, suppose you run your experiment a really long time and you end up with an estimated effect size of 0.002 with a standard error of 0.001 (on some scale in which an effect of 0.1 is reasonably large). Then (a) you’d have to say whatever you’ve discovered is trivial, (b) it could easily be explained by some sort of measurement bias that’s crept into the experiment, and (c) in any case, if it’s 0.002 on this group of people, it could well be -0.001 or -0.003 on another group. So in that case you’ve learned nothing useful, except that the effect almost certainly isn’t large—and that thing you’ve learned has nothing to do with the statistical significance you’ve obtained.

Or, suppose you run an experiment a short time (which seems to be what happened here) and get an estimate of 0.4 with a standard error of 0.2. Big news, right! No. Enter the statistical significance filter and type M errors (see for example section 2.1 here). That’s a concern. But, again, it has nothing to do with sequential data collection. The problem would still be there with a fixed sample size (as we’ve seen in zillions of published papers).


Based on the snippet we’ve seen, there are lots of reasons to be skeptical of the paper under discussion. But I think the criticism based on sequential data collection misses the point. Yes, sequential data collection gives the researchers one more forking path. But I think the proposal to correct for this with some sort of type 1 or false discovery adjustment rule is essentially impossible and would be pointless even if it could be done, as such corrections are all about the uninteresting null hypothesis of zero effect and zero systematic error. Better to just report and analyze the data and go from there—and recognize that, in a world of noise, you need some combination of good theory and good measurement. Statistical significance isn’t gonna save your ass, no matter how it’s computed.

P.S. Clicking through, I found this amusing article by Casper Albers, “Valid Reasons not to participate in open science practices.” As they say on the internet: Read the whole thing.

P.P.S. Next open slot is 6 Nov but I thought I’d post this right away since the discussion is happening online right now.

57 thoughts on ““We continuously increased the number of animals until statistical significance was reached to support our conclusions” . . . I think this is not so bad, actually!

  1. At least they were honest about the practice and what their intentions were, rather than some post-hoc sample size “justification” to give the appearance that sample size was determined a priori. This sort of ad-hoc sequential design is unfortunately fairly common in biology, and usually rooted in the NHST framework. Since sample size is often more flexible (compared with human subjects), it can be a lot more resource-efficient to use sequential designs until the data “look good,” however that may be defined. But it has to be done properly and with purpose. I don’t see that changing for the typical biologist any time soon – even if they are aware of the problem, there aren’t (to my knowledge) any easy-to-use tools to help model sequential data collection.

  2. I have to admit that I don’t entirely get the argument that sequential testing is not a problem – strategic sequential testing that (I believe) goes on seems obviously problematic to me in that it leverages chance to produce statistically significant results. I’ll think on this more, and will watch this space tomorrow to see if anyone else is also having trouble with this line of argument.

    For now I’ll just challenge the idea that a pilot test is not designed to inform a priori power analyses. I think this statement is overly strong. It seems perfectly acceptable to me to run a small study and use that study to inform guesses about the likely range of effect sizes that might be observed. I do think it would be a mistake to run a small study and use the mean effect to inform power analyses – because the mean effect is likely to be estimated imprecisely, ignoring the dispersion of scores is really risky. However, in the absence of other information it seems perfectly reasonable to use, for example, the lower limit of an 80% confidence interval as a relatively conservative estimate to inform a priori power analysis (though assuming the pilot test is small, even an 80% CI will be quite wide).

    So I’ve come around a little bit – maybe we can agree that while, in theory, a pilot test might be a good way to generate an estimate for a priori power analysis, in practice that estimate is likely to be too imprecise to be of much help.

    • It’s more like, it’s not dramatically more problematic than the root problem which is standard NHST is full of shit in the first place.

      As far as I’m concerned, a pilot analysis is “for” showing that the experiment is feasible, but after collecting the data, there’s absolutely no reason not to get a Bayesian Posterior Distribution from the pilot data and use it to form an informed prior for the full analysis.

      NHST + power analysis is just again almost always going in the wrong direction. In a Bayesian analysis a “power” analysis is all about the question of “how much data would I have to collect to make a sufficiently low-risk decision” instead of “false positive” or “false negative” dichotomization.

      • Would it be fair to say in response “Look, if you’re hunting snipe or some magical p-value feel free to increase your sample size all you want but remember that as sample size approaches a big number the effect size quickly falls into ‘nobody gives a damn territory'”?

        • Yes, it’s more or less true, but there’s also the issue of “getting lucky” and type M error. When you have a noisy measure, you might “accidentally” obtain statistical significance early in the process, but not actually find out a good estimate for the parameter. That seems to be in part the case with these small N noisy studies.

          In the end I really think just *stop using statistical significance* in any way, unless you whole goal is actually to test computer random number generators.

          No one is going to start out to do a test saying “I’m going to sample until I get statistical significance” and actually do it if it takes more than a very moderate amount of resources. So the concept of “sampling to significance” is a purely theoretical one for almost all researchers. And if you’re doing research and you decide “our sample doesn’t give us significance, let’s collect a little more data and see if it does” then *you’re just doing it wrong*, NOT because you are sampling wrong, but because you’re ANALYZING wrong.

    • Jeff:

      1. Yes, with some sequential designs you can increase the probability of getting statistical significance. So what? Statistical significance, by itself, tells us nothing.

      2. I disagree with your statement that, “in theory, a pilot test might be a good way to generate an estimate for a priori power analysis.” Even in theory the pilot study is a bad way to generate this estimate. It will be too noisy. The lower limit of an 80% interval from a pilot study is not a conservative estimate of anything; it’s just a random number!

        • Daniel:

          Of course don’t use a flat prior for the design. Use prior information. There’s always prior information, otherwise why are they doing the experiment in the first place?

        • Even most hard core frequentists won’t complain about using informative priors for _design_.

        • Here’s an example of one of the things you could do with your pilot posterior:

          draw N posterior samples, for each i = 1…N generate a fake dataset of Q data points according to the generating process, then run a Bayesian inference on this fake dataset, and determine some posterior samples for k, an important parameter.

          Vary Q

          Using a utility function that encodes how much you really care about knowing the best value for k, choose a Q that maximizes your expected utility across the N possible parameter vectors.

          Now carry out your “real” study using sample size Q

          And *that* is what “Bayesian Power Analysis” should look like, and it *doesn’t* suffer from the “noisy point estimate” problem.

        • Daniel:

          Sure. But another way of putting it is that, if you have any kind of reasonable prior, the likelihood from your pilot study will be so weak as to have virtually no effect on your posterior. And, in addition to that, a pilot study will typically have lots of bias: you’re doing the pilot to make sure the treatment can be implemented as planned, and there’s no real reason to put lots of effort into controlling biases.

        • “But another way of putting it is that, if you have any kind of reasonable prior, the likelihood from your pilot study will be so weak as to have virtually no effect on your posterior.”

          Based on my experiences, this is not true. I’ve seen lots pilot studies that are something like matches pairs with 3 or 4 pairs. To claim that the researchers prior sd about the effect size that was considerable less than half of the sd of the difference of a randomly selected pair was definitely not the case.

          To be clear, I’m not talking about ideal worlds here.

        • A:

          I’d need to see the example. In any example I’ve ever seen, the interval based on 3 or 4 pairs is so wide as to include huge swathes of completely unrealistic parameter values.

        • “there’s no real reason to put lots of effort into controlling biases”

          unless you’re doing it explicitly to make a decision about the size of your follow-up study, in which case you should try to get the best data you can so you can feed it all into a decision analysis.

          I think normally this kind of decision analysis based followup isn’t done, and that’s why people put less effort into their pilot studies.

        • Daniel:

          it’s not just that. Given that variance will be so high with small N, there’s not really any point in working to control bias at this stage. That’s one reason that pilot studies are often not randomized. Or, if they are, the point is to check that the randomization is feasible, not to worry about balance in some group of 4 patients or whatever.

        • I guess what you call a pilot study and what I call a pilot study may differ significantly, or maybe just it differs by field… In forensic engineering for example, I’ve definitely done things like sent an inspector into the field to take observations of say 8 randomly selected windows to figure out what measurements need to be taken, and get a handle on visually discernible failure rates. Rates could be anything from 0 to 1… and the “real” study needs to quantify failure across 1500 windows installed on a 15 story building, using measurement techniques appropriate to the types of failures likely to occur.

          The pilot study might cost $800, and collect very basic info about failure types and soforth, but the final study will need to look at N windows, with pressure tests or chipping away stucco to reveal installation techniques, or whatever. Maybe $5000 per window by the time scaffolding is set up, and 4 or 5 simultaneous workers per window.

          You definitely don’t want to do some kind of industrial process control textbook formula for sample size and tell the inspection team to completely strip 400 windows out of the building at a total cost exceeding twice the quantity requested in the settlement discussions.

          A real-world cost based decision analysis is a real thing here, and even low grade biased measurements from a pilot of 8 windows is a damn site better than any other technique for determining the sample size.

        • Daniel:

          Yes, I agree, your scenario is different than the sort of pilot study we see in statistics where the range of the data are pretty much known ahead of time (a simple example being binary data with a roughly known frequency).

        • > pilot studies are often not randomized. Or, if they are, the point is to check that the randomization is feasible
          This certain would make sense in clinical research for instance trying to carefully balance the control and intervention groups. But again, primary motivation being to evaluate feasibility, safety, compliance, timing, costs, etc.

  3. I am worried that misinterpreted blog posts (and journal articles) like this one lead to problematic behaviors by practioners. Sequential data collection combined with non-group-sequential frequentist methods (looking for p-value <= 0.05 in a test not adjusted for the data collection) or looking at whether Bayesian credible intervals (obtained using vague/improper uninformative priors) exclude zero is a wide-spread practice (combined with a "this is significant and thus, proves xyz" interpretation). Some people that do that are vaguely aware that, if you care about the type I error rate or 95% CI coverage, frequentist methods require some adjustment. They then tend to justify what they are doing instead by claiming that "Bayesian analyses don't require an adjustment and with vague priors have great frequents operating characteristics, too! And in any case, I can do a frequentist analysis and then just interpret it as a Bayesian analysis with uniformative priors. See, the whole problem goes away with the wave of a hand".

    As long as it's done this way, this approach is a double-whammy, you get the chance to "win big" with a huge effect size early on based on a type SM error and late on based on showing an irrelevant effect.

    In my experience, when I tried to discuss the potential issues (which admittedly are more to do with the significant or not interpretation rather than the sequential data collection), I just get told that I am too stupid to understand the likelihood principle (and that I should read some article by Berry and Berry that explains it even for people like me). So these kind of posts do really worry me in terms of the practical effects they have, although here at least the "this is fine in a sense, of you don't care about the type 1 error rate or other frequentist operating characteristics"-disclaimer is clear. However, I would have wished for it to be even bigger and more clearly spelled out. One can never be too clear.

    • Bjorn:

      Just to be clear:

      – Bayesian methods don’t require an adjustment for sequential design. They do, however, require a model for the outcome that includes all variables used in the design (in the case of a sequential design, the key variable is time).

      – I think decision rules based on null hypothesis significance testing (these include p-values, Bayes factors, and decisions based on whether a 95% confidence or posterior includes zero) make no sense and will in general have bad statistical properties.

      – I think it’s a mistake to think you’ve “won big” if you get a huge effect size estimate along with a large standard error. I discussed this problem in section 2.1 of the paper linked to above.

      – I disagree completely with the claim that Bayesian analyses with vague priors have great frequency properties. I’ve talked and written about this a lot: Bayesian analysis with vague priors leads to the following sort of statement: If you have an estimate that’s 1 se from zero, you end up with 5:1 odds that the true effect is positive. Go around giving 5:1 odds based on pure noise and you’re gonna lose a lot of bets.

      – The likelihood principle is what it is. In any case, you can do most of the above reasoning without worrying about the likelihood principle, just looking at frequency properties.

      – The practical effect I’m hoping from this post is for people to focus on important statistical issues. To criticize the above-linked study based on its sequential design is, to me, ridiculous, as it would have almost all the same problems had the sample size been fixed. The sequential design is a minor part of the study, and to pick on that seems to me like a distraction. For influential people including the editor of a leading psychology journal) to focus on this seems to me to miss the point, and it’s perhaps one clue how so many crappy papers get published in top journals: there’s an attitude that if various arbitrary rules are followed (no sequential design, p less than 0.05, etc.), that a paper gets to be published. That led to the Bem ESP debacle.

      • Maybe we need more examples of how to do such trials properly and interpret it in an appropriate manner. For example, what would a possible Bayesian model including time look like? Picking some sensible priors, what would be a good analysis of this experiment and what would be an appropriately cautious interpretation?

        • You need to model the process you think generated the data. Ie, if there is sequential sampling you need to model sequential sampling.

          “We continuously increased the number of animals until statistical significance was reached to support our conclusions”

          “We assumed the data was iid but then made collection of new data dependent on the outcome of the previous data so it wasn’t iid. Then we rejected the iid model and concluded that we know the cure for cancer (or whatever).”

          That is how stupid this is. And do not be mistaken, it is widespread and has been for decades. The main problem at this point is that the human mind recoils at the thought of the consequences.

        • Here is a model of sequentially collecting data from two equivalent groups, performing a t-test, and collecting more until you either get a “significant” result or run out of money:

          Here are the results of the above for 100 simulations. It is the distribution of p-values you get by taking either the final or lowest p-value. About 35% were less than 0.05 in that case:

          If you didn’t do sequential sampling then those histograms would look like uniform distributions and ~5% of p-values would be below 0.05.

      • While I agree with most of what you’ve said, I do think that is important to criticize (or warn against, education about, etc.) these kind of sequential designs, as they just provide noise. Yes, other practices *also* lead to noise, but that means we should criticize those practices too, not to ignore some practices because others are also bad or worse.

        You do make a great point that had this study used an a-priori fixed sampling procedure instead of a post-hoc sequential one, it would not have been much better. While that is true in this case, in many other cases this does not hold. As such, I do think that it is good to focus (to some degree) on this particular bad approach, without losing sight of other problematic practices.

        • I agree that sequential designs are just fine, and should be actually encouraged; however, I do feel that Andrew is engaging in a bit of a dodge there. If you don’t use sequentially valid frequentist inference, you are basically guaranteed to reject the null hypothesis eventually without the use of sequentially valid inference procedures.

          Andrew’s argument, if I understand correctly, is: ‘Whatever, NHST (frequentist and bayesian) is useless and broken so who cares if you do the sequential analysis wrong.”

          I think you can make the argument that NHST has serious problems, as Andrew often does. Whatever your bottom line decision rule should be, which has always been less clear to me in Andrew’s writing, you’ve got to correctly account for your sequential design. Sometimes you get this for free by virtue of being in a Bayesian paradigm and sometimes you don’t.

        • I think you may mean more precisely that a point null hypothesis is false, as it is obviously not the case that any null hypothesis is false. It is quite common to have basic business practices in place and then have the research question “does doing X improve Y”, where Y is revenue or some such thing. X does not always improve Y, so the null hypothesis that we should keep with the status quo is sometimes true.

        • One last thought on this. Most people interpret point null hypotheses in a way that I think you’d approve of. Namely, a point null hypothesis can be thought of as two tests:

          1. HA: ‘treatment’ is better than baseline
          2. HA: ‘treatment’ is worse than baseline

          A point null hypothesis test tests both of these controlling for the family wise error rate. This actually corresponds to how people actually interpret the results. People don’t say: “The treatment was shown to have a non-zero effect on the condition,” rather they interpret it directionally: “the treatment was shown to improve the condition.”

          So yes, the null hypothesis of no effect is often a priori false, however, the one sided nulls are not false. So here is a counterintuitive response to your statement that “the null hypothesis is false.”:

          The null hypothesis is true. The question the test is addressing is which one.

        • Ian:

          That’s all fine, except (a) effects can be highly variable, hence an effect size of +0.002 in a particular experiment, even if several standard errors from zero, doesn’t tell us much about what might happen next time (I’m assuming a scale in which effect sizes on the order of 0.1 are interesting); and (b) all this type 1 error rate control stuff is not really relevant to questions of distinguishing positive from negative effects.

    • Björn:

      I had very much the same experience with a statistical colleague a couple months ago. Before and afterwards, I sent them some material on how bad this actually is. Have no idea what the impact was/will be.

      Largely, I think it is bad meta-physics or meta-statistics at the root of this and why it is so hard to get folks to take criticism seriously. For instance, the likelihood principle, to some means frequency properties are irrelevant so they will just dismiss looking at frequency properties.

      If you can get someone’s attention and time, this simulation based exposition of the issue by Andrew may be a good bet http://statmodeling.stat.columbia.edu/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

      I discussed it in a wider context here (where it is Case study 1) http://statmodeling.stat.columbia.edu/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

  4. In psychophysical studies sequential testing is fairly common. This is usually paired with with adaptive stimulus placement, using some heuristic to choose the “best” stimulus on each trial, based on the current prior. After Kontsevich & Tyler (1999), the heuristic usually has been to minimize the posterior entropy. Some implementations, simulations and other stuff (equations! Curly lines in figures!) can be found from e.g. DiMattina (2015), Kujala & Lukka (2006), and Shen & Richards (2013).

    The motivation for this is, at least it used to be, rather practical: if we are interested in, e.g., the faintest stimulus the subject can detect, it doesn’t really make sense to present them with stimuli they always are able to notice. This resulted in different sorts of “non-parametric” sequential tests, in which some simple rule would be used to determine the next stimulus. Later, as was said in the beginning of this post, more mathematical methods for stimulus selection were developed, since in the more complex models the stimulus placement is dependent on more things than just the psychophysical threshold.

    To make everyone more bored, I’ve attached a quickly put together R code of a simple adaptive psychophysical task. I scripted it while on a tea brake, so it lakes a certain robustness in programming sense… but still, I thought that maybe people could find it fun to play around with it. It uses sequential importance sampling, at this point, so the particle degeneracy can become a problem if one wants to run longer simulations. In these cases I’d recommend one to add a “resample-move” step, as in Chopin (2002).

    Also, since it is all in native R, and I was too lazy to figure out some vectorizations, it is also really slow, so be aware. The model in itself is quite simple. There’s an observer making binary decisions, basing their decision on the “internal” strength of the signal (depends on where signal-to-noise ratio is 1 and non-linearity of the internal scale) and a decisional bound, pretty much like in basic probit models. The probability is “padded” a with by mixing in some non-cognitive factors (like in Zeigenfuse and Lee 2010, if I recall correctly). So there it is.


    Chopin, N. (2002). A sequential particle filter for static models. Biometrika.
    Dimattina, C. (2015). Fast Adaptive Estimation of Multidimensional Psychometric Functions. Journal of Vision.
    Kontsevich, L.L and Tyler, C.W. (1999). Bayesian Adaptive Estimation of Psychometric Slope and Threshold.
    Kujala, J.V and Lukka, T.J. (2006) Bayesian Adaptive Estimation: the next dimension. Journal of Mathematical Psychology.
    Shen, Y, and Richards, V.M. (2013). Bayesian Adaptive Estimation of the Auditory Filter. Journal of the Acoustical Society of America
    Zeigenfuse, M.D. and Lee, M.D. (2010). A General Latent Assignment Approach for Modeling Psychological Contaminants. Journal of Mathematical Psychology.


    # Some Functions
    pYes = function(x, par) {
      0.98 * pnorm(-par[3] + (x / exp(par[1])) ^ exp(par[2])) + 0.02 * 0.5
    informationGain = function(stimulus, particles, weights) {
      pyes = rep(0.5, length(weights))
      sum1 = 0
      sum2 = 0
      for(i in 1:length(weights)){
        pyes[i] = pYes(stimulus, particles[i,])
        sum1 = sum1 + pyes[i] * weights[i]
        sum2 = sum2 + (-(pyes[i] * log(pyes[i]) + (1 - pyes[i]) * log(1 - pyes[i]))) * weights[i]
      sum1 = (-(sum1 * log(sum1) + (1 - sum1) * log(1 - sum1)))
      return(-(sum1 - sum2))
    # Particle set
    priorMeans = c(log(2), log(1), 1.2)
    priorSd = c(1, 1, 1)
    nParticles = 1000
    particles = matrix(NaN, ncol = 3, nrow = nParticles)
    particles[,1] = rnorm(nParticles, priorMeans[1], priorSd[1])
    particles[,2] = rnorm(nParticles, priorMeans[2], priorSd[2])
    particles[,3] = rnorm(nParticles, priorMeans[3], priorSd[3])
    weights = rep(1 / nParticles, nParticles)
    # Parameters for the simulation
    nTrials = 100
    answers = c()
    stimuli = c()
    generatingValues = c(1, 0.5, 1)
    # Run simulation:
    for(t in 1:nTrials) {
      # Choose stimulus:
      stimuli[t] = optimise(informationGain, lower = 0, upper = 10, particles = particles, weights = weights)$minimum
      answers[t] = rbinom(1, 1, pYes(stimuli[t], generatingValues))
      # Update prior
      for(i in 1:length(weights)) {
        weights[i] = weights[i] * (answers[t] * pYes(stimuli[t], particles[i,]) + 
                                     (1 - answers[t]) * (1 - pYes(stimuli[t], particles[i,])))
      weights = weights / sum(weights)
    • Why are there so many logs and exps and arbitrary constants and wierd equations in this code? If its meant to demonstrate something about sequential testing that makes it unneccesarily difficult to understand.

      • :(

        I give you that it is quite esoteric. I will try to unwind some peculiarities.

        The exps in the function “pYes” are so that par[1] and par[2] would always be positive. Conversely in the vector priorMeans the means are log’d so that they would be more easily understood–but maybe they aren’t. The core equation in itself–(x/alpha)^beta–is widely used in psychophysics to relate physical signal level to the internal signal-to-ratio (cf. e.g. Kontsevich och Tyler from the previous post). Intercept in the model corresponds to a “decision criterion”–if you think it as a latent variable model (https://en.wikipedia.org/wiki/Logistic_regression#As_a_latent-variable_model).

        The logs in the function informationGain are related to the calculation of the entropy of a bernoulli distribution. As I said, the heuristic is to minimize the entropy of the posterior distribution. Here, instead, it is the probability distribution of the responses in which the entropy is minimized; the algorithm (during the optimisation step inside the main loop) chooses the stimulus that reduces the entropy of the bernoulli distribution the most. Kujala and Lukka have more information about this in their article.

        The arbitrary constants in pYes are indeed arbitrary. The constant 0.98 and conversely 0.02 is mixing coefficient; it denotes how much of the response is dictated by the core equation and how much of it is due to unbiased noise (the coefficient 0.5). This principle is elaborated on in the Zeigenfuse reference. Also I think Kruschke wrote about this sort of mixture modeling in his book, calling it “robust regression”, but I can’t put my finger on it.

  5. Andrew writes, “So in that case you’ve learned nothing useful, except that the effect almost certainly isn’t large—and that thing you’ve learned has nothing to do with the statistical significance you’ve obtained”

    I have a small disagreement with this statement.
    (A) It IS useful to learn about an effect size being small
    (B) The usefulness of (A) is predicated on having a large enough sample. And one way that will occur is if your ‘true’ effect size is very small and you have a statistical significance based stopping rule.

    So while I agree that p-value based stopping rules are not a generally coherent framework, a side effect of implementing them is that a ‘precisely estimated zero’ obtained from doing so is quite useful. Think of this as the inverse to the type-M problem.

  6. Thanks for writing on this, Andrew. (And thanks for liking my open science joke :))

    I wrote a blog post outlining the consequence of sticking to NHST and not adjusting for sequential data collection. I hope it can help as an eye opener to some, as it clearly shows how large the bias in the p-values *and* the effect size estimates is when applying this approach:

    • Caspar:

      I think the thing with the p-values is irrelevant to good practice in that we should not be using p-values to make inferences or decisions. I disagree entirely with your “false discovery rate” attitude in that I do not think the purpose of a study is, or should be, the “discovery” of nonzero differences. All differences are nonzero. Just get N=10^6 and you can get as many discoveries as you want.

      Regarding the point estimates: yes, any selection on statistical significance will bias your point estimates. This arises with sequential or non-sequential designs. However, if you perform a sequential design and report all your data, there should not be a problem.

      In addition, I disagree completely with your conclusion that a researcher should “increasing your sample size in small bits until you meet some threshold.” It’s always better to get more data. The reason for not getting more data is some combination of cost, convenience, and urgency—not a statistical significance threshold. Again, the null hypothesis of exactly zero effect and zero systematic error will never be true, so I have no interest in rejecting it 5% of the time or whatever. This is a game that I have no interest in playing, and which I don’t think researchers should be playing. And, for that matter, I don’t think Alan Turing used statistical significance thresholds when cracking codes (or, at least, I haven’t heard of him doing so).

      • Cool.

        If I had the time to waste I could redo this with a rejected study with frequency based analyses where the reviewers stated that the study was too noisy (under powered) had not adjusted properly for multiple analyses. The resubmit would do a Bayesian analysis with a flat prior and prattle on how about the advantages of now knowing the posterior probabilities highlighting credible intervals that are almost identical to the previous confidence intervals.

  7. Ian Fellows wrote:

    1. HA: ‘treatment’ is better than baseline
    2. HA: ‘treatment’ is worse than baseline
    So yes, the null hypothesis of no effect is often a priori false, however, the one sided nulls are not false.

    Not at all. That is not the null hypothesis, alternative hypothesis, or any hypothesis being tested. Your null hypothesis is whatever model you actually calculated a p-value (or whatever) based on. This will include other assumptions besides mean1 = mean2, such as normality, iid data, etc. There is no reason to privilege the mean1 = mean2 assumption.

    In this sequential sampling case, the iid assumption that a lot these default statistical models make is violated. I think people don’t even understand the first thing about what they are testing, which leads to all these problems. I know a lot of readers probably think I am hyperbolic but it really is idiotic if you understand what is going on:


    • One reason normality assumptions are so popular is that the normal distribution is a mathematical attractor. So the assumption can hold approximately without much assumption. Of course a thing that mathematically has to be true isn’t of much interest testing scientifically.

      In the end, hypotheses about which rng generated your data are stupid things to test. What we want is mechanistic predictive models with bounds on the imprecision. That’s what Bayesian gives you.

      • Yes, normal distributions assumptions often make sense in the absence of anything better because they involve the minimum of assumptions. That doesn’t mean testing whether such a model is correct makes any sense.

        My ever-present position is to test a model you have derived from a theory/explanation/whatever and then work from there.

  8. “Statistical significance isn’t gonna save your ass, no matter how it’s computed.”

    It won’t, and it was never, ever claimed to, but statistical significance is a good ‘first pass’ at the problem, filter/gatekeeper what have you. But hey, same with Bayes factors and everything else.


    • Justin:

      Two things. First, there’s a reason people use different statistical methods. They’re not all equivalent, and there’s no reason to think that every statistical method is a good first pass at a problem. Second, a lot of the problems with statistical significance are exactly that it is used as a “filter/gatekeeper.” We discuss general problems with filter/gatekeeper/etc. here.

Leave a Reply

Your email address will not be published. Required fields are marked *