The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time

Kevin Lewis points us to this article by Joachim Vosgerau, Uri Simonsohn, Leif Nelson, and Joseph Simmons, which begins:

Several researchers have relied on, or advocated for, internal meta-analysis, which involves statistically aggregating multiple studies in a paper . . . Here we show that the validity of internal meta-analysis rests on the assumption that no studies or analyses were selectively reported. That is, the technique is only valid if (a) all conducted studies were included (i.e., an empty file drawer), and (b) for each included study, exactly one analysis was attempted (i.e., there was no p-hacking).

This is all fine, and it’s consistent with the general principle that statistical analysis must take into account data collection, in particular that you should condition on all information involved in measurement and selection of observed data (see chapter 8 of BDA3, or chapter 7 of the earlier editions, for derivation and explanation from a Bayesian perspective).

I just want to point out one little thing.

This bit is wrong:

“exactly one analysis was attempted (i.e., there was no p-hacking)”

There is still a problem even if only one analysis was performed on the given data. What is required is that the analysis would have been done the same way, had the data been different (i.e., there were no forking paths). As Eric Loken and I put it, multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.

Vosgerau et al. clarify this point in the last sentence of their abstract, where they emphasize that “preregistrations would have to be followed in all essential aspects”—so I know they understand the above point about forking paths. I just wouldn’t want people to just read the first part and mistakenly think that, because they did only one analysis on their data, they’re not “p-hacking” and so they have nothing to worry about.

20 thoughts on “The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time

  1. There is a term for this in standard computer science: it’s an oblivious algorithm. A classic example is that Batcher sort is oblivious but Quicksort is not. Oblivious algorithms have the advantage that their runtimes can be known advance, and are often easier to parallelize.

  2. And conversely folks stuck in NHST thinking sometimes take it as a badge of honor that they only did one ANOVA or whatever and didn’t take the time to plot their data in a variety of ways and check different models and their fit, because that superficially seems like ‘p hacking’. We desperately need a revolution in practice across all applied fields towards workflow and transparency and emphasis on all inference as conditional on what should be clearly specified assumptions!

      • Doesn’t this require that you specify the full model in advance including all things that you may possibly want to compare? If you check the model against the data and change it then, or if the data give you other ideas what to model and look for, then you’re in for the same forking paths issue, aren’t you?

        • Hi, Christian. Yes, when you expand the possibilities you’ll want to include that in the multilevel model. There’s no way to completely avoid this Cantor’s corner issue, but I think that including the possibilities in the multilevel model is better than the usual approach of going with just one particular possibility. My suggestion is to include a family of forking paths rather than just choosing one path. There will always be other paths not chosen, but they can be incorporated into the model too. The existence of possible future developments should not stop us from trying to model what we have so far.

        • What’s the practical utility of the whole cult of forking paths? Aren’t we overdoing it, bringing it up all the time?

          Sure, it’s a catchy phrase but beyond that? Does it carry any practical prescription?

          Or is it a generic plug for “Go Bayes!”?

          To be unaware of researcher degrees of freedom one would have to be living under a rock. And for the ones who are aware what does the forking paths paradigm add? Does it help them in any practical sense?

        • Rahul:

          1. It’s a concept, not a cult.

          2. I do not bring it up all the time. I brought it up here because of its direct relevance to the article under discussion. P-hacking and researcher degrees of freedom are not cults either; they too are concepts.

          3. The practical prescription is that researchers should not take p-values at face value without accounting for the dependence of their choice of statistical analysis on their data, a dependence that is a concern even if they only did one particular analysis on the data at hand.

          4. Forking paths can be a problem with Bayesian analyses too. See my article, The problems with p-values are not just with p-values.

          5. Unfortunately, people continue to make elementary statistical mistakes. Perhaps they’re living under a rock, perhaps they’re following the advice of bad textbooks or the examples of erroneous published articles, perhaps they don’t want to look too hard at their apparently successful conclusions . . . or perhaps they’re misled by the term “p-hacking” into thinking that multiple testing is only a problem when the multiple tests are performed on the data at hand. It was the last concern that motivated this post.

    • Framing forking paths as an issue only relevant to one statistical methodology (or even a suite of them) misses the crux of why the forking paths is such a concern. Put simply: it is not inherent to any particular methodology – though it likely can be swept under the rug with methodologies that make things less explicit – it is an inherent issue to the scientific enterprise.

      We have choices everywhere, some or many of which we are even unaware that we are making, and those choices have a profound effect on how we analyze, view, and conduct investigations.

  3. But if you don’t explore your data and try out different things, you may not be getting the most out of it. You may even miss important things you hadn’t picked up on before.

    The point is not that you shouldn’t try more than one approach or technique, it’s that what you end up with has to be rigorously “oblivious”, to use J Storrs Hall’s expression.

    I see the use of fake data simulation as an important element of approaching the nirvana of obliviousness.

  4. Like any statistical method, internal meta-analysis can be abused by the “garbage in, garbage out” problem.

    But the quoted criticism seems inept: a main purpose of internal meta-analysis is to reduce any perceived need for selective reporting. Internal meta-analysis helps shift us away from a world where every study or analysis must “work” (be statistically significant) in order to be reported and toward a world where the focus is instead on meta-analytic estimates of effect sizes and their variation across studies so that the statistical significance of any study or analysis is just not relevant.

    Internal meta-analysis thus aligns both points made in your post from the other day. It helps moves away from the silly false positive / false negative framework you criticize and it is a way to facilitate your comment:
    “I think it’s good to analyze the whole portfolio of trials, not to skim the statistically significant results and then try to use statistical methods to try to estimate the rest of the iceberg.”

    • Sort of agree, but one of the problems with internal meta-analysis is that the study authors have a conflict of interest in their own particular study.

      But not as bad as ignoring everything else to be an island on one’s own in order to make “conclusions and recomendations”.

      Better would be to just report what you did and report what you think happened along with access to the data. Then external groups would be beter able to make sensible conclusions and recomendations based on more relevant studies.

      Here, with access to raw data, the analyses could be redone according to a protocol less influenced by a particular data set.

      These are old ideas, e.g. see https://en.wikipedia.org/wiki/Meta-analysis#cite_note-1 .

      p.s. That reference has been the number 1 reference on wiki entry for about 10 years, still the topics are continually re-invented.

    • Anon:

      This is where preregistration comes in. You can also look at the statistician’s track record. If he or she has always done the same analysis on every problem that comes in, then it makes sense to believe that this would be the case for this problem too. If the statistician is like me and does something different for each new problem, then I’d be skeptical that the analysis for this particular dataset was decided before seeing the data. I’m not always a big fan of preregistration, but if you really have decided your analysis ahead of time, you might as well preregister it.

  5. This was a slightly confusing post–I had to look at the paper to see that “internal” meta-analysis is “statistically aggregating multiple [original] studies in a paper.” My initial impression was that this was a new term for a regular meta-analysis, which made me wonder why the authors didn’t discuss funnel plots, the typical method for assessing the file drawer problem in meta-analysis.

    Now that I realize it’s a meta-analysis on a set of original studies, I still wonder if reporting a funnel plot of those studies might provide evidence of the extent to which reporting was selective. Obviously, it still wouldn’t address Andrew’s concern about the entire universe of possible analyses. Interestingly, an article about funnel plots is in the paper’s references but not cited in its actual text.

  6. Andrew suggested I write a note I sent him here. Here it goes (edited & expanded):

    I don’t think the distinction you have made, a few times, between forking and p-hacking is as meaningful as you do. The reason is that either forking is the same thing as p-hacking, or it is inconsequential behavior that readers are likely to misconstrue as consequential given your framing.

    Specifically. The distinction you make is that with forking an analyst may run just 1 analysis, but that 1 analysis run is data-dependent, and that data-dependency may anyway inflate the false-positive rate (which of course also: generates bias, leads to wrong posteriors, leads to wrong Bayes factors, leads to type-M erros, lead to type-S error, etc etc).

    But I think that this idea of forking being a probelm has a big implicit assumption, and once it is made explicit, we see that (1) forking = p-hacking, and that (2) the term ‘forking’ may mislead readers.

    The implicit assumption is that data-based decisions, forking, are made in a way that systematically increases the odds of findings something.

    If, contrary to such implicit assumption, we fork in ways that are blind to expected results, nothing bad happens.

    For example, if a researcher controls for gender when the number of observations is even, but does not when it is odd, this will not generate any problems. The reported results can be interpreted at face value. This will be forking, but it will be inconsequential forking.

    If in contrast we fork in a way that is not blind to expected results, for example, we control for gender only after eyeballing the data and thinking it may help us, or when we drop an outlier in the control condition noticing it is in the ‘wrong’ direction, then we are really just p-hacking. Instead of p-hacking by typing lm(y~x) we are p-hacking by thinking. It seems unnecessary to specify the locus of p-hacking (computer vs brain) to understand its consequences, or solutions.

    You have mentioned one thing you don’t like about ‘p-hacking’ is that it sounds intentional.
    You are not alone in that, it’s the main drawback of the term. We have said ad nauseum that p-hacking does not require the intention to mislead, but to some it still sounds like an accusation. That’s a shortcoming. But, while p-hacking has this shortcoming as a term, I don’t know of a better one.

    What I don’t like about forking, as a term, is that it gives the wrong impression that *any* uncertainty in how data should be analyzed that is resolved after the data are obtained is problematic or consequential. That is not true. Only decisions that are made based on the results they would generate produces problems. Only forking that is equivalent to p-hacking is bad, blind-forking is fine.

    Which is why I have long thought that either forking=p-hacking, or forking=inconsequential behavior.

    • So would it be fair to summarize as, “forking is bad when it is done for the purpose of p-hacking?”

      I think this position is reasonable, since forking per se is actually a good thing. By that, I mean that one’s analyses should always be suited to the particulars of a situation and that we should be free to use things discovered in the data to direct us to new ways of looking at the data. As Andrew says, this is his approach too.

      I was trying to think of inherent problems with forking, but couldn’t really come up with any. Instead, the problem is when someone presents their final analysis as if that was the only possible choice, but that’s a problem because a) the person is being dishonest; and b) it doesn’t help future researchers who will have to face similar choices. Issue (a) is a flaw of personality or expected communication style. Issue (b) might be resolved with “open lab notes”, but I don’t think so; presenting every false step might give insight into someone’s moment-by-moment thinking, but would be difficult to slog through.

      Instead, I think a reasonable way to avoid those problems with forking is to say, “this is the analysis that we think best conveys the aspects of the data that are relevant for the questions we had…here are some examples of where we would have made different choices had certain aspects of the data been different or if we had been trying to address different questions…” Of course, this is essentially a text representation of an encompassing hierarchical model as Andrew describes—the model should be constructed to make clear the paths that you think are reasonable, while still conveying why one path turned out to be best.

  7. Why is preregistering hypotheses important for Bayesians? Let’s look at four simplified examples:

    1. Suppose John wins a seemingly fair lottery with a million tickets. My conclusion, on the basis of this alone, is that the lottery was rigged in favor of John because P(John won|Rigged lottery in favor of John, I) = 1, but P(John won|fair lottery, I) = 10^-6, where “I” is my background knowledge about the lottery. Nobody should take this seriously. The prior probability P(Rigged lottery|I)*P(Rigged lottery in favor of John|Rigged lottery, I) is very low. John winning has a practically negligible effect on the posterior in absolute terms. (Frequentists have their reasons to reject this “test” too, of course.)

    2. But now suppose I had “preregistered” my hypothesis. Even better, I had a plausible theory: John’s best friend is organizing the lottery, they were both involved in corruption before, John did buy a ticket for this lottery, and leaked text messages indicate that they were planning to rig the lottery. John wins the lottery and my prediction is confirmed. My hypothesis and my test are no longer nonsensical. It seems, then, that preregistration is very important. But all the effect here comes from the higher prior probability P(Rigged lottery in favor of John|J, I), where I is the same background information we had in example 1 and J is the extra background information favoring the rigged lottery hypothesis.

    3. What if my “preregistration” is made entirely at random? I can simulate a fair lottery using a random number generator and then preregister the result as my prediction. No forking paths! If the prediction is confirmed, we have an extraordinary coincidence, but my conclusion is as plausible as it was in the first example (let’s suppose, for the sake of argument, that there’s no way the lottery could be influenced by my simulation). The fact that my analysis was data-independent is irrelevant; the plausibility of the conclusion would be the same if I had “p-hacked” (case 1): P(Rigged lottery in favor of John|I selected the hypothesis at random, I) = P(Rigged lottery in favor of John|I)

    So, in this simplified example, the only potential benefit of preregistration is that it make us think harder about our hypotheses; that is, the hypotheses we’re testing are not crazy, they’re somewhat plausible. But if the preregistered hypotheses *are* generated entirely at random, they will mostly be false. Similarly, there’s nothing intrinsically wrong with data-dependent analyses.

    From a Bayesian perspective, how is real preregistration different from this toy example? Can preregistration of hypotheses be useful for reasons other than “on average, we expect preregistered hypotheses to be more plausible”?

    • Ops, I accidentally deleted example 4

      “4. Finally, just like in the first example, imagine I don’t preregister my hypothesis. Instead, I see that John won and decide to do some “data exploration”. In my investigation, I find out that John’s best friend is organizing the lottery, that they were both involved in corruption before, and that leaked text messages indicate they were planning to rig the lottery. I therefore conclude post hoc that the lottery was rigged in favor of John. This hypothesis seems to be just as plausible as the preregistered one in the second example. The “posterior” (i.e. probability of the hypothesis conditioned on total evidence) is the same, preregistration makes no difference.”

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *