Skip to content

A more formal take on the multiverse

You’ve heard of multiverse analysis, which is an attempt to map out the garden of forking paths. Others are interested in this topic too. Carol Nickerson pointed me to this paper by Jan Wacker with a more formal version of the multiverse idea.


  1. Dale Lehman says:

    The “cooperative forking path analysis” that the paper describes is a welcome step in the right direction. It will work better in some fields than others. However, in most social science research, it is likely to be severely limited in applicability. It would seem to require that the subject to be studies is known far enough in advance to permit the cooperative effort to be designed. Much research in economics is prompted by evolving events – there is insufficient lead time to plan this approach and studies are likely to be conducted “on the fly” to address current issues. Add to this the pernicious effects of poor incentives – often the work will benefit an interested party, so there is a payoff to publishing work that might not stand up to the more rigorous standards proposed by cFPA. As a result, I don’t see studies like the impact of gun control laws on gun violence, the impacts of eliminating net neutrality rules, the impacts of removing the individual mandate from the ACA, etc. as being good candidates for cFPA. It would require an agreement on the particular issue that is to be studied, and arrangement of the cooperative effort, if this approach is to be successful. I just can’t see this applying widely in the social sciences (though I would be happy if it did).

    • Jerrod says:

      I think a potentially better idea (mainly because of its lower burden on the researcher) is just to have built in replication via splitting a large data set into multiple data sets (a kin to train and test sets). The data sets many economists are using these days are enormous. Chetty uses millions of linked social security and tax records in some of his more recent analyses. There is no reason why building the model (in which the researcher is likely to be looking for anything significant) on a few million observations and then running the same model on another few million observations to see if the trained model holds. Pretty much anything involving health data, tax data, employment data, social security data, etc. (basically most administrative micro data an economist would come by), would be amenable to this process. Why this is not done is beyond me.

      • Anoneuoid says:

        built in replication via splitting a large data set into multiple data sets (a kin to train and test sets)

        The purpose of a replication is to check that you understand the circumstances/etc (Fisher called them “experimental conditions”) well enough for someone else to repeat your work and get similar results. The same people just splitting up a dataset does not help us there. Checking the predictive skill of a model on future data is also an important thing, but that is not the purpose of a replication.

        I don’t know where this desire comes from to not perform such a simple and basic part of the scientific method, but it is very, very bad. It seems to be widespread too, I’ve seen entire review articles on all the “top replications” from some fields that somehow don’t mention a single actual replication. Then there were the “prediction market” people who outright said that researchers “don’t like replications”. People who feel that way shouldn’t be doing science…

        • Anoneuoid says:

          I was feeling lazy, but decided to fix that. I was referring to these papers. Here is the number one replication:

          Significant and substantial genetic influence on individual
          differences in psychological traits is so widespread
          that we are unable to name an exception. The challenge
          now is to find any reliably measured behavioral trait for
          which genetic influence is not significantly different from
          zero in more than one adequately powered study.

          You see this has nothing to do with any specific method, and therefore doesn’t demonstrate any replication has ever been performed. All it shows is that “everything is correlated with everything else”, something we already know.

          Then here is the admission that researchers don’t like doing replications:

          Apart from rigorous replication of published studies, which is often perceived as unattractive and therefore rarely done, there are no formal mechanisms to identify irreproducible findings.

          • Jerrod says:

            “The same people just splitting up a dataset does not help us there.” I disagree. One of the things the whole “replication crisis” as brought to the fore is that researchers get a data set, p-hack/fork paths/whatever the data to death such that when some runs the analysis on a different data set, the results fail to replicate. But it doesn’t matter if it is a different researcher or the same researcher. It’s not that they can’t replicate the experiment or analysis. It’s that when they do (if they do), they can’t replicate the results.

            Right now you have researchers reporting the results of one (maybe 2) models built on one data set. It seems like a large improvement over current practice if researchers with a large enough data set were (instead of reporting results of one model on one data set) required to report the results of that model over two or more data sets. Whether or not you want to call that replication, I really don’t care.

  2. Frank says:

    This is definitely something that people do. Here’s one recent example:

    Of course even with a hold-out sample, there is the potential for forking paths. What if all the models I fit perform poorly on the hold-out data? Clearly I would want to consider alternative models, but if I use the same hold-out sample to evaluate these new models, then I’m just data snooping at one remove.

    • Frank says:

      I was replying to Jerrod above, but my comment didn’t appear in the right place. I must have clicked the wrong link…

      • Jerrod says:

        Thanks for the reference. I know that *some* researchers do this, but my impression is that it is far from standard practice (e.g. this is not taught in PhD econometric courses as of 2016). “What if all the models I fit perform poorly on the hold-out data?” If you are fitting multiple models on the hold-out data rather than just the one you favored from training on the training set, then I think you’re doing it wrong (but maybe that’s the way to go… split the data into multiple data sets, apply a few of the same models to all those data sets, and report the results… not perfect by any stretch of the imagination but even this scenario is certainly no worse than the status quo (and most likely better)). The idea is that researchers have lots of degrees of freedom in building their models because they fit several models to the same data. You know that will happen, so let the researcher do that on the training set and then apply their “preferred model” on the hold-out data. I’m not saying that this will solve all the methodological problems, and we’d probably still only see papers that had significant results, but at least they would have significant results over more than one sample. Perfect solution? No. Better than status quo at very little burden on the researcher? For sure.

        • Frank says:

          I’m not sure if it’s standard practice or not: I’ve taught it in my PhD econometrics course for a while, but I don’t know how widespread this is.

          Regarding using the hold-out data twice, all I was trying to say is that “let them do that on their training sample” still requires us to trust researchers not to cheat. It’s just that in this case, the cheating would be at one remove. (We are totally in agreement that if you use the hold-out sample twice, you’re doing it wrong.)

          When you have a huge dataset, there may be very little loss from using a hold-out sample. But in more modest datasets, you’re certainly paying a cost in terms of precision. Of course one may be willing to do this to get valid inference, but there’s some kind of trade-off here. There’s some interesting recent research that explores this issue:


          Item (1) talks about a setting in which a Bayesian could find it optimal to use a hold-out sample, even though “traditional” Bayesian reasoning would suggest that one should always condition on the full set of available data. Item (2) makes precise the sense in which “data splitting” or using a hold-out sample is sub-optimal, and proposes an method called “selective inference” that allows one to adjust for the double use of a single dataset for model selection and subsequent inference in exponential families.

Leave a Reply