Skip to content

The garden of forking paths

Bert Gunter points us to this editorial:

So, researchers using these data to answer questions about the effects of technology [screen time on adolescents] need to make several decisions. Depending on the complexity of the data set, variables can be statistically analysed in trillions of ways. This makes almost any pattern of results possible. As a result, studies have suggested both the existence of and the lack of an association between screen time and well-being, even when analysing the same data set. Naturally, it’s the research that highlights possible dangers that receives the most public attention and helps to set the policy agenda.

It’s the multiverse. Good to see people recognizing this. As always, I think the right way to go is not to apply some sort of multiple comparison correction or screen for statistical significance or preregister or otherwise choose some narrow subset of results to report. Instead, I recommend studying all comparisons of interest using a multilevel model and displaying all these inferences together, accepting that there will be uncertainty in conclusions.


  1. Anoneuoid says:

    This study looks like mathemagical thinking to me:

    The authors examine three key large-scale data sets, two from the United States and one from the United Kingdom, that include information about teenager well-being, digital-technology use and a host of other variables. Instead of running one or a handful of statistical analyses, they run all theoretically plausible analyses (combinations of dependent and independent variables, with or without co-variates) — in the case of one data set, more than 40,000. This allows the authors to map how the association between digital-technology use and well-being can vary — from negative to non-significant to positive — depending on how the same data set is used.

    Three hundred and seventy-two justifiable specifications for the YRBS, 40,966 plausible specifications for the MTF and a total of 603,979,752 defensible specifications for the MCS were identified.
    After noting all specifications, the result of every possible combination of these specifications was computed for each dataset. The standardized β-coefficient for the association of technology use with well-being was then plotted for each specification.

    The coefficients of the arbitrary models they are talking about are meaningless. The coefficients only mean something if the model is specified correctly, which they must admit is not the case or else they wouldn’t be trying out all these different specifications. I wouldn’t even think it is likely that any single one of these hundred million models is the “correct specification”, since the variables are limited to whatever data was available.

    How can taking the average of 600 million meaningless values result in a meaningful value?

  2. Andrew,

    I tried to locate the thread in which you commented on Samuel Huntington. I did not have all that much contact with him really. I recall him as a child I recognized several of his colleagues in his book Clash of Civilizations. Nor had I known that Huntington even knew Serge Lang. Lang’s book Challenges chronicles Hungtington’s assignment of probabilities to political events in South Africa. It’s worth a read.

  3. Stuart Buck says:

    What exactly does it mean to put all comparisons in the same model if there are actually “trillions” of possibilities?

    • Andrew says:


      It would depend on the example, but the short answer is that it would make sense to build these options into the model rather than to consider them as discrete alternatives. I think of the multiverse as more of a conceptual demonstration than a recommendation for any particular applied analysis.

    • Choices based on noisy data are risky – so just reduce the number of choices based on data that you can.

      So place your bet, make some choices and hope to win (well not loose too badly) and partially pool all back towards a safe? bet.

      But you can’t doubt everything at once and should not expect to learn anything reliable from a single study*.

      * Perhaps an exception – Intravenous Immunoglobulin Therapy for Streptococcal Toxic Shock Syndrome—A Comparative Observational Study. Kaul, et. al.

      I was originally pushed off the project by statisticians who argued they could find the best out of 2^k possible adjusted effect models for k possible confounders and investigators should only report that. Reviewers stepped on that and I was invited back on and they left.

      I reported the minimum effect estimate. “Because of uncertainty about optimal model selection given the small sample size, we looked at the estimated IVIG effect for all possible models. The minimum estimated odds ratio for survival associated with an IVIG effect survived for was 4.3.” So finding the best adjustment was not even necessary given low cost and fairly well understood side-effects – almost any effect was worth adopting. Of course with one observational study – no one really knows…

    • Suppose you have 40 different things you can in theory estimate. Then you have 2^40 possible different subsets of these 40 things you could choose to include in your estimates. That’s about 1 trillion different subsets you could choose from…

      Or you could just estimate all of them.

      Now, comparing two things, you could do choose(40,2) = 780 different comparisons between two things… Second order comparisons (is a-b dramatically different from c-d) could be done in around choose(780,2) = 304000 different ways.

      All of these things can be handled, correctly, by computing the posterior distribution over all 40 items.

  4. Michael Nelson says:

    Maybe this has already been thought of (and possibly dismissed for good reason), but with access to what amounts to a “population” of comparisons, seems like you could report the probability of obtaining the observed number of effects greater than size d under the assumption that all effects are actually less than d, where d is the minimum meaningful/practical effect (based on the literature or practical consequences, for example). Ideally, after computing all the comparisons, you could conclude something like, “We observed 90 comparisons with effects greater than d=.4, which should only occur 2% of the time if none of the effects of the 10000 possible comparisons were truly greater than .4 in the population.” Crucially, you’re not reporting the results of a NHST, you’re reporting a global p-value under an assumption of no effects being greater than a meaningful threshold.

    In fact, you’re not limited to reporting a single p-value. If you define D as a matrix of the effect sizes actually observed along with the number of comparisons that yielded an effect at least as large, then you can report a semi-continuous probability distribution across the range of D reflecting the probability of observing that number of effects at each level, under the assumption that none of the effects are greater than d among all of the comparisons. The drawback to this analytical approach would be that you could never say anything more certain than “Hey, guys, something important is very likely in this data set!” The benefit is that you’re using all of the data simultaneously and reporting all of your results concisely, while drawing conclusions that depend on only a small number of assumptions (i.e., you defined the minimum meaningful effect correctly, you’ve correctly specified the distribution of observed effects given none of them are real, you’ve accounted for the plausible ranges/ceilings of effect sizes, you’ve accounted for multicollinearity that renders some comparisons redundant through removal, composites, or weighting/pooling).

    • Michael Nelson says:

      I should say that this is very similar to reporting the false discovery rate, except that you’re describing the probability that any of the observed effects is truly meaningful, whereas I believe the FDR describes the probability that a particular effect observed to be statistically significant is truly not zero. So maybe what I’m proposing is really just telling people who are already reporting the FDR to stop drawing conclusions that are meaningless (there are no null effects) and arbitrary (there’s nothing magical about any alpha level) and selective (don’t just report significant comparisons).

    • Andrew says:


      This all seems a bit tricky and indirect to me. I’d rather just attack the decision problem directly, or give inferences for everything, than try to make hard decisions based on whether an effect size exceeds some threshold, as that does not seem to correspond to real losses and gains.

    • Anoneuoid says:

      It will seem to work great until you add that new feature, specification that switches d from positive to negative. Then how will you explain such a sudden reversal to the layperson who has been listening to your model and making decisions based on the idea d is greater than 0.4? Progress will become your enemy.

Leave a Reply to Anoneuoid