Skip to content

The garden of 603,979,752 forking paths

Amy Orben and Andrew Przybylski write:

The widespread use of digital technologies by young people has spurred speculation that their regular use negatively impacts psychological well-being. Current empirical evidence supporting this idea is largely based on secondary analyses of large-scale social datasets. Though these datasets provide a valuable resource for highly powered investigations, their many variables and observations are often explored with an analytical flexibility that marks small effects as statistically significant . . . we address these methodological challenges by applying specification curve analysis (SCA) across three large-scale social datasets . . . to rigorously examine correlational evidence for the effects of digital technology on adolescents. The association we find between digital technology use and adolescent well-being is negative but small, explaining at most 0.4% of the variation in well-being. Taking the broader context of the data into account suggests that these effects are too small to warrant policy change.

They continue:

SCA is a tool for mapping the sum of theory-driven analytical decisions that could justifiably have been taken when analysing quantitative data. Researchers demarcate every possible analytical pathway and then calculate the results of each. Rather than reporting a handful of analyses in their paper, they report all results of all theoretically defensible analyses . . .

Here’s the relevant methods paper on specification curve analysis, by Uri Simonsohn, Joseph Simmons, and Leif Nelson, which seems similar to what Sara Steegen, Francis Tuerlinckx, Wolf Vanpaemel and I called the multiverse analysis.

It makes sense that a good idea will come up in different settings with some differences in details. Forking paths in methodology as well as data coding and analysis, one might say.

Anyway, here’s what Orben and Przybylski report:

Three hundred and seventy-two justifiable specifications for the YRBS, 40,966 plausible specifications for the MTF and a total of 603,979,752 defensible specifications for the MCS were identified. Although more than 600 million specifications might seem high, this number is best understood in relation to the total possible iterations of dependent (six analysis options) and independent variables (224 + 225 – 2 analysis options) and whether co-variates are included (two analysis options). . . . The number rises even higher, to 2.5 trillion specifications, for the MCS if any combination of co-variates (212 analysis options) is included.

Given this, and to reduce computational time, we selected 20,004 specifications for the MCS.

I love it that their multiverse was so huge they needed to drastically prune it by only including 20,000 analyses.

How did they choose this particular subset?

We included specifications of all used measures per se, and any combinations of measures found in the previous literature, and then supplemented these with other randomly selected combinations. . . . After noting all specifications, the result of every possible combination of these specifications was computed for each dataset.

I wonder if they could’ve found even more researcher degrees of freedom by considering rules for data coding and exclusion, which is what we focused on in our multiverse paper. (I’m also thinking of the article discussed the other day that excluded all but 687 out of 5342 observations.)

Ultimately I think the right way to analyze this sort of data is through a multilevel model, not a series of separate estimates and p-values.

But I do appreciate that they went to the trouble to count up 603,979,752 paths. This is important, because I think a lot of people don’t realize the weakness of many published claims based on p-values (an issue we discussed in a recent comment thread here, when Ethan wrote: “I think lots of what’s discussed on this blog and a cause of common lay errors in probability comes down to, ‘It’s tempting to believe that you can’t get all of this just by chance, but you can.'”).


  1. Ethan Bolker says:


    I’d be prouder to be quoted by you if you weren’t quoting me quoting you…


  2. Blake says:

    If I’m fitting a Bayesian multi-level model with an optional quantity of data (1 year, 5 years, 10 years, etc. of observations), and I have the option to include all of the data or less, does it ever make sense to include less data, say, 1 year instead of 5?

  3. Bobbie says:

    Reminds me of a paper called: 636,120 Ways to Have Post-traumatic Stress Disorder. (Galatzer-Levy & Bryant, 2013) Because that’s what clinical psych diagnoses allow.

  4. jim says:

    ‘It’s tempting to believe that you can’t get all of this just by chance, but you can.’.

    with 603,979,752 separate statistical analyses, indeed you must.

  5. Anonymous says:

    Quote from above: “Ultimately I think the right way to analyze this sort of data is through a multilevel model, not a series of separate estimates and p-values.”

    My guess is that the results of this sort of data will be (attempted to be) “crowdsourced”.

    If i understood things correctly, “crowdsourcing” is currently being used for situations where a small group of people get all the power, and benefits, and the large group of people “help” them achieve this all because “it’s good for science” and because “we have to change the incentives”.

    I fear some folks are very close, if not already there, to proposing that a “crowdsourced” group will vote which analysis to put forward as “the best one”. Science!

    Perhaps this is one of the last steps in the possible politicalization of (parts of) (social) science…

  6. John says:

    “Ultimately I think the right way to analyze this sort of data is through a multilevel model, not a series of separate estimates and p-values.”

    I’m curious if anyone has seen a paper that deals with analytic flexibility (including covariates coded in different ways, or not at all, etc) using the multi-level approach Andrew favors. This framework makes a lot of sense to me when you have a bunch of similar things you want to apply the same model to (e.g., states in a nation) but not when you have a lot of models you want to apply the same data to.

Leave a Reply to Anonymous