Cristobal Young and Sheridan Stewart write:

Social scientists face a dual problem of model uncertainty and methodological abundance. . . . This ‘uncertainty among abundance’ offers spiraling opportunities to discover a statistically significant result. The problem is acute when models with significant results are published, while those with non-significant results go unmentioned. Multiverse analysis addresses this by recognizing ‘many worlds’ of modeling assumptions, using computational tools to show the full set of plausible estimates. . . . Our empirical cases examine racial disparity in mortgage lending, the role of education in voting for Donald Trump, and the effect of unemployment on subjective wellbeing. Estimating over 4,300 unique model specifications, we find that OLS, logit, and probit are close substitutes, but matching is much more unstable. . . .

My quick thought is that the multiverse is more conceptual than precise. Or, to put it another way, I don’t think the multiverse can be ever really defined. For example in our multiverse paper we considered 168 possible analyses, but there were many other researcher degrees of freedom that we did not even consider. One guiding idea we had in defining the multverse for any particular analysis was to consider other papers in the same subfield. Quite often, if you look different papers in a subfield, or different papers by a single author, or even different studies in a single paper, you’ll see alternative analytical choices. So these represent a sort of minimal multiverse. This has some similarities to research in diplomatic history, where historians use documentary evidence to consider what alternatives courses of action might have been considered by poilicymakers.

Also, regarding that last bit above on matching estimation, let me emphasize, following Rubin (1970), that it’s not matching or regression, it’s matching and regression (see also here).

Unless the model is correctly specified the parameters/coefficients are arbitrary.

Thats why it needs to be derived from a set of agreed upon assumptions if you want to interpret it rather than just use the model to predict something.

There’s not only the combinatorics, some of these decisions, like the scale of a hyperprior, are continuous.

@Anoneuoid: I think a logistic regression intercept has an interpretation even if the model’s not well specified. But it’s a different interpretation than the same model with a new predictor. If we standardize predictors, then the interpretation scale changes for coefficients even if we change the prior to compensate. I’ve been discussing just this issue with one of the Columbia grad students, but didn’t want to put them on the spot.

I’ve been thinking about this recently. It seems like the ideal practice would be for authors of statistical analyses to publish the set of models that they considered, rather than just the model they settled on. Could anyone recommend literature with a discussion of that idea?

There was a series of discussion pieces not long ago in Computational Brain & Behavior that addressed the topic of “robust modeling”:

https://www.springer.com/journal/42113

Transparency of the kind you describe is one of the many ideas that is brought up in that discussion. Another point that comes up in those discussions is that there are many different types of models developed for different purposes, and the same suggestions don’t necessarily apply to all of them.

For example, a statistical model meant to describe the data might be a variant of an “off-the-shelf” approach, like logistic regression. In that case, it would be straightforward to describe the different members of that model family because the choices you can make within that framework are somewhat more proscribed.

On the other hand, we also build models as representations of scientific theories, and here it is almost never obvious what *all* the choice points are. Of course, there still might be several “leading contender” models that the authors would do well to describe. But in general, with such a wide-open field, I think it’s better for the authors to make clear what they did rather than what they might have done. A big problem is that often a choice needs to be made and there is little compelling reason to pick a single option.

For example, I might say something like, “my theory is that a certain decision gets made by accumulating evidence to a threshold,” and then say, “I will model evidence arrivals as a Poisson process, but this is just an assumption to be able to derive predictions and other models of evidence arrival are possible.” I can justify the choice of Poisson if it actually fits the data, but that’s it. Nonetheless, I’ve also made clear to a reader that it was a choice point that others might amend if additional constraints are found.

This way, I’m still being transparent, but I don’t need to describe every possible variant one might pick.

Very interesting explanation, and thanks for the reference.

Ryan—What you’re looking for in the traditional p-value literature is “post-selection inference”. This tries to adjust p-values for the effect of things like variable selection in a regression model.

There’s also a literature on exploratory data analysis (EDA). That isn’t necessarily model-specific in a formal way, but it plays much the same role as fitting simple models—you can see basic patterns in the data. If you do EDA before selecting a model, it introduces the same kind of data-dependent model-selection bias.

So the real question is whether the kind of continuous model expansion and prior enrichment we propose in the Bayesian workflow paper going to bias our inferences, and if so, how? Our working assumption is that as long as we work with posterior predictions and only introduce as much model as the data supports (something like a penalized complexity prior as a way of life—it’s the theory behind Andrew’s Fantabulous Unfolding Flower), then we should be OK. Just moving to a cross-validation viewpoint from a posterior predictive check viewpoint is helpful here, as is moving away from null hypothesis significance testing toward measuring more practical decision-theoretic utility. I seriously believe that the re-orientation from hypothesis testing to prediction is the main reason why machine learning has been eating statistics’ lunch for the past couple of decades.

Yet again, worth emphasizing that the “many analysts” study did not specify a clear estimand — with some teams interpreting it as causal and some not, making it much less useful on its own and also in comparison here in this Young & Stewart.

Also, I think these comparisons are a bit weird to frame in terms of, e.g., OLS vs. matching since common advice now is to center covariates and interact them with treatment. With categorical covariates this is equivalent to post-stratification — aka many-to-many (potentially coarsened) exact matching.

I’m a bit surprised by the advice to interact the treatment indicator with covariates. Is there somewhere I can read more about this? And also about the equivalence to matching?

Parameter combinatorics is only one possibility for the multiverse and a fairly limiting one at that. Without completely going down the garden of forking path rabbit holes, model ensembles, divide and conquer algorithms, any and all iterative, approximating routines should also qualify for inclusion under the multiverse umbrella. To this one should add models with differing assumptions, whether that be OLS, maximum likelihood, generalized method of moments, RFs, SVD, splines, quantile regression, gradient descent, neural nets, Rissanen’s information theoretic approach, whatever. This method casts a wider net than simple combinatorics of existing information since such widely differing methods have the potential to explore and expose unexpected features of the information.

That said, too many modelers (everyone?) rely on a single loss minimization metric as the sole criterion for model evaluation and selection. While this approach has a long history in statistics and has the advantage of being easily implemented, breaking out of this box wrt exclusive reliance on a single metric seems like a good idea. For instance, why wouldn’t an analyst want to know if the model which minimizes error is also, at the same time, the model which maximizes dependence? Expanding the evaluation from one into a cloud of metrics which includes, not just alternative measures of error (robust, asymmetric, etc.), but also metrics of linear and nonlinear dependence should only be insightful and salubrious.

If the goal is to choose a single ‘best’ model from the set, depending on the breadth and depth of the models developed and the metrics employed, it’s easy to imagine a summary matrix of models by metrics which would lend itself to some variant of multivariate dimension reduction, e.g., scored results from PCA, KMapper, correspondence analysis, whatever. Pick the top-scoring model.

Note, however, that any form of multiverse modeling does not lend itself to automated machine learning algorithms, the bread and butter of all applied environments. These are custom, ad hoc suggestions which few are likely to explore since they are time and cpu intensive, a deal breaker for most.

Worth noting that Hadley did some work on this for his dissertation (http://had.co.nz/model-vis/2007-jsm.pdf, which evolved into https://vita.had.co.nz/papers/model-vis.pdf) and many of those multiverse analysis ideas went onto to form the basis of the tidyverse style approach to working with multiple models (i.e. list columns + broom, see also https://r4ds.had.co.nz/many-models.html).