Renato Frey writes:

I’m curious about your opinion on combining multi-model inference techniques with rstanarm:

On the one hand, screening all (theoretically meaningful) model specifications and fully reporting them seems to make a lot of sense to me — in line with the idea of transparent reporting, your idea of the multiverse analysis, or akin to Simonsohn’s Specification Curve Analysis (which seem to be quite closely related to each other anyway). So in the past, I’ve been using tools such as glmulti or regressionBF (from the BayesFactor package) to obtain model-averaged coefficients and more interestingly, the “importances” of the various predictors, which are determined across various model specifications (i.e., in the case of glmulti). However, these tools are only available for the “traditional” OLS and related estimation methods.

On the other hand, I’ve recently started to use rstanarm, which I now clearly prefer to the traditional estimation methods, not least because of the possibility to specify weakly informative priors and the resulting regularization.

As different model specifications would make sense from a theoretical point of view in some of my current projects, I’ve now wondered if it would be reasonable to write a wrapper that automatically implements different model specifications and runs them with rstanarm, which would permit an (automatic) comparison of coefficients across different model specifications (and would possibly also permit to extract a measure of “importance” for each of the predictors). Of course, this would need to be highly parallelizable to make it computationally feasible.

Do you have any thoughts on this, and / or do you plan to implement something related in rstarnarm in the future?

My reply:

Yes, definitely use rstanarm! Or go straight to Stan (that would be rstan if you’re running it from R) and program more general models. If you want to fit several models and average their posterior distributions, I recommend stacking, as described in this recent paper. Also, sure, someone could write a wrapper to automatically fit a large number of models in rstanarm in parallel and then average over them—this would not be hard to do—but I think I’d prefer fitting a single model with all these predictors and interactions, using strong priors to regularize all the coefficients.

Implementing Bayesian variable selection might be the way forward. But writing a wrapper function following the Burnham and Anderson,2002 approach is also possible. I did this using JAGS and hope will not be difficult to do it also in rstan. The wrapper function should fit all models, extract DIC for instance then compute DIC weight then to get model average estimate just weight each parameter estimate with their corresponding DIC weight.

Belay:

1. DIC is not a good idea; better to use LOO; see here and here.

2. When model averaging, don’t use weights based on any information criterion, use stacking; see here.

For predictor relevance and selection see my StanCon talks and notebooks at https://github.com/avehtari/modelselection_tutorial. Main points are

– in case of predictors no need to do model combination, just include all potentially relevant predictors

– regularized horseshoe prior is good if you have lot of predictors

– projection predictive variable selection is better than any approach to choose a smaller set of predictors

– rstanarm + projpred package make all this easy (projpred at https://cran.r-project.org/package=projpred)

– you can use projpred also with rstan

– all this is computationally feasible with laptop at least up to 10000 predictors

I’m sure this is an obvious caveat, but let’s be careful about encouraging people to dump all predictors into a common model. Colliders do exist, for example.

Sure. Question about terminology: If you know a variable is collider, would you still call it also predictor?

I would say yes. A variable / node can be a collider on one path but not a collider on another path. So, it is possible there is a unblocked path from a variable / node to the outcome, in which case it is a predictor in the usual statistical sense, but at the same time, it can be a collider on another path in the DAG sense if you condition on it.

On the other hand, the utility function for projpred or stacking is being defined solely in terms of predictive ability, rather than being able to interpret an estimate in causal terms. So, if you decide to go that route, I think you have already set aside your DAG.

Yeah it’s a predictor. The right caveat for Richard’s concern is something like, “Be aware of the difference between predictive inference and causal inference. If no intervention into the system is planned or possible and the aim is simply to predict as accurately as possible then it’s appropriate to use all available predictors but if the investigation aims to tease out causal connections then you need to avoid conditioning on colliders (for Neyman-Rubinites, post-treatment variables).”

Corey:

Or, to be even more careful: It’s ok to include post-treatment variables in your regression, but then you can’t directly interpret your regression coefficient as a causal effect. You need to do more modeling.

Is there an example of interpreting a stacked model for some explanation of say how changes in inputs can be anticipated to change predictions?

You can have pre-treatment colliders as well, but the same considerations about causal inference vs. predictive performance applies.

Colliders are not only “post-treatment” variables, colliders can be pre-treatment as well.

Of course it’s a predictor — predictor is what predicts.