Bayesian post-selection inference

Richard Artner, Francis Tuerlinckx, and Wolf Vanpaemel write:

We are currently researching along the lines of model selection/averaging/misspecification and post-selection inference. As far as we understand your approach to Bayesian statistical analysis looks (drastically simplified) like this:

1. A series of models is sequentially fitted (with an increase in model complexity) whereby the types of model misfits motivate the way the model is extended in each step. This process stops if additional complexity could not be handled by the amount of data at hand (i.e.; when parameter uncertainty due to estimation surpasses a certain point) or potentially earlier in the (lucky!) case that a model has been found where no discrepancies between the observed data pattern and the model assumptions can be found.

2. The final model is then, once again, put to the acid test. That means residual plots, posterior predictive checks and the likes.

3. Inference for the model parameters of interest as well as functions of them (i.e.; expected mean, quantiles of response variable etc.) is then conducted in the chosen model.

An example of this process is, for instance, given in BDA (Chapter 22.2 “Using regression predictions: incentives for telephone surveys”). [That example is in section 9.2 of the third edition of BDA. — ed.]

We are wondering to what extent the inferences achieved by such a process can be problematic and potentially misleading since the data were used twice (first to end up with the final model and second to fit the likelihood to conduct the inferences). You do not mention any broadening of credible intervals, nor data splitting where the third step is conducted on an unused test sample. Maybe you do not mention it because it does not matter so much theoretically and in practice. Or perhaps because it is too difficult to deal with the issue in a Bayesian sense.

As far as we understand it, in such a process the dataset influences the form of the likelihood, the prior distributions as well as the parameter fits (e.g.; via ML) thereby violating the internal consistency of Bayesian inference (i.e.; given an apriori specified likelihood and the “correct” prior distribution, the posterior distribution is correct where in the M-open case, correctness is defined by best approximating model).

– Yes, that’s a reasonable summary of our model-building approach. A more elaborate version is in this paper with Jonah Gabry, Dan Simpson, Aki Vehtari, and Mike Betancourt.

– I don’t think it will ever make sense to put all of Bayesian inference in a coherent framework, even for a single application. For one thing, as Dan, Mike, and I wrote a couple of years ago, the prior can often only be understood in the context of the likelihood. And that’s just a special case of the general principle that there’s always something more we could throw into our models. Whatever we have is at best a temporary solution.

– That said, much depends on how the model is set up. We might add features to a model in a haphazard way but then go back and restructure it. For example, the survey-incentives model in section 9.2 of BDA is pretty ugly, and recently Lauren Kennedy and I have gone back to this problem and set up a model that makes more sense. So I wouldn’t consider the BDA version of this model (which in turn comes from our 2003 paper) an ideal example.

– To put it another way, we shouldn’t think of the model-building process as a blind data-fitting exercise. It’s more like we’re working toward building a larger model that makes sense, and each step in the process is a way of incorporating more information.

1. This query does seem a bit of a quest for certainty about uncertainty …

Tried to provide a helpful intro to Bayesian workflow here – https://drive.google.com/file/d/1L74bST2KaI5bwAZRA_PhchkSK_sXW2Nw/view

Unfortunately, the red in the plots does not always show up – but there here along with the programs https://github.com/KeithORourke/BayesinWorkflowLecture

Anticipating comments about what repeatedly happens not being Bayesianly relevant, I’ll do a future post on that.

2. Richard says:

I do not quite see how Andrew’s response answers the question of “to what extent the inferences achieved by such a process can be problematic and potentially misleading” and what follows. Could anyone elaborate?

• Anoneuoid says:

The process looks guaranteed to overfit to me. They need to check the predictions of the model against new data at the end to see how good it actually is.

3. Aki Vehtari says:

Vehtari and Ojanen (2012) https://projecteuclid.org/euclid.ssu/1356628931 discuss decision theoretical approach for inference after selection in section 3.3. The Bayes theory says we need to integrate over all uncertainties, which in case of sequential fitting would mean we need to integrate over the model space including the sequential paths we didn’t take to obtain actual belief model M_* (defined in end of section 2). Given M_* we can use decision theory to find out how to make optimal inference after selecting some individual model M_k (section 3.3). This approach is not commonly used, except in limited settings such as in projection predictive variable selection where forming M_* is easier (section 5.4 in Vehtari and Ojanen (2012); see also Piironen and Vehtari (2017) http://link.springer.com/article/10.1007/s11222-016-9649-y).

Example how this could work in case of sequential fitting: Let’s say residual plots look like quadratic function would be a good fit. To take into account the uncertainty in model space we need to consider also other possible functions. We could integrate over large set of parametric functions, or we could do more continuous model expansion by go non-parametric and using, for example Gaussian process prior in function space. Naturally this will still leave the question which GP covariance function we should use to consider, e.g., different smoothness and stationarity assumptions, but with finite data we know that eventually we get to a point that data is not informative of higher level hyperparameters (Goel and Degroot, 1981) and adding more uncertainty layers doesn’t change what we learn, and thus carefully made sequential fitting can produce good M_*.

4. This could well be the title for my next book:

we shouldn’t think of the model-building process as a blind data-fitting exercise. It’s more like we’re working toward building a larger model that makes sense, and each step in the process is a way of incorporating more information.