Modular statistics, multiple imputation, and bootstrapping: theta-hat is the elephant in the room, and so forth

Stef writes,

In the preparation of my inaugural lecture I am searching for literature on what I momentarily would call “modular statistics”. It widely accepted that everything that varies in the model should be part of theloss function/likelihood, etc. For some reasonably complex models, this may lead to estimation errors, and worse, problems in the interpretation, or in possibilities to tamper with the model.

It is often possible to analyse the data sequentially, as a kind of poor man’s data analysis. For example, first FA then regression (instead of LISREL), first impute, then complete-data analysis (instead of EM/Gibbs), first quantify then anything (instead of gifi-techniques), first match on propensity, then t-test (instead of correction for confounding by introducing covariates), and so on. This is simpler and sometimes conceptually more defensible, but of course at the expense of fit to the data.

It depends on the situation whether the latter is a real problem. There must be statisticians out there that have written on the factors to take into account when deciding between these two strategies, but until now, I have been unable to locate them. Any idea where to look?

My reply:

One paper you can look at is by Xiao-Li Meng in Statistical Science in 1994, on “congeniality” of imputations which addresses some of these issues.

More generally, it is a common feature of applied Bayesian statistics that different pieces of info come from different sources and then they are combined. Cor example, you get the “data” from 1 study and the “prior” from a literature review. Full Bayes would imply analyzing all the data at once but in practice we don’t always do that.

Even more generally, I’ve long thought that the bootstrap and similar methods have this modular feature. (I’ve called it the two-model approach, but I prefer your term “modular statistics”). The bootstrap literature is all about what bootstrap replication to use, what’s the real sampling distribution, should you do parametric or nonparametric bootstrap, how to bootstrap with time series and spatial data, etc. But the elephant in the room that never gets mentioned is the theta-hat, the estimate that’s getting “bootstrapped.” I’ve always thought of bootstrap as being “modular” (to use your term) because the model (or implicit model) used to construct theta-hat is not necessarily the model used for the bootstrapping.

Sometimes bootstrapping (or similar ideas) can give the wrong answer but other two-level models can work well. For example, in our 1990 JASA paper, Gary King and I were estimating counterfactuals about electoins for Congress. We needed to set up a model for what could have happened had the electoin gone differently. A natural approach would have been to “bootstrap” the 400 or so elections in a year, to get different versions of what would have happened. But that would have been wrong, because the districts were fixed and not a sample from a larger population. We did something better which was to use a hierarchical model. iI was challenging because we had only 1 observation per district (it was important for our analysis to do a separate analysis for each year), so we estimated the crucial hierarchical variance parameter using a separate analysis of data from several years. Thus, a modular model. (We also did some missing data imputation.) So this is an example where we really used this approach, and we really needed to. Another reference on it is here.

Even more generally, I’ve started to use the phrase “secret weapon” to describe the policy of fitting a separate model to each of several data sets (e.g., data from surveys at different time points) and then plotting the sequence of estimates. This is a form of modular inference that has worked well for me. See this paper (almost all of the graphs, also the footnote on page 27) and also this blog entry. In these cases, the second analysis (combining all the inferences) is implicit, but it’s still there. Sometimes I also call this “poor man’s Gibbs”.

Stef replied,

Meng’s congeniality is surely relevant. If I remember well, he
introduced it as a principle for specifying imputation models. It would
be interesting to see if the principle carries over to other types of
modular scenarios. In a simple chained data analyses (e.g. first
impute, then quantify, then regression), the updated data sets form the
only link between the steps. The advantage here is that there is no
missing data problem anymore after step 1. We ‘know’ that this is
suboptimal relative to solving it all at once, but we do not know by how
much. I interpret congeniality loosely as the requirement to include in
the current step all data that are used at later steps. It might be that
if we adhere to that, then we will be on reasonably safe grounds.

I like your bootstrap example. Indeed, bootstrapping works (or should
work) irrespective of the theta-hat elephant in the room. So yes, I
think that is another example of a modular method, though slightly more
complex than the simple chained scenario because it “packs” the
estimation of theta with parallel replications. The same holds for
multiple imputation, which also packs theta. So, perhaps it is useful to
think of parallel versions of each step, packing whatever comes after
it.

You remark that there are cases where the bootstrap does not work, but
other two-step models do. Would these, incidentally, be those cases
where theta-hat and bootstrap cannot be modularised for some reason. It
would be useful to have an idea when that will happen. And why the other
methods make it right.

The other two examples you provide use different models on different
data, and combine the results in some way. I didn’t think of that as
modularity, so perhaps I should allocate some more brain tissue to the
concept. In any case, the footnote on page 27 is right on target. We do
it to make life simple. If only we knew when we can…