John Spivack writes:

I am contacting you on behalf of the biostatistics journal club at our institution, the Mount Sinai School of Medicine. We are working Ph.D. biostatisticians and would like the opinion of a true expert on several questions having to do with observational studies—questions that we have not found to be well addressed in the literature.

And here are their questions:

(1) Among the popular implementations of propensity score-based methods for analyzing observational data, matching, stratification based on quintiles (for instance) , and weighting (by inverse probability of assigned treatment, say) is there a clear preference? Does the answer depend on data-type?

I personally like stratification by quintiles of propensity score (followed by an analysis pooled over the quintile groups) because it is simple and robust with no complicated matching algorithms, and no difficult choices over different types of weights. Is this method acceptable for high quality publications?Also, given that the main threats to the validity of observational studies are elsewhere (unmeasured confounders, treatment heterogeneity, etc.), is the choice of which implementation of propensity score to use really as important as the volume of the literature devoted to the subject would suggest?

(2) Let’s say we’re estimating a treatment effect using an estimated-propensity-score-based-reweighting of the sample (weighting by the inverse of the estimated probability of the assigned treatment, say). The authoritative references (eg Lunceford and Davidian) seem to say that one must take account of the fact that the weights are themselves estimated to produce a valid standard error for any final estimate of treatment effect. Complex formulas are sometimes provided for the standard errors of particular estimators, expressed for instance, as a sandwich variance.

In practice, however, whether for weighted analyses or matched ones, we seldom make this kind adjustment to the standard error and just proceed to do a ‘standard’ analysis of the weighted sample.if you conceptualize the ’experiment’ being performed as follows:

(a) Draw a sample from the (observational) population, including information on subjects’ covariates and treatment assignments

(b) Re-weight the sample by means of an estimated propensity score (estimated only using that sample).

(c) Observe the outcomes and perform the weighted analysis ( for instance using inverse-probability-of-assigned-treatment weights) to calculate an estimate of treatment effect.Then, yes, over a large number of iterations of this experiment the sampling distribution of the estimate of treatment effect will be affected by the variation over multiple iterations of the covariate balance between treatment groups (and the resulting variation of the weights) and this will contribute to the variance of the sampling distribution of the estimator.

However, there is another point of view. If a clinical colleague showed up for a consultation with a ‘messy’ dataset from an imperfectly designed or imperfectly randomized study, we would often accept the dataset as-is, using remedies for covariate imbalances, outliers, etc. as needed, in hopes of producing a valid result. In effect, we would be conditioning on the given treatment assignments rather than attempting an unconditional analysis over the (unknowable) distribution of multiple bad study designs. It seems to me that there is nothing very wrong with this method.

Applied to a reweighted sample (or a matched sample), why would such a conditional analysis be invalid, provided we apply the same standards of model checking, covariate balance, etc. that we use in other possibly messy observational datsets? In fact, wouldn’t conditioning on all available information, including treatment assignments and sample covariate distributions, lead to greater efficiency, and be closer in spirit to established statistical principles (like the extended form of the ancillarity principle)? To a non-expert, wouldn’t this seem like a strong enough argument in favor of our usual way of doing things?

My reply:

(1) The concern in causal inference is mismatch between the treatment and control groups. I have found it helpful to distinguish between two forms of mismatch:

– lack of complete overlap on observed pre-treatment predictors

– imbalance on observed pre-treatment predictors

From my book with Jennifer, here are a couple pictures to distinguish the two concepts.

Lack of complete overlap:

Imbalance:

The point of *matching*, as I see it, is to restrict the range of inference to the zone of *overlap* (that is, to define the average treatment effect in a region where the data can directly answer such a question). You match so as to get a subset of the data with complete overlap (in some sense) and then discard the points that did not match: the treated units for which there was no similar control unit, and the control units for which there was no treated unit.

*Stratification and weighting* are ways of handling *imbalance*. More generally, we can think of stratification and weighting as special cases of regression modeling, with the goal being to adjust for known differences between sample and population. But you can really only adjust for these differences where there is overlap. Outside the zone of overlap, your inferences for treatment effects are entirely assumption-based.

To put it another way, matching (and throwing away the non-matches) is about identification, or robustness. Stratification/weighting/regression are for bias correction.

Propensity scores are a low-dimensional approximation, a particular way of regularizing your regression adjustment. Use propensity scores if you want.

(2) If someone were to give me the propensity score, I’d consider using it as a regression predictor—not alone, but with other key predictors such as age, sex, smoking history, etc. I would *not* do inverse-propensity-score weighing, as this just seems like a way to get a lot of noise and get some mysterious estimate that I wouldn’t trust anyway.

You write, “we would be conditioning on the given treatment assignments.” You always have to be conditioning on the given treatment assignments: that’s what you have! That said, you’ll want to know the distribution of what the treatment assignments could’ve been, had the experiment gone differently. That’s relevant to your interpretation of the data. If you do everything with a regression model, your implicit assumption is that treatment assignment depends only on the predictors in your model. That’s a well known principle in statistics and is discussed in various places including chapter 8 of BDA3 (chapter 7 in the earlier editions).

(3) One more thing. If you have lack of complete overlap and you do matching, you should also do regression. Matching to restrict your data to a region of overlap, followed by regression to adjust for imbalance. Don’t try to make matching do it all. This is an old, old point, discussed back in 1970 in Donald Rubin’s Ph.D. thesis.

I’m a bit confused about the last point. Surely matching changes the range and also the distribution of both groups to be approximately the same already? So matching deals with both overlap and balance?

This is an extremely helpful discussion. Thank you!

What about Gary King’s 2016 paper, “Why propensity scores should not be used for matching”? https://gking.harvard.edu/files/gking/files/psnot.pdf

Joshua:

As I see it, the purpose of matching is to achieve overlap, not balance, so I don’t really see this linked article as being so relevant to what I would do.