Quantifying uncertainty in identification assumptions—this is important!

Luis Guirola writes:

I’m a poli sci student currently working on methods. I’ve seen you sometimes address questions in your blog, so here is one in case you wanted.

I recently read some of Chuck Manski book “Identification for decision and prediction”. I take his main message to be “The only way to get identification is using assumptions which are untestable”. This makes a lot of sense to me. In fact, most literature in causal applied design working in the Rubin identification tradition that is now popular in poli sci proceeds that way: first consider a research design (IV, quasi experiment, RDD, whatever) and a) justify that the conditions for identification are met and b) proceed to run the design conditional on the assumptions being true. My problem here is that the decision about a) is totally binary, and the uncertainty that I feel is associated with it is taken out of the final result.

Chuck Manski’s idea here is something like “let’s see how far we can get without making any assumptions” (or as few as possible), which takes him to set identification. But as someone educated in the bayesian tradition, I tend to feel that there must be a way of quantifying, if only subjectively or a priori, how sure we are about how sensible the identification assumptions by putting a probability distribution on them. Intuitively, that’s how I assess the state of knowledge in a certain area: if it relies on strong/implausible identification assumptions, I give less credit to its results; if I feel the assumptions are generalizable and hard to dispute, I give them more credit. But obviously, this is a very sloppy way of assesing it… I feel I must be missing something here, for otherwise I should have found more stuff on this.

My response:

Yes, I think it is would be a good idea to quantify uncertainty in identification assumptions. The basic idea would be to express your model with an additional parameter, call it phi, which equals 0 of the identification assumption holds, and is positive or negative if the assumption fails, with the magnitude of phi indexing how far off the assumption is from reality. For example, if you have a model of ignorable treatment assignment, phi could be the coefficient on an unobserved latent characteristic U in a logistic regression predicting treatment assignment; for example, Pr(T=1) = invlogit(X*beta + U*phi), where X represents observed pre-treatment predictors. The coefficient phi could never actually be estimated from data, as you don’t know U, but one could put priors on X and U based on some model of how selection could occur. One could then look at sensitivity of inferences to assumed values of phi.

I’m sure a lot of work has been done on such models—I assume they’re related to the selection models of James Heckman from the 1970s—and I think they’re worthy of more attention. My impression is that people don’t work with such models because they make life more complicated and require additional assumptions.

It’s funny: Putting a model on U and a prior on phi is a lot less restrictive—a lot less of an “assumption”—than simply setting phi to 0, which is what we always do. But the model on U and phi is explicit, whereas the phi=0 assumption is hidden so somehow it doesn’t seem so bad.

Regression models with latent variables and measurement error can be difficult to fit using usual statistical software but they’re easy to fit in Stan: you just add each new equation and distribution to the model, no problem at all. So I’m hoping that, now that Stan is widely available, people will start fitting these sorts of models. And maybe at some point this will be routine for causal inference.

At the time of this writing, I haven’t worked through any such example myself, but I think it’s potentially a very useful idea in many application areas.

22 thoughts on “Quantifying uncertainty in identification assumptions—this is important!

  1. > The coefficient phi could never actually be estimated from data, as you don’t know U, but one could put priors on X and U based on some model of how selection could occur. One could then look at sensitivity of inferences to assumed values of phi.

    I’ve suggested in the past that you could collect pairs of randomized & observational studies measuring the same variable, and treat it as a hierarchical mixture model whether the correlation/causation estimates come from the same distribution, giving you a generic prior for use elsewhere.

    • That’s a cool idea. But it would essentially require them to accept the view that their research design isn’t necessarily more clever than past designs, and that their posterior should be restricted by the prior from previous attempts at uncovering an effect. That could work well on stuff like nootropics/supplements, but would be harder in PolSci research, which rewards ‘clever research design’ and doesn’t have as much meta-studies.

      • I don’t think that using previous research to build in some constraints on effect sizes (or put priors over them) is conceding the one’s own research design is not better than previous work. That is the whole point of updating priors, yeah? We had some mediocre old information, have some better new information, and we want to update. I mean, I’m no Bayesian statistician, but that seems like the idea.

        In general, I think good (not necessarily “clever”) research design is necessary for advancement of knowledge, just not sufficient. And I don’t think that position is compromised just because I begin my research where others left off, even if I think those other investigations were less good than the one I want to conduct.

  2. Guido Imbens has done some work on sensitivity analysis that is something of a compromise between Manski’s approach (seeing how far we can get without making any assumptions) and the standard approach (positing the truth of an untestable identification assumption). This approach is similar in spirit to Andrew’s suggestion: posit the existence of a confounder, then see what the strength of the relationship between (a) the confounder and treatment and (b) between the confounder and the outcome must be to change your conclusion. An overview is in this paper: http://scholar.harvard.edu/imbens/files/sensitivity_to_exogeneity_assumptions_in_program_evaluation.pdf

  3. Unlike most other posts on this blog, here I don’t even know what you are taking about.

    Do you have any references I could read to become familiar with this subject?

    Thank you.

      • I think, if it means anything, it means “studying how well different research methods work across multiple areas of a particular discipline” instead of say “studying the answer to political science question X by designing studies to determine the answer to whether or not X is true” etc

      • One of the important developments in “work on methods” was Campbell and Fiske’s multi-trait, multi-method matrix for establishing construct validity. Isn’t that what we’re talking about here? As this Wiki article on MTMM states, “Multiple traits are used in this approach to examine (a) similar or (b) dissimilar traits ( constructs), as to establish convergent and discriminant validity between traits. Similarly, multiple methods are used in this approach to examine the differential effects (or lack thereof) caused by method specific variance,” (https://en.wikipedia.org/wiki/Multitrait-multimethod_matrix).

        • Thanks Thomas B,

          After looking over the wikipedia page, I am still not sure why I would want to use this procedure. I think I understand the goal of assessing how well different supposed measures of the same thing correlate with each other, but don’t see how the real problem is addressed.

          Say you want to measure “motor ability”. So you count how many bananas a monkey successfully grabs through a hole, and also record how long it takes them to navigate an obstacle course with a food reward at the end. Both of these should be related to “motor ability”, but they will also be affected by how hungry, and thus motivated/stressed, the monkey is. So the two different measures may be very correlated, but also have little to do with the construct we attempted to measure.

          So I do not see how construct validity can be established via the method at that page. Maybe I misunderstood the goal.

  4. A lot of work on this issue has also been done by Lash, Fox, and Greenland, under the name “quantitative bias analysis”. They wrote a book about it too. The idea being that any particular assumption (e.g., no unmeasured confounders, no selection bias, no measurement error) may be violated, and then estimating how the effect of interest would change if any (or all) assumptions would be violated. They provide computer code (I think SAS) that estimates these types of multiple-bias models.

    https://sites.google.com/site/biasanalysis/

    • What was being discussed seemed very much like what Greenland was addressing with his multiple bias analyses (which goes back to examples of DM Eddy’s confidence profile method 198? and conceptually DB Rubin earlier).

      This can be fully or partial Bayesian and the Lash, Fox, and Greenland is an general introduction to the topic.

  5. There’s a slight paradox in the general idea of quantifying uncertainty in identifiability assumptions using probability distributions in that, by requiring the possibilities to sum to one, probability assumes a type of identifiability assumption.

  6. We’ve done some methodological work on exactly this issue, combining Bayesian nonparametric regression (via BART) with informative priors and sensitivity analysis in cases where the parameter of interest is partially identified:

    http://www.tandfonline.com/doi/abs/10.1080/01621459.2015.1084307

    The main application in the paper is bolstered by some sensible priors on how many firms are engaged in misconduct, and the attributes that make the more likely to be scrutinized. The supplemental material includes a case where identifying restrictions fail in an instrumental variable model, and we need thoughtful priors over the plausible size of the direct effect of the instrument on the outcome.

    • P.S.:
      unlike all the other models mentioned here, this one makes life not just easier, but *significantly* easier – not harder; which is why I find it rather odd to find that the Full Bayesian Significance Test (FBST), which the book describes, has so far failed to gather more attention from the Stan community.

      Submit Comment

      • no_identd:

        When I have a week off to actually read rather than skim this long possibly interesting but to me somewhat vague book with its many quotes from Peirce …

        Or maybe not – the argument that there are singularities in any reality (i.e. given a specific context at a moment in time – all unknowns are actually set to discrete values) therefor our models that try to represent those realities need to have and formally test discrete values – seems a stretch. Models need not be literal representations, the parameters in the model need not correspond to exactly those settings, specific contexts and points in time need to be generalized over context and time, etc., etc.

        The author does _seem_ to be using philosophy to justifying their FBST rather than trying to discern what are justified means of enabling inquiry with statistics.

    • >”nobody bothers to read his damn book on it”

      I tried to read it, but lost interest. Maybe it is just me, but I need to see examples of FBST in action. Ideally this would be on a few real datasets, and some interesting insight or engineering feat would be produced. Either that or the method and reasoning needs to be summarized much, much more succinctly (which may not be possible).

      I am probably not alone in that I just can’t bring myself to spend all that time figuring out what he is getting at, when it may be irrelevant to the problems I face.

Leave a Reply to Thomas B Cancel reply

Your email address will not be published. Required fields are marked *