Two spans of the bridge of inference

This is Jessica. Larry Hedges relayed a quote to me recently that I thought others here might appreciate. It appears in an old Annals of Mathematical Statistics paper by Tukey and Cornfield:

In almost any practical situation where analytical statistics is applied, the inference from the observations to the conclusion has two parts, only the first of which is statistical. A genetic experiment on Drosophila will usually involve flies of a certain race of a certain species. The statistically based conclusions cannot extend beyond this race, yet the geneticist will usually, and often wisely, extend the conclusion to (a) the whole species, (b) all Drosophila, or (c) a larger group of insects. This wider extension may be implicit or explicit, but it is almost always present. If we take the simile of the bridge crossing a river by way of an island, there is a statistical span from the near bank to the island, and a subject-matter span from the island to the far bank. Both are important. By modifying the observation program and the corresponding analysis of the data, the island may be moved nearer to or farther from the distant bank, and the statistical span may be made stronger or weaker. In doing this it is easy to forget the second span, which usually can only be strengthened by improving the science or art on which it depends. Yet a balanced understanding of, and choice among, the statistical possibilities requires constant attention to the second span. It may often be worth while to move the island nearer to the distant bank, at the cost of weakening the statistical span-particularly when the subject-matter span is weak. 

The example is about generalization from experimental evidence, where we often fixate on removing threats to the statistical inference while leaving the additional “work” required to apply the results for policy informal and outside the bounds of the research. But more broadly the quote is a nice metaphor for the inevitable limitations of statistical methods for getting us all the way there in solving real world problems. The most interesting part of statistics for me has often been trying to make sense of what is happening at the edges of the technical solutions, where some yet unformalized judgment has to come in and make it work. 

Sometimes it’s about understanding how people are using estimates or predictions from statistical models, where, for example, the interfaces we provide to model outputs can shape what researchers conclude or what policies are enacted. Sometimes it’s about how we conceive of our goals going into modeling, like when we’re designing studies and have to pull effect size estimates from some foggy internal model of plausible effects. Often it’s about assessing the extent to which assumptions hold in real world settings where models are applied. My interest in theoretical work on calibration for decision-making, for example, is partly kept up by my hesitance to accept certain assumptions as realistic descriptions of how predictions are used in practice. 

Sometimes it’s about what happens in between the deployments of some algorithmic solution. I’m reminded of a talk Susan Murphy gave at a workshop on individualized prediction that Ben Recht organized last summer. She presented a number of reflections on her work in reinforcement learning for health care, where they are deploying online learning algorithms in apps to learn personalized policies for nudging people at risk of coronary disease to exercise, or people at risk of dental disease to brush their teeth. Comments she made that stuck with me were about the iterative learning that happens between experiments, where there’s an inevitable “discovery” phase that involves pooling data across individuals to identify themes that can help them tweak the algorithm or better initialize it for the start of the next round. In other words, a lot of the important learning remains outside of the formalized algorithmic loop. 

When researchers identify the stuff happening at the edges, and start taking it seriously, the results can be big. Where would research or development on topics like data visualization and interactive data analysis be without Tukey’s vision of exploratory data analysis to signal their significance? Andrew’s philosophy of model checking and his and others’ work on workflow also come to mind, as well as work by Beth Tipton and others on generalizability and taking heterogeneity seriously in behavioral research. There are many great examples. Sometimes entire new fields spring up in acknowledgement of the overlooked second span. A more recent example is research on algorithmic fairness and bias arising largely from an observation that the traditional machine learning pipeline, where performance is optimized in aggregate over a population, leaves a big gap when it comes to applying models confidently in practice in fields like medicine or law, where we care about doing right by individuals. 

Still it seems that there’s often hesitance to “break the frame” implied by conventions in highly technical fields. Maybe researchers sense that the same tools won’t necessarily solve the problems of the second span. Or that they won’t garner the same respect. Or they’re so busy following the technical train that they don’t get around to looking beyond its trajectory. But that’s ok. If nothing else, less competition for researchers like me! 

4 thoughts on “Two spans of the bridge of inference

  1. Jessica:

    I agree that there are many interesting angles to this problem. I usually frame these external-validity issues in terms of poststratification: we study how some treatment effect or other outcome of interest depends on factors such as person, scenario, and time, and then we extrapolate from available data to a larger population of interest (where, again, “population” refers to some larger set of possible scenarios and time points, not just to people (or animals, or stores, or whatever) that aren’t in the sample).

    One motivation for me to study this topic was several years ago, when encountering the replication crisis in psychology, I noticed the two-step by which researchers would alternate between making broad generalizations from their little experiments—and then, when these experiments failed to replicate, researchers would emphasize the narrow conditions of their studies.

    A pair of particularly resonant examples came from Harvard psychology professor Daniel Gilbert.

    At one point I criticized a psychology study on the grounds of representativeness. I wrote that “participants in an Internet survey and University of British Columbia students aren’t particularly representative of much more than … participants in an Internet survey and University of British Columbia students.”

    Gilbert’s response was: “Complaining that subjects in an experiment were not randomly sampled is what freshmen do before they take their first psychology class. I really *hope* you why that is an absurd criticism – especially of authors who never claimed that their study generalized to all humans.”

    This was an interesting response: On one hand, he was accepting the idea that a research study can only generalize to people and scenarios that are similar to the people and scenarios in the study. On the other hand, the paper in question was called “Women Are More Likely to Wear Red or Pink at Peak Fertility,” with no mention of who those women were, and the abstract of that paper concluded, “Our results thus suggest that red and pink adornment in women is reliably associated with fertility and that female ovulation, long assumed to be hidden, is associated with a salient visual cue,” again with no implication of restricted generalization.

    In a later discussion, this time with Brian Nosek, Gilbert defended a study of his that had failed to replicate on the grounds that the new experiment differed from his original experiment in certain ways.

    The point is that (a) generalization is one of the major goals of just about any experiment, (b) it’s understood that additional modeling is needed to generalize, but (c) this is typically hidden, with challenges of generalization often only coming out after a study fails to replicate or is revealed to have some serious problem. Researchers (and policymakers, and journal editors) should be thinking about generalization from the beginning.

    I conjecture that part of the problem is that generalization, or external validity, is often seen as outside of the standard laboratory science paradigm—as illustrated in your “break the frame” comment in the final paragraph of your post. One reason I like the poststratification framework is that it brings the generalization problem “inside the tent,” as it were. So, instead of framing the problem as, “Does this result from study X generalize to situation Y, or not?”, it becomes, “What are the assumptions involved in the generalization from X to Y?” And, at a technical level, we switch from an all-or-nothing, generalize-or-not approach to a partial-pooling approach to generalization (as we did in this paper with Weber et al.).

    We had a discussion of this general topic—using hierarchical modeling for partial generalization—with Judea Pearl here a few years ago. I got the sense that Pearl prefers to frame generalization, or transportability, as an all-or-nothing question where there is some mathematical condition for transportability, whereas I would rather set up the problem using a continuous model.

  2. I’m having trouble envisioning the first span as entirely a matter of statistics, devoid of the second span. Most statistical analyses seem intertwined with subject matter knowledge, at least in my eyes. If that is the case, then seeing two spans is part of the problem – the belief that if you do the first span “correctly” then perhaps you can ignore the 2nd span. While Tukey and Cornfield are highlighting the importance of the second span, they seem to say you can do the statistical analysis without it, but shouldn’t generalize the results. I’m not sure I agree.

  3. A similar effect applies in computer benchmarking, especially in the early days, where people would over-generalize from a few benchmarks, some of which bore little resemblance to real applciations, and of course, the numbers were of great commerical interest to computer vendors and their customers.
    It took much work to collect and improve collections of benchmarks, summarize results and also disoplay variability.
    This one started in 1988 and has been going since:
    https://www.spec.org/benchmarks.html

Leave a Reply

Your email address will not be published. Required fields are marked *