This is Jessica. Zach Lipton gave a talk at an event on human-centered AI at the University of Chicago the other day that resonated with me, in which he commented on the adoption of causal inference to solve machine learning problems. The premise was that there’s been considerable reflection lately on methods in machine learning, as it has become painfully obvious that accuracy on held-out IID data is often not a good predictor of model performance in a real-world deployment. So one computer scientist who reads the Book of Why at a time, researchers are adapting causal inference methods to make progress on problems that arise in predictive modeling.
For example, Northwestern CS now regularly offers a causal machine learning course for undergrads. Estimating counterfactuals is common in approaches to fairness and algorithmic recourse (recommendations of the minimal intervention someone can take to change their predicted label), and in “explainable AI.” Work on feedback loops (e.g., performative prediction) is essentially about how to deal with causal effects of the predictions themselves on the outcomes.
Jake Hofman et al. have used the term integrative modeling to refer to activities that attempt to predict as-yet unseen outcomes in terms of causal relationships. I have generally been a fan of research happening in this bucket, because I think there is value in making and attempting to test assertions about how we think data are generated. Often doing so lends some conceptual clarity, even if all you get is a better sense of what’s hard about the problem you’re trying to solve. However, it’s not necessarily easy to find great examples yet of integrative modeling. Lipton’s critique was that despite the conceptual elegance gained in bringing causal methods to bear on machine learning problems, their promise for actually solving the hard problems that come up in ML is somewhat illusory, because they inevitably require us to make assumptions that we can’t really back up in the kinds of high dimensional prediction problems on observational data that ML deals with. Hence the title of this post, that ultimately we’re often still left with some really hard social science problem.
There is an example that this brings to mind which I’d meant to post on over a year ago, involving causal approaches to ML fairness. Counterfactuals are often used to estimate the causal effects of protected attributes like race in algorithmic auditing. However, some applications have been been met with criticism for not reflecting common sense expectations about the effects of race on a person’s life. For example, consider the well known 2004 AER paper by Bertrand and Mullainathan, “Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination,” which attempts to measure race-based discrimination in callbacks on fake resumes by manipulating applicant names on the same resumes to imply different races. Lily Hu uses this example to critique approaches to algorithmic auditing based on direct effects estimation. Hu argues that assuming you can identify racial discrimination by imagining flipping race differently while holding all other qualifications or personal attributes of people constant is incoherent, because the idea that race can be switched on and off without impacting other covariates is incompatible with modern understanding of the effects of race. In this view, Pearl’s statement in Causality that “[t]he central question in any employment discrimination case is whether the employer would have taken the same action had the employee been of a different race… and everything else had been the same” exhibits a conceptual error, previously pointed out by Kohler-Haussman, where race is treated as phenotype or skin type alone, misrepresenting the actual socially constructed nature of race. Similar ideas have been discussed before on the blog around detecting racial bias in police behavior, such as use of force, e.g., here.
Path-specific counterfactual fairness methods instead assume the causal graph is known, and hinge on identifying fair versus unfair pathways affecting the outcome of interest. For example, if you’re using matching to check for discrimination, you should be matching units only on path-specific effects of race that are considered fair. To judge if a decision to not call back a black junior in high school with a 3.7 GPA was fair, we need methods that allow us to ask whether he would have gotten the callback if he were his white counterpart. If both knowledge and race are expected to affect GPA, but only one of these is fair, we should adjust our matching procedure to eliminate what we expect the unfair effect of race on GPA to be, while leaving the fair pathway. If we do this we are likely to arrive at a white counterpart with a higher GPA than 3.7, assuming we think being black leads to a lower GPA due to obstacles not faced by the white counterpart, like boosts in grades due to preferential treatment.
One of Hu’s conclusions is that while this all makes sense in theory, it becomes a very slippery thing to try to define in practice:
To determine whether an employment callback decision process was fair, causal approaches ask us to determine the white counterpart to Jamal, a Black male who is a junior with a 3.7 GPA at the predominantly Black Pomona High School. When we toggle Jamal’s race attribute from black to white and cascade the effect to all of his “downstream” attributes, he becomes white Greg. Who is this Greg? Is it Greg of the original audit study, a white male who is a junior at Pomona High School with a 3.7 GPA? Is it Greg1, a white male who is a junior at Pomona High School with a 3.9 GPA (adjusted for the average Black-White GPA gap at Pomona High School)? Or is it Greg2, a white male who is a junior at nearby Diamond Ranch High School—the predominantly white school in the area—with a 3.82 GPA (accounting for nationwide Black-White GPA gap)? Which counterfactual determines whether Jamal has been treated fairly? Will the real white Greg please stand up?
And so we’re left with the non-trivial task of getting experts to agree on the normative interpretation of which pathways are fair, and what the relevant populations are for estimating effects along the unfair pathways.
This reminds me a bit of the motivation behind writing this paper comparing concerns about ML reproducibility and generalizablity to perceived causes of the replication crisis in social science, and of my grad course on explanation and reproducibility in data-driven science. It’s easy to think that one can take methods from explanatory modeling to solve problems related to distribution shift, and on some level you can make some progress, but you better be ready to embrace some unresolvable uncertainty due to not knowing if your model specification was a good approximation. At any rate, there’s something kind of reassuring about listening to ML talks and being reminded of the crud factor.
Pearl’s statement does not sound like a conceptual error to me. …It sounds like different questions are being asked. In Pearl’s example, the goal seems to be to determine intent of the employer. The examples that follow seem to be aimed at determining disparity.
Suppose the employer’s a little racist so they filter out resumes with black sounding names.
Designing an idealized experiment for “everything is the same except for a subject’s race,” you might have a white guy named Jamal and and a black guy named Jamal, with the same educational background and experience and stuff, apply for this job, as well as a white guy named Jake and a black guy named Jake. You would find both Jakes get through resume screening and both Jamals get rejected. Is the conclusion then that “the employer is not racist, they are discriminating not based on race, but based on names”? In some sense, you’re conditioning on a mediator here.
Pearl would tell you that conditioning on mediators biases your estimate of the average treatment effect, but gives you an estimate of the direct effect. Of course, if you think about it too hard, what even is a mediator vs a direct effect? Can’t you always decompose everything down into more and more mediators until all direct effects disappear?
What’s the causal effect of me spending money on Amazon on the level in my checking account? Well, it’s not a direct effect of me clicking on the buy page, I can click on the buy page but control for the value submitted to the payment processor by Amazon, my checking account doesn’t go down. Well, if I control for the value submitted by the payment processor to my bank, my checking account doesn’t go down. Well, if I control for the value dispatched to the bank’s database, my checking account doesn’t go down. etc.
So the decomposition of effects in structural causal models are highly subjective; the direct effects, especially in social science, are effectively up to the choice of parts of the process to not model.
I like this post a lot, especially coming from an economics causal inference perspective. Pearl’s approach seems to dominate the causal ML literature from my limited understanding, and Jessica seemed to jab at that in her post today.
The time series aspect does have roots in economics (Granger causality, which was really popular in neuroscience for a bit), though it seems more hand-wavy in ML and more similar to structural equation modeling (SEM) in psychology than in structural econometrics with theory-driven models.
“Pearl would tell you that conditioning on mediators biases your estimate of the average treatment effect, but gives you an estimate of the direct effect. Of course, if you think about it too hard, what even is a mediator vs a direct effect? Can’t you always decompose everything down into more and more mediators until all direct effects disappear?”
This is a great point about average treatment effects, and something that is often overlooked and misunderstood in industry.
somebody – I can’t (nor do I care to try to) defend Pearl’s methods in general, as I don’t use them or know too much about them. However, your examples seem incorrect to me, because names are an identifier of race for the employer, just as is skin color. The assumption in your Jamal/Jake example is that the employer was unable to see the employee and thus the *only* identifier of race was the name. Because of this assumption, the name of the person *is* the race in this experiment, because it is the only means by which the employer can identify the race of the candidate. So I would suggest that in this case, the experimenter has not thought carefully enough about the experiment, and no amount of dags or ML or whatever will get it right. I don’t see that an example like this invalidates Pearl’s statement, because the action of the employer to discriminate can only happen in the context of the information that the employer has. In the Jamal/Jake example, race of candidate = name of candidate.
Let’s put it a different way. Pearl said
“the central question in any employment discrimination case is whether the employer would have taken the same action had the employee been of a different race… and everything else had been the same”
What variables constitute “the employee having been of a different race” and what variables constitute “everything else?” If you toggle the race, should that also toggle the name?
And yes, I agree that in this example, “race of candidate = name of candidate”. The issue is, why? Pearl’s statement suggests that the more “all else” you hold equal, the better. If it’s not incorrect, it’s nonetheless unhelpful and not-at-all revealing.
Sorry for the triple post, on my phone, habit.
Suppose you have a traffic cop’s record. During traffic stops, the “whiteness” of name of the person being pulled over, defined as the percent of people with that name who are white over the percent of the population who are white, is highly correlated with whether or not they are let off with a warning. However, if you control for the name, the warning rate is uncorrelated with the racial identification of the person pulled over. Remember that the officer is interacting with these people in person. Do you control for the name whiteness, or do you not?
somebody – It seems like you are essentially asking the same questions in all three posts. Again, I can’t speak for Pearl or his methods, but I can only imagine that when he makes that statement, that he assumes people will define race in the context of the problem at hand. In the examples you have posed, race is defined by the means by which it is detectable. The “why?” is because race is a quality that must be inferred by some detection process, and thus in the problems you pose, race would be defined by whatever detection process it is inferred from.
I’m afraid my objection runs a little deeper than that.
Suppose an employer genuinely has no racist intent, but has a policy against afros and other high-volume hairstyles, genuinely because they find it unprofessional. If a white person had an afro, they would be disciplined just the same as a black person with an afro. If a black person has straight hair, they are genuinely treated the same as a white person with straight hair. The employer knows their hairstyle and also knows their race.
Is this policy non-discriminatory? According to Pearl’s quote, the answer would be no. I would disagree; the policy has a greater burden on black (and Jewish) people for whom low volume hair requires more maintenance.
somebody – See my original comment to this blog post. We’ve come full circle. You’re asking a different question than what Pearl is asking in his quote.
I think maybe you’re not remembering what’s the quote says?
“the central question in any employment discrimination case”
Well, first of all, it seems like you’re implying that the law only cares about intent, which is flatly just not true.
If you ignore the part where it makes a flatly incorrect legal claim, if you grant that the question one is trying to answer is discriminatory intent, the statement still provides an infinite rhetorical loophole. Until you define the essential characteristics of race, which are subjective from person to person, you can always argue that literally no policy carries discriminatory intent. You can always control for more stuff. I’m not discriminating based on race, I’m discriminating based on education level, income, height, weight, hair, dialect, dress style, national origin, parental birthplace, skin tone, cheekbone shape; the more stuff you hold equal the less discrimination there can be. Where that line is constitutes the “really hard social science problem”
Jessica:
Relevant to your discussion is this discussion from a few years ago of a statistical controversy on estimating racial bias in the criminal justice system.
I already linked to in the post!
The big crime is murder/homicide, which has been an obsession of Anglo-Saxon jurisprudence since medieval times. Other crimes may have a lot of noise in them depending upon how much the police care about them, but the Big Sleep is the big one. A body with a hole in it or a hole with a body in it demands bureaucratic attention.
The FBI summarizes murders from police department statistics and the CDC summarizes homicides from cause-of-death reports.
Both statistics tend to track together closely. For example, the FBI reported that murders rose 29% from 2019 to 2020, the year of “the racial reckoning.” The CDC reported that homicide deaths were up 30% in 2020 over 2019.
Similarly, the FBI reported that blacks peaked as 60.4% of known murder offenders in 2021, while the CDC reported that blacks peaked as 55.0% of homicide victims in 2021. (The CDC reported that the black share declined slightly in 2022, while we are still waiting on the FBI to announce its 2022 results. It normally announces in late September, but this year it is late.)
If you look only at killings with guns, the racial imbalance is even worse.
All this data suggests that it’s improbable that America does not have what it obviously looks like it has: America has a big black gun violence problem.
Not surprising, given your history. But your view at least deserves some pushback. I agree that guns are endemic in black areas of the country, but labeling it as a “black gun violence problem” carries a lot of baggage I don’t agree with. It is our (all races) problem. I think it is unwise to ignore race when thinking about what to do about gun violence but that is not the same thing as labeling it as “their” problem. Perhaps you don’t intend it that way, but sloppy use of language here is not helpful.
In my view we frequently see this labeled as a “black” problem, and in a sense it is, but it’s also “a problem of a minority group that has been marginalized for centuries” and “a problem of a group that was preferentially targeted in the 1980s – 90s for participation in violent turf wars for crack cocaine distribution and sales” and “a group forced into tight living quarters with criminals through government housing projects that have produced some of the bleakest living conditions in the US” and etc etc.
If we want to address the problem of black males committing homicide we should probably address it at antecedent points along the chain of causality before the black male gang member with a gun is pointing it at another person…
Unfortunately even the white liberal groups you might expect to want to help black youth don’t seem to understand the problem. Segregation into housing projects, a dependence on a perverse set of “welfare” policies which form a poverty trap that can’t be escaped without jumping from $0 income to $50k and not going through any intermediate income level, and the continued role of illegal drug markets all need to be addressed. Also the family structure of these families.
Daniel
Your comments are very much in line with the Hu paper Jennifer cites. My understanding, put in loose unprecise terms, is that the construction of the counterfactual in such studies is fraught with normative judgements that cannot be avoided. The classic discrimination studies (Jamal vs Greg, identical in every respect except their name-implied race) seem well designed at first, but on closer inspection, avoid all of the socially constructed aspects of race that may be associated with discrimination. I find this mirrors Andrew’s prior post on research in psychology. These discrimination experiments are narrowly defined attempts to say something about a more general issue, but there is a leap required that is questionable. If we find that Jamal and Greg are treated similarly or differently in the discrimination experiment, what, if anything, does that tell us about any treatment of real Jamals or Gregs?
A third of a century ago, I would have called homicide largely a “black and Latino problem,” but there has been some change for the better since then.
Unfortunately, it’s not well known that homicides became significantly less of a problem for Latinos between the early 1990s and 2019 relative to both blacks and whites. (Unfortunately, Latino homicide rates have gone back up a lot in the 2020s relative to whites during the ongoing “racial reckoning.”)
That Latinos improved their behavior quite a bit relative to other groups in recent decades suggests to me that the current conventional wisdom that blacks are absolutely condemned by all history since 1619 (but not at all by, say, the 70,000 years of history before 1619) to behave as they behave now is overly pessimistic. If Hispanics can do better, why not blacks? Perhaps we should ask blacks to not shoot each other as much.
Unfortunately, since the emergence of Black Lives Matter at Ferguson in 2014, we’ve been increasingly telling them the opposite: that their problems are not their fault. But the Ferguson Effect and the Floyd Effect have not saved black lives on the whole. According to the CDC, black deaths by homicide increased from 7,767 in 2014 to 14,313 in 2021. And, presumably also related to BLM’s campaign to get the cops to pull over blacks less often for bad driving and then search them less for illegal guns, black traffic fatalities grew from 4,776 in 2014 to 8,583 in 2021.
Few are aware of these trends, even though they may be the most spectacular finding in American social science since Case and Deaton’s “deaths of despair” paper in 2015.
I have always been a little perplexed by many of the claims that ML can a priori be a causal inference tool – and I don’t really think the claim that it can be one is remotely verifiable either. Even if you use approaches like g-computation or double ML, you are still assuming that the completely unknowable approach used by an overly flexible algorithm in some way mirrors reality, or that the data alone are capable of identifying causal relationships without an underlying heuristic or framework required to justify that assumption. All k variables are being used regardless of their actual relevance to the task at hand, making them noise at worst and proxies at best in many cases.
To that end you might be led to non-sensical inferences about race. It seems increasingly as if we have a shiny new toy and we want to apply it to every social problem we can find regardless of its suitability.
This is probably irrelevant, but in case the authors feel like correcting some factual mistakes:
“predominantly Black Pomona High School”
I went to a debate tournament at Pomona HS c. 1974, so I was surprised to see that it had become “predominantly Black” since then. Looking up its current demographics in a US News database, I see, however, that it is predominantly Hispanic instead and in only 7% black.
98.7% Minority Enrollment
88.3% Hispanic
6.8% Black
2.0% Asian
1.3% White
1.1% Two or More Races
0.3% American Indian/Alaska Native
0.2% Native Hawaiian/Pacific Islander
It’s also not true that nearby Diamond Ranch HS is “predominantly white.” It’s only 4% white. Granted, that’s about 3 times as white as Pomona HS, but 4% white is not “predominantly white.”
95.9% Minority Enrollment
74.8% Hispanic
11.0% Asian
6.2% Black
4.1% White
3.5% Two or More Races
0.3% Native Hawaiian/Pacific Islander
0.1% American Indian/Alaska Native
https://www.usnews.com/education/best-high-schools/california/districts/pomona-unified-school-district/diamond-ranch-high-school-3062
In general, very few public high schools in Southern California are majority non-Hispanic white these days. I recently created a database of all public high schools in giant Los Angeles County, and not many were majority white. That seems like a baseline fact that would be useful to know when discussing the problem in the example.
For example, ultra-rich Beverly Hills HS, as seen in numerous movies such as “Clueless,” is merely 70% white these days (and that’s counting as white a lot of Persians — NPR reported in 2006 that Iranians made up 40% of the Beverly Hills school district’s students — and other Middle Easterners and North Africans whom the Biden Administration proposes to stop counting as white and put them in their own MENA race on the next Census).
https://www.npr.org/2006/06/08/5459468/living-in-tehrangeles-l-a-s-iranian-community#:~:text=Community%20estimates%20put%20the%20Iranian,the%20students%20in%20the%20schools.
Typing in high schools at random that seem like they might be majority white, I next tried Thousand Oaks HS, which is the public high school in a really, really nice exurb of Los Angeles in lovely Ventura County. And yes, it is majority white! Well, it’s 50.5% white, but that’s a majority (until a year or two from now).
I’m sorry if this seems off-topic, but I am concerned that theorists these days seem to be out of touch with reality.
While I appreciate you bringing data into the discussion I think the quoted text was intended purely as example, and the school names are just there to take the place of “School A” and “School B”. At the same time people should use different wording when they’re just throwing out names for flavor rather than as factual examples. For example “suppose Diamond Bar is a neighboring school that is predominantly white” the suppose is there to tell you that we aren’t talking about the real world we are talking about a theoretical situation.
Or use generic sounding high school names like East HS vs. West HS or Adams HS vs. Franklin HS.
We possess a lot of data on the relationship of high school GPA, standardized test scores, and college GPA. Why not make use of all that rather than hypothesize in this data-free fashion?
You didn’t understand even the basic premise of the discussion, Why not get started on figuring out what the question is before you start “making use of” data?
In the example, the discussion of Jamal’s and Greg’s GPA is rife with hand-waving assumptions poorly informed by the numerous empirical studies done since the 1960s on precisely the example’s question of how best to compare black and white college applicants. A lot of researchers have thought hard about the theory and practice of college admissions over the last two generations. Why not look into this sizable literature rather than wing it in an “Assume we have a can opener” mode?
The discussion is not “assuming we have a can opener”. It does not make an actual conclusion or hypothesis about actual college admissions’ processes. It is responding to a purported general definition of fairness and discrimination. What you’re doing now is like looking at Euler’s presentation of the Seven Bridges of Königsberg problem and saying “well you could swim across a river. People have actually done it. Why not bring data to bear on the problem?”
Why bother talking about the Trolly Car Problem? No one rides in trollies anymore.
This discussion centers around counterfactuals. Why not look into the actual post rather than making assumptions about what it’s supposed to be about?
> If both knowledge and race are expected to affect GPA, but only one of these is fair […] assuming we think being black leads to a lower GPA due to obstacles not faced by the white counterpart, like gaining access to certain study opportunities
Wouldn’t the main affect of having access to those study opportunities be an increase in knowledge?
Good point, that was confusing. Removed.
I think I’m missing something here. Sure one doesn’t have access to all variables that “generate race,” but ultimately when you apply causal inference approaches no one does. We do find however, that you can reliably predict race with high accuracy even with somewhat unintuitive proxies. One example is from radiology: https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00063-2/fulltext where they got .95 AUC prediction of race from X-rays…
So getting back to your example with Greg and Jamal. How about you just train a model that embed the names in high dimensional space, and all other data you have that predicts race with high enough accuracy, then balance on that information to get average effect of race using some IPTW estimator?
Doesn’t seem like this “ridiculously hard” social problem if you can get a model to predict it with high accuracy and you specify what it generalizes too, but maybe I’m missing something.
What exactly do you think this would accomplish? Or, to put it another way. You do this and you get T = x. What is it that you think x is?
Generally speaking, artificial intelligence pattern recognition systems have shown an unpleasant tendency to discover esoteric ways to stereotype blacks in empirically accurate but ethically dubious ways.
For example, a consistent finding of American social science is that a black college applicant with identical objective traits as a white applicant will tend to wind up on average with a worse GPA in college. But, most Americans of all partisan views are highly uncomfortable with downgrading the application of a hard-working black applicant just because the data shows that building in a bias against blacks would increase the predictive validity of the selection syste
You can somewhat blind human evaluators to the race of applicants (e.g., replacing the name of the applicant with a number on his file). But machine learning black boxes have shown a disconcerting ability to accurately tease out the race of the applicant from obscure details in the file (analogous to this famous radiology mystery where radiologist can’t figure out how the machine learning program can determine the race of the scan) and then apply the basic empirical finding that black applicants tend to underperform their objective measurements in college to reject some black applicants just for being black.
Very few Americans, including me, feel ethically comfortable with this kind of racist robot that punishes high-achieving blacks just for being black on the grounds that big data shows that high-achieving blacks tend to regress toward a lower mean at the next stage of their education.
Steve –
> But, most Americans of all partisan views are highly uncomfortable with downgrading the application of a hard-working black applicant just because the data shows that building in a bias against blacks would increase the predictive validity of the selection syste
It’s almost like people might think the goal of education is to advance the potential of all students, rather than to advance the “predictive validity” of the selection system.
Crazy!
Recall Rubin’s motto: “No causation without manipulation.” If you can’t, at least in theory, change a treatment, then you can’t no the causal effect of that treatment. Race and sex were always his favorite examples. It makes no sense to talk about the effect of race on employment if you can’t manipulate race.