“The problem of infra-marginality in outcome tests for discrimination”

Camelia Simoiu, Sam Corbett-Davies, and Sharad Goel write:

Outcome tests are a popular method for detecting bias in lending, hiring, and policing decisions. These tests operate by comparing the success rate of decisions across groups. For example, if loans made to minority applicants are observed to be repaid more often than loans made to whites, it suggests that only exceptionally qualified minorities are granted loans, indicating discrimination. Outcome tests, however, are known to suffer from the problem of infra-marginality: even absent discrimination, the repayment rates for minority and white loan recipients might differ if the two groups have different risk distributions. Thus, at least in theory, outcome tests can fail to accurately detect discrimination. We develop a new statistical test of discrimination—the threshold test—that mitigates the problem of infra-marginality by jointly estimating decision thresholds and risk distributions. Applying our test to a dataset of 4.5 million police stops in North Carolina, we find that the problem of infra-marginality is more than a theoretical possibility, and can cause the outcome test to yield misleading results in practice.

It’s an interesting combination of economics and statistics. Also, they do posterior predictive checks and use Stan! I only wish that on Figure 8 they’d’ve labeled the lines directly. Or at least put the codes of the legend in the same order as the lines in the graph. Figure 9, too. Also, I think Figure 7 would’ve worked better as a 2 x 4 grid of graphs. All those dots with different colors are just too hard to visually process.

10 thoughts on ““The problem of infra-marginality in outcome tests for discrimination”

  1. > Outcome tests, however, are known to suffer from the problem of infra-marginality: even absent discrimination, the repayment rates for minority and white loan recipients might differ if the two groups have different risk distributions. Thus, at least in theory, outcome tests can fail to accurately detect discrimination.

    I’m having trouble understanding the point of this observation.

    Suppose you have two groups who have radically different risk distributions. Greeks are unlikely to pay back loans no matter how creditworthy they may appear to someone who doesn’t realize they’re Greek. Turks scrupulously pay back whatever they’ve borrowed.

    We could have a bank that issues loans based strictly on the wealth and cosigners the borrower can demonstrate. This bank will issue a lot of bad loans, and Greeks will have much higher default rates than Turks, suggesting that the bank should be lending less money to Greeks and/or more money to Turks.

    Another bank could ask for the borrowers nationality, and apply very different standards to Turks than it does to Greeks. This bank would have similar default rates for Greeks and for Turks, suggesting that its processes are serving their intended purpose.

    Which bank is “biased”? One has a different standard of creditworthiness for Turks than for Greeks, exactly as you’d hope it would have when, by hypothesis, Greeks and Turks are very different. This is a bias (in the colloquial sense).

    One has a race-blind standard of creditworthiness and applies it equally to Turks and to Greeks, meaning that when measured by actual creditworthiness, Greeks can easily get loans a Turk would never qualify for. This is also a bias. Unlike the other one, it’s a bias that loses the bank money.

    Outcome-testing the Bank of Equal Outcomes will show “no bias”, and outcome-testing the Bank of Equal Treatment will show a large bias. Who benefits from the new test that shows large bias at the Bank of Equal Outcomes because Greeks and Turks have equal default rates when Greeks should be defaulting a lot more, and “no bias” at the Bank of Equal Treatment because its incredible Greek default rate matches what the Greek default rate “should be”?

    Why would we want a bank to target different loan default rates for different groups of applicants (for the same loan types)? Why would the bank want this? As far as I can see, we do outcome testing because it measures what we care about, not because it is closer to some platonic truth.

    • > Why would the bank want this?

      For regulatory compliance, if they cannot legally “discriminate”. “Reasonable business practices” can stop being treated as such. For example, differential pricing of insurance by gender was banned in the European Union five years ago.

      > During routine traffic stops, officers have latitude to search both driver and vehicle for drugs, weapons, and other contraband when they suspect more serious criminal activity. These decisions are based on a myriad of contextual factors visible to officers during stops, including a driver’s age and gender, criminal record, and behavioral indicators of nervousness of evasiveness. We assume that officers use this information to estimate the probability a driver is carrying contraband, and then conduct a search when that probability exceeds a fixed, race-specific search threshold.

      A more natural model would include the race among the myriad of contextual factors visible to officers, but of course we shouldn’t do that because it’s against the law. As the footnote explains:

      > Taste-based discrimination stands in contrast to statistical discrimination [Arrow (1973), Phelps (1972)], in which officers might use a driver’s race to improve their estimate that he is carrying contraband. Regardless of whether such information increases the efficiency of searches, officers are legally barred from using race to inform search decisions outside of circumscribed situations (e.g., when acting on specific and reliable suspect descriptions that include race among other factors). As is standard in the empirical literature on racial bias, we test only for taste-based discrimination.

      So the model is that officers make a prediction for each individual (a function of a myriad of factors except race) and we assume that the prediction is an unbiased estimator of the actual probability of finding contraband during a search. These predictions will be distributed differently for different populations (as the mix of the myriad of factors won’t be the same), but we make the assumption that these distributions will always be beta functions (without any justification). At that point is relatively easy to find the implied thresholds and determine if the officers are applying different thresholds on their unbiased beta-distributed race-independent predictions.

    • That’s not really a decision a bank is allowed to make – “we need to accept loss rate X for race A and loss rate Y for race B”. Legally there are lots of factors they cannot use (race, location, etc.) in underwriting for good reasons. They also need to make sure that they are not using proxies for off-limit factors, hence prohibitions on redlining, for example.

      I am curious though to look for data on banking decisions along racial lines based on type of loan. For example, credit card lending is usually robotically decided by scorecard models and seldom require face-to-face contact. Small business loans however are larger and less cookie-cutter, affording underwriters more discretion and should more often include face-to-face meetings.

      • The point of the OP is that the test shows that the decision that is blind to ethnicity is the one flagged as discriminatory. If the distributions differ between groups, it seems the regulations require that the lender find the functional equivalent of a proxy for the groups in order to comply. The regulators and lenders may not be conscious of what they are doing, but they will at least take a random walk tending to that result.

        A very controversial example in recent years was the CFPB’s prosecution of Ally (formetly GMAC) over discrimination in auto loans. The complaints against the CFPB vary from forcing equal racial outomes to using arbitrary criteria to extract political rent. It is understandably hard to find apolitical information on an alleged politically motivated prosecution, but this artical at least focusses on an innocent Bayesian.

        http://www.latimes.com/business/la-fi-rand-elliott-20160824-snap-story.html

  2. “if loans made to minority applicants are observed to be repaid more often than loans made to whites, it suggests that only exceptionally qualified minorities are granted loans”

    Is it just me or does this sound like the authors think it is impossible for minorities to be more reliable at paying back loans? Why would they think this?

    From the paper we get more context:
    “Becker argued that even if minorities are less creditworthy than whites, minorities who are granted loans, absent discrimination, should still be found to repay their loans at the same rate as whites who are granted loans. If loans to minorities have a higher repayment rate than loans to whites, it suggests that lenders are applying a double standard, granting loans only to exceptionally qualified minorities.”

    There it is, they are just assuming that “minorities are less creditworthy than whites”, apparently because some other guy assumed it in 1957.

    • >There it is, they are just assuming that “minorities are less creditworthy than whites”, apparently because some other guy assumed it in 1957.

      well, maybe that and the observable fact that in the US most minority groups are less wealthy than whites on average, and less educated on average, and have lower income on average, and have lots of other things… *on average* which would usually indicate a higher default rate. At the individual level of course, if you’re allowed to select on whatever you want, it’s easy to find a group of minority citizens and a group of white citizens where every single minority citizen is more credit worthy than every white citizen in the sub-sample. So the question comes down to something like: “Does the minority/white group that is served share the statistical properties of the entire minority/white population it’s drawn from? And, if not, were the factors used to sub-sample the population legally allowable factors?”

    • I find this improbable that “authors think it is impossible for minorities to be more reliable” or that “some other guy [Becker] assumed it”. Instead, I guess that in this very specific situation they operate under a very simplified model which makes the claim correct.

      Suppose there are two groups “Minority” and “Majority”. Say 30% of people in “Minority” have a probability of repayment equal to 0.1 and 70% of people in “Minority” have a probability of repayment equal to 0.9. For the “Majority” proportions are reversed. Suppose further that lending company can estimate a probability of default almost perfectly and that it only makes sense to grant the loan if a probability of default is bigger than 1/2. In this example there is no discrimination, the “Minority” is more reliable in terms of loan repayment, but observed rates of default will be approximately identical because in both groups the unreliable lenders are eliminated. Of course, there will be a big difference in the number of credit applications accepted!

      Sure, in real-life there are two important problems invalidating a claim of their discussed passage:
      1.Conditional on passing acceptance threshold the distribution of default probability for “Minority” can be more concentrated in the neighboorhood of zero (as contrasted with “Majority” case).
      2.In presence of noise in estimating a probability of default, it matters what underlying proportions in given group are. For instance, in my simplistic example such a noise should increase the observed default rate more for “Majority” than for “Minority”.
      But it seems that it is an exact problem the authors are trying to learn how to resolve, isn’t it?

    • “Is it just me or does this sound like the authors think it is impossible for minorities to be more reliable at paying back loans?”

      It’s just you . . . it sounds to me like that’s just the setup hypothetical for a proof by contradiction. And it’s intuitively sensible. Suppose that at point A (credit score 800 and assets of $70,000), Whites and Blacks both have a 0% likelihood of default. At point B (700 and assets $50,000) Whites and Blacks both have a 10% likelihood of default. And at point C (credit score 650 and assets $40,000), Whites and Blacks both have a 25% likelihood of default. But the bank is giving loans to Whites at 650/$40K, and denying loans to Blacks at that same point. Then all things being equal, you’d expect a lower rate of default from Blacks and a higher rate of default from Whites.

      “All things being equal” covers a lot, though, because if 50% of Whites are at point A, 25% at point B, and 25% at point C, but 0% of Blacks are at point A, 25% at point B, and 25% at point C, the White default rate could actually be lower. You’d get a White default rate of 8.75%, and a Black default rate of 10% (because only Blacks at point B => default rate of 10%, would be getting loans). So it would look like Blacks are actually being loaned more money than they should.

      I had to fiddle with those percentages before the numbers said what I wanted them to say (at, say, A:25%, A:25%, A:25% for Whites, their blended default rate would be greater than 10%), but given the magnitude of the racial disparities in household wealth mentioned below, it’s not inconceivable that the distributions for Blacks and Whites could be this different.

  3. It’s easy to understate what a difficult problem this is. Assume a bank with a totally fair model makes some loans in 2005. When you fast forward to 2010, after the financial crisis hits minorities especially hard, the model probably doesn’t look so good in retrospect. Creating a model that’s theoretically fair is hard enough; creating one that’s still fair given imperfect information and forecasting error just seems insanely hard.

    For context, I used to build financial models for a totally different context where we didn’t have this set of concerns, and even there most models discussed in academia and the media didn’t pass the laugh test.

Leave a Reply to Michael Watts Cancel reply

Your email address will not be published. Required fields are marked *