Measuring Fraud and Fairness (Sharad Goel’s two talks at Columbia next week)

MONDAY DSI TALK

One Person, One Vote

Abstract: About a quarter of Americans report believing that double voting is a relatively common occurrence, casting doubt on the integrity of elections. But, despite a dearth of documented instances of double voting, it’s hard to know how often such fraud really occurs (people might just be good at covering it up!). I’ll describe a simple statistical trick to directly estimate the rate of double voting — one that builds off the classic “birthday problem” — and show that such behavior is exceedingly rare. I’ll further argue that current efforts to prevent double voting can in fact disenfranchise many legitimate voters.

Paper: https://5harad.com/papers/1p1v.pdf

TUESDAY PRC TALK

The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning

Abstract: The nascent field of fair machine learning aims to ensure that decisions guided by algorithms are equitable. Over the last several years, three formal definitions of fairness have gained prominence: (1) anti-classification, meaning that protected attributes — like race, gender, and their proxies — are not explicitly used to make decisions; (2) classification parity, meaning that common measures of predictive performance (e.g., false positive and false negative rates) are equal across groups defined by the protected attributes; and (3) calibration, meaning that conditional on risk estimates, outcomes are independent of protected attributes. In this talk, I’ll show that all three of these fairness definitions suffer from significant statistical limitations. Requiring anti-classification or classification parity can, perversely, harm the very groups they were designed to protect; and calibration, though generally desirable, provides little guarantee that decisions are equitable. In contrast to these formal fairness criteria, I’ll argue that it is often preferable to treat similarly risky people similarly, based on the most statistically accurate estimates of risk that one can produce. Such a strategy, while not universally applicable, often aligns well with policy objectives; notably, this strategy will typically violate both anti-classification and classification parity. In practice, it requires significant effort to construct suitable risk estimates. One must carefully define and measure the targets of prediction to avoid retrenching biases in the data. But, importantly, one cannot generally address these difficulties by requiring that algorithms satisfy popular mathematical formalizations of fairness. By highlighting these challenges in the foundation of fair machine learning, we hope to help researchers and practitioners productively advance the area.

Paper: https://5harad.com/papers/fair-ml.pdf

50 thoughts on “Measuring Fraud and Fairness (Sharad Goel’s two talks at Columbia next week)

  1. The notion of fairness, that similarly risky people should be treated similarly is absolutely the right notion of fairness, except we also want equal accuracy in some sense. It’s no good to say “all black people are bad risk” and then show that you treat black people just as bad as you treat high risk white people… and so you’re being fair. It’s just not true that all black people are high risk, so you need to show that whatever your risk assessment, it is justified by data and a model that doesn’t build in strong biases.

    On the other hand, it absolutely can’t be the case that “the same fraction of both black people and white people are classified as high risk”. These populations are different, here in the US. Median net worth of a black family is like a few thousand dollars, and a few hundred-thousand for white people. So, default risk just *isn’t distributed the same* in those populations.

    These topics are very important, policy makers tend to be way behind the curve on technology, math, science, etc. We need to communicate these issues clearly.

    • Daniel:

      What I find problematic is what I would term “lazy modeling”. That is throwing as much information as possible into a model, regardless of the quality of the measurement and without appropriate consideration of multifinality and equifinality, and using such a model to make important life altering decisions about people. It is lazy as both a modeling process, but also intellectually as it pretends the hard work of studying and understanding the causal processes is unimportant, when it quite clearly isn’t. Lazy models treat all members of a categorized group homogeneously and often pretend that a model with a lot of variables must contain all of the important causal processes and therefore assume strong inferences at all levels of analysis are perfectly reasonable. They are not. If a single causal variable is missing from the model that changes the results at any level, then the fact that it is missing is a failure of imagination, effort, time, or treasure.

      You mentioned the importance of these topics, but when many pretend that failures of these kinds are not real or are not important and that the reification of the classification is “fair” because it was not intentional, I think we need to think seriously about the uncertainty in these results, especially when there may be many convinced that they are in fact certain. Labels applied to people that are then reified is a pernicious problem for which there is no simple solution, but we who attempt to build models of reality should not allow lazy modeling to substantially contribute to the problem.

      • I am in 100% agreement with you. That we can’t mechanistically model something in order to decide whether we should mechanistically model it shouldn’t be carte blanche to just throw stuff at a deep learning algorithm and claim the result is good because after all we didn’t bias it on purpose… or stuff like that.

        When I was working with people doing forensics on construction, people would often claim to have “randomly sampled” locations around a site… this *always* meant *haphazardly sampled* with whatever biases their own particular brain brought to bear on the problem. I had to explain to people “just because you don’t know why you picked those spots doesn’t mean that they are as representative as if you’d used a random number generator to uniformly sample the region”. Eventually they’d get it… but they’d mostly still want to slip back into that idea of “just because I don’t know about the bias, it must be unbiased”… a sort of Mind Projection Fallacy.

        • One of my character flaws is that I can’t seem to really absorb certain facts to the extent that they fail to surprise me. Andrew says I’m easily surprised, and he’s right. To draw an analogy: many people find that sunsets are often strikingly beautiful. You’ve seen 1000 beautiful sunsets, or 10,000, and yet somehow the next one is still striking.

          That ability to be put in a state of wonder again and again is something like my ability to be surprised by the same fact again and again, at least with certain facts. One of those facts it that people will say something was ‘random’, indeed will insist on it, and then it turns out it wasn’t random at all..and when I say ‘people’ I include mathematically inclined people in scientific fields. I recently had this experience, in spades, in a consulting project: we asked the client for a ‘representative sample’ of data and they gave us a subset of data that were filtered based on an important parameter. We did eventually discover this but…wtf. What concept of ‘representative’ could they possibly have? I have no idea.

        • It’s an interesting fact about randomness that almost every possible long sequence of bits is random, and yet they are extremely hard to make. It comes down to Kolmogorov complexity. Most long random sequences can only be output by a computer program equivalent to “print(my_sequence)” where obviously my_sequence is a variable that is explicitly initialized to be the enormous sequence itself. The work that goes into designing RNGs that can output “pretty good random” is significant.

          I think the duality between Bayes and Frequency based probability is a category built into our brains because it seems so natural to people to mix these up. I propose when we say a thing is random we should mean it’s part of one of those high complexity incompressible sequences. When we mean the thing is unknown (and so we have some Bayesian probability distribution over it) we should say unknown. Never say Random when you mean Unknown. Bayes isn’t about random variables necessarily. Frequentist inference *is* about random variables necessarily.

          This would solve a lot of confusion. It should be somewhere near the front of every stats textbook, at least a half chapter on computational complexity, random sequences, computational RNGs, and soforth

        • I’m in the midst of writing a stats textbook and a characterization of the epistemic nature of Bayesian notions of probability is literally the first thing I cover. It takes a while before I get to random sequences, typicality, and computational RNGs. Kolmogorov complexity, as cool as the idea is (arguably one of the biggest ideas of the 20th century mathematics), it would be a stretch for my intended audience (unless someone knows an elementary description that’d work). Also, I don’t feel like I understand Kolmogorov complexity well enough to do it justice.

        • Bob, it sounds like a cool project. I keep thinking I should do something similar. Actually one of the reasons I comment so much here is because it helps me gel the ideas in my mind for a book like this… I should probably start writing it soon ;-)

          Kolmogorov complexity is not that complicated of an idea, it’s got a lot of complicated consequences, but you don’t really need to get into that.

          Kolmogorov complexity is a function of two things, one is the language you use, but this basically just sets the units of measurement… the other is the problem you’re solving (output a certain dataset). If you assume we just pick a convenient language that isn’t too stupid, then for any solvable problem (ie. like outputting a certain image) there exists some shortest program that solves the problem. The length of that program is the algorithmic complexity of that problem.

          Unfortunately, you can’t compute the Kolmogorov complexity, because finding the shortest program that solves the problem requires essentially trying all the different programs, and some of them don’t terminate, as soon as you hit “while(1){}” you’re done searching.

          But you can exhibit a program that solves the problem, and this immediately shows that the algorithmic complexity is less than or equal to the length of that program.

          One program that solves the problem is always the program print(“…my_dataset_here…”) where you just type out the dataset. So for a given language, you can easily show that the complexity of *every* sequence of length k is less than or equal to k+c for some small constant c.

          in the language where print(“…”) is possible you need about c=9 characters (please don’t ask about escaping quotes, I’m pretty sure the problem is solvable but I don’t know off the top of my head!).

          It’s easy to show from a counting argument that if there are N symbols in your language, that there are N^(k+1)-1 programs of length less than or equal to k (just think of k digit numbers in base N)

          On the other hand, suppose l is a number substantially bigger than k, there are N^l different length l datasets.

          Even if all of the programs less than k symbols long terminate, they can’t produce all the length l datasets, especially because a bunch of them are just print(“hello world”) and print(“hello hello world”) and soforth.

          as l gets bigger than k, very rapidly due to the exponential function, essentially all the length l datasets have complexity bigger than k just by counting.

          So, if you want to produce “random numbers” of very long length… let’s say a billion 32 bit integers, and you want to do it with a program you can write in a couple hundred lines ~ 10,000 characters of a normal programming language… only a *very small* fraction of the billion digit sequences can be produced by programs of 10000 characters, and essentially all of them will fail tests of randomness, so you really have to work hard to find very special programs called RNGs.

          Per Martin-Lof showed that sequences that pass big complicated batteries of statistical tests are basically the same set of sequences as the high Kolmogorov complexity sequences. (he actually shows that there’s a “most powerful computable test” even though the test itself isn’t findable by a computable algorithm)

          If you are studying a thing in the world that has some regularity to its behavior, like say pharmacokinetics measurements, you can imagine writing a computer program that simulates it with ODEs. Imagine you collect data from a person every couple minutes for 12 hours after giving them a pill, and you do this for millions of people. You can get a massive dataset. That dataset can be simulated by your computer program by searching for a couple of parameters that describe that person’s physiology. Sure it won’t be perfectly computing the measurement at each point in time, but the errors will be small enough that they can be encoded in a few extra “error make-up bits”… So you can write a program which takes in the dataset, and the pharmacokinetics model, and outputs a program that describes the parameters to use for each person, and the make-up bits to add to the simulations, and this program will perfectly output the hundreds of millions of numbers you collected. That program will be let’s say a million bits long (most of which is make-up error bits) to output a gigabyte (8 billion bits) of data.

          This argument shows that no, really, your dataset isn’t an IID random sample. Even the errors from the model probably aren’t IID random. Some person will have measurements that are not entirely consistent with any parameter for the ODE and so there will be correlations between the errors etc. You could further compress things using say Lempel-Ziv on the “make-up bits”

        • I like to post to get feedback. It’s the one venue Andrew’s guaranteed to read and comment on in a timely fashion, among other good properties of the blog. Same with the Stan forums. Luckily, I’m not afraid to launder my ignorance or stupidity in public if it’ll let someone else correct my misconceptions.

          I understand the basic definition and ramifications of Kolmogorov complexity. I have a pretty strong background in computability and ZF set theory—I used to teach logic and computability theory as part of our computational linguistics program at Carnegie Mellon. I have just never used Kolmogorov complxity for anything, so I’d be reluctant to start talking about it for fear of getting important details and implications wrong. I think the problem for the book I’m writing is that I won’t be able to expect any background in computability and I don’t want to have to detour through the halting problem (equivalently Gödel’s incompleteness with a little coding). But if you write it up, I’ll definitely read it.

          Martin-Löf showed that sequences that pass big complicated batteries of statistical tests are basically the same set of sequences as the high Kolmogorov complexity sequences.

          Now that’s cool. I never did understand all the stuff Martin-Löf was doing when everyone was talking about it all the time in computation theory seminars in grad school (I went to Edinburgh Uni, a hot bed of programming language theory). Matthijs Vákár, one of the Stan devs, is very up on it as he did his thesis with Samson Abramsky on related topics. Matthijs is a real programming language theorist—I say that because I can’t even understand the point of his papers much less the details. In the past, I’ve used a bit of domain theory (computability meets topology to provide a model of untyped lambda calculus and also arbitrary precision floating point convergence) to prove some theorems about logic programming back when I dabbled in programming language theory. Turns out everything I worked on as a grad student and professor completely fell out of fashion. That’s why I moved into machine learning for natural langauge. Then I moved into stats because I couldn’t understand the stats discussions in the ML papers (which I now realize was only partly my fault—the natural language literature in particular is a mess).

        • Breck and I had that experience with almost every one of our clients at Alias-i. We’d ask for a random sample, and they’d do things like give us a week of data from December, saying that’s the most important month. Or they’d send us a sample of “high value” customer data or something like that. They were genuinely trying to help.

        • Phil said,
          “… people will say something was ‘random’, indeed will insist on it, and then it turns out it wasn’t random at all.. .we asked the client for a ‘representative sample’ of data and they gave us a subset of data that were filtered based on an important parameter.”

          And Bob said,
          “Breck and I had that experience with almost every one of our clients at Alias-i. We’d ask for a random sample, and they’d do things like give us a week of data from December, saying that’s the most important month. Or they’d send us a sample of “high value” customer data or something like that. They were genuinely trying to help.”

          These both sound to me like criticizing someone who does not know your language, when it is not reasonable to expect them to know your language. If you ask a client to give you a representative or random sample, it is part of your job to tell them precisely what you mean by a representative or random sample (i.e., to define your terms) — or better yet, tell them how to obtain one. Your bad, not theirs — they can’t read minds. A basic tenet of communication skills is that you need to know your audience, and speak to them, not to people like yourself.

        • While I agree with you about communications, I think it’s interesting how easily even people trained in some of these issues slip into the “I don’t know, so it must be random”. Of course, “I don’t know” corresponds to a Bayesian distribution that is wide… but a wide Bayesian distribution doesn’t mean the thing itself is random, it just means you don’t know !

          The concept of the Mind Projection Fallacy is widespread enough that I think this is some fairly fundamental issue with human cognition.

          But, yes, when you say “give me a random sample” instead you should probably say “use a computer random number generator that gives numbers between 0 and 1 to select some of your data by choosing all the data points that have their random number less than 0.02, that will give me about 2% of your data”

        • Daniel said,
          “I think it’s interesting how easily even people trained in some of these issues slip into the “I don’t know, so it must be random””

          The question here is: Have they just “been told” or have they really been “trained”? One of my mantras as a teacher is, “Telling is not teaching” — yet too often teachers just tell, and don’t really teach or train. So, for example, in training or teaching about what we mean by “random” in the context of statistics or probability, do we just give the definition (and maybe an example or two that fit the definition), or do we give them exercises that ask “Is this random or not?”, and ask them to give reasons for their answers? The latter is needed for really teaching or training.

        • Martha (and Andrew),
          Sure, you’re right, it’s clear from my experience (and Bob’s, and Daniel’s) that a lot of people with degrees in fields like engineering, math, physics, etc., think that a ‘random’ selection is one that they have chosen for a particular purpose. That’s exactly what surprises me, no matter how many times it occurs. As I said in my comment, I agree that this is a flaw in my character.

          If someone is asked for data from a random selection of dates and they haphazardly choose, say, January 11, February 2, March 8, March 17, April 4, etc. (I just made up those dates off the top of my head), I agree, I should have been more clear that I wanted genuine randomness and not just a selection that doesn’t have a deliberate pattern. “Chosen without method or conscious decision” is the first definition that I get when I search for ‘random’ on Google, and I agree that the difference between that plain-English definition of ‘random’ and the statistical one is up to me to explain. But if I ask for data from a random selection of dates, and someone gives me the dates when Parameter A was highest, I don’t see how that conforms to _either_ definition of random. If someone thinks that is ‘random’, it makes me wonder what they think ‘random’ means. They certainly don’t think it means ‘chosen without method or conscious decision’!

          So, on the one hand: mea culpa. But: I am never going to stop being surprised by this behavior. I’ll never stop being amazed by beautiful sunsets either. These are not feelings I can turn off.

        • These both sound to me like criticizing someone who does not know your language, when it is not reasonable to expect them to know your language.

          Absolutely—communication is always a huge problem when collaborating, especially in a consulting framework where there’s very asymmetric knowledge (techniques known by consultants, domain known by clients). And it’s especially challenging in short blog comments!

          We certainly tried to explain what we meant by a random sample as best we could right from the get-go. We weren’t so dysfunctional as to just tell them “give us a random sample”. We did want these projects to succeed, after all, and we were always on a time budget. The real problem that I didn’t manage to convey in the original comment was that even after we managed to explain to clients that we really wanted them to use a random number generator to select a subsample of data, which they understood as computer scienitsts, they wanted to argue that a non-random sample would be better. They seriously thought that if we trained on just December, we’d get a better system because that was the “important” month. This was also an ongoing problem for my dad as attorney—clients would just ignore his advice because they thought they knew better.

          Martha goes on to say

          yet too often teachers just tell, and don’t really teach or train. So, for example, in training or teaching about what we mean by “random” in the context of statistics or probability, do we just give the definition (and maybe an example or two that fit the definition), or do we give them exercises that ask “Is this random or not?”

          It would’ve been great if the clients gave us the time to teach them how to do stats. We really did try the best we could to educate them. But it’s challenging. They’re not college students sitting down for a class, they’re busy professionals who just want to get something done. They’d say things like they hired us to do the job, then they’d fight against our recommendations. It was really frustrating.

          As another example, one of the ongoing struggles we had is that clients thought human agents (at a call center, for example) were perfect. They’d tell us things like they wanted our systems to be 95% accurate. It took a huge amount of educational effort (usually on Breck’s part) to bring them to the point of evaluating a group of humans on tasks like spelling correction or intent classification in text messages. Invariably, these were complicated problems on which human performance varied with typically less than 80% agreement among human agents. Then the customers would complain the evaluation was biased, and on it went, because they didn’t want to give up the position that their current system was perfect and a computerized version was just for cost savings.

          One of these cases was with an academic spin out in epidemiology for the CDC—we classified chief complaints in emergency room admissions into major syndromes. Despite the clients insisting epidemiologists could do this flawlessly, we realized the classificaiton task made no sense because it insisted on a single syndrome per chief complaint. If someone was hit in the head, they’d often have skeletal issues, hemorrhaging issues, and neural issues—there was no way to select just one reliably as there were no guidelines for what to do in cases of ambiguity. When we finally got them to get 10 epidemiologists to each classify 1000 chief complaints, agreement on the major syndrome reported (out of something like seven or eight) was around 70%!

        • And if you think people are prone to confuse random and haphazard, that’s nothing compared to the way people misunderstand the term “missing at random.” They usually take it to mean haphazardly missing. If they are more sophisticated, they are still likely to get it wrong in the other direction, assuming that it means what is properly called “missing completely at random.”

          I think these patterns of misuse of “random” and its derivatives are so deeply entrenched that the only solution is to retire the word random and invent a new one.

        • Virtually any dictionary of the English language will give something like “haphazard” as the first definition of “random”.

          My desk dictionary gives the etymology of “random” as “ME randoun < OFr. random < randir, to run, of Germanic origin."

          The Online Etymology Dictionary at https://www.etymonline.com/word/random elaborates as
          ""having no definite aim or purpose," 1650s, from at random (1560s), "at great speed" (thus, "carelessly, haphazardly"), alteration of Middle English noun randon "impetuosity, speed" (c. 1300), from Old French randon "rush, disorder, force, impetuosity," from randir "to run fast," from Frankish *rant "a running" or some other Germanic source, from Proto-Germanic *randa (source also of Old High German rennen "to run," Old English rinnan "to flow, to run;" see run (v.))."

          In other words, the "lay" meaning came first; so it is our responsibility to define the technical meaning of our usage. In teaching statistics, I have always tried to do that — I don't always succeed, but I can't rationally blame the students if I haven't defined the technical meanings of the terms and pointed out that the technical meanings are different from the lay meanings.

      • It’s not incumbent upon financial risk modelers to understand and account for every potential bias that’s incorporated into financial risk. One man’s bias is another man’s risk factor. This discipline of taking the bias out of machine learning by accounting for the most ridiculous minutia in potential bias is mostly a recipe for lawyer employment.

        There’s another potential approach to all this that seems to have been forgotten. It used to be common for people from “underserved” communities to start businesses to serve those communities.

        Why not just incentivize that?

        • Do you have any evidence that the bias in machine learning is “the most ridiculous minutia”… because I’m pretty sure lots of ML researchers have uncovered lots of pretty serious bias in the outputs of ML projects. It wouldn’t surprise me to find out things like where black people are much more often mistaken for lamp-posts in self-driving car programs because lamp posts tend to be painted dark colors, and when white people are mistaken for things the researchers find it out much faster because they test their algorithms all around Palo Alto or San Francisco where only the richest white people can afford to live anymore… etc

        • “It wouldn’t surprise me to find out things like where black people are much more often mistaken for lamp-posts in self-driving car programs because lamp posts tend to be painted dark colors,”

          Maybe. But SDCs don’t want to run into lamp posts any more than they want to run into people, so until the algorithm has to choose between a “lamp post” and a “person” it’s not clear it’s such a big deal.

          “because I’m pretty sure lots of ML researchers have uncovered lots of pretty serious bias in the outputs of ML projects. ”

          What’s “serious”? What’s “bias”? If the algorithm isn’t using explicit groups or some intentionally engineered substitute as a feature, is every difference in the result “bias”? Or is it a natural outgrowth of the behavior of the subgroup. Would it has been “bias” to charge higher life insurance rates to gay men in the 1980s? Or to heroin addicts in the 2010s?

          Maybe the “bias” is actually the result of previously unrecognized group behavior. Remember the man who was irate because Target advertised pregnancy-related products to his daughter? Only to find out later that his daughter really was pregnant? Was this “bias” on Target’s part?

          Perhaps machine learning will tell us a lot of things we couldn’t have otherwise admitted to socially.

  2. I haven’t given a politically unpopular opinion for some time, so here goes. Anti-classification is all that is required. The issue with prejudice in these matters is that the characteristics over which decisions are made are deemed to be (more or less) immutable, so that is clearly unfair to disadvantage someone for a statistical grouping over which he or she had no control. But things correlated with race are not immutable — zip codes, names, education, etc. So as long as I can demonstrate that my machine decider didn’t have direct access to the protected class classifier, any conclusions it reaches which turn out to be correlated with race are not unfair in the sense that they are beyond the choice set of the classified. I certainly agree with Goel that there can be adverse impacts to protected classes in anti-classified data…. but prejudice and fairness is about intention, not result, and leaving protected class status out of the data that the machine uses to grind its decisions is fair with respect to those classifications.

    This is an old problem. Thirty years ago I was involved with a case in which a mortgage originator was accused of adverse impact against blacks in mortgage lending even though they took applications over the phone and made no notations about race of the applicant (Needless to say, the person who took the information over the phone was not in the group making the decision.) There are few cases in which I was involved in my career in which I thought that my client was right as a matter of principle, no matter what the law was. This was one of them. Their mortgage algorithms (which I never saw) had no way of taking race into account, so how could they be guilty? This was true even though blacks got substantially fewer loans than whites and paid higher rates for those mortgages. While the models I ran suggested that, once adjusted for only a few fairly standard indicia of creditworthiness, differentials between blacks and whites were very minor, I don’t think that should have mattered. I should add that it took a lot of effort even to try and get the race of the mortgage applicants since the client had no clue about the race of the applicants.

    • Jonathan:

      A question from an interested non-statistician: Why couldn’t race become highly correlated with some other irrelevant piece of data which could then serve in the algorithm as a proxy for race creating bias in the machine learning algorithm?

      On your claim that “but prejudice and fairness is about intention” is simply false as a matter of law and equity (not to mention boarder moral concerns). In the law, we often care about disparate effects regardless of the intention of the parties. If the building doesn’t have ramps for wheel chair access, it does not matter what the architect’s intention was. It is a clear violation of the Americans Disability Act. The same can be said for other areas of discrimination laws. But, even more boardly in the law and equity in general intent rarely matters. If you violate a contract, you will be liable regardless of intent. If you rear end another car, you’re liable for damages, and the fact that you didn’t mean to do it is no defense. Actually, intent matters almost exclusively only in the area of criminal law. I never quite understand why people think that intent is necessary in matters of discrimination.

      • Fair questions, Steve. I agree that my position is emphatically not the position of the law, which punishes intent when it finds it, but substitutes statistical tests when it cannot. There is obviously a tension here, in that intent is often impossible to prove while disparate impact is comparatively easy to measure, though I would argue that disparate-impact-controlling-for-relevant-but-unprotected-characteristics is really hard to measure…. probably as hard as intent.

        But there are lots of laws that require intent and lots that do not, and I don’t think there’s really a clear philosophical divide between those that do and those that don’t. I’m not convinced by your ADA example at all… the law compels you to take the circumstances of the disabled into account. No law requires you to take the circumstances of protected classes into account in your decisions which impact them. Indeed, though morality might require it, people in arms lengths transactions don’t have to take anything into account they don’t want to. Discrimination protections are about things *not* to take into account, not what *must* be taken into account. The ADA requires you to spend money you (presumably) wouldn’t have otherwise spent to make buildings better utilizable by the disabled. There is no law which requires a mortgage lender to subsidize mortgages for protected classes. They simply cannot explicitly disfavor those protected classes. If they do so without meaning to and have good business reasons for doing so, they are in exactly the position that Griggs talks about (see the article.) Disparate impact is legal if you can demonstrate that it improves the bottom line. This shifting burden is rarely adjudicated because demonstrating it is hard… just about as hard as adjudicating intent.

        • Another Jonathan
          I have Steve’s concern about your second paragraph as well regarding factors correlated with race. It is not so easy to determine whether the algorithms “had no way of taking race into account.” But I also have a concern about your first paragraph. I don’t find the immutability characteristic at all helpful. Your example seems to suggest that zip code cannot be the basis for prejudice because it is not immutable – meaning that a zip code that is predominantly nonwhite and has higher accident rates should have higher insurance premiums – and that result cannot be deemed prejudice because those nonwhite households had a choice of where to live. I may agree with the conclusion (I’m not sure), but I don’t agree with the reasoning. How much choice does a household have regarding where to live? How would we measure that ability to choose in any meaningful way? You could also say that skin color is not immutable either – your parents could marry a different race or to have more or less children. Or your age – you could be said to “choose” to continue living or not.

          I think the “choice” dimension is a red herring and not at all helpful. Choices are a continuum – some people in some circumstances have more ability to choose than others. I do think that is relevant to a philosophical statement about prejudice, but I don’t think you can operationalize it in the way you have stated. If homeless people have higher accident rates than others, then you would automatically state that higher premiums for homeless people cannot be prejudice because the choose to be homeless. From this example, you can see that I think it is better to separate the issue of whether or not the premiums are prejudiced from whether or not the homeless person “chooses” to be homeless. I think the latter is a red herring, while I can easily be persuaded that charging higher insurance premiums to homeless people is not prejudiced.

        • Blog comments are rarely suitable for a full exploration of difficult questions, so I confess incompleteness in my answer. There may well be circumstances where the protected characteristic is so closely tied to some other variable that choice is lacking, but in general these things are on a continuum, as you state. But the real question (to me if no one else) is not whether we are trying to achieve some Olympian level of Justice, in which case everyone’s circumstances matter to what they get in life, but whether some particular irrational animus is being rooted out at the margins. There may be dozens of reasons Joe can’t move from his zip code and may pay higher accident rates on his insurance because of it. One of those might be because he isn’t allowed to buy in other neighborhoods simply because of his race. If so, then zip code is *actually* just a recoding of race and should not be used. My supposition, though, is that is a very rare case. In fact, Joe doesn’t move because (a) he doesn’t want to, even though he pays higher insurance rates; or (b) he can’t afford to, for whatever reason. (a) is just the way the world works (that’s the controversial part, IMO) and (b) is a general problem that needs to be addressed… it’s not a discrimination issue. This covers your homeless guy problem.

          In your hypothetical, the *cause* of high accident rates isn’t race at all. It’s bad drivers. Bad drivers might well have reasons to cluster by zip code. Some of these reasons might be race-related, and a lot of them might be dependent on some sort of race-related history. As a practical matter, regulating the machine learning algorithms of firms who have excluded explicit racial coding in those algorithms ought to be 50th on our list of priorities to improve racial problems.

          Although you didn’t ask, I also don’t see any obligation by companies to have “non-lazy” models once they have they have fulfilled their obligation to use anti-classification methods.

        • Jonathan:

          You teach your children (presumably) that when there carelessness leads to some problem that it is not excuse that “I didn’t mean to.” This is not some strange principle of morality or law that has been introduced into law only for the purpose of dealing with discrimination. It is thousands of years old principle (probably older). When your actions harm someone and reasonable care could have prevented it, you are liable for the damage. You aren’t a criminal. You are just liable to the person who you hurt. For some deeply weird psychological reason, people want to treat the allegation that they have discriminated as if they are being accused of murder and the burden should be just as high. I am not arguing about what the current law requires with respect to home lending. I am just making the point that intent is rarely required in civil or regulatory contexts, nor should it be.

        • In fact negligence is more or less close to the opposite of intent, and yet it makes up a large part of civil disputes. You might say when you harm someone with intent, it’s a crime, and without intent, it’s a tort. I mean it’s not far off.

        • I agree with both of you, but this not a universal principle. Negligence induces liability only when we define a standard of care. This debate is entirely about what standard of care ought to apply in algorithmic design. One cannot answer this question efficiently without a full discussion of costs and benefits. If the only intent by the algorithm designer is to maximize profits, and they do not use data on protected characteristics to do so, then I would make a strong presumption that the incidental effects of those algorithms on protected classes are not compensable as tort damage. How much profit would you have them sacrifice to monitor and reduce such incidental effects? It seems to me (as I fully grant that it may not to others) that this is a bridge too far.

          To take a simple example: I price my product at $10 and, owing to some local market power, earn monopoly rents of $1 per unit. We carefully measure consumers surplus in this market and we discover that women are disparately affected by this pricing decision. I have harmed them, and enriched myself. Under your theory, should women be able to file suit against me for my pricing algorithm? If so, what should I do? Charge women lower prices? Charge everyone lower prices until I come against my profitability constraint? Why?

        • This is anything but a simple example. The issue of price discrimination, algorithmic bias, and their interrelationships entail myriad economic, philosophical, and public policy issues. I’m not sure what my answer to your hypothetical question would be – but I do object to characterizing it as a “simple” example. Apparently you think the answer is obvious, but I do not.

        • Take a more realistic example, someone designs an algorithm to assess credit risk, which does not explicitly have any data about race, but does have data highly correlated with race which is irrelevant to credit risk. Let’s say that in the real world people don’t want to extend credit to African Americans because people are bigots. Let’s also say that in the real world a lot of people who are good investors are also bigots. Now, isn’t it possible that the machine learning algorithm will learn to mimic the behavior of the best investors. After all, the data will teach it that if it makes decisions like those investors do. So, isn’t it entirely possible that it will mimic both the good investing strategies and the racism. The fact that it does not have an explicit piece of code that says “deny credit to African Americans” surely doesn’t matter. If we start with a data set that is biased in some way, the algorithm that learns from that data will likely have similar biases, and shouldn’t we want to eliminate those biases for efficiency reasons alone not even considering our moral responsibility.

        • Excellent Jonathan, I support your position.

          The problem is further exacerbated by the fact that racial or ethnic groups actually do have cultural practices that affect their creditworthiness or other factors. In our state, a recent ballot measure opposing preferential hiring practices toward minorities was sponsored by a group of Asians and passed.

          I see first hand every day that of two common ethnic groups in my region, one has a strong emphasis on education, the other has a strong tendency to starting full time employment as young as possible. These two paths will clearly lead to divergence in creditworthiness, since educated people have higher incomes. And in this case even the uneducated parents of the group that pursues education commonly own homes; while parents in the group that tends toward labor employment rent.

          I believe – I can’t say for sure but I think – that the “education” group uses their own capital and distributes it through religious institutions to help it’s members buy homes, rather than using capital from the banking system. So the idea that banks and their credit decisions prevent people from succeeding is just wrong.

        • I’m not sure I understand your example. First, even there were any nonbigoted investors, they would do even better than the bigoted investors, because they wouldn’t be blinded by their bigotry, which by assumption must lower their returns. (Otherwise, it isn’t bigotry.) The algorithm should pick up that difference. A good machine algorithm should be able to discover the irrelevance of the race-correlated variables. (To be honest, I’m not quite sure what the model is doing in your example.)

          But I grant that it is possible for anti-classified data to have disparate impact. But correcting that disparate impact will then, by definition, lower the profit optimum. Unless you are certain that the disparate impact is caused by racism, rather than merely associated with race, what is the point of reducing returns? And how could you possibly discern causality? Why is this the obligation of the algorithm designer?

        • I don’t see how if you make money you are automatically not a bigot. If you and all your colleagues all agree to charge piano players an arm and a leg for airplane tickets, this makes you automatically “not a piano bigot?”

          I think your argument requires a vibrant competitive market, in which case if you’re a bigot and charge more, some non-bigot will come in and eat your lunch… but there are lots and lots of markets that are not like that.

        • There is a long history in economics of how discrimination contains the seeds of its own demise (a phrase coined by Gary Becker). In a competitive market, discrimination should only be temporary – those who don’t discriminate should be able to out-compete those who do, leading the market to end up without discrimination. Needless to say, virtually all other disciplines have found that unsatisfactory. To be fair to economists, these other disciplines rarely offer a coherent critique – they simply ignore it or claim it is not the real world. Rather than take sides in that debate (which is far from settled), I will simply say that I wouldn’t want to hang my hat on the assumption that markets are sufficiently competitive to erase discrimination to the point that algorithmic bias should be expected to not exist.

        • Daniel Lakeland and Dale Lehman:

          You guys are making a much stronger claim than I am. First, if “If you and all your colleagues all agree to charge piano players an arm and a leg for airplane tickets” (I am assuming by colleagues you mean competitors) then you are violating antitrust laws, not antidiscrimination laws. And Dale: the Chicago School refutation of the existence (or at least stability) of discrimination is way too strong for me. Markets clearly aren’t competitive enough to root out all vestiges of taste-based discrimination. But surely algorithms which do not actually know the race of customers won’t do any *worse* than explicit animus. How? So the replacement of biased human judgment by machine AI black boxes ought to do at least marginally better, right? Otherwise, you aren’t giving the bigots nearly enough credit in their ability to fine tune their distinctions!

        • I kinda agree with Jonathan that it seems like the profit motive would be enough to mess up the fairness of any of these mortgage operations. If the goal is to make dollarydoos, then it’s unsurprising that something like fairness would suffer as a result. Gotta get rid of that profit motive and focus on the fairness. Once fairness is the goal, let’s bring in some regressions and try to understand the situation and get everything working, cause yeah, this doesn’t seem simple at all.

        • This is a great discussion.

          But today there are more factors at play than just algorithms. The biggest one is that, while discrimination has been common in the past with things like loans, that past was one of relatively small banks with local control and much less competition in general.

          today we have a world where most of these algorithms will be controlled at some distant HQ where local biases can’t be incorporated and the explicit aim is the maximize returns in a very competitive market. The goal of algorithm development is to exclude any biases that might generate lower returns.

          OTOH, with a small local bank that has a local quasi-monopoly might be doing well enough, with only modest competition, tok indulge it’s biases.

        • Another response to Another Jonathan
          We are partially on the same page – I also don’t believe competition is strong enough to erase discrimination, although I do believe the effects of competition are not sufficiently appreciated in the discussion of discrimination and its demise/continuation. I have also made the case myself that algorithms are quite likely to embody less bias than humnans. But that is a very low bar to set. I am not comfortable with overlooking potential algorithmic bias because it is less than humans exhibit. To cite but one example: we’ve seen studies about how judges hand out different sentences just before lunch compared with other times of the day (and there is a growing slew of such studies, most of which I’ve become aware of through this blog). We already know that algorithmic sentencing guidelines are capable of bias (although the evidence is, and should be, debated and further analyzed). But if we use that as a justification for replacing judges with algorithms, then I wonder what need we have for humans at all. We are getting close to the point where algorithms outperform most humans (not all, but most) on most tasks that require judgement. I do believe the world would be more efficient with extensive replacement of human judgement with algorithmic decision making – and we are well on the way to doing that. I find the idea that we needn’t worry about algorithmic bias because it is less than what humans do, disturbing to say the least.

          Perhaps I am overstating your position again. But you seem too complacent about algorithmic bias for my tastes. Humans should be accountable for their creations – if we create a model to decide about giving out loans, then we are responsible for what that model does. If it is biased against piano players, then we should own up to that rather than dismissing it by saying the person that was making those decisions was even more biased.

        • Humans should be accountable for their creations. But that doesn’t mean that every human decision is put through an anti-discrimination filter. To invoke our host here, no effects are equal to zero. Every decision every human makes has effects which may have effects which affect one protected class or another. The original claim I made, and I still believe, is that with rare exception only intentional discriminatory impact ought to matter and that anti-classification algorithms should get a safe harbor absent extraordinary factors or blatant intent. That’s a statement about where I’d place the standard of care on algorithmic designers and it would be a really high bar for those adversely affected to surmount. That’s not because I have no sympathy for those adversely affected… It’s because I believe that in a world where you don’t try to use protected classifications in your algorithms, the vast majority of disparate effects you actually will observe will be random and associational, not causal. I have no proof of that proposition, but I think there is a lot of evidence for it which the margin of this blog comment is too narrow to contain.

          I also think that not only has the vast majority of conscious active animus in commerce been eradicated over the last 50 years, but that when confronted with unconscious bias by demonstration, that the zeitgeist is such that large companies will do what they can to reduce that bias, so long as doing so has only an incidental effect on profitability. Maybe small companies as well, but I’m less certain there. But if we make the requirements (or risks) of proving nondiscriminatory effect too high on algorithms, we raise their costs and we get less algorithmic decisionmaking, which in turn means, since we agree that anti-classification algorithms which lack explicit intent ought to be less biased, means we get more bias.

        • Jonathan (another one)
          Perhaps we will pick up this thread on today’s post by Andrew. I am not willing to give anti-classification algorithms a “safe harbor absent extraordinary factors or blatant intent.” I don’t disagree with your reasoning, only your conclusion. We employ technologies (I mean the term broadly here, to include machines, software, and even economic systems) far too quickly, and your suggestion will only contribute to this. There are plenty of science fiction scenarios, both harmless and dangerous, showing potential paths ahead. Contrast the iRobot vision with Robot and Frank for two extremes. My own favorite is Wall-E: the humans have ceased to serve much purpose any more. I’d like to see humans make choices about our futures – we may be fallible, even likely to be so. But once we abandon our discretion, what are we?

          You cite the “vast majority of conscious active animus in commerce” as having been eradicated over the last 50 years. You must live in a different world than me. An increasing amount of commerce I encounter is so anonymized that I have no idea of who is actually providing a product/service to me. And, when I complain, I rarely get any response, if I can even figure out who to complain to. The animus may well be unconscious, but I don’t think it is any less serious. In the “olden days” I shopped at a local merchant – and that merchant may well have exhibited conscious animus. But they were accountable and identifiable. In some ways, that is uglier than my dealings with my cable provider, robocallers, etc. But it was more personal. And, when we lose these personal interactions, I believe we have lost something.

        • One more comment on this thread for continuity…
          I completely agree that we have lost something as we move from my local store to Amazon in terms of customer service. Lots of companies today give lip service to customer service without operationalizing that through empowering humans to take reasonable ameliorative actions. I think that sucks. But I also think it has nothing to do with discrimination against protected classes… (which is what I meant by animus) nothing at all. Indeed, an airline that gives upgrades based on some unknowable formula versus one in which pleasant people who talk pleasantly to people at the check-in desk is a world that favored me… a pleasant white guy. But I recognize that ruthlessly being a dick to people, in many cases, makes things cheaper and just creates a tradeoff. I have had horrible experiences with Amazon that I would never have had from a local retailer twenty years ago just because you can’t talk to anybody. Overall though, I pay that price quite willingly.

          I think it would be a great world in which everyone behaved courteously… I believe in Oscar Wilde’s definition of a gentleman: someone who never insults someone unintentionally. But I wouldn’t try to use anything but moral suasion to enforce courtesy. And there are 4,532 ways in which consumers are screwed over today… Many of them are traceable to algorithms and treating people as abstract consuming units rather than people. And I would think that if I were in the retail business I’d try to fix that because I would really like it if my customers were satisfied and had no bad experiences… at least if I could do so without costing myself too many customers who didn’t care and didn’t mind being mistreated as long as they got what they wanted cheaply enough. But that’s not a racial problem, or a gender problem, or a national origin problem tied causally to the identity of my customers.

          One last hypothetical (though you didn’t like my last one.) I’m a retailer. Should I be required to have a Spanish language staff to interact with customers who don’t speak English? Clearly, my failure to do so has differential impacts on non-English speaking customers. Also (presumably) my decision to do so is based on my estimation of the cost of Spanish-speaking support staff versus my lost revenues. (And if it’s just because I hate Hispanics, that’s a different problem, perhaps hard to suss out but I grant actionable.)

        • I think you have confused some issues that are better considered separately. The original post was about issues related to examining bias in machine learning. Most of your comments appear to concern regulation of machine learning applications. Use of algorithms and regulation are not the same thing. I believe algorithmic bias is a complex subject and not easy to detect in a meaningful way – and that appears to be what the talk and paper are about. I believe analysts should care about bias and should be aware of how to measure it and pitfalls in such an analysis. That does not mean it should be regulated. The public policy issues are quite different. As your last example suggests, should there be a public policy forcing retailers to have Spanish-speaking staff? That is certainly different from an analysis determining the extent of benefits of Spanish-speaking staff.

          Your example, though, is not so straightforward. You clearly were trained as an economist, so you are well aware that the public policy considerations depend, to some extent, on the market conditions. We do require broadcasters to have some types of programming to meet community needs. That requirement stems from a market imperfection – spectrum scarcity (at least a reality in past times). If your store was located in an area with particular zoning requirements that limited the possibility of opening competing stores, then a requirement of Spanish-speaking staff might make sense – although in a more perfect market setting I would not want to see such a legal requirement on businesses. Only a tunnel-vision economist would reject all public non-discrimination policy regardless of market conditions.

          I am far from recommending that we regulate users of algorithms to prove their models are unbiased. But I would like to see a voluntary code of ethics to do so – and that is not a simple matter as the paper suggests. A recent survey (Kaggle users, clearly not a random sample, but with a large number of responses from active data scientists) revealed that a vast majority of data scientists consider algorithmic bias an important thing to worry about (on the order of 90% of respondents). When asked how much time they spend analyzing such bias, the answer was very little. Given the issues associated with appropriate measurement of bias, we are far from a simplistic statement of the appropriate “standard of care” for examining bias. I would argue that we are far from establishing either an ethical or a legal standard, and that it is premature to simply declare that algorithms be given a free pass aside from extenuating circumstances.

        • I plead guilty to jumping from “what should be the fairness criteria for machine algorithms” to “what should we do about algorithms that violate whatever criteria we establish.” But I think I disagree with you that those issues are “better considered separately.” I think it’s essential that they be considered jointly. Interestingly, I just received this morning the latest copy of Significance magazine, which has an article on German plans to regulate machine algorithms. It sounds to me like a nightmare, but YMMV. bit.ly/32XgkXH (Chapter 3) for the report (Article link at https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2019.01329.x)

  3. Nothing engenders more hand-waving than fairness. And in the domain of machine learning, fairness is simple. Whether motivated by altruism or compulsion to comply with regulation, it amounts to taking a decision to ignore the facts.

  4. Their argument against classification parity does not hold water to my eyes. Their argument is premised on the idea that since subpopulation false positive rate depends on the density of the risk distribution above the classification threshold close to the middle, properly calibrated conditional probability estimates can still have differential false positive rates. In other words, they argue that so long as conditional probability estimates are unbiased it is fair, and discrepancies in false positive rate can arise from differences in residual variance between subpopulations. I would argue that difference in residual variance is still unfair. The risk distribution is not an object determined by inherent random process, it’s a result of how you model the process and what data is included or excluded. You can inadvertently create models which produce unbiased estimates for all subpopulations but has higher performance and lower variance for one subpopulation than another.

    Suppose we have red people and blue people, who both live in a world with two towns, A and B, and two cars, X and Y. Red people from town A always become doctors, and red people from town B always become lawyers. Blue people with car X always become doctors, and blue people with car Y always become lawyers. The population is 50% red, 50% of cars are X, and 50% of people live in town A, and all three of these variables are uncorrelated. I can write a program that correctly predicts profession for reds correctly every time but predicts a probability of 50% for blue people by classifying based on town of residence. I can flip that around between red and blue if I classify based on car ownership. Both of these systems have perfectly calibrated probability outputs. But it’s obvious here that the risk distribution and hence the false positive rate is not some inherent property of the subpopulations, it’s my modeling choice.

    • This example misses an essential component, which is that here the quantity you’re trying to predict is deterministic. When the process you’re trying to model is nondeterministic, even a model with perfect information will have different false positive and false negative rates on different subpopulations. The tricky part is that the best model we have is often the one whose fairness is in question, but we need to keep in mind that our lack of a perfect model is not an argument about the intrinsic properties of the population.

Leave a Reply to Jonathan (another one) Cancel reply

Your email address will not be published. Required fields are marked *