Anybody want a drink before the war?

Your lallies look like darts, and you’ve got nanti carts, but I love your bona eke – Lee Sutton (A near miss)

I’ve been thinking about gayface again. I guess this is for a bunch of reasons, but one of the lesser ones is that this breathless article by JD Schramm popped up in the Washington Post the other day. This is not just because it starts with a story I relate to very deeply about accidentally going on a number of dates with someone. (In my case, it was a woman, I was 19, and we bonded over musical theatre. Apparently this wasn’t a giveaway. Long story short, she tried to kiss me, I ran away.)

There is no other Troy for me to burn

The main gist of Schramm’s article is that the whole world is going to end and all the gays are going to be rounded up into concentration camps or some such things due to the implications of Wang and Kosinski’s gayface article.

I think we can safely not worry about that.

Success has made a failure of our home

For those of you who need a refresher, the Wang and Kosinski paper had some problems. They basically scraped a bunch of data from an undisclosed dating site that caters to men and women of all persuasions and fed it to a deep neural network face recognition program to find facial features that were predictive of being gay or predictive or being straight.  They then did a sparse logistic regression to build a classifier.

There were some problems.

  1. The website didn’t provide sexual orientation information, but only a “looking for” feature. Activity and identity overlap but are not actually the same thing, so we’re already off to an awkward start.

To go beyond that, we need to understand what these machine learning algorithms can actually do. The key thing is that they do not extrapolate well. They can find deep, non-intuitive links between elements of a sample (which is part of why they can be so successful for certain tasks), but it can’t imagine unobserved data.

For example, if we were trying to classify four legged creatures and we fed the algorithm photos of horses and donkeys, you’d expect it to generalize well to photos of mules, but less well to photos of kangaroos.

To some extent, this is what we talk about when we talk about “generalization error”. If a classifier does a great job on the data it was trained on (and holdout sets thereof), but a terrible job on a new dataset, one explanation would be that the training data is in some material way different from the new data set. This would turn classification on the new data set into an extrapolation task, which is an area where these types of algorithms excel.

2. The training data set is a terrible representative of the population. The testing set is even worse.

There are other problems with the paper. My favourite is that they find that facial brightness is positively correlated with the probability of being gay and posit a possible reason for that is that is that an overabundance of testosterone darkens skin. Essentially, they argue that straight people are a bit dull because they’ve got too much testosterone.

As much as I enjoy the idea that they’ve proposed some sort of faggy celestial navigation (you’re never lost if there’s a gay on the horizon to light your way to safety), it’s not that likely. More likely, gay men use more filters in their dating profile shots and we really should sound the correlation is not causation klaxon.

How I be me (and you be you)?

But the howling, doom-laden tone of the Schramm piece did make me think about if  building the sort of AI he’s warning against would even be possible.

Really, we’re talking about passive gaydar here, where people pick up on if you’re gay based solely on information that isn’t actively broadcast. Active gaydar is a very different beast–this requires a person to actively signal their orientation. Active gaydar is known to confuse whales and cause them to beach themselves, so please avoid using it near large bodies of water.

To train an AI system to be a passive gaydar, you would need to feed it with data that covered the broad spectrum of presentation of homosexuality. This is hard. It differs from place to place, among cultures, races, socioeconomic groups. More than this, it’s not stable in time. (A whole lot more straight people know drag terminology now than a decade ago, even if they do occasionally say “cast shade” like it’s a D&D spell.)

On top of this, the LGB population is only a small fraction of the whole population. This means that even a classifier that very accurately identifies known gay people or known straight people will be quite inaccurate when applied to the whole population. This is the problem with conditional probabilities!

I think we need to be comfortable with the idea that among all of the other reasons we shouldn’t try to build an AI gaydar, we probably just can’t. Building an AI gaydar would be at least as hard as building a self-driving car. Probably much harder.

51 thoughts on “Anybody want a drink before the war?

  1. > It differs from place to place, among cultures, races, socioeconomic groups. More than this, it’s not stable in time.
    Sounds almost as tricky as modelling (appraised) study quality “appears that _quality_ (whatever leads to more valid results) is of fairly high dimension and possibly non-additive and nonlinear, and that quality dimension are highly application specific and hard to measure from published information” …

    OK even harder.

    Unlikely to stop many people or prevent misunderstanding.

    • There are so many papers where people just throw a neural net over some scraped data. Maybe tensorflow needs an automatic warning that just says “guys this is hard” every time someone tries to run it.

  2. All of this you say could be for nothing, since once a system is built, those who want to will find it easy to believe its results. They probably won’t care if it’s accurate or has a large false negative rate, etc.

    They might care more if it was the autonomous driving system in their own car, though …

    • Right this is the part Dan seems to miss. Suppose the Nazis wanted a “jewface” system, and suppose they got it, and suppose it had lots of false positives. Would they care? So they round up 8 million Jews and 4 million gentiles with facial features that maybe look Jewish to a computer… I doubt Hitler would have had a hard time sleeping over this.

      If you’re willing to do harm to other people who *look* a certain way, you’ve already flushed your humanity down the toilet. You aren’t going to care about a little fallout or data quality issues.

      • I rest in comfort that if a classifier has 90% accuracy when identifying a gay man as gay (the in sample accuracy after viewing 5 pictures from W&K) and a similar accuracy for correctly identifying a straight man (which is not reported because heteronormativity is real) and every man identified as being gay by the algorithm was summarily executed, then around 75% of the dead would be straight.

      • In your analogy that would be ~22 million gentiles (I mean you can do your own calculation to get the exact number with a 4% rare population, but you get the point)

        • “That seems careless” doesn’t really address Lakeland’s “Would they care?” argument.

          We’re prepared to ban people from many Muslim nations from visiting because of the much tinier fraction who pose a security risk.

          I grant you a travel ban is is a long way from murder, but, again, “Would they care?”

      • Germany had quite complicated rules to determine who were “full jews” or “mixed blood” and, on the other extreme of the blood purity spectrum, who were part of the master race and could get an Aryan certificate. It’s true that they were much less careful in occupied countries, slavs were not regarded much more highly than jews anyway.

        But I don’t follow the logic in your comment, the same reasoning could be used to say that those who eat lambs, horses, whales, dogs or any other fellow animal being wouldn’t lose their sleep over sending millions of children to the slaughterhouse.

        • So it seems my point wasn’t well made.

          To someone who plans to do evil to others, a tool like this is just used to help justify doing what they already planned to do.

          If an evil person plans to round up gay people, they will do it. In every case where they need a reason and this software can provide the air of impartial technology they will use it.

          They won’t bother using the technology on people they don’t want to round up…

          The only probability of relevance is probability that the software gives the desired answer when justification is needed for doing harm to someone that the evil actor wants to harm.

          It’s a lot like how NHST just exists to stamp p less than 0.05 next to whatever you want to publish… Cynical maybe but I think in many cases realistic

        • I though in your example the evil person wanted to do evil things to 8 million people who were the real target and he ended doing evil things as well to 4 million people who were misidentified by the computer.

          Of course if they are only going to use the detector on the people they want to harm the optimal detector is the one which always says “yes”, there is no need to bother calculating false positives, etc.

        • Politics isn’t always so easy for poor little evil people (sarcasm). Sometimes they have to externally justify things, or put an air of some form of legitimacy to win allies etc.

          One point is that the base rate of people type X in the population is less relevant to the question than the base rate of people of type X in the population of people on whom the technology will be used.

          If evil group rounds up a bunch of “gay seeming” people who offend their evil sensibilities, such that actual rate of gayness is say 50% in this population instead of say 4%, and then runs 95%+ accurate screening procedure, and then ships off all the ones that spit out of that procedure… the result is very different. And if they can tweak a few pixels here and there and get the desired result, while still seemingly having 95%+ accuracy in testing that they can use to justify their “impartiality” then the more the better…

          We can’t treat the politics of evil as a pure math textbook problem.

        • But given that a tool would just be used to help justify what they were already planning to do, why would it matter whether or not you had a classifier that even sort of worked? In fact, why would you take the substantial risk that the classifier would return a negative when you really needed a pretext to execute someone? Wouldn’t you just say you had some kind of objective classifier, and then execute the people you wanted to execute? Or why not take a page from criminal prosecutors and call some “expert” to the stand so that you could find some tiny strand of camp in their personal belongings on which to hang them?

          I guess I just don’t see how, if some people have the desire and political ability to lead a pogrom on queer people, this classifier lets them do anything they couldn’t do before.

          If we were going to worry about pernicious police-state applications of deep learning, I’d be a lot more worried about a classifier designed to automatically scrape networks of surveillance cameras for evidence of homosexual activity and match faces to social media profiles or driver’s licenses or whatever. Behavior is a lot less ambiguous than some kind of gay-face physiognomy that may or may not even exist.

        • “But given that a tool would just be used to help justify what they were already planning to do, why would it matter whether or not you had a classifier that even sort of worked?”

          Presumably for political reasons. Just as NHST p values help convince reviewers to accept bogus pseudo-scientific publications.

          “If we were going to worry about pernicious police-state applications of deep learning, I’d be a lot more worried about a classifier designed to automatically scrape networks of surveillance cameras for evidence of homosexual activity and match faces to social media profiles or driver’s licenses or whatever.”

          Yes, that worries me a lot, and not just regarding gayness, lots of other bad applications as well, everything from persecuting people of color, to cracking down on undocumented immigrants, to the well documented evils of traffic cameras (they cause lots of injury accidents in the process of generating local extortion revenue).

        • Dan:

          The paper may be lovely but I was annoyed by this from the abstract:

          The results show that 70.97% of the natural images can be perturbed to at least one target class by modifying just one pixel with 97.47% confidence on average.

          Rounding to the nearest hundredth of a percentage point? Whassup with that? If you ask one of these authors their weight, do they say, “170.97 pounds”?

        • For what it’s worth, at least one of the data sets appears to have 10k images, so presumably they’re just writing 7097/10000 as a percent. (Not going to comment on the 97.47% number tho)

        • Honestly, in an abstract I see no real reason to go past the 5s (or maybe 10s). So 70% and >90% would be my choices.

          Doesn’t change the coolness of generating one pixel adversarial image modifications!

        • Sometimes I think people don’t like to round if it changes 2 of the decimal places. I.e. rounding up the 7 on the 9 rounds the 0 to a 1 so you would end up with 71%. I know that’s not really any different to rounding 71.03% down to 71% but it feels different.

  3. I thought “gayface” was the homophobic equivalent of “blackface”—a straight actor caricaturing a gay man. I was confused for quite a while trying to understand the post.

  4. > For example, if we were trying to classify four legged creatures and we fed the algorithm photos of horses and donkeys, you’d expect it to generalize well to photos of mules, but less well to photos of kangaroos.

    How many legs does a kangaroo have?

    • Maybe a better example was the neural net that had good classification of cows but failed when the cows where on a beach (as green pixels were always available in the training data – OK just a garbage in garbage out problem – but how do you tell whats garbage?)

  5. I think the question of whether we can build such a system misses the point. We can – and likely will build it – at least someone will. It will not be 100% accurate, but it might well be accurate enough to be useful to someone. Especially if that someone is not benevolent. It seems to me that there are two crucial questions:
    1. Is there a threshold accuracy that is required for an AI system?
    2. Who (and how) decides what the threshold is?
    3. (ok, I guess I have 3 questions) Are there some uses of AI that should not be developed at all?
    I’m assuming the answer the #3 is that there are some uses that should not be developed, since there are few absolutes in this world. But, do we have the sociopolitical mechanisms that can enforce such a prohibition?

    • Well, my point is that we can’t build this sort of system using current technology, data, and the current state of our understanding of the science.

      As for hour accurate, I think for a 3-5% subpopulation we’d be shooting for the moon to have the chance of someone classified as gay actual being gay as high as 50%

      AI is like any other technology: not immune to ethics.

      • Well, my point is that the ethical question is far more important than the feasibility question. I’m not sure I agree with your assessment of the feasibility, but I don’t think that really matters compared with the ethical questions regarding AI. Waiting for the technology to develop before asking these questions is a poor way to proceed and debating the current state of the technology can be a distraction from these (in my opinion, more important) questions.

        • > assessment of the feasibility
          Was unable to find the cell phone screening for melanoma talk I recently heard, but even with very high sensitivity and specificity, the positives that would require biopsy would wipe out the Canadian health care system while the majority would be ruled as false and unnecessary. The life time incidence is about 2%.

        • This is merely an example (your statement) of good decision making. Even accurate models will flunk some cost-benefit analysis. But at some point, they won’t – and that does not mean the ethical questions have been dealt with. I believe there are (and will be) many instances where AI models will pass a cost-benefit test and are sufficiently accurate to be useful. But not all of those “useful” cases are ethical. We are so far from being able to determine “ethical use” that I believe many people disregard ethics as simply subjective opinion. My complaint in this post is that claiming that these models are insufficiently accurate becomes a screen behind which we delay asking ethical questions.

        • Well I don’t want to delay ethical questions I just want to defer howling hyperbole. And really, that was all I addressed.

          I guess the end point is that they problem isn’t the construction of the classifier, it’s the small population of homosexuals. Because of this small population, any classifier that could solve this task would need to ave unprecedented accuracy. For example if we required 3 of 4 people classified as gay to actually be gay, the classifier would need 99% accuracy when turned on known strait or known gay populations. It is unlikely that that accuracy is achievable.

          So short of increasing the rate of homosexuals in the population, there’s no way of making this scheme work. Ethical considerations need to be grounded in reality. That Washington Post article was pure fantasy.

        • Carlos,

          Actually, no. If you read the article linked to by the Economist, it explicitly states that its results are contradictory to various other findings in that area. For each contradiction, a theoretical reason is given; it’s clear that any finding would be theoretically possible in the authors’ general framework. In addition, even setting aside the possibility that all their patterns could be obtained from pure noise (just from forking paths), the results could be explained by confounding. Just for example, maybe the traders with wider faces are older, on average, and maybe older traders are more likely to buy health-care stocks, and maybe health-care stocks happened to do poorly in the years under study. Or whatever; there are endless such explanations. Finally, part of “this sort of thing” is the authors of the manuscript, and the news article, continually referring to testosterone when they’re actually talking about face shape. This is a bait-and-switch that we commonly see in hyped pop-science.

        • I don’t see they saying that their results “contradict” anything (maybe the neoclassical view that manager facial structure should not matter for fund performance).

          They say they “contrast” with studies regarding CEOs and the financial results of the companies they manage. But CEOs are not investment managers. They don’t just have to make decisions, they have to be persuasive and many other things that are not relevant for fund managers.

          They also mention that the observed relationship in high-frequency traders goes in the other direction, but again trading is not investing. I find unsurprising that excessive risk taking results in underperformance when we are looking at long-term investment management.

          You give one example of an alternative explanation due to differences in age, but they say the result survives after controlling for age. I’m sure you could find other examples, there is no need to come up with another one.

        • Carlos:

          My points are:

          1. Any pattern they find in these data, they can find something in the literature that would seem to be in agreement with it. They find a pattern in one direction, they can say this is consistent with a paper on high-frequency traders. (You say, “I find unsurprising that excessive risk taking results in underperformance when we are looking at long-term investment management,” but I’d also find it unsurprising that excessive risk taking results in short-term underperformance as well.) They find a pattern in another direction, they can say this is consistent with a paper on hormones and risk taking.

          2. Any claim based on stock performance is contingent on the time period being studied. To put it another way, they don’t have N=hundreds, they have N=1, because, whatever period they’re studying, there are some sorts of assets that go up in value more than others, so any behavior that’s correlated with investment in different sorts of assets can show apparently statistically significant patterns.

        • 1) I’m not sure what is your point. They did a study, motivated by the fact that “despite the importance of the asset management industry and the preponderance of testosterone-charged behavior amongst financial traders, still little is known about the impact of testosterone on investment management.” They report what they found and then they discuss the results in the context of previous studies.

          Should the refrain from discussing the results? If the results had been different they would have written a different paper, that’s obvious. And they acknowledge that their analysis may be missing something:
          “In our work, we carefully consider several alternative explanations (…) but find that they are unlikely to drive our findings. Still, it is not possible to fully rule out all other stories or mechanisms. One caveat, therefore, is that our findings may be driven by some omitted variable that we have not controlled for or that we have not adequately adjusted for in our tests. The findings in this paper should be considered in light of this limitation.”

          2) “Fig. 1 indicate that the high-testosterone fund portfolio consistently under-performs the low-testosterone fund portfolio over the entire sample period and suggest that the underperformance of funds managed by high-testosterone managers is not peculiar to a particular year.”

        • Carlos:

          Here are some things I object to.

          From the Economist article:

          “Are alpha males worse investors?” No. There’s nothing in the research saying anything about alpha males.

          “A paper recently published by researchers at the University of Central Florida and Singapore Management University looks at the relationship between testosterone (a hormone associated with competitiveness and risk-taking) and investment performance.” No. There’s nothing in that research article about testosterone.

          From the research article:

          “Do Alpha Males Deliver Alpha? Testosterone and Hedge Funds.” No. Nothing in the article about alpha males or testosterone.

          “Hedge funds managed by managers with high fWHR underperform those managed by managers with low fWHR by an economically and statistically significant 5.80% per year (t-statistic = 3.16) after adjusting for risk.” No. To call this “statistically significant” is misleading, given the forking paths in the analysis and given autocorrelation in the data.

          “Moreover, masculine managers are more likely to engage in suboptimal trading behavior such as purchasing lottery-like stocks and holding on to loser stocks.” This does not represent additional information. Conditional on the stocks performing relatively poorly in the dataset, this will show up as “holding on to loser stocks.”

          “Since alpha males do not deliver alpha…” No. Again, there is nothing in the paper about “alpha males”; they’re looking at face shape.

          I have no problem with people going through the data and looking at patterns. It’s fine to discuss the results, but not so fine to hype them. Their caveat is fine for what it is, but I still think their conclusions are way too strong, and I am disappointed in the Economist for taking the research so seriously. It’s possible to present such work in a more skeptical way, indeed the Economist is well known for its tone of lighthearted skepticism, which I think would’ve worked very well here.

        • I agree that we should not forget that a proxy is being used. But unless this proxy is completely useless (I have no idea), saying that the paper says *nothing* about testosterone seems a bit too strong. I wouldn’t say that papers on climate change say nothing about climate because they are about the growth of trees and the composition of ice.

          The reference to alpha males is just to be able to conclude the paper with a bad pun. And it’s easy to understand as a shorthand for high testosterone (and the associated traits). But I agree it doesn’t really mean anything.

          > “Hedge funds managed by managers with high fWHR underperform those managed by managers with low fWHR by an economically and statistically significant 5.80% per year (t-statistic = 3.16) after adjusting for risk.” No. To call this “statistically significant” is misleading, given the forking paths in the analysis and given autocorrelation in the data.

          What forking paths? They use their only variable and the most obvious outcome (risk-adjusted performance). Using raw performance doesn’t change anything. Forming portfolios using deciles is also a standard procedure. Maybe you refer to publication bias, but then everytime you read anything you’re being misled.

          > “Moreover, masculine managers are more likely to engage in suboptimal trading behavior such as purchasing lottery-like stocks and holding on to loser stocks.” This does not represent additional information. Conditional on the stocks performing relatively poorly in the dataset, this will show up as “holding on to loser stocks.”

          There are many ways to lose money. Holding on to losers (according to the quarterly filings) is one of them. Purchasing lottery-like stocks is another. The results of those analysis could have been different, how is this not additional information?

        • Carlos:

          Saying “alpha males” and “testosterone” all over the place, when neither is being measured, is not just meaningless; it’s actively misleading. How hard would it be for them to say “face shape” everywhere? Not hard at all—but then it would just make the research article and the news report less appealing.

          Just for example, from the abstract: “high-testosterone managers are more likely to terminate their funds…”. They provide no evidence for this claim. Why not be descriptive and say, “High width-to-height ratio managers . . .”? It doesn’t sound so good, huh? Well, too bad. You present what data you have.

          Regarding the forking paths: There are many many choices in data collection and analysis in that paper. For a sense of how forking paths can arise without researchers even being aware of it, see this story.

        • Ok, I don’t know how many unspeakeable things they did to that poor data.

          I forgot to mention one mechanism that I imagined that could result in negative correlation between testosterone (I’ll keep it simple, all these correlations could be weak) and investment performance wihout implying any causal link. Let’s say there are to ways to get a job as fund manager: by being good at investing or by being good at selling. And let’s say more testosterone somehow results in in better salesmanship. Then high-testosterone managers may or may not be good investors, while low-testosterone managers got there because of their investing talent.

    • https://www.economist.com/blogs/graphicdetail/2018/02/daily-chart-13

      You don’t need to look any further than the first graph in this link to know that this is complete and utter BS.

      The graph shows 600% higher returns over a 20-year period using their strategy. Thousands of very intelligent and highly-paid researchers with the best resources in the world at their disposal spend millions of hours each year trying to improve investment returns by a single percentage point a year. Very few (if any) find anything at all.

      There is no way in the ever-loving $#&*@$^ world that these bozos found what they claim to have found.

  6. I haven’t seen anyone discuss or mentions this, but I think it is very interesting, so I’ll link it:

    https://medium.com/@blaisea/do-algorithms-reveal-sexual-orientation-or-just-expose-our-stereotypes-d998fafdf477

    These team (Google and Princeton) basically argues that the differences uncovered by Wang and Kosinski are NOT genetic but due to grooming and cultural differences (same point as Dan Simpson makes in this blog post about filters). Examples: gay men are equally likely to have eyesight problems, but more likely to wear glasses; lesbian women are less likely to wear makeup in general, and eyeshadow in particular. Wang and Kosinski, however, do not consider this possibility at all and make a lot of rather naive assumptions in those “average gayface” images: that all differences are innate and genetic rather than cultural!

    These assumptions, for instance, lead them to conclude that straight men have wider chins than gay men (the explanation is, of course, some hormonal differences – big jaw from more testosterone or whatever). But these differences are due to the fact that gay people are more likely to take selfies from above (making their chins look narrower)!

    In sum, Wang and Kosinski have to answer for a lot of flaws, all around the same point: it is not innate, but cultural differences that produce the “difference in face shape” that they find. Their results are spurious. It is a pity – as someone else on Twitter also observed – that this response is only on Medium and not at JPSP like the original article.

    • This is a very good article!

      And I completely agree that the results are spurious. They badly over-interpret the results of a classifier for finding a rare population that was trained (and validated) on inherently biased samples. We need to do a lot of work to train scientists (social and otherwise) about both the limits of AI and the way that we can force our own implicit biases on the inference.

      • Kosinski is often in the media because of his AI research. He has not answered – to the best of my knowledge – to this methodological criticism, but mostly to “social” criticism such as “oh no what are the new Nazis gonna do with this”. As if the old Nazis needed AI to kill millions. I wish he would reply to this point, I wonder if he is a “brand defender” like many or he is going to consider these criticisms.

        • W&K wrote a lengthy response to their critics (including GLAAD and the HRC) that’s linked on the same page as their pre-print (they call it “author notes”). Based on my memory of reading that back in November (when I wrote my first post on this), they look much more like “brand defenders”. But I’ve been wrong before.

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published. Required fields are marked *