God, goons, and gays: 3 quick takes

Next open blog spots are in April but all these are topical so I thought I’d throw them down right now for ya.

1. Alex Durante writes:

I noticed that this study on how Trump supporters respond to racial cues is getting some media play, notably over at Vox. I was wondering if you have any thoughts on it. At first glance, it seems to me that its results are being way overhyped. Thanks for your time.

Here’s a table showing one of their analyses:

My reaction to this sort of thing is: (a) I won’t believe this particular claim until I see the preregistered replication. Too many forking paths. And (b) of course it’s true that “Supporters and Opponents of Donald Trump Respond Differently to Racial Cues” (that’s the title of the paper). How could that not be true, given that Trump and Clinton represent different political parties with way different positions on racial issues? So I don’t really know what’s gained by this sort of study that attempts to scientifically demonstrate a general claim that we already know, by making a very specific claim that I have no reason to think will replicate. Unfortunately, a lot of social science seems to work this way.

Just to clarify: I think the topic is important and I’m not opposed to this sort of experimental study. Indeed, it may well be that interesting things can be learned from the data from this experiment, and I hope the authors make their raw data available immediately. I’m just having trouble seeing what to do with these specific findings. Again, if the only point is that “Supporters and Opponents of Donald Trump Respond Differently to Racial Cues,” we didn’t need this sort of study in the first place. So the interest has to be in the details, and that’s where I’m having problems with the motivation and the analysis.

2. A couple people pointed me to this paper from 2006 by John “Mary Rosh” Lott, “Evidence of Voter Fraud and the Impact that Regulations to Reduce Fraud Have on Voter Participation Rates,” which is newsworthy because Lott has some connection to this voter commission that’s been in the news. Lott’s empirical analysis is essentially worthless because he’s trying to estimate causal effects from a small dataset by performing unregularized least squares with a zillion predictors. It’s the same problem as this notorious paper (not by Lott) on gun control that appeared in the Lancet last year. I think that if you were to take Lott’s dataset you could with little effort obtain just about any conclusion you wanted by just fiddling around with which variables go into the regression.

3. Andrew Jeon-Lee points us to this post by Philip Cohen regarding a recent paper by Yilun Wang and Michal Kosinski that uses a machine learning algorithm and reports, “Given a single facial image, a classifier could correctly distinguish between gay and heterosexual men in 81% of cases, and in 71% of cases for women. Human judges achieved much lower accuracy: 61% for men and 54% for women.”

Hey, whassup with that? I can get 97% accuracy by just guessing Straight for everybody.

Oh, it must depend on the population they’re studying! Let’s read the paper . . . they got data on 37000 men and 39000 women, approx 50/50 gay/straight. So I guess my classification rule won’t work.

More to the point, I’m guessing that the classification rule that will work will depend a lot on what population you’re using.

I had some deja vu on this one because last year there was a similar online discussion regarding a paper by Xiaolin Wu and Xi Zhang demonstrating an algorithmic classification of faces of people labeled as “criminals” and “noncriminals” (which I think makes even less sense than labeling everybody as straight or gay, but that’s another story). I could’ve sworn I blogged something on that paper but it didn’t show up in any search so I guess I didn’t bother (or maybe I did write something and it’s somewhere in the damn queue).

Anyway, I had the same problem with that paper from last year as I have with this recent effort: it’s fine as a classification exercise, and it can be interesting to see what happens to show up in the data (lesbians wear baseball caps!), but the interpretation is way over the top. It’s no surprise at all that two groups of people selected from different populations will differ from each other. That will be the case if you compare a group of people from a database of criminals to a group of people from a different database, or if you compare a group of people from a gay dating website to a group of people from a straight data website. And if you have samples from two different populations and a large number of cases, then you should be able to train an algorithm to distinguish them at some level of accuracy. Actually doing this is impressive (not necessarily an impressive job by these researchers, but it’s an impressive job by whoever wrote the algorithms that these people ran). It’s an interesting exercise, and the fact that the algorithms outperform unaided humans, that’s interesting too. But this kind of thing: like “The phenomenon is, clearly, troubling to those who hold privacy dear—especially if the technology is used by authoritarian regimes where even a suggestion of homosexuality or criminal intent may be viewed harshly.” That’s just silly, as it completely misses the point that the success of these algorithms entirely depends on the data used to train them.

Also Cohen in his post picks out this quote from the article in question:

[The results] provide strong support for the PHT [prenatal hormone theory], which argues that same-gender sexual orientation stems from the underexposure of male fetuses and overexposure of female fetuses to prenatal androgens responsible for the sexual differentiation of faces, preferences, and behavior.

Huh? That’s just nuts. I agree with Cohen that it would be better to say that the results are “not inconsistent” with the theory, just as they’re not inconsistent with other theories such as the idea that gay people are vampires (or, to be less heteronormative, the idea that straight people lack the vampirical gene).

Also some goofy stuff about the fact that gay men in this particular sample are less likely to have beards.

In all seriousness, I think the best next step here, for anyone who wants to continue research in this area, is to do a set of “placebo control” studies, as they say in econ, each time using the same computer program to classify people chosen from two different samples, for example college graduates and non-college graduates, or English people and French people, or driver’s license photos in state X and driver’s license photos in state Y, or students from college A and students from college B, or baseball players and football players, or people on straight dating site U and people on straight dating site V, or whatever. Do enough of these different groups and you might get some idea of what’s going on.

51 thoughts on “God, goons, and gays: 3 quick takes

      • Though for real, I wouldn’t be surprised if (self-labelled) straight people on that site are just slightly less ‘conventionally attractive’ than gay people.

        Which is easily explained by different ‘dating scenes’. Or, you know, noise.

        • (And now I feel bad for going down the rabbit hole of even bothering to interpret this study, potentially in offensive ways. I can only imagine what life as an evolutionary psychologist would be like. Gross)

        • There’s a very long post tomorrow about this, but basically you’re completely correct. The most likely problem with this study is that the training data gathered from a single dating site is unlikely to be representative. (I’m pretty sure I used the words “the Balkanisation of gay desire” because it makes me laugh.)

        • Yup I don’t think you even need to postulate differences in what makes a female/male face attractive/unattractive (though this obviously limits generalisability even more), just drawing from different pools of people than you would expect in the full pop. (of ‘white’ people at least…)

        • It’s not entirely that. The paper spends a long time arguing that it’s classified doesn’t focus on transient grooming features. Even if that’s true (hint: it’s probably not) if the training sample isn’t representative of the population at large (look at table 1), then population level inferences will be likely false.

        • Your point is also valid, but my point is that a photograph you post to a dating site is not necessarily in some sense a representative photograph of what you look like everyday, also the ways in which it is non-representative may be more exaggerated in one population than another.

          Suppose you took say DMV photos for several thousand self-identified as straight people, and similar for self-IDed gay people. Under the assumption that people don’t primp for their DMV photo, does the classifier still work? Or better yet, how well does the dating-site-trained classifier work? Does it generalize to photos not intended for dating?

        • I don’t disagree with you. It’s just that that point is covered in the paper (with the argument I gave). If the thing the authors claim is true, then context doesn’t matter (because my cheekbones are the same regardless of why I took a photo). If the authors’ claim is not true, then the problem you pointed out is very real.

          But the solution you pointed out (get a better validation sample) would fix literally every problem ithis paper (and likely render the results false). The validation set they actually use is hilariously bad. But a real problem that they have in constructing a validation set is “how do you get information about sexual orientation to go with the photo?”. But obviously, if you don’t have a good solution to that problem, you shouldn’t be doing the study in the first place.

        • But your cheekbones don’t necessarily *look* the same regardless of why you took a photo. Are gays more likely to use photoshop (or any other photographic techniques) to improve the look of their facial features? Plausible. That’s another confound. It means they can arbitrarily change facial features, including those that are context-independent in the real world.

        • That’s very true. I guess I’m willing to concede they may be correct on this point mainly because I don’t want to check the details. And also because even if they are 100% correct that their classifier does not use any transient or styling features, the study is still fundamentally flawed due to selection issues.

          Possibly the only good thing I can say about this study is that there are so many things wrong with it, we don’t need to worry about our “hot takes” overlapping!

  1. a paper by Xiaolin Wu and Xi Zhang demonstrating an algorithmic classification of faces of people labeled as “criminals” and “noncriminals” (which I think makes even less sense than labeling everybody as straight or gay, but that’s another story).

    The main take-home from that study is that the Chinese system tends to criminalise anyone whose facial proportions deviate too much from the norm.

  2. Note that in the “Racialization” study, Clinton supporters also have positive coefficients on the “Favorability x Cue variable”, but they are half as large as the Trump coefficients, so not statistically significant.

    Note further that the difference between the Trump and Clinton coefficients don’t appear to be statistically significant. (Compare the coefficients on “Trump Favorability x Racial Cue” versus “Clinton Favorability x Racial Cue.”)

    So, it looks like this is the old problem that “the difference between a statistically significant coefficient and a non-statistically significant coefficient is not necessary statistically significant.”

    • Here is the relevant discussion from the paper:

      By contrast, attitudes toward Hillary Clinton failed to moderate the impact of
      the racial cues for any of the dependent variables (i.e., the interaction between Clinton
      favorability and cue condition fails to reach significance in any of the models). Furthermore,
      tests constraining the two coefficients to equality indicated that the coefficient for the interaction
      between Trump favorability and the racial cue is significantly stronger than the coefficient for
      the interaction between Clinton favorability and the racial cue for anger (p<0.05) and individual
      blame (p<0.10), and marginally stronger for opposing mortgage assistance (p<0.10, one-tailed).
      These findings indicate that responses to the racial cue varied as a function of feelings about
      Donald Trump—but not feelings about Hillary Clinton—during the 2016 presidential election.3

      First, they conflate a lack of statistical significance with proof of no effect, and then they claim a significant difference between the Trump and Clinton coefficients based on a p < 010.

      • Terry:

        Yeah, good catch. Too bad about the pressure to make deterministic statements from noisy data, and too bad about the incentives that academic journals and news organizations create for researchers to behave in this way.

        • Even worse, they resort to a one-tailed test to get the third p-value less than 0.10.

          Note also that the second and third Clinton interaction terms are “significant” at the 10% level in a one-tailed test as well (t = 1.43 & 1.57). But they don’t apply that standard to the Clinton interaction terms — they only apply it to the test they want to be significant.

  3. Even if everything about study #3 was ok, the predictive ability would still be very poor. With an accuracy of 0.8 = sensitivity = specificity (I didn’t read the paper to see if they differentiate the two) and a prevalence of 0.03, the probability of actually gay given predicted gay is 11%.

    • Hey, whassup with that? I can get 97% accuracy by just guessing Straight for everybody.

      Even if everything about study #3 was ok, the predictive ability would still be very poor. With an accuracy of 0.8 = sensitivity = specificity (I didn’t read the paper to see if they differentiate the two) and a prevalence of 0.03, the probability of actually gay given predicted gay is 11%.

      They meant AUC rather than accuracy:

      Among men, the classification accuracy equaled AUC = .81 when provided with one image per person. This means that in 81% of randomly selected pairs—composed of one gay and one heterosexual man—gay men were correctly ranked as more likely to be gay. The accuracy grew significantly with the number of images available per person, reaching 91% for five images. The accuracy was somewhat lower for women, ranging from 71% (one image) to 83% (five images per person).

      As I noticed earlier , AUC is probably going to confuse people just as much as p-values. I see no legitimate reason they should define “classification accuracy” as “AUC” when “classification accuracy’ already means “% correct predictions”.

      • Thanks.

        If you construct an ROC curve with 3 points — anchored at (0,0) and (1,1) with only a single other point located at (0.2,0.8) corresponding to sensitivity = specificity = 0.8, then the AUC is 0.8. Of course, their AUC is the output of a logistic regression and will be smoother than this toy example. However, I don’t believe there is a combination of sensitivity and specificity that can be pulled from an ROC curve with AUC of 0.8 such that the predictive quality in the presence of a prevalence of 3% (or 5% or whatever) is worth a darn.

      • Dan:

        That’s hilarious. After pages and pages of misinterpreting and overintepreting their findings, they write, “Importantly, we would like to warn our readers against misinterpreting or overinterpreting this study’s findings.”

        I guess cos on page 31 they say “Importantly,” they must really mean it this time, and we can ignore much of pages 1-30. . . .

        • Less funny is when they basically say (starting around line 154): Hey, it would be really really terrible if, like, despotic governments that criminalize homosexuality were to come up with a technology like this. From this they conclude: “It is thus critical to inform policymakers, technology companies and, most importantly, the gay community, of how accurate face-based predictions might be.”

          Yes, I am definitely thinking to myself “I really hope Stanford develops a way for governments and employers to identify gay people, so that when they give that algorithm to evil people in the form a publicly available working paper, I’ll know precisely how likely those evil people will be to correctly classify the sexuality of people for the purposes of potentially incarcerating or murdering them.” And for the record, I am not the one who went there first – the authors actually list four of the eight (¿why just four?) countries where homosexuality is punishable by death.

          Just like – ¿why? I’m not so naive as to think that, absent this research, no one with bad motives would try to do this. But what is the value of helping them? And what is the value of such a study if its point is NOT to help them? If it is just about evolutionary biology, then why this whole bullshit paragraph about social justice concerns that are like the evil cousin of actual social justice concerns?

          Ugh. The only things I miss about the Bay Area are public transportation and those Jamaican beef patties from around the corner Andrew got me hooked on.

        • – Things don’t have to be funny to be hilarious.
          – In the Authors’ Notes they conveniently provide a map of countries where homosexuality is illegal/punishable by death. They want this to be the narrative. The second author has an attachment to controversial research.
          – The burlesque of hand-wringing ethics considerations they put in the paper is both funny and hilarious.

        • Oh I get it now… you are totally right. For instance, this isn’t funny at all, but it is hilarious:

          “And quite frankly, we would be delighted if our results were wrong. Humanity would have one less problem, and we could get back to writing self-help bestsellers about how power-posing makes you bolder, smiling makes you happier, and seeing pictures of eyes makes you more honest.”

          Which they then footnote with “None of these findings seem to replicate.”

          So really, they are the heroes of academia, because they are out there fighting to protect marginalized people, when what they really want to do is bullshit garbage research that would make them rich and famous. I mean, as a joke it isn’t funny, but in the context it is, indeed, hilarious.

          https://docs.google.com/document/d/11oGZ1Ke3wK9E3BtOFfGfUQuuaSMR8AO2WfWH3aVke6U/edit#

        • Sometimes you have to make your own fun jrc. (Also I’m glad someone else had the joy of reading that document. I liked the bit where they talk about the second author’s previous work as an agent of change…)

        • Embrace variation, Andrew. Just because you have a home that isn’t a cave doesn’t mean your lived experience is more important than that of Patties Georg, who eats 10,000 patties a day. He counts in the average too.

  4. Thanks a lot for this post and the one by Dan Simpson. I looked at this paper a few days ago and immediately thought “I wish there was some post about it on Andrew’s blog”. (Well, I’m just a math student with no real experience in statistical analysis and can’t find the problems you see immediately.)

    Greetings from Poland!

  5. “In all seriousness, I think the best next step here, for anyone who wants to continue research in this area, is to do a set of “placebo control” studies, as they say in econ, each time using the same computer program to classify people chosen from two different samples, for example college graduates and non-college graduates, or English people and French people, or driver’s license photos in state X and driver’s license photos in state Y, or students from college A and students from college B, or baseball players and football players, or people on straight dating site U and people on straight dating site V, or whatever. Do enough of these different groups and you might get some idea of what’s going on.”

    I want to see how well a neural classifier works on distinguishing OKcupid vs. match.com photos. My hypothesis, based on experience using the sites (years ago, I guess tinder has taken over now): OKcupid is free, so attracts basically students and other people with limited income. And they are flaky. Match.com is expensive, and the people on there are in professional careers…and if you’re a poor student you’re wasting your time.

    It’s not hard to see how these differences might lead to statistical regularities in the profile photos that a classifer could pick up, even within, say, hetero whites of a particular sex.

  6. “Given a single facial image, a classifier could correctly distinguish between gay and heterosexual men in 81% of cases, and in 71% of cases for women. Human judges achieved much lower accuracy: 61% for men and 54% for women.”

    I’d be interested in finding groups of humans who were better than the computer: perhaps Hollywood casting agents, prostitutes, or gossip columnists.

Leave a Reply to Dan Simpson Cancel reply

Your email address will not be published. Required fields are marked *