Controversy over “Facial recognition technology can expose political orientation from naturalistic facial images”

A couple people pointed me to this research article, “Facial recognition technology can expose political orientation from naturalistic facial images,” by Michal Kosinski, which reports:

A facial recognition algorithm was applied to naturalistic images of 1,085,795 individuals to predict their political orientation by comparing their similarity to faces of liberal and conservative others. Political orientation was correctly classified in 72% of liberal–conservative face pairs, remarkably better than chance (50%), human accuracy (55%), or one afforded by a 100-item personality questionnaire (66%).

The result seemed plausible to me. As Kosinski writes, a face contains lots of information; it’s not just bone structure. It seems that people of different social classes have different-looking faces, so I could imagine this to be true of political orientation as well.

I wan’t quite sure how to think about the 72% result—is that high or low?—so I asked a colleague who prefers to remain anonymous, who told me this:

Regarding the paper, I’m not exactly sure what to make of the controversy. He released the underlying data, so I ran a couple of models [see script at end of this post] and here are the results:

AUC [area under the curve, a measure of classification accuracy] of simple demographic model: 0.63
AUC of model with more observable traits: 0.65
AUC of model with observable traits + personality: 0.72

It looks like ethnicity doesn’t include “Hispanic” for some reason, which is probably hurting the performance of the demographic model. If you use demographics + directly observable traits in the dataset (e.g., smiling, facial hair, glasses, etc.), you get 65% AUC. When you throw in personality (based on a survey), you get to 72% AUC, which is the same as what is reported in the paper based on the face recognition algorithm alone.

Personality is not directly visible, so the face algorithm is picking up on something more than the obviously observable traits. But I think the gap between 65% and 72% is not that big, and likely can be closed if you have more fine-grained ethnicity (e.g., Hispanic) and account for things like wearing make-up. Perhaps throwing in a few interactions and using a non-linear model would also close the gap a bit more.

As another point of comparison, I computed performance of a simple demographic model for 2012 presidential vote choice (based on exit poll data that included Hispanic ethnicity), and got an AUC of 68%.

So (unsurprisingly) there is real information about people’s political leanings that is discernible from how they look and choose to present themselves. But I think the paper itself over-claims (e.g., only hints at the comparisons above), and strongly suggests something more subtle is going on.

That’s a somewhat critical take, but I’d like to thank Kosinksi for posting his data.

I got some more feedback on the paper from political scientist Brian Sala, who sent comments to Jeff Lax who passed them on to me. Here’s Sala:

This is really interesting. Limited, because the model is about picking the liberal (conservative) in a liberal/conservative pair. But the model should be useful for classifying individuals. I would have liked to have seen it applied via an ordered probit/logit to model self-assessed ideology (the typical, 5-pt scale from very liberal to very conservative) instead of
this binary classification.

Good point! A more precise measure should allow us to learn more.

Sala continues:

I [Sala] am a little unclear on the out-of-sample classifications here from my quick skim (the main model, I think, classifies within sample which in a pair is the liberal or conservative, by comparing facial attributes of the individuals drawn from the sample to the average lib/conservative attributes in the sample). Still, cool stuff: “In other words, a single facial image reveals more about a person’s political orientation than their responses to a fairly long personality questionnaire, including many items ostensibly related to political orientation (e.g., ‘I treat all people equally’ or ‘I believe that too much tax money goes to support artists’).”

What I don’t like is the reliance on central tendencies for the collapsed “liberal” and “conservative” groups. Because one would think the tool could be used to compare all pairs from the sample to choose the “more liberal” of the pair to get a rank ordering of the whole sample, a la Groseclose and Milyo’s approach to ordering media outlets or the power rankings of sports teams.

It seems to me (again, based on a quick skim) that this paper’s approach, by comparing individuals in a lib/conservative pair, is asking which in the pair is most like a “centrist” lib or “centrist” conservative. If the recovered geometry of facial features is multidimensional, “very liberal” and “very conservative” individuals could be closer to each other than they are to their respective ideological centroids, even as each is closer to the “right” centroid than the “wrong” one (or rather, one in the pair is closer to the reference centroid, leading to the classifications in the pair). Or worse, the “very liberal” person in the pair could be closer to the conservative centroid than the “very conservative” person AND closer to the liberal centroid (or vice-versa). Implicit here is two pairs of points in the facial characteristics space: the two individuals and the two centroids.

I’m doubtful about this claim from Sala. I’d guess there’d be more of a linear trend going on, with moderate liberals and moderate conservatives closer to the center of the distribution. But I don’t really know; this is just my guess.

Sala continues:

I [Sala] presume that the model is projecting the two individuals’ locations on to the line through the two centroids. If the model is “right”, the two individuals will be ordered the same way as the two centroids (liberal projected to the left of the conservative on the line through the centroids). The two individuals could be “outside” their respective centroids, inside, both to one side of a centroid, or straddling a centroid. If the left/right order is “correct but both are projected to one side, you get classification errors (the liberal is further from the liberal centroid than is the conservative), whereas you get correct classifications if both project in between. If they straddle one centroid, it depends on the reference category (are you comparing each to the liberal centroid or the conservative centroid? If comparing to the liberal and they straddle the liberal, I think the model will classify the “closer” projection as liberal, leading to some classification errors. If comparing to the conservative centroid, no classification errors in this case.) Again, my quick read.

Again, I’m skeptical of Sala’s conjectures but who knows? and so I thought I’d share this with you.

Finally, someone forwarded me a copy of this email that was floating around Stanford (where Kosinski works):

Dear colleagues,

Several members of the ACM US TPC on AI and Algorithms are always concerned on the unethical use of AI, even more when it comes from top universities like Stanford, given its reputation and influence. The last example, but not the only one, is more modern phrenology:

Facial recognition technology can expose political orientation from naturalistic facial images, M. Kosinski, Scientific Reports, 2021. (and is not facial recognition, is facial biometrics).

The argument that this is to warn about the bad use of ML, is not a valid one for publishing this type of research as this is pseudo-science (the backslash was worse when the same approach was used for sexual orientation). Surprisingly, Stanford’s IRB approved this research, so they are not taken care of the ethical aspect. Of course Nature also shares the responsibility. For this reason we have already contacted all the stakeholders involved to avoid more of this in the future.

However, I believe that we as computer scientists also should be concerned about unethical science and bad press in AI (and for you, the same for Stanford, especially with the HAI initiative). So this personal action is just to make sure you are aware of this, in case you were not.

I disagree with this “Dear colleagues” letter on many levels. OK, let me be clear. I fully support the freedom of the author of this letter to send it around; I’m not saying such letters should not be allowed; I’m just saying I disagree with the substance.

First, I don’t think it’s helpful to refer to this as “phrenology.” It’s a statistical analysis of photos. Similarly, I don’t see why they call it “pseudo-science.” The work of that ESP guy at Cornell, or the Pizzagate guy, or the beauty-and-sex-ratio research that we’ve discussed on this blog . . . I could see calling all of that pseudoscience, or, at least, really really bad science. But this facial recognition paper seems legit. I don’t like the idea of calling a paper “pseudoscience” just because you find it annoying.

Second is the ethical question. I have mixed feelings on this one. Maybe the analysis shouldn’t be done, as there’s something Big Brotherish about it—but as Kosinski points out, companies and governments are already doing such things, so it’s not clear that academics shouldn’t be doing it too. On the other hand, if it’s really something that shouldn’t be done, then no point in academics leading the way. Overall I don’t think I’d consider this to be unethical research, but that’s a matter of opinion; I can’t really say that the letter writer is right or wrong on this one.

Third is the statement, “Surprisingly, Stanford’s IRB approved this research.” I’ll agree that Stanford researchers can have ethical problems; for example there was this mailer they sent to voters in Montana, and Stanford also employs the business school professor who notoriously told 240 different restaurants that “Our special romantic evening became reduced to my wife watching me curl up in a fetal position on the tiled floor of our bathroom between rounds of throwing up.” So, sure, there are some studies that never should be approved. For this facial recognition study, though, on what ground would Stanford not approve it? Because someone thinks it’s evil? I really don’t like the idea of the IRB being used as some sort of political correctness filter, and I’m bothered that these people think the job of the IRB is to stop research from being done, just because somebody finds it politically objectionable.

P.S. Here’s my colleague’s R script:

library(tidyverse)
library(ROCR)

auc <- function(model) {
  pred_ROCR <- prediction(predict(model), model$y)
  auc_ROCR <- performance(pred_ROCR, measure = "auc")@y.values[[1]]
  auc_ROCR
}

round_any = function(x, accuracy, f=round){f(x/ accuracy) * accuracy}

# load the original face data and restrict to americans
# downloaded from https://drive.google.com/file/d/1I3QMFzb12-i6Mu9lSD1xxymm5nmQyqm9/view?usp=sharing
# note that race/ethnicity is classified as 'asian', 'black', 'india', 'white'
# in particular, hispanic ethnicity is not included
faces <- load('faces.RData')
faces <- tibble(d) %>% 
  filter(country == 'united states') %>%
  select(-userid, -starts_with('pol_'), -database, -age) %>% 
  mutate(age_bin = factor(round_any(age.value, 10))) %>%
  drop_na()

# exit poll data from 2012 presidential election
survey <- read_tsv('https://5harad.com/mse125/assets/hw6/survey.tsv')

# fit a simple logistic regression based on sex, race, and age bin
model_demo <- glm(pol == 'liberal' ~ 
                      gender + ethnicity.value + age_bin, 
                    data = faces, family = 'binomial')

# fit a model with more observables
model_obs <- glm(pol == 'liberal' ~ 
                    gender + ethnicity.value + age_bin +
                    facial_hair +
                    emotion.sadness + emotion.neutral + emotion.disgust + 
                    emotion.anger + emotion.surprise + emotion.fear + emotion.happiness +
                    headpose.yaw_angle + headpose.pitch_angle + headpose.roll_angle + 
                    smile.value +
                    left_eye_status.normal_glass_eye_open + left_eye_status.no_glass_eye_close +
                    left_eye_status.occlusion + left_eye_status.no_glass_eye_open +    
                    left_eye_status.normal_glass_eye_close + right_eye_status.dark_glasses +
                    right_eye_status.normal_glass_eye_open + right_eye_status.no_glass_eye_close +
                    right_eye_status.occlusion + right_eye_status.no_glass_eye_open +    
                    right_eye_status.normal_glass_eye_close + right_eye_status.dark_glasses,
                  data = faces, family = 'binomial')

# fit a model with observables and personality (openness)
model_personality <- glm(pol == 'liberal' ~ 
                           gender + ethnicity.value + age_bin +
                           facial_hair +
                           emotion.sadness + emotion.neutral + emotion.disgust + 
                           emotion.anger + emotion.surprise + emotion.fear + emotion.happiness +
                           headpose.yaw_angle + headpose.pitch_angle + headpose.roll_angle + 
                           smile.value +
                           left_eye_status.normal_glass_eye_open + left_eye_status.no_glass_eye_close +
                           left_eye_status.occlusion + left_eye_status.no_glass_eye_open +    
                           left_eye_status.normal_glass_eye_close + right_eye_status.dark_glasses +
                           right_eye_status.normal_glass_eye_open + right_eye_status.no_glass_eye_close +
                           right_eye_status.occlusion + right_eye_status.no_glass_eye_open +    
                           right_eye_status.normal_glass_eye_close + right_eye_status.dark_glasses +
                           ext + neu + ope + agr + con,
                         data = faces, family = 'binomial')

# fit a simple logistic regression based on sex, race, and age
model_exit_poll <- glm(vote == 'A' ~ sex + race + age, data = survey, family = 'binomial')

# compute AUC of the various models
cat('AUC of simple demographic model: ', auc(model_demo), '\n')
cat('AUC of model with more observable traits: ', auc(model_obs), '\n')
cat('AUC of model with observable traits + personality: ', auc(model_personality), '\n')
cat('AUC of demographic model based on survey data: ', auc(model_exit_poll), '\n')

20 thoughts on “Controversy over “Facial recognition technology can expose political orientation from naturalistic facial images”

  1. ‘I disagree with this “Dear colleagues” letter on many levels. OK, let me be clear. I fully support the freedom of the author of this letter to send it around; I’m not saying such letters should not be allowed’

    If they are not willing to attach their names to it publicly, but are still going around to ‘stakeholders’ to ‘avoid more of this in the future’, perhaps such letters should not be allowed. Sunlight is the best disinfectant, is it not?

  2. Did I miss this, or is there a limitations section in this paper that says that because only a very biased population was studied (US, UK, Canada), your mileage may vary? I’m trying to imagine guessing political orientation through images of Indian villagers, who constitute some 1.3 times the population of US, Canada, and the UK combined.

    This would be the equivalent of the “in mice” caveat in this line of work.

    There is a rule in linguistics that I believe is called the (Emily) Bender rule, which stipulates that one has to say which language the theoretical/empirical claim is about. The reason for this is the same as the “in mice” stuff; in linguistics, too often we look at English and just assume that whatever holds in English must hold in every other language.

    • I imagine the combination views deemed “liberal” and “conservative” differ considerably across cultures. Even just between the US and UK, I’ve long found it weird that it’s the Republicans here who have more conspicuous pockets of anti-Semitism, while Labor has the same problem over there. So maybe the sample is both too diverse and not diverse enough?

      • My impression (mostly from reading the LRB and Guardian) is that what is called “anti-semitism” in UK politics is more connected with pro-Palestinean sentiment in Labour, while in the USA is connected with more of conspiracy/cabal (~ Elders of Zion) aspects, more common among fringe “Republicans”. That is, anti-Semitism is not a constant, as your statement seems to imply.

  3. I don’t want to dismiss the idea out of hand, but I wonder if the comparisons they are doing between their method and human beings is really all revealing. At first, I thought they compared human classifications of the pictures in their dataset with that of their algorithm, but as far as I can tell, the quoted 55% for human accuracy is based on a variety of different studies (including one showing silent videos of politicians). I should think that if you want to claim that you have a method better than human accuracy, the tasks should be identical, not radically different ones (I’m curious whether human beings would do better than 55% if they were looking at pictures from Facebook and dating sites, where I might expect that some kind of self-presentation consistent with self-described political leanings).

    About the ethics: I guess if they about to identify the political ideology of random people in a crowd I’d be a bit concerned. But I’m less concerned about a forced choice between pictures that users themselves selected for Facebook and dating sites. I guess that there’s the slippery slope issue, but still . . .

  4. Andrew –

    I’m surprised that your overall take (before considering the details of the work) is that it’s plausible.

    I’m dubious that there would be any viable mechanism other than they’ve got an algorithm to assess race/ethnicity and sex. Maybe class/SES, but I’m skeptical. But even if they had some way of discerning a combo of those attributes, then aren’t you saying you can just go with those predictive attributes, and then using facial recognition would only lower your skill with a larger downside.

    As for the 100 item questionaire, why not just go with a question about the death penalty or views of Muslims or views on racial issues?

    As a not too terribly bright, total non-expert who isn’t statistically literate, I’m not buying that this idea crosses the plausible mechanism bar.

    • Along those lines…

      > It seems that people of different social classes have different-looking faces, so I could imagine this to be true of political orientation as well.

      What, about a face, other than predictors of race/ethnicity/sex/SES could plausibly predict political identity?

        • I thought about facial hair or hairstyle.

          I mean if you look for ZZ Top goatees or mullets and you might have an overall predictor of a biker who likes Trump. In the female side, do they have tall hair?

          But it seems to me that all you’d be doing is something like, are they black or white, then are they male of female, then are they non-Hispanic white, and you’d get 90% of your prediction. Then you might add in do they look botoxed or have a weathered face and get some marginal improvement.

          But is that really predicting political identity? Doesn’t really seem that way to me.

        • With the mullet guy with the ZZ top goatee, you’re going to pin the Trump supporter with like 95% accuracy, which will then mean (if you aggregate results) that outliers basically explain your predictive skill and for the other vast majority you have little skill.

          But maybe they controlled for that influence of outliers?

        • The algorithm is a black box: no one knows what features led to the classification. The authors looked at some features like head orientation (did they look directly at the camera) and expressions of emotion (joy, disgust). They argued that liberals were more likely to look at the camera and express surprise but less likely to express disgust. I don’t know why that is: remember, these are photos were people are presenting an image of themselves, so I wouldn’t be surprised one could use these features, even though I’m pretty skeptical about explanations were for why this is the case: I’d want to see more of the context. It might be plausible: I imagine in dating sites particularlly people might signal their political orientation. (Facial hair, incidentally, wasn’t realy helpful as a feature)

          I just really wonder how human beings would perform on this particular task. As far as I can tell, they didn’t ask anyone to “pick the conservative”: the 55% is drawn from different studies.

        • Joe –

          >(Facial hair, incidentally, wasn’t realy helpful as a feature)

          Thanks.

          > They argued that liberals were more likely to look at the camera and express surprise but less likely to express disgust.

          My bullshit meter just pinned at 11.

          I mean first, the whole [difference between groups] vs. [diversity within groups] factor never worked for me with things like “disgust” vs. “openness” as a way to differentiate between libs and cons. But then on top of that you add in the uncertainty coming from trying to discern those attributes from photos?

          Maybe even at 12.

        • I’ll respond here because of the embeddding.

          In fairness to,the authors, the algorithm had a higher predictive score that the sum total of the features they looked at, but, sure, I agree that I’m generally skeptical, unless there are other contextual reasons governing the differences. As I said, I really want to see the pictures and the larger contexts. These aren’t yearbook pictures: these are pictures that people chose to out on a social media site, sometimes for a very specific reason (as in the case of dating sites). I don’t find it implausible that liberals and conservatives might post pictures of themselves on particular social media websites, and one of these differences might be facial expressions. (Actually, I think lots of different social vrouos might differ in this way: would you be surprised if says, members of a fraternity had different facial features on social media that members of the school chess team in photos they posted on social media sites.

  5. Related to the idea that some people want the IRB to filter research that might one day harm society (which I also disagree with), Stanford recently instituted a separate Ethics and Society review board that is a required process to receive AI funding from the university: https://arxiv.org/abs/2106.11521

    It’s similar to some motivation for the Broader Impacts statement requirement at conferences, in that it’s meant to redirect the research while there’s still time, and apparently has been appreciated by researchers who have gone through it.

  6. I think it’s fair for the email’s author to compare the paper to phrenology, but not fair to make that out to be a bad thing. The AI research here *is* phrenology, methodologically speaking, but there’s nothing wrong phrenological methodology. It’s just measuring physical traits and identifying correlations with psychological traits. The problem with phrenology is its theory of causation: it tries to attribute psychological traits and physical traits to a common genetic cause. If the AI work is only predictive, not causal, then it doesn’t share the causal problems.

    I guess another objection is that, by giving the predictions an algorithmic veneer, we may give people the impression that the AI’s predictions are less biased than people’s, so relying on them can’t be discriminatory or dangerous. That’s always been a negative consequence of formalizing diagnostics, whether actuarial (“the judgment comes from unbiased quantitative methods!”) or clinical (“the judgment comes from an objective expert!”). We gonna stop creating measures that maximize accuracy for real-world applications? Of course not. We gotta invest more in finding ways to increase awareness/visibility of measures’ fallibility and unintended consequences–in AI and elsewhere.

    Finally, yeah, it’s problematic when published AI research could facilitate social evils. I don’t want China using AI to sort Hong Kong citizens into socialists and non-socialists, sending the latter off for “re-education.” I don’t even want the GOP using the same tech for gerrymandering. But we all know somebody’s already working on it, somebody with an incentive to not publish. It’s better for AI researchers to study and (potentially) counter these systems, like the CDC with does with dangerous viruses.

    • > by giving the predictions an algorithmic veneer, we may give people the impression that the AI’s predictions are less biased than people’s, so relying on them can’t be discriminatory or dangerous

      +1

    • Indeed, security by obscurity is never a good idea. The CCP or other malefactors *will* eventually get this tech, if they don’t already. The best course of action is for smart people on the side of good to fully investigate the tech so we can detect and respond to it more effectively, e.g. using GANs to detect GAN-generated deep fake images.

  7. Thanks for sharing the script (extended to your colleague).

    Are you aware of code that also includes the VGGFace2 part? We have a similar (but a lot smaller) dataset of images of people with a known political orientation. It would be interesting to see how the performance of the model would be on our dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *