A psychologist who would like to remain anonymous writes:
I am writing to share serious concerns about a 2018 article in the Journal of Personality and Social Psychology. This paper claims that it is possible to detect people’s sexual orientation with very high accuracy from their photographs.
There are a few red flags in the paper. The authors only analyzed about 50% of the images that they had scraped. The procedure used to select these 50% images seems to have been arranged in a manner that maximizes the chances of favorable results. The attached document outlines some problems that my collaborator had uncovered while pouring over multiple versions of the data and code uploaded on OSF. Overall, it seems that the data has been fudged.
Unfortunately, I am not in a position to openly pursue these concerns. . . .
My correspondent shared this document that has details on replication problems in the data. They report:
The authors obtained images of gay and heterosexual women and men of varying ages from a dating website. . . . Within each gender, the authors implemented a complex trimming process to ensure that they have an approximately equal number of images of gay and heterosexual individuals across varying numbers of images and with the same median age and interquartile age range . . . The procedure described in the manuscript indicates room for researcher degree of freedom in this process: “Finally, we randomly removed some users to balance the age distribution of the sexual orientation subsamples and their size—separately for each gender” (Wang & Kosinski, 2018, p. 248). . . .
We tested the internal replicability of the authors’ results by reshuffling the master dataset 5 million times and applying their trimming procedure after each reshuffle (see the code highlighted in red added to the original script [in the above-linked document]). This analysis was conducted in March 2022 based on the data uploaded on OSF at the time. The summary statistics of the frequency counts of unique gay men and lesbian individuals across the 5 million trimmed datasets are presented below:
Men:
Range: 3994 to 4110
. . .Women:
Range: 3760 to 3846
. . .Table 1 of the manuscript indicates that the trimming procedure yielded 3947 unique gay men and 3441 unique lesbians. . . . Thus, the number of unique gay men and lesbian individuals reported in Table 1 is virtually impossible to have been obtained by applying the trimming procedure to the master data if the master data were randomly ordered. The only viable conclusion is that the master data was manually ordered to yield the trimming data reported in the manuscript, not randomly ordered.
I don’t know if this is “the only viable conclusion”! There’s always something you haven’t thought of yet (see this story, for example). But, yeah, it seems that whatever happened to put together the dataset wasn’t quite as stated.
The authors continue:
We next repeated the above analysis on the data uploaded on OSF in February 2023 (the current v29 of the data). This time round, reshuffling the data yields the same number of unique individuals in the trimmed data. Thus, between February 2022 and February 2023, the authors edited their data and/or code to eliminate the randomness in their February 2022 data and code. Note that the paper was published in 2018, so the data and code uploaded at all times after publication should yield identical stimuli sets and results. Despite the elimination of randomness, the descriptive statistics reported in the manuscript are still not replicable. The trimming procedure applied to the reshuffled master data always yields . . . very different numbers than the corresponding Table 1 reported in the manuscript.
Again, the details are in the above-linked pdf.
And more:
Next, we tested whether the classification AUC of the authors’ LASSO regression replicates once the master data are reshuffled. This analysis is based on the current v29 of the data and code. We reshuffled the master data once (see code in red highlighted below) and applied their trimming procedure. We found that all AUC scores were less than 0.57 (see the figure below), much lower than the AUCs of 0.81 and 0.71 mentioned in paper’s abstract. The impressive results presented by the authors are thus not replicable even within their own dataset.
That last bit is a particular concern because it’s not just a matter of different counts in the dataset but appears to be a large systematic error.
The authors of this document seem to have been very careful, but I have not looked at the raw data or tried to run their code myself.
From the copied R code:
#!/usr/bin/Rscript #### Data and code accompanying #### Deep neural networks are more accurate than humans at detecting sexual orientation from #### facial images #### https://doi.org/10.1037/pspa0000098 ## ## This code and data are shared only for the purposes of replicating the results of Studies 1 and 2. ## No new research should be carried out using this data. ## ## The following code allows estimating the core results presented in the paper. ## ## To preserve participants' anonymity, some of the variables used in the study were removed, ## including the location of the facial landmarks. Additionally, a small amount of random noise ## has been added to all continuous variables, including age and face vectors. The face vectors ## were also reordered. Consequently, the results and the distribution of variables ## may be minimally different from those presented in the paper. ## ## Please let me know if you run into any issues: [email protected]
The above results seem more than “minimally different.” Again, I don’t know the full story here. It’s good that the authors shared the anonymized data. I’m not such a fan of the statement, “No new research should be carried out using this data”—once the data are collected, you’d want others to follow up on your research, no?—but, hey, nobody’s perfect.
Larger concerns about the gayface research
My connection to that gayface study is a sociology paper with Greggor Mattson and Dan Simpson published in 2018, Gaydar and the fallacy of decontextualized measurement, which begins:
Recent media coverage of studies about “gaydar,” the supposed ability to detect another’s sexual orientation through visual cues, reveal problems in which the ideals of scientific precision strip the context from intrinsically social phenomena. This fallacy of objective measurement, as we term it, leads to nonsensical claims based on the predictive accuracy of statistical significance. We interrogate these gaydar studies’ assumption that there is some sort of pure biological measure of perception of sexual orientation. Instead, we argue that the concept of gaydar inherently exists within a social context and that this should be recognized when studying it. We use this case as an example of a more general concern about illusory precision in the measurement of social phenomena and suggest statistical strategies to address common problems.
and concludes:
The point of this article is not to pick on a small area of psychology research that happened to catch the fancy of the press or even to criticize larger trends of sensationalism in science and the news media. Rather, we seek to draw attention to the general problem (one that is all too frequent in this era of genetics, machine learning algorithms, and MRI studies) in which the ideals of scientific precision end up stripping all context from a social phenomenon, leading to nonsensical claims based on predictive accuracy or statistical significance. . . . A social interaction cannot always be measured in a test tube or even in a psych lab. In the case of the research under discussion here, the steps taken ostensibly to ensure objectivity obliterated much of the objects of research.
I like that paper of ours—I recommend you read the whole thing! In that work we did not look into the data underlying the two studies we had cited, and I have not carefully studied or tried to replicate the reanalysis in the pdf document that my correspondent sent to me.
I’m not sure whether I am more bothered by the likely inaccuracy of such categorizations based on visual images or by the idea that it is worth exploring such categorizations at all. There is certainly research interest in accuracy of machine learning classification models based in images – these have plenty of potential uses and misuses. To some extent testing these models on gaydar images might tell us something about the accuracy of using such models for things such as: identifying potential military targets, finding potential survivors from a natural disaster, etc. So, it can be important research. But it isn’t clear to me how much the results would generalize: if a model were extremely accurate for identifying images of gay people, does that mean it could be extremely accurate at distinguishing between valid military targets and civilians?
That seems like a stretch. So, without a compelling case that this research would generalize (never mind whether it is replicable), I’m left with this particular application to distinguishing gay people from straight people. I can see where such models could be used for evil purposes, but it is harder for me to see what good uses would be (especially because the binary classification seems inappropriate to me – a continuum seems more likely). I’m left wondering why this is worth studying. I liked Andrew’s article linked above – divorcing such studies from their social context seems to be throwing away the most interesting and important parts of the issue.
Dale:
Yes, we see this a lot in the social and health sciences, that the efforts taken to study something in a way that seems to be “scientific” take researchers in a direction away from the underlying phenomenon of interest. This is related to the tyranny of measurement or quantification fallacy—the habit of taking more seriously those observations that have been presented as numbers—, but, even beyond that, I think there’s a problem with this push-a-button, take-a-pill model of science, for example the idea that gaydar is a thing that can be studied in isolation from its social context.
Wouldn’t pushing a button and taking a pill be an intervention, requiring some model of causality? This research appears to just be descriptive, without that angle. In that sense it seems closer to the “exploratory” model of research you’ve discussed.
I raised the concern about the dangers of this research, and the author of the original paper replied that he and his collaborator were merely sounding the alarm to protect LGBTQ+ people:
https://greggormattson.com/2017/09/09/artificial-intelligence-discovers-gayface/
I found this rich, given that this was something queer artists and orgs, and the Electronic Frontier Foundation, had already been studying. One of my objections, then, is studying a minoritized community that researchers are not part of, and then being patronizing about it.