“In 1997 Latanya Sweeney dramatically demonstrated that supposedly anonymized data was not anonymous,” but “Over 20 journals turned down her paper . . . and nobody wanted to fund privacy research that might reach uncomfortable conclusions.”

Tom Daula writes:

I think this story from John Cook is a different perspective on replication and how scientists respond to errors.

In particular the final paragraph:

There’s a perennial debate over whether it is best to make security and privacy flaws public or to suppress them. The consensus, as much as there is a consensus, is that one should reveal flaws discreetly at first and then err on the side of openness. For example, a security researcher finding a vulnerability in Windows would notify Microsoft first and give the company a chance to fix the problem before announcing the vulnerability publicly. In [Latanya] Sweeney’s case, however, there was no single responsible party who could quietly fix the world’s privacy vulnerabilities. Calling attention to the problem was the only way to make things better.

I think most of your scientific error stories follow this pattern. The error is pointed out privately and then publicized. Of course in most of your posts a private email is met with hostility, the error is publicized, and then the scientist digs in. The good stories are when the authors admit and publicize the error themselves.

Replication, especially in psychology, fits into this because there is no “single responsible party” so “calling attention to the problem [is] the only way to make things better.”

I imagine Latanya Sweeney and you share similar frustrations.

It’s an interesting story. I was thinking about this recently when reading one of Edward Winter’s chess notes collections. These notes are full of stories of sloppy writers copying things without citation, reproducing errors that have appeared elsewhere, introducing new errors (see an example here with follow-up here). Anyway, what’s striking to me is that so many people just don’t seem to care about getting their facts wrong. Or, maybe they do care, but not enough to fix their errors or apologize or even thank the people who point out the mistakes that they’ve made. I mean, why bother writing a chess book if you’re gonna put mistakes in it? It’s not like you can make a lot of money from these things.

Sweeney’s example is of course much more important, but sometimes when thinking about a general topic (in this case, authors getting angry when their errors are revealed to the world) it can be helpful to think about minor cases too.

13 thoughts on ““In 1997 Latanya Sweeney dramatically demonstrated that supposedly anonymized data was not anonymous,” but “Over 20 journals turned down her paper . . . and nobody wanted to fund privacy research that might reach uncomfortable conclusions.”

  1. I think discovering a “single responsible party” was what got this group to finally make this correction –

    https://www.researchgate.net/publication/246604897_Erratum_to_Correspondence_analysis_is_a_useful_tool_to_uncover_the_relationships_among_categorical_variables_J_Clin_Epidemiol_201063638-646

    Before that discovery it was like pulling teeth and before we found out it could be blamed on SAS, the primary investigator actually asked everyone in the group to stop communicating with me altogether!

    Yes, SAS did make the error but the math was wrong so it should have been obvious to anyone with the math – e.g. other statisticians.

    Just now I have discovered the erratum is behind a paywall :-(

    • the erratum is behind a paywall

      Here you go:

      The authors would like to report a coding error in the SAS procedure PROC CORRESP option for the Greenacre adjustment. Now corrected, this error has an impact on some results reported in our paper. In Table 6, only two dimensions should be retained for adjustment. The Greenacre-adjusted percentages of explained inertia for Dimension 1 should be 86.1% instead of 45.8%; for dimension 2, the value should be 0.56% instead of 17.9%. These revised results also affect the scaling of coordinates on the graph, showing a very strong single dimension in the data.

      This SAS coding error would also affect all other papers using SAS to perform multiple correspondence analyses using the Greenacre adjustment method and authors are encouraged to review their findings.

      A SAS hot fix is now available for download at: http://ftp.sas.com/techsup/download/hotfix/HF2/A52.html

      The authors are grateful to Dr. Greenacre and Dr. O’Rourke for identifying this error.

  2. I don’t think the analogy with software really works. I think the reason why you would inform Microsoft first is because you don’t want to publicize the vulnerability before it’s fixed (not out of a desire for a software company to save face). As a last resort, you may have to go public because that’s the only way the problem can get fixed. With published research, I would say the situation is almost reversed. The only way the problem can get fixed is by making it known that the results shouldn’t be trusted and adopted by others. The original authors don’t have much control over that process (unlike Microsoft, who controls Windows), so thete’s no Real reason to delay publicizing errors.

      • Maybe I’m naive (actually, I know I’m naive), but I think it’s fair to inform the original authors first.

        Certainly that’s what I tried to do in an applied context (within a company). Let people quietly fix their errors.

        And I got rewarded for it: one of the women I trained came into my office one day, and pointed out an error in an algorithm that I’d made roughly 5 years earlier that had been put into production then. (I could have made an adjustment either at point A or at point B, but my algorithm made the adjustment in BOTH places, therefore overadjusting.)

  3. Here’s a 2017 paper from Johndrow, Lum, and Dunson titled “Theoretical Limits of Record Linkage and Microclustering.” The abstract suggests exercising considerable caution wrt record linking, “There has been substantial recent interest in record linkage, attempting to group the records pertaining to the same entities from a large database lacking unique identifiers. This can be viewed as a type of “microclustering,” with few observations per cluster and a very large number of clusters. A variety of methods have been proposed, but there is a lack of literature providing theoretical guarantees on performance. We show that the problem is fundamentally hard from a theoretical perspective, and even in idealized cases, accurate entity resolution is effectively impossible when the number of entities is small relative to the number of records and/or the separation among records from different entities is not extremely large. To characterize the fundamental difficulty, we focus on entity resolution based on multivariate Gaussian mixture models, but our conclusions apply broadly and are supported by simulation studies inspired by human rights applications.”

    https://arxiv.org/abs/1703.04955

    • Apologies for giving the same link twice. Yes, the link you gave was the one I intended for my second link. Thanks for pointing out the fumble.

Leave a Reply to Joe Cancel reply

Your email address will not be published. Required fields are marked *