An urgent puzzle

This is Jessica. I’ll be posting something more substantial this week as a continuation of my differential-privacy-at-the-Census kick, but first, in quiz fashion I invite any readers who haven’t been following the news of the new disclosure avoidance system to consider what’s different between the two scenarios below. Both use Census data to infer attributes for individuals for whom names and addresses have been obtained from some external source. (Recall that the Census is bounded by Title 13, which includes preventing them from “Mak[ing] any publication whereby the data furnished by any particular establishment or individual under this title can be identified”).

I’m posting this because I’m genuinely curious to hear readers’ impressions of how these seem different, technically but also qualitatively or on a personal level, as possible risks of published Census data. 

Scenario 1: An attacker uses a selection of Census 2010 tables to infer a system of equations for sex, age (in years), race, Hispanic/Latino ethnicity, and Census block variables then solve it using linear programming. They reconstruct over 300 million records for block location and voting age (18+), then use data on race (63 categories), Hispanic/Latino origin, sex, and age (in years) from the 2010 tables to reconstruct individual-level records containing those variables. They obtain commercial data available in 2010 containing name, address, sex and birthdate, and after matching addresses to Census blocks, they loop through the reconstructed data twice, on a first pass taking the first record in the commercial data that matches exactly on block, sex, and age, then on a second pass, they try to match the remaining unmatched records in the reconstructed data, taking the first exact match on block and sex with age matching +/-1 year. Turns out you can correctly link the reconstructed data on block, sex, age, race and ethnicity to a name and address in the third party data for 17% the 2010 Census resident population. If the attacker were to have higher quality name and address data (i.e., matching the Census confidential files), they could correctly link the reconstructed data for 58% of the 2010 population.  

Scenario 2: An attacker uses Census 2010 tables to learn a prior on the probability of race given location, then combines this information with information on the probability of name (first, middle, last) given race and ethnicity, obtained from Census data on the racial distribution of 151,671 surnames occurring at least 100 times in the 2000 Census as well as 12,500 common Latino surnames, and voter files from six southern states (AL, FL, GA, LA, NC, SC) where race and ethnicity are self-reported. Given a name and address as input, the model assigns a probability to each race. If you assign each individual their highest probability race according to the model, it turns out you can correctly predict race for 90% of 5.8 million voters in North Carolina in 2021.

What do you see as the implications of each scenario? 

P.S. Post title comes from a paper where the authors call the questions about how to balance accuracy and privacy ‘an urgent puzzle,’ requiring perspectives from many disciplines. Given how reliant social science is on Census data, I agree, we should be concerned! 

12 thoughts on “An urgent puzzle

  1. In the first scenario, does the attacker know if they have correctly linked the data for any given individual? That’s something that Census can know, given their access to the data (that the attacker linked correctly for a specific record), but if the attacker just knows the probability their guess was correct, than the scenarios start to feel much more similar from a privacy perspective…unless I’m missing something?

    • I would assume in both scenarios, the attacker can’t know when they’ve correctly inferred or linked attributes. Though they may have estimated how well the approach does on some smaller labeled dataset or get probabilities from the method as in #2.

  2. Scenario 2 has some obvious dangers if used for illegal purposes – such as screening job applicants or potential tenants on the basis of race. Of course, plenty of this discrimination occurs in person, and I know that the use of names to screen for race can have fairly high success rates without needing to use any Census Data. So, while scenario 2 troubles me a bit, it isn’t much of a step beyond many other ways that race could be inferred and used to nefarious purposes.

    Scenario 1 strikes me as more dangerous since it potentially identifies the individual, thereby allowing all the associated data to be tied to that individual. If the match is only probabilistic, then that does not change much, since knowing that your match had only an 80% chance of being correct only introduces some uncertainty – certainly wouldn’t prevent all sorts of terrible uses that it could be put to.

  3. The first scenario highlights how ridiculous this alleged threat is. The real privacy violation is buried in the middle: you can just buy a dataset with my name, age, address, and exact birthdate!! Then if you solve the world’s hardest linear algebra problem you can make an educated guess about my race, which is already obvious from my surname.

    The true motivation here has got to be one of the following (or some of both):
    1) the census has been overrun by weird nerds who are obsessed with DP in and of itself, like the string theorists in physics
    2) the actual social scientists at census / with connections are using this to wall off the real data so they can hoard all the top publications for themselves

    • If you just pay a little more Google will tell someone how often you cheat on your spouse and what your diet primarily consists of using info on how often you frequent different restaurants and where you drive throughout the week. Also if you have diverticulitis based on how often you surf your favorite blogs for extended periods while on the toilet

    • Ding ding ding! This is what the entire social science research community keeps shouting at the Bureau, to no avail. You can already buy this data, why would you bother trying to get it from the Census?

      • A few thoughts … the Census collects data under the guise that it’s confidential and is bound by Title 13, which means they are not supposed to publish data that enables identifying individuals. To me it seems the Census is interpreting the Title 13 requirement similar to how privacy experts might, through the lens of database reconstruction attacks. The privacy people produced new knowledge a while back about how to “solve for the data” given a bunch of published stats, and produced a framework that addresses this problem efficiently and transparently compared to what came before (differential privacy). My guess is that once the Census learned this stuff, it was hard to see the problem the way they used to. I totally get the question of why one would worry about Scenario #1 over other possibly easier ways to obtain personal information. Though perhaps from the standpoint of fulfilling the Title 13 obligation, the question of why someone would try to get this data from the Census rather than some other source is not relevant? There are however interesting questions about why individual level privacy is so privileged in our view of what protection is required, when some of the legitmately harmful disclosures of Census data in the past have been about groups of people having their race/ethnicity and location disclosed.

  4. So is the point in both scenarios that you are obtaining individual-level race data for a certain proportion of the population, on a probabilistic basis? With scenario 1 having effectively a noisy lookup table, while in scenario 2 you have a formal statistical model?

  5. Scenario 2 has as input “voter files from six southern states (AL, FL, GA, LA, NC, SC) where race and ethnicity are self-reported” this includes NC. “correctly predict race for 90% of 5.8 million voters in North Carolina” is easy if that data is already in the files!

    How much improvement over this data set is derived from using the census files?

    Scenario 1 is built on census data alone, and individuals don’t have a privacy option as they do with the voter records, so that worries me more.

  6. I’m not sure how the Census data actually helps anyone in step 2. Like yes in theory you can build a model that will do really well at guessing race from name and location…but people are very good at doing that by hand. I’m not sure that, like, Li Chen from Flushing, NY, and Carter Cabot, Jr., from Concord, MA, and Sonia Martinez from San Antonio, TX, are all really that annoyed that anyone with access to their names and addresses can now guess what race they are.

    If you look at the literature on racial discrimination during job interviews, everyone recognizes this and hides the name and address of the applicant to begin with!

    I’m with anonymous coward above — the whole database reconstruction situation seems to me to be thin gruel and certainly not worth destroying the future of social science research over. Differential privacy is neat math but, at least as it’s being deployed for the 2020 census, not appropriate.

Leave a Reply

Your email address will not be published. Required fields are marked *