More discussion of differential privacy at the Census

Posted on May 16, 2022 4:51 PM by Jessica Hullman

This is Jessica. As I’ve blogged about before, the U.S. Census Bureau’s adoption of differential privacy (DP) for the 2020 Census has sparked debates among demographers, computer scientists, redistrictors, and other users of census data. In a nutshell, mechanisms that satisfy DP provide a stability guarantee on the output of a function computed on data over changes in the input. This is typically achieved by adding a calibrated amount of noise to computed statistics, controlled by one or more privacy budget parameters. DP replaces more ad-hoc prior approaches to disclosure avoidance that the Census has used to preserve anonymity in releasing statistics, like swapping selected records for households so that households that are more unique in their block (e.g., in terms of racial or ethnic makeup) are less likely to be identifiable from doing inference on released statistics.

A couple weeks ago the Center for Discrete Mathematics and Theoretical Computer Science held a workshop on the Analysis of Census Noisy Measurement Files and Differential Privacy, organized by Cynthia Dwork, Ruobin Gong, Weijie Su, and Linjun Zhang, where computer scientists, demographers and others paying attention to the new disclosure avoidance system (DAS) discussed some of the implications. I wasn’t able to make it, but Priyanka Nanayakkara filled me in and it sounded like some of the dimensions of the new DAS that we’ve considered on the blog came up, so I asked her to summarize.

I (Priyanka) will offer a brief summary of some notable themes from the first day of the workshop, with the caveat that these are certainly not comprehensive of everything discussed.

Perceptions and demography use cases involving census data often omit or downplay uncertainty measures—debates about DP are surfacing and calling this into question. The topic of uncertainty quantification came up early. Data users such as demographers have been concerned that DP noise makes data too inaccurate for their work, particularly because census data are critical to methods for making population estimates. Computer scientists and statisticians have been suggesting uncertainty measures as a solution, in that quantifying uncertainty would help contextualize error owing to the DP-based disclosure avoidance system (DAS) relative to other forms of error already affecting the data (e.g., non-response error). Demographers argued, however, that population estimates are not a form of inferential statistics—a comment which was met with confusion from some members of the audience. As far as I could tell, they were drawing a distinction between statistical inference and evaluating estimates based on the decennial census, which is considered to be ground truth. The issue seems to be that demographic estimates are considered primarily as point values, and that available methods for producing uncertainty measures are either not applicable, considered unwieldy, or not used. Jessica’s previous posts have pointed out how much of the pushback to DP seems related to “conventional certitude,” in which census population figures are treated as ground truth point estimates, with negligible error; this theme came up at the workshop as well, with computer scientists and statisticians seeming to suggest that DP offers the chance to challenge this norm and normalize uncertainty quantification. In other words, census data are a “statistical imaginary,” or example of “conventional certitude,” treated as perfect. (For an in-depth characterization of this phenomenon, see boyd and Sarathy’s paper—it does an excellent job of pinpointing the epistemological rupture brought on by the Census Bureau’s adoption of DP.)

Developing methods for effectively using noisy data is difficult when use cases are not predetermined. As per the workshop’s imperative, computer scientists and statisticians seemed eager for concrete research tasks to make using noisy census data more viable, yet the resulting discussion showed similar challenges to those occurring between the Census Bureau and stakeholders more broadly around trying to elicit critical use cases for census data. One demographer emphasized the variability in their work, describing the wide range of questions they receive and must attempt to answer. As an example, they cited the following question which they’d previously received: How many 18 year olds are in the Upper Peninsula of Michigan? Although they immediately undercut the question by noting that it came from a militant group engaged in strategic planning for defending against a possible Canadian invasion of the Upper Peninsula, the point seemed to be that sometimes demographers are put in the position of answering bizarre queries and this makes it difficult to precisely predict the use cases for which they’ll need methods for adapting noisy counts.

The relationship between privacy and legitimacy is complicated. Separate from the Census Bureau’s Title 13 mandate to maintain confidentiality, what is the role of privacy in the Census? At least one privacy expert noted that a severe, bad privacy event (think most of the population being re-identified from census statistics and individual-level records published)—which they termed a “privacy Chernobyl”—could reduce trust in the Census Bureau, and lead to lower rates of participation in future censuses. The point of DP is to help prevent outcomes like this. But what if this isn’t the type of privacy threat people are actually concerned about? Participants debated whether people are actually concerned about the confidentiality of their responses, citing various reports with survey results on how much of the population (and which parts of the population) cites confidentiality of census responses as a concern. Some pointed out that historically, people have been more concerned about the Census Bureau misusing or inappropriately sharing confidential records as opposed to an outside party performing an attack on published census statistics. Clearly, there are merits in both privacy concerns, though they approach the matter from different angles—namely, who is the “attacker” of concern? And in light of an answer to that question, where does DP fit in?

Courts may weigh the Census Bureau’s mandates differently and interpret DP noise as conceptually distinct from other forms of error. A primary use of census data is for redistricting and upholding the Voting Rights Act. While acknowledging the existing published analyses on the extent to which DP will impact redistricting, there was consensus that it would be good to further investigate this, given the complexity and immensity of redistricting. In these discussions, workshop participants also discussed how courts might interpret DP. One legal expert noted that the enumeration of the population is constitutionally mandated, whereas the Census Bureau’s mandate to keep responses confidential does not appear in the Constitution. From my understanding, it seems that courts may weigh these two mandates differently considering the importance courts place on the Constitution generally. This point complicates the trade-off between privacy and accuracy as it implies that perhaps from a legal standpoint, accuracy is more important.

Second, workshop discussions noted that while we can all acknowledge several sources of error in census data, courts may consider DP noise to be of a different “flavor,” since there is something conceptually different about intentionally injected noise compared to other sources of error that are not intentional or widely known (e.g., error introduced from previous disclosure avoidance methods like swapping). There may also be a tendency to treat DP noise as different since it is added to census block counts, which the previous DAS held invariant. The new DAS may also be viewed differently since census block counts are noised, which was not true of previous systems. Relatedly, Jessica and I are currently working with Abie Flaxman and a law colleague to try and contextualize what exactly is different with DP (spoiler: a lot less than some of the pushback has suggested) for a law audience.

The trade-off is not just between privacy and accuracy—there are other dimensions, too. When it comes to DP, at least for the Census, the trade-off is not solely between privacy and accuracy. One presenter suggested that there is a third dimension related to legitimacy and trust in census data that the Census Bureau is taking into account when considering the trade-off. Noisy counts showing nonsensical values (e.g., negative population counts) harm this third dimension, perhaps explaining why the Census Bureau did not settle on an “optimal” balance between privacy and accuracy. One participant who works closely with residents noted how disastrous it would be to show people illogical census counts, since people would be alarmed at seeing what they would perceive as low-quality data given the amount of taxes that go into producing high-quality censuses. Computer scientists and statisticians have suggested, and continued to suggest at the workshop, that accounting for DP noise in analyses would be much easier if the Census Bureau released statistics without post-processing (the process by which the Census Bureau converted DP-noised data into “logical” [e.g., non-negative] values for publication). My takeaway here is that reasoning about trade-offs around disclosure avoidance and the Census requires including the role of human factors around perception of census data, and will be crucial to account for in future uses of DP, especially for the Census.

As I’ve said before, I (Jessica) like the idea of noisy counts becoming normalized. It would be nice if those pushing the argument that releasing noisy counts would lead to chaos could provide more concrete examples of what that might look like; will courts halt completely when it comes to voting rights act violations? It’s not clear to me how much we would have to explicitly change legal practices that involve rules like one-person-one-vote which are already recognized as somewhat absurd.

It would also be nice to see more direct attempts to get at this proposed relationship between perceptions of census data as private and willingness of populations to be included in data collection, related to the kind of indirect costs of not using DP that Priyanka alludes to. Even without a “privacy Chernobyl,” if the Census bureau switched to DP out of fear that their liability under the old methods was too high and would threaten their credibility as an organization/lead to higher non-response error, could we try to quantify that trade-off? This would be exploratory of course but if the bureau thinks certain hard to reach populations will become harder to reach if they don’t make a dramatic change, then it makes sense to ask how much harder to reach would they need to become (relative to current estimated non-response) for this trend to threaten data quality more than DP does. This could involve identifying how much evidence there is in past data for a link between greater privacy awareness in a population and higher non-response error. Hard to know if this would be hopelessly confounded without seeing any attempts.

Much of the discussion so far has been about PL 94-171 data, which is used for redistricting and voting legislation. However, attention is now turning to the Demographic and Housing Characteristics (DHC) files, which contain many more variables. Next month the bureau is holding a meeting to collect feedback from those who have evaluated the demonstration products they released which applied the new DAS to 2010 DHC data (so we should expect the same messy comparison of noised data to noised data). It’s unclear to me whether there are any groups of computer scientists or others trying to understand the implications of the DHC demo data for privacy. My sense from talking to others who have followed the discussions very closely is that what happens at that meeting could be pivotal for the strategy going forward.

Thanks to Abie Flaxman, who also caught parts of the workshop, for reading a draft of this post.

11 thoughts on “More discussion of differential privacy at the Census”

Andrew on May 16, 2022 5:19 PM at 5:19 pm said:

Jessica/Priyanka:

Interesting discussion. You’re bringing in a third concern, so it’s now privacy, accuracy, and trust. Once you have opened Pandora’s box in this way, we can start considering other issues. One concern that was raised about census adjustment, years ago, was the issue of moral hazard: once you get away from simply using the direct, raw numbers, you’re allowing the possibility of forking paths. Longer term, the moral hazard was presented that the existence of trusted adjustment methods would motivate future censuses to try less hard to get good data. I didn’t find this a very compelling argument, as ultimately there’s no pure data—even the so-called raw numbers exist only after various steps of checking and local adjustment—but it is an argument that was made.

My other point is statistical and relates to the issue raised by Jessica regarding communication of uncertainty, and it reminds me of something that goes on with weighting in sample surveys. Practitioners and statisticians often have the incorrect intuition that for every survey there is a set of sampling weights so that you can just take weighted averages of the data and estimate whatever you want. A moment’s reflection regarding small-area estimation reveals that this cannot possibly work, but people continue to act as if it does. Similar issues arise with imputed census data.

Reply ↓
- Jessica Hullman on May 16, 2022 10:34 PM at 10:34 pm said:
  
  Yeah, I guess there are many stories one could tell oneself about what sort of dynamics *might* arise. The survey weights thing is interesting – I could see how knowing that post-stratification exists could make one assume all weighting problems have been solved. Maybe the ease with which people can grasp the intuition behind a method is not always a good thing.
  
  Reply ↓
Morris on May 16, 2022 7:21 PM at 7:21 pm said:

test

Reply ↓
Fafa on May 17, 2022 11:28 AM at 11:28 am said:

The DP extremism is bizarre. I don’t understand the attraction. To say “noisy counts” should be “normalized”, is a ridiculous euphemism for saying “I would like the census to be more inaccurate”.

It is NOT a bizarre question to ask how many 18 year olds live in Michigan’s UP. I don’t care who asked the question. Counting up the number of people is literally the entire premise of the census.

The DP fanatics are the ones attempting to shoehorn epistemological debates about the ‘truth’ quotient of census counts in order to muddy the water. I am sorry, but “The census numbers are in reality already noisy point estimates already, so it is fine to make them noisier” is not an argument I am buying.

Yes, DP is an interesting methodological toy. Please play with it somewhere else.

Reply ↓
- Jessica Hullman on May 17, 2022 12:45 PM at 12:45 pm said:
  
  It’s fair to say that questions like how many 18 year olds shouldn’t be called bizarre. To call me a DP fanatic though is inaccurate. See my previous posts for my critiques of both sides of this debate.
  
  > I am sorry, but “The census numbers are in reality already noisy point estimates already, so it is fine to make them noisier” is not an argument I am buying.
  
  Do you have some proof that over all the different applications of census data, the calibrated error introduced as a result of using DP is greater than the error contributed by the previous noising techniques, like swapping and table suppression? Or do you just have a few examples in mind?
  
  Keep in mind that we cannot currently evaluate the error contributed by DP without already comparing it to census data to which swapping has already been applied. However, after a certain amount of time goes by the census can release the unadjusted counts, and applying the new DAS (TopDown algorithm) to 1940 census data suggests that even with a much much more powerful privacy budget than what the census used, the error is on par with releasing a 90% sample of the data. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7216402/.
  
  So with the high epsilon the census used, we are likely talking about negligible error, which *might* have a smaller impact on many analysis applications than the previous techniques did. I’m not saying this is definitively the case. I just don’t see how, given the huge diversity of uses of census data, we can all immediately assume it must be worse for accuracy overall, especially if there’s an option of not post-processing (which adds further bias). One thing that would be useful to explore, for example, before we jump to conclusions is how the error associated with swapping to different degrees (since we can’t know how much the census was actually doing) affects various applications, including redistricting.
  
  >To say “noisy counts” should be “normalized”, is a ridiculous euphemism for saying “I would like the census to be more inaccurate”.
  
  Again it seems maybe you’re using “inaccurate” to refer to some specific examples you have in mind. We know that post-processing the data to remove negatives and make things sum up nicely over geographies contributes additional bias over the calibrated noise that the mechanism adds. We also know that once you start doing post-processing it’s much harder or even impossible to account for the amount of error added in your uncertainty estimates (which is possible under DP without post-processing, but not under previous noising techniques). So why are we post-processing?
  
  Reply ↓
  - anon e mouse on May 17, 2022 3:28 PM at 3:28 pm said:
    
    I think there are two main things that demographers are upset about (as well as many minor things):
    
    1) The way the need for DP was sold to the public and the research community was at best misleading and at worst fraudulent. The supposed demonstration of the risks of reconstructing the database actually showed that it’s very hard to reconstruct anything meaningful at all, and impossible to know which records you got right unless you have access to confidential data (and if anyone got access to this confidential data, they wouldn’t need to use the public data at all).
    2) Giving up on holding block counts invariant. There is no other data at this resolution. The Decennial Census is it. Lots of important things happen at scales that are not captured in block groups or tracts. The Bureau’s attitude about this issue in particular has essentially been a giant middle finger to all of us who need the block data for our work. Last I heard they now don’t want to give the public block or block group data at all for most of the tables, which leave us at the tract level. Utterly useless for a lot of things people are used to being able to do with these data.
    
    Reply ↓
    - Jessica Hullman on May 17, 2022 5:36 PM at 5:36 pm said:
      
      On 1, there is definitely an element of trust in the Census Bureau that is required given that they have the ground truth. I think for some of us in CS it’s hard to see how it was misinterpreted as so misleading, because what they did was following results from privacy research pretty directly – that’s how you do a reconstruction attack, that’s how you talk about the risk of de-identification. So I tend to think that the census communication strategy was not deliberately misleading at all, they were just trying to mimic the computer scientists, and the unfortunate fact is that the type of privacy protection you get from DP is narrow relative to the broader landscape of ways that someone might try to infer sensitive information about individuals. The bureau just didn’t anticipate this mismatch so well. At the same time though, they are trying to get people to take what they perceive as a big liability on the privacy side seriously, but demographers are obviously much more concerned with accuracy given their profession, so they’re in a tough position. I’m more sympathetic to the bureau’s side than I am to the many privacy researchers who continue to imply that DP will solve all the privacy issues despite what we’ve seen with the “gentle roll-out” at the census, or to those seem ok with intentionally spreading false information about how differential privacy compares to past techniques (beyond the big change that you mention of not holding block counts invariant).
    - Fafa on May 17, 2022 9:41 PM at 9:41 pm said:
      
      I appreciate Jessica replying here, but the comments are a great example of DP defenders goalpost moving and obfuscation. Point out the very plain fact that DP introduces inaccuracy? Well, you have to provide proof “over all the different applications of census data” that DP is more inaccurate. Why? Because DP is cool and we want to use it.
      
      The legal issue is a red herring–the standard the court uses for one particular application of census data, and whether they even follow that standard, has no bearing on whether DP is doing anything meaningful enough that is worth wiping out census block data.
      
      DP does not appear to solve any pressing problem, and it will remove an entire stratum of information. Especially if you are trying to figure out basic stuff about rural areas, DP just made that much more difficult. Some people care about the UP in Michigan; it’s not that hard too understand.
Clyde Schechter on May 18, 2022 4:55 PM at 4:55 pm said:

“DP does not appear to solve any pressing problem”

This is precisely the point. When we introduce new policies or practices that have costs, we ought to be quite clear what problem these are intended to solve, and at least a reasonable estimate of how effective they are at solving them. As best I can infer from what I read, DP was introduced to increase privacy of the census data, but nobody has made a convincing case that there is any problem with previous levels of privacy.

Although I have no dog in this fight, as I do not use fine-grained census data in my work, I see this as part of a larger problem: a “data privacy” growth industry that is running amok, indiscriminately imposing obstacles to research without any clear indication that there is a serious countervailing problem to solve. Some of these policies, such as the rapid expansion of data-use agreements that include non-disclosure clauses, will provide considerable cover to those who would fabricate or falsify data. Sure, in some contexts there are legitimate concerns about privacy of sensitive data that must trump the research issue–this is common, for example, in health care research. Many internet websites abuse sensitive data they gather on their users. There are genuine abuses like these that need to be abated.

But census data? The information gathered in the basic census form is utterly banal. I have yet to hear an explanation of how anybody would be harmed if the census bureau were to post the unobscured household-level data on a publicly available website.

Reply ↓
- Jessica Hullman on May 18, 2022 6:05 PM at 6:05 pm said:
  
  I very much get this sentiment. I’m not personally concerned about someone figuring out my race or ethnicity or gender from the Census, so it’s been hard for me to relate to arguments that improving privacy protection is simply essential. However from listening to various discussions on all this here’s a few reasons I think might play in to the bureau’s decision to improve privacy protection:
  -How surveys pose questions about sexual orientation and gender identity are morphing over time allowing for more possible responses. Similar to ethnicity questions now allowing people to be pretty specific, some people may worry about getting targeted based on how they answer these, and stop responding truthfully. When a lot of people do this, we can’t trust the collected data on gender or ethnicity even at higher levels of aggregation.
  -Even if a government agency refuses to hand over data to a leader that wants to do certain things with it (e.g., target people they think are illegal aliens for deportation), the database reconstruction theorem proves that someone can recover individual level records from published statistics. (The unfortunate caveat/elephant in the room being that DP won’t necessarily guard against association-based attacks like using BISG).
  -The amount of swapping the Census had to do to feel like they were doing their job on Title 13 was increasing over time, but not necessarily helping much with privacy protection in the DB reconstruction attack sense.
  
  Reply ↓
Kenny on June 7, 2022 4:55 PM at 4:55 pm said:

I think the phenomena of public agencies ‘preemptively managing the public’s trust’ to be incredibly frustrating – and comically counter-productive, even on the agency’s own terms. It’s _much_ harder to trust someone or some organization that’s deliberately trying to maintain a pre-defined level of trust, independent of whether that’s actually deserved. The CDC and FDA have been notably terrible in this regard of late.

I also found my own interaction with Census employees/volunteers to be incredibly frustrating. I was contacted right after the initial COVID-19 pandemic ‘lockdowns’ and my own personal ‘household’ situation was incredibly uncertain, e.g. people I would have expected to be living with me weren’t there and couldn’t credibly claim whether they ever would be again. Given the crucial ‘voter counting’ function of the census, and all of the derivative purposes of that info, I really really wanted some kind of reasonable exception, or even just guidance, on how to answer their questions. I realize that the specific people with which I was in contact couldn’t realistically give me any better answer than they did: “Do the best you can.”, but it was so discouraging that there was no apparent consideration of any of these issues that I declined to answer at all.

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

More discussion of differential privacy at the Census

11 thoughts on “More discussion of differential privacy at the Census”

Leave a Reply Cancel reply