Show me the noisy numbers! (or not)

This is Jessica. I haven’t blogged about privacy preservation at the Census in a while, but my prior posts noted that one of the unsatisfying (at least to computer scientists) aspects of the bureau’s revision of the Disclosure Avoidance System for 2020 to adopt differential privacy was that the noisy counts file that gets generated was not released along with the post-processed Census 2020 estimates. This is the intermediate file that is produced when calibrated noise is added to the non-private estimates to achieve differential privacy guarantees, but before post-processing operations are done to massage the counts into realistic looking numbers (including preventing negative counts and ensuring proper summation of smaller geography populations to larger, e.g. state level). In this case the Census used zero-concentrated differential privacy as the definition and added calibrated Gaussian noise to all estimates except predetermined “invariants”: the total population for each state, the count of housing units in each block, and the group quarters’ counts and types in each block.   

Why is the non-release of the noisy measurements file problematic? Recall that privacy experts warn against approaches that require “security through obscurity,” i.e., where parameters of the approach used to noise data have to be kept secret in order to avoid leaking information. This applied to the kinds of techniques the bureau previously used to protect Census data, like swapping of households in blocks where they were too unique. Under differential privacy its fine to release the budget parameter epsilon, along with other parameters if using an alternative parameterization like the concentrated differential privacy definition used by the Census, which also involves a parameter rho to control the allocation of budget across queries and a parameter delta to capture how likely it is that actual privacy loss will exceed the bound set by epsilon. Anyway, the point is that using differential privacy as the definition renders security threats from parameters getting leaked obselete. Of more interest to data users, it also opens up the possibility that one can account for the added noise in doing inference with Census data. See the appendix of this recent PNAS paper by Hotz et al. for a discussion of conditions under which inference is possible on data to which noise has been added to achieve differential privacy versus where identification issues arise.

But these inference benefits are conditional on the bureau actually releasing that file. Cynthia Dwork, Gary King, and others sent a letter calling for the release of the noisy measurements file a while back. More recently, Ruth Greenwood of Harvard’s Election Law clinic and others filed a Freedom of Information Act (FOIA) requesting 1) the noisy measurements file for Census 2010 demonstration data (provided by the bureau to demonstrate what the new disclosure avoidance system under differential privacy produces, for comparison with published 2010 estimates that used swapping), and 2) the noisy measurements file for Census 2020. The reasoning is that users of Census data need this data, particularly for redistricting, in order to better assess the extent to which the new system adds bias through post-processing. Presumably once the file is released it could become the default for reapportionment to sidestep any identified biases.

The Census responded to the request for the noisy measurements file for the 2010 Demonstration data by saying that “After conducting a reasonable search, we have determined that we have no records responsive to item 1 of your request.” They refer to the storage overhead of roughly 700 950 gigabyte files as the reason for their deletion. 

Their response to the request for the 2020 noisy measurements file is essentially that releasing the file would compromise the privacy of individuals represented in the 2020 Census estimates. They say that “FOIA Exemption 3 exempts from disclosure records or portions of records that are made confidential by statute, and Title 13 strictly prohibits publication whereby the data furnished by any particular establishment or individual can be identified.” They refer to “Fair Lines American Foundation Inc. v. U.S. Department of Commerce and U.S. Census Bureau, Memorandum Opinion at No. 21-cv-1361 (D.D.C. August 02, 2022) (holding that 13 U.S.C. § 9(a)(2) permits some level of attenuation in the chain of causation, and thus supports the withholding of information that could plausibly allow data furnished by a particular establishment or individual to be more easily reconstructed).” They encourage the plaintiff to request approved access to the files for their specific research project, since this kind of authorized use is still possible. 

I find the claim that somehow releasing the 2020 noisy measurements file would compromise individual privacy interesting and unexpected. I don’t really have reason to believe that the Bureau would be lying when they claim that leakage would result from releasing the files, but how exactly is the noisy measurements file going to aid reconstruction attacks? My first thought was maybe post-processing steps were parameterized partially based on observing the realized error between the original estimates and noised estimates, but this would contradict the goals of post-processing as they’ve been described, which are removing artifacts that make the data seem fake (namely negative counts) and making things add up. A more skeptical view is that they just don’t want to have two contradicting files of 2020 estimates out there based on the confusion and complications it could cause legally, for instance, if redistricting cases that relied on the post-processed estimates are now challenged by the existence of more informative data. Aloni Cohen and Christian Cianfarini, who have followed the legal arguments being made in Alabama’s lawsuit against the Department of Commerce and Census over the switch to differential privacy, tell me that there is some historical precedent for redistricting maps being revisited after the discovery of data errors, including examples where rule has been in favor of and against the need to reconstruct the maps. 

If the reasoning is primarily to avoid contradictory numbers, then it’s yet another example of the same fears about losing the (false) air of precision in Census estimates that has been called “incredible certitude” and “the statistical imaginary” and goes hand in hand with bizarre (at least to me) restrictions on the use of statistical methods by Title 13, which prevents using any “statistical procedure … to add or subtract counts to or from the enumeration of the population as a result of statistical inference.” (This came up in the Alabama case but was dismissed because noise addition under differential privacy is not a method of inference). 

Finally, in other Census data privacy news, Priyanka Nanayakkara informs me that they recently announced that the ACS files will not be subject to a formal disclosure avoidance approach by 2025 as hoped. The reason being that “the science does not yet exist to comprehensively implement a formally private solution for the ACS.” It sounds like fully synthetic data is more likely than differential privacy, which could be good for inference (see for instance the same Hotz et al. article appendix above, which contrasts inference under synthetic data generation and differential privacy). We need more computer scientists doing research on it.

1 thought on “Show me the noisy numbers! (or not)

  1. Some fifteen years ago I was working on a system for a state agency. I needed sample data to test database queries, so I asked for some. The data would include names, addresses, and SSNs since that is what the system needed to query. I received a password-protected zip file of data that included names, SSNs, and addresses, apparently all real data. I was horrified to be getting real data, and I had never imagined that they would send any to me as a mere subcontractor.

    I tried to anonymize the data by mapping the SSNs into the illegal range (numbers that will never be assigned), and randomly swapping the street addresses among the cities/zip codes. That might not have been perfect because some street addresses might have been unique to one city. I also randomly swapped last names; I forget what else I did with the names. Then I deleted the original zip file.

    This was the best I could come up with on short notice, but I didn’t want that real data residing on my system any longer than necessary.

Leave a Reply

Your email address will not be published. Required fields are marked *