Journalists and social scientists team up to discover that Census data is not perfect

Posted on November 3, 2021 11:39 AM by Jessica Hullman

This is Jessica. Just a day or so after we’d had what I thought was a productive, relatively nonpartisan discussion about the new Census disclosure avoidance system and the evidence Kenny et al. provide of possible implications, Mike Schneider, a journalist writing for the Associated Press, publishes an article demonstrating some flawed but all too common argumentation about the new Census DAS.

It should be clear by now that I do think there are valid points for debate around the new DAS, and to give Schneider a little credit, the article does suggest there is debate and some uncertainty. But the rhetorical strategy is worth calling out because it keeps popping up. The idea seems to be to first point out that there are errors in Census data, in this case through anecdotes about how some reported counts do not accurately describe our real world experiences (gasp!) And then to imply that because—brace yourself—these errors were added by design, they must be worse than whatever (conveniently left undescribed) thing was going on before.

For example, the article reports that “Forty-eight of the residents living in the block are Black, according to the census, though it’s difficult to know for sure, given the whimsy of differential privacy” and elsewhere describes the new data as “unreliable” despite previously describing how we can’t know for sure the error rates because we don’t have access to the raw data.

In case it is still not clear to anyone out there who is either writing on this topic or consuming these news stories and “expert” reports made in court (and at the risk of repeating myself): There were always errors in Census data, the most major stemming from error in data collection like non-response (which by the way some claim is worse this year than ever for certain groups). But the Census has also used techniques to limit disclosure for years. Just because swapping of attributes between selected individuals and households (which is what the Census’s previous system did to protect information deemed sensitive) did not involve setting a single parameter value to control the amount of injected error does NOT mean that the data was perfectly accurate.

Put another way, it’s like these journalists (not to mention certain “experts” who have come down hard on the new DAS without necessarily publishing any careful analyses demonstrating their points) are implying that a planned murder can only occur if the criminal first said “I want you to be [gesturing with their fingers to designate some epsilon value] this much dead”, and choosing to ignore any past planned murders which they committed that were not preceded by such a statement, but instead involved shooting as many times as it took to kill the victim, at least as far as the criminal could tell on a foggy night.

The real distinction is that previously we didn’t have all the information to talk about error contributed to certain tabulations by disclosure limitation techniques, because the whole process needed to be kept secret (“security through obscurity”). With differential privacy, we can discuss the magnitude of error added. So let’s discuss that, and its implications! Shooting the messenger is not getting us anywhere.

One way that swapping was legitimately different from the new DAS is that it did not affect certain tabulations (for instance block level population counts not faceted by any sensitive attributes) that are noised under the new DAS. Is that what this article is trying to get at? I can’t really tell. Whatever the point is, it gets lost in all the mysticization of “vanishing homes” and “magic” and of course “the whimsy” of a mathematical guarantee.

6 thoughts on “Journalists and social scientists team up to discover that Census data is not perfect”

harryq on November 3, 2021 3:49 PM at 3:49 pm said:

Far be it from me to criticize Census’s implementation of privacy protections (they have *way* more experience with this than I do, and I recognize that they had a daunting task), but examples like “the official 2020 census results say 54 people live in Stephenson’s census block in midtown Milwaukee, but also that there are no occupied homes” make it super easy to criticize. In my work, that would be like generating death data that suggested more people died of heart attacks than died of all forms of heart disease or that a county had more people die from heart disease than it had residents. My impression is that Census generated population sizes and the number of occupied homes independently of each other, but perhaps a conditional approach — e.g., sample/allocate the number of occupied homes to various groups subject to a constraint at a higher level of geography on the total number of occupied homes, and then sample/allocate people to homes subject to the constraint that each home needs at least one resident and a constraint based on the total number of people at a higher level of geography — would have been more appropriate.

In any case, Census was the first to do something like this at this scale, and other agencies/organizations can learn from their experience.

Reply ↓
Josh on November 4, 2021 1:59 AM at 1:59 am said:

Hmmm, this does seem problematic. Of course, without comparing this to the past census, it’s difficult to say for sure, but really 15,000 neighborhood blocks with residents but no occupied homes? Did the old swapping method lead to such wacky results?

I appreciate your point on quantifiable uncertainty but I think there also needs to be some weight to usability.

Reply ↓
- anon e mouse on November 4, 2021 11:53 AM at 11:53 am said:
  
  I know I’ve made clear on other posts what my bias is regarding the DAS rollout, but no, it didn’t. The pro-DAS crowd refuses to acknowledge that data that are not even internally consistent at a basic level are useless for many purposes for which public and private entities have historically relied on them.
  
  Reply ↓
- Jessica Hullman on November 4, 2021 1:42 PM at 1:42 pm said:
  
  I’m not arguing that the differences don’t matter, just that the article does a terrible job of conveying what exactly has changed or how bad it is. Reporting counts of blocks where there’s a discrepancy between population and occupied housing gives me no sense of potential implications. E.g., it says 1,200 of Florida’s 484,000 blocks list occupied homes but not population, so 0.25%… does that matter? I have no idea. How are occupied home and population data used together by city planners? The article reads like an attempt to get people’s attention using scare tactics without having taking the time to build a coherent argument. Having been written by someone whose beat is apparently the Census (and whose previous reporting, at least as far as I could tell from the few articles I looked at, seemed more balanced), it seems questionable what they were after with this one.
  
  Reply ↓
Clyde Schechter on November 4, 2021 2:27 PM at 2:27 pm said:

“How are occupied home and population data used together by city planners? ”

Well, I don’t do city planning, so maybe my perspective on this is off base, but it seems to me the issue is when you have two variables that should be strongly related to each other but the data values are not even compatible, one might worry _which one to use_, or in an analysis requiring that kind of information.

Does 0.25% blatant errors matter–for many purposes that is less than other sources of error in most data sets. But there are applications where that kind of error would, indeed, make a difference. How often those come up in the use of census data, I have no idea–it’s something I rarely do.

For me, though the larger question is the whole context of privacy of census data. In many areas over the last few decades we have seen increasingly stringent or disruptive measures taken to protect data privacy. Where truly sensitive information is involved, that makes sense. But it seems to me that, at least for the regular census form, the information is not sensitive, and simply removing names and exact addresses would be sufficient, at least sufficient to protect information that is not readily available in other publicly or commercially available data sets. Maybe I don’t remember what’s on the form and I’m missing something. But I’m just not convinced that data requires much in the way of privacy protection. Who would be harmed in what ways by breaches?

In many other contexts I often feel that nowadays we are paying a large price to protect the privacy of data that doesn’t really need protection. Is the basic census information in that category?

Reply ↓
- Jessica Hullman on November 5, 2021 2:53 PM at 2:53 pm said:
  
  ‘Who would be harmed in what ways by breaches?’ … I haven’t necessarily reached a satisfactory answer for myself – what is most obvious to me is the value of differential privacy once we make the assumption that Census data shouldn’t leak individual level data.
  
  My guess is that some would argue that controlling how much individual information is disclosed is the necessary responsibility for any government that wants to collect data on its residents, like as a contract of trust that must be in place if you expect people to keep honestly reporting to the state. I’ve also recently been pointed to philosophical arguments that even in a perfectly moral world (e.g. one where sensitive information would never be misused even if made public), there would still be value in privacy, and of notions of privacy that stress that the importance of expectations (e.g., Helen Nissenbaum’s notion of privacy as contextual integrity) : if I report my personal data to the Census I might expect it to be used in certain ways to direct resources, delegates etc. but feel violated if it were used by some attacker in an identity theft.
  
  There are some interesting historical takes (e.g, by the historian Margo Anderson) that give context on how the notion of individual privacy protection as described in Title 13, which strikes me as a very ‘American’ thing, became the predominant one, even while the most major known harms of from Census data are due to the Census handing over information on small tabulations of certain minority groups (e.g Arab Americans in the 1990s and Japanese Americans during WWII). I’m not sure what I personally think about how to weigh the harms of breaches, other than how people interpret ‘harm’ as it relates to reidentification seems to be part of why this topic is hard to find consensus on.
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Journalists and social scientists team up to discover that Census data is not perfect

6 thoughts on “Journalists and social scientists team up to discover that Census data is not perfect”

Leave a Reply Cancel reply