(One of) the hardest things about differential privacy could also be seen as opportunity 

This is Jessica. During a discussion recently a collaborator expressed a fact about differential privacy that I think captures what is fundamentally difficult about getting data analysts to accept it. The fact is:

If you can describe a small set of queries that capture what you fundamentally care about, designing the differentially private mechanism for that set will be much more efficient than any mechanism that tries to give data that allows you to do a wider set of queries.

Given that every query leaks information, and you can avoid some leakage by designing the privacy mechanism to account for the structure of the queries (e.g. where outputs can be expected to correlate), it will always be to your advantage to have information on the analyst’s priorities. This creates a design problem: how do we get the analyst to give you their preferences over what’s important so you can give them the most efficient privacy mechanism? 

This aspect of differential privacy can seem at odds with all the wisdom around the value of exploring the data to identify how to best analyze it. This older slide deck from Wasserstein describes a cultural divide that crops up, where many statisticians would much prefer a sanitized database where they can issue whatever queries they want, but that’s not what the computer scientists have in mind.

I’ve generally leaned a bit more toward the side of statisticians than privacy researchers, in the sense of agreeing that its not trivial to expect people to be able to tell you about their highest priority queries before they see the data. But, taking the above fact as inherent to privacy protection, maybe we could instead see this as a feature of differential privacy, in that it could inspire new approaches to getting analysts to more efficiently arrive at and specify their interests or fundamental questions when doing interactive data analysis. In visualization research for instance, we’ve mostly shied away from thinking like mechanism designers who want to elicit analysts’ prior domain knowledge in order to recommend useful views to them (or maybe we equate “seeing data” with elicitation). The design mantras focus on simple defaults for how to show the observed data, like Provide an overview first, but offer details on demand, or Give them the ability to focus on certain data, but don’t let them lose the context. I don’t think these ideas are incongruent with getting analysts to provide information about what they are most interested in, but there’s a tendency to shoot for domain general, knowledge-agnostic tools where possible. Ideas like trying to elicit analysts’ priors or expectations can seem controversial (see, e.g., our discussion article on model checking as a theory for visualization).  

Perhaps more importantly, there haven’t really been popular use cases where you couldn’t just see all the data, outside of the researcher degrees of freedom type examples discussed in science reform more generally. There have been some suggestions that popular visual analysis tools like Tableau might be p-hacking machines in a sense, by letting people make all sorts of comparisons, but it’s hard to find really compelling evidence that this is a major risk. So, maybe the awkward constraints differential privacy imposes are a blessing in disguise, and we’ll get more sustained attempts to elicit domain knowledge as an alternative to the “just let them click around until they find it” approach.

2 thoughts on “(One of) the hardest things about differential privacy could also be seen as opportunity 

  1. Jessica:

    I think one problem with the whole privacy discussion is that there are different standards for what is acceptable, so in some settings there are strong restrictions based on theoretical possibilities of data leakage, while in other settings lots of identifying data is sitting out there for anyone to grab. I’m reminded of the Institutional Review Board, which can forbid all sorts of things in a research setting that are perfectly legal if you just want to do them as a private citizen. Having a patchwork of rules and restrictions is not necessarily a bad thing; it just can lead to confusing discussions, especially with math and CS people who can be used to dealing with absolutes.

    • Yes, and the theorists versus data users / general public have conflicting expectations about what privacy is supposed to mean, e.g., people get confused about the fact that predicting sensitive attributes is not considered privacy loss if you are learning purely form from aggregate statistical relationships, like between race and name or geography. And different sides can talk past each other when one side is convinced by theoretical evidence and the other only cares about what’s probable. Actually, my student and I recently summarized some of these sources of conflict as they seem to be affecting the US Census in a magazine article: https://ieeexplore.ieee.org/abstract/document/9904294

      This post is just saying that of the various ways DP challenges intuitions, needing to know what the most important analyses are given some data is one that seems particularly hard to overcome. But, at least in exploratory data analysis settings, we haven’t necessarily directed much attention to trying to elicit what is critical to an analyst, so maybe there is some innovation that could come out of trying.

Leave a Reply to Jessica Hullman Cancel reply

Your email address will not be published. Required fields are marked *