This is Jessica. During a discussion recently a collaborator expressed a fact about differential privacy that I think captures what is fundamentally difficult about getting data analysts to accept it. The fact is:
If you can describe a small set of queries that capture what you fundamentally care about, designing the differentially private mechanism for that set will be much more efficient than any mechanism that tries to give data that allows you to do a wider set of queries.
Given that every query leaks information, and you can avoid some leakage by designing the privacy mechanism to account for the structure of the queries (e.g. where outputs can be expected to correlate), it will always be to your advantage to have information on the analyst’s priorities. This creates a design problem: how do we get the analyst to give you their preferences over what’s important so you can give them the most efficient privacy mechanism?
This aspect of differential privacy can seem at odds with all the wisdom around the value of exploring the data to identify how to best analyze it. This older slide deck from Wasserstein describes a cultural divide that crops up, where many statisticians would much prefer a sanitized database where they can issue whatever queries they want, but that’s not what the computer scientists have in mind.
I’ve generally leaned a bit more toward the side of statisticians than privacy researchers, in the sense of agreeing that its not trivial to expect people to be able to tell you about their highest priority queries before they see the data. But, taking the above fact as inherent to privacy protection, maybe we could instead see this as a feature of differential privacy, in that it could inspire new approaches to getting analysts to more efficiently arrive at and specify their interests or fundamental questions when doing interactive data analysis. In visualization research for instance, we’ve mostly shied away from thinking like mechanism designers who want to elicit analysts’ prior domain knowledge in order to recommend useful views to them (or maybe we equate “seeing data” with elicitation). The design mantras focus on simple defaults for how to show the observed data, like Provide an overview first, but offer details on demand, or Give them the ability to focus on certain data, but don’t let them lose the context. I don’t think these ideas are incongruent with getting analysts to provide information about what they are most interested in, but there’s a tendency to shoot for domain general, knowledge-agnostic tools where possible. Ideas like trying to elicit analysts’ priors or expectations can seem controversial (see, e.g., our discussion article on model checking as a theory for visualization).
Perhaps more importantly, there haven’t really been popular use cases where you couldn’t just see all the data, outside of the researcher degrees of freedom type examples discussed in science reform more generally. There have been some suggestions that popular visual analysis tools like Tableau might be p-hacking machines in a sense, by letting people make all sorts of comparisons, but it’s hard to find really compelling evidence that this is a major risk. So, maybe the awkward constraints differential privacy imposes are a blessing in disguise, and we’ll get more sustained attempts to elicit domain knowledge as an alternative to the “just let them click around until they find it” approach.