Could you say that again less clearly, please? A general-purpose data garbler for applications requiring confidentiality

Ariel Rokem pointed me to this Python program by Bill Howe, Julia Stoyanovich, Haoyue Ping, Bernease Herman, and Matt Gee that will take your data matrix and produce a new data matrix that has the same size, shape, and general statistical properties but with none of the same actual numbers.

The use case is when you want to give your data to someone to play around with some analyses, but the data themselves are proprietary or confidential.

This is different from the scenario such as with the Census where they create a synthetic dataset that adds a little bit of noise or takes out some observations to preserve confidentiality but otherwise is intended to give the right answers. Here, there’s no aim to use the fake data to perform real applied analyses; it’s all about creating something similar that people can work with, for example to prototype their data analysis plans.

Howe et al. call their program DataSynthesizer but if it were up to me it would just be called Garbler.

Here’s the paper describing what the program actually does. The statistical procedures used in the garbling are nothing fancy, but that’s fine: just as well to keep it simple, given the simplicity of the goals. The title of their paper is “Synthetic Data for Social Good,” but I don’t really understand that last part: it seems to me that Garbler, like other statistical tools, could be used for good or bad.

6 thoughts on “Could you say that again less clearly, please? A general-purpose data garbler for applications requiring confidentiality

    • working link:

      https://cran.r-project.org/web/packages/synthpop/

      From their link: The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the data set.

      Sounds like synthpop maybe makes more effort to “get the right answer” compared to the program Andrew is referring to.

      Thanks for this link, I definitely could imagine using it.

  1. Very cool. This has great applications for saving time when teaching statistics. For example, reusing assignments that analyze data, but garbling the dataset each semester. I know I can just simulate data, but I find they often turn out too idealized when compared to real data.

  2. I work mostly with medical data, and this should be quite useful when turning encountered errors into reproducible examples that I can post online, and turning real applications into examples/vignettes.

Leave a Reply to Wouter Steenbeek Cancel reply

Your email address will not be published. Required fields are marked *