Rajesh Venkatachalapathy writes:

Recently, I had a conversation with a colleague of mine about the virtues of synthetic data and their role in data analysis. I think I’ve heard a sermon/talk or two where you mention this and also in your blog entries. But having convinced my colleague of this point, I am struggling to find good references on this topic.

I was hoping to get some leads from you.

My reply:

Hi, here are some refs: from 2009, 2011, 2013, also this and this and this from 2017, and this from 2018. I think I’ve missed a few, too.

If you want something in dead-tree style, see Section 8.1 of my book with Jennifer Hill, which came out in 2007.

Or, for some classic examples, there’s Bush and Mosteller with the “stat-dogs” in 1954, and Ripley with his simulated spatial processes from, ummmm, 1987 I think it was? Good stuff, all. We should be doing more of it.

Fake data-simulation should have been the first thing I should have been taught when I started out. It has a life-changing-magical quality and has decluttered my modeling life completely and totally. Nowadays, before we even run a study we do some fake data simulation to get a handle into what we are getting into (e.g., for writing a pre-registration). Even after I read Andrew’s work it didn’t sink in how important it was until Stan came into the picture.

Shravan:

We have lots of fake-data simulation in Regression and Other Stories. Not enough, but at least we’re going in the right direction.

Yes, I discovered the github repo for that on Aki’s github page only today, via twitter. I’m surprised I missed that page. Looking forward to going through the examples. Fake data simulation has really changed the way I approach data analysis; I don’t even touch the target data in the beginning. Related: Betancourt’s prior predictive checks have in his Bayesian workflow also been an eye-opening experience.

I think the biggest impediment is implementation. I’m unaware of any references that really walk a practitioner through all the steps for complex projects. I know lots of sources for simulations for a t-test or something similarly simple, but that is often not complex enough for the problem at hand. On the other end, I know lots of sources that try to solve problems encountered by people very familiar with simulated data sets, like whether one should use Bayesian methods, sandwich estimators, whether multivariate normal distributions are good enough etc. What I’m not aware of is a sort of cookbook, which lays out a couple of use cases and describes each step, including which priors to plug in under conditions A, B and C and what to do if some variables are not multivariate normal. Sure, this can be worked out, but (a) it’s a slog and (b) it has too many forking paths to explain to laypeople. An authoritative guide would be useful in that one could then discuss the simulation based on that and refine from there.