(What’s So Funny ‘Bout) Fake Data Simulation

Someone sent me a long question about what to do in a multilevel model that he was fitting to data from a memory experiment. It was a fun email, even including a graph of data!, and his question involved a result from an analysis he did where an estimated difference was implausibly large.

I didn’t have the time or energy to figure out exactly what the model was doing but I still wanted to be helpful, so I sent him some generic advice, consistent with my earlier posts on fake-data simulation.

This advice might be relevant for some of you. Here it is:

I don’t have the energy right now to follow all the details so let me make a generic recommendation which is to set up a reasonable scenario representing “reality” as it might be, then simulate fake data from this scenario, then fit your model to the fake data to see if it does a good job at recovering the assumed truth. This approach of fake-data experimentation has three virtues:

1. If the result doesn’t line up with the assumed parameter values, this tells you that something’s wrong, and you can do further experimentation to see the source of the problem, which might be a bug in your fitting code, a bug in your simulation code, a conceptual error in your model, or a lack of identifiability in your model.

2. If the result does confirm with the assumed parameter values, then it’s time to do some experimentation to figure out when this doesn’t happen. Or maybe your original take, that the inferences didn’t make sense, was itself mistaken.

3. In any case, the effort required to simulate the fake data won’t be wasted, because doing the constructive work to build a generative model should help your understanding. If you can’t simulate from your model, you don’t really know it. Ideally the simulation should be of raw data, and then all steps of normalization etc. would come after.

7 thoughts on “(What’s So Funny ‘Bout) Fake Data Simulation

  1. I recall a case from decades ago. I haven’t been able to find references to it with internet searches so far, and I’m going by memory. So details may be hazy, but that won’t matter.

    After the asteroid theory of the extinction event associated with the end of dinosaurs became more accepted, some researchers published a paper claiming that there was a 55-million year cycle to major extinction events. They had a complicated methodology for analyzing the paleontological record, and it spit out this period.

    Later other people published a paper in which they showed that the methodology *always* came up with a 55-million year cycle for any data that somewhat resembled the actual data, even when the simulated data had very different periods baked in.

  2. Tom wrote:

    “the methodology *always* came up with a 55-million year cycle for any data”

    Ah yes, a Cheeburger-Pepsi Model, which always returns the same output no matter the input. When these are constructed as an honest attempt to model data, they always embody a useful cautionary tale about what not to do.

  3. It seems quite a few questions on the Stan forum could be answered by fake data simulations.

    Regarding point 3, yes! I think I find it easier to learn a model if I have to create a generative model and simulate data from it.

Leave a Reply

Your email address will not be published. Required fields are marked *