Simulated-data experimentation: Why does it work so well?

Someone sent me a long question about a complicated social science problem involving intermediate outcomes, errors in predictors, latent class analysis, path analysis, and unobserved confounders. I got the gist of the question but it didn’t quite seem worth chasing down all the details involving certain conclusions to be made if certain affects disappeared in the statistical analysis . . .

So I responded as follows:

I think you have to be careful about such conclusions. For one thing, a statement such as saying that effects “disappear” . . . they’re not really 0, they’re just statistically indistinguishable from 0, which is a different thing.
One way to advance on this could be simulate fake data under different underlying models and then apply your statistical procedures and see what happens.

That’s the real point. No matter how tangled your inferential model is, you can always just step back and simulate your system from scratch. Construct several simulated datasets, apply your statistical procedure to each, and see what comes up. Then you can see what your method is doing.

Why does fake-data simulation work so well?

The key point is that simulating from a model requires different information from fitting a model. Fitting a model can be hard work, or it can just require pressing a button, but in any case it can be difficult to interpret the inferences. But when you simulate fake data, you kinda have to have some sense of what’s going on. You’re starting from a position of understanding.

P.S. The first three words of the above title were originally “Fake-data simulation,” but I changed them to “Simulated-data experimentation” after various blog discussions.

31 thoughts on “Simulated-data experimentation: Why does it work so well?

  1. I learned about the value of testing using simulated data decades ago. I think it was in the 1970s. I’m going by memory, and some details will be wrong. A paper had been published that claimed to be able to explain certain major peaks of species extinction in the fossil records going back many millions of years. In particular, IIRC, the interest was in some episode about 50 million years ago.

    The paper showed that when you plugged in the known data, the model reproduced this extinction episode. Very impressive.

    Someone else then published a critical paper – which I read at the time – in which random, made up data were plugged into the model, and the result still produced the 50 MY peak. In fact, nearly any data input produced the peak.

    Not much was heard after this about the model. Score one for fake data simulation!

  2. >But when you simulate fake data, you kinda have to have some sense of what’s going on. You’re starting from a position of understanding.

    Yes! Simulating data has really helped me with my understanding of the data analysis / modeling process in general. As someone only a few years into the game, I really recommend learning how to do this for any newbie like me. I wish my education had included a class entirely on data simulation.

  3. I’m curious if anyone has a link to an online / publicly available example like the one Andrew details in the blog post? I get the gist (hopefully) of what Andrew and others have been saying about the value of simulated data experimentation to understand a real-world problem but it’d be helpful to see a real worked through example, preferably in R.

  4. Andrew wrote:
    P.S. The first three words of the above title were originally “Fake-data simulation,” but I changed them to “Simulated-data experimentation” after various blog discussions.

    I think using “simulated-data experimentation” is a vast improvement.
    Bob76

    • I think the term “synthetic data” is typically used to describe datasets that use simulation to approximate real data in order to avoid disclosing the identity of the individuals in the dataset (usually people, but also firms and other entities). For instance, the US Census Bureau publishes a synthetic version of its Longitudinal Business Database. The idea is that researchers can use the synthetic data to get approximately correct results without seeing the real and highly sensitive data. In some cases, the data provider will rerun researchers’ analyses on the real data to compare results, which also helps improve future iterations of synthetic data.
      See, for example,
      https://www.theatlantic.com/technology/archive/2015/07/fake-data-privacy-census/399974.

      In contrast, I think “simulated data” as used in the original post is more like Monte Carlo simulation, with the primary goal of testing statistical methods rather than producing substantive results. But I would be happy to be corrected if I misunderstood.

  5. While I miss the punchiness and non-jargony nature of “fake data simulation”, I prefer the new phrase “simulated-data experimentation”.

    In particular, I like adding “experimentation” because it emphasizes that the goal of the simulation is to establish the causal relations between the putative data-generating process and the analysis. We establish causal relations in reality using experiments too. And just like with “real data experimentation”, we need to care about things like replication and variability in our simulated-data experiments.

  6. On the other hand, I like the pairing of fake data and fake world.

    The fake data is being drawn from fake world (simulation) to _just_ learn about the fake world as we created it to be.

    Unfortunately that more than should be is blurred, even by statisticians.

    Not sure if the nicer vocabulary suggested here is worth the switch, but swimming against the vocabulary school is seldom a good idea…

  7. While I was getting used to “fake data” and had used “synthetic data” as its formal synonym (thanks, David, for your comment!), I like simulated-data experimentation. This is all related to forward and inverse modeling, correct? — The first being the creation of simulated data and the second being the fitting of a model to real data.

    Coming from a bit of a simulationist background (system dynamics), I sense that both simulationists and statisticians do both, but they may mentally start from their own corner. I like your encouragement.

  8. We’re using “fake data simulation” in a systematic manner for example here:
    https://link.springer.com/article/10.1007/s11222-015-9566-5

    The reason why it works so well is in my opinion that modelling is really or best tool to have a formalised idea of how the world comes up with data. Of course all our models are wrong because they are models, but still they’re the best we have. So we can set up a model that seems “realistic enough” to us where we can *control the truth* (this is the major advantage), and then see how what our method gives us related to the truth.

    This is really not very different from how people have analysed the quality of estimators and other inferential methods theoretically since the old days based on model assumptions. The difference is that we should not take for granted that reality is so simple that it allows theory to work out nicely, so our models for simulating fake data can be properly guided by our knowledge of the situation rather than by mathematical convenience, and in the same manner we can analyse what methods come up with that are not just outcomes of simplistic optimisation problems.

  9. Not that I oppose synthesized simulated fake artificial experimental data. But how can you create realistic data simulations without a model to generate the data? In the end don’t you have to understand the data in order to fake it?

    While on the one hand there’s a sensible side to it – the idea of assessing exactly how the model works appeals to me, especially in situations where data are limited, on the other hand there’s the sense that you’re creating a euphemism for a euphemism for a euphemism: ie., just burying the source of the actual model that controls the outcome deeper, so that whatever biases and assumptions are built into that model are yet more difficult to see.

    All of which brings me back to the one check on reality that we actually have: all models that are deployed for anything valuable must be verified against reality. It’s cool to use fake data or whatever kind of intellectual circus that can be dreamed up as long as at the end of the day the model is validated – repeatedly – against fresh data over the range of values or concepts that it supposes to address.

    • Jim,

      That’s my thought as well. If you want to create a dataset with known properties (I think the term might be “known data generating processes”) and see how certain aspects of the data play out in a model, then simulation is the obvious strategy. It’s a good way of doing methodological investigations.

      But simulating a dataset that seems (to the analyst) similar to some real dataset and treating model results from the simulated data as though they are informative about the real data? Maybe it’s my limited mathematical background but I can’t see how such results can possibly be anything other than a reflection of what the analyst programmed into the simulated data generating process.

      • That’s right, it doesn’t tell you anything about the real data or the real world. What it does elucidate is the behavior of your proposed model and your estimating procedure. Whether or not your stan model is coded correctly, whether or not a model of the form you specified is identifiable, what data generated from the process you imagine would actually look like.

        Ex (I’m making this up on the spot, don’t check my math)

        I believe some process produces observations which are distributed cauchy around some location parameter determined by some wide linear model. I do a simulation from some realistic looking parameters and I look at the observations. Even I get the locations and spreads to match up; they don’t look anything like the real observations—way too many extreme mismatches. So without doing any fitting at all I already know my proposed model needs some modification. I try fitting the model and I don’t recover anything like the coefficients I simulated from—too much spread. This also tells me that even if I was exactly right about the structure of the data generating process, I wouldn’t really be able to learn anything meaningful from this model and data anyways

    • The thing is that if you want to use your methodology to reveal something underlying the actual real data that is not directly observable, you can’t check this in any straightforward manner against the real data. You can check what the methodology gives against what you have put in your fake data model though, because there you can control the “model truth”.

      Of course one can discuss the model and whether what’s implies with it is realistic or may reflect a certain bias, and of course one should check the method fit against the real data. But still fake data simulation gives you something valuable that you cannot see otherwise.

  10. In my last position before retirement I became interested in randomization statistics, which I think similarly focuses one’s attention on the model, especially when there are multiple factors. Of course, the usual model is ‘degenerate’ – the data is real, but the associations between data points are purely random. But having set up the random model, one can’t help but spend more time thinking about the models corresponding to alternative hypotheses.

  11. I like a lot simulate data.

    One part is model debugging: if your model is supposed to catch a particular behavior in the data, well, using it against simulated data that have or not have that behaviour is the simplest way to check whether the model is working as supposed.
    Moreover it give you sensibility about how to read the model’s results.

    The other part is simulating a dataset is difficult and that is exactly why it is rewarding in the end.
    Some time ago I had a stochastic process which seems to be quite easy to reproduce. That wasn’t the case. The first assumptions I was doing about the signal (a brownian motion with a trend) turned out to be insufficient to reproduce the main characteristics of the data. Working hard on the details taught me a lot about the specific kinds of noise in play. Then using that knowledge I was able to reasoning more deeply about the signal.

Leave a Reply to Kenneth Schulz Cancel reply

Your email address will not be published. Required fields are marked *