Following our recent post on the latest Dishonestygate scandal, we got into a discussion of the challenges of simulating fake data and performing a pre-analysis before conducting an experiment.
You can see it all in the comments to that post—but not everybody reads the comments, so I wanted to repeat our discussion here. Especially the last line, which I’ve used as the title of this post.
Raphael pointed out that it can take some work to create a realistic simulation of fake data:
Do you mean to create a dummy dataset and then run the preregistered analysis? I like the idea, and I do it myself, but I don’t see how this would help me see if the endeavour is doomed from the start? I remember your post on the beauty-and-sex ratio, which proved that the sample size was far too small to find an effect of such small magnitude (or was it in the Type S/Type M paper?). I can see how this would work in an experimental setting – simulate a bunch of data sets, do your analysis, compare it to the true effect of the data generation process. But how do I apply this to observational data, especially with a large number of variables (number of interactions scales in O(p²))?
I elaborated:
Yes, that’s what I’m suggesting: create a dummy dataset and then run the preregistered analysis. Not the preregistered analysis that was used for this particular study, as that plan is so flawed that the authors themselves don’t seem to have followed it, but a reasonable plan. And that’s kind of the point: if your pre-analysis plan isn’t just a bunch of words but also some actual computation, then you might see the problems.
In answer to your second question, you say, “I can see how this would work in an experimental setting,” and we’re talking about an experiment here, so, yes, it would’ve been better to have simulated data and performed an analysis on the simulated data. This would require the effort of hypothesizing effect sizes, but that’s a bit of effort that should always be done when planning a study.
For an observational study, you can still simulate data; it just takes more work! One approach I’ve used, if I’m planning to fit data predicting some variable y from a bunch of predictors x, is to get the values of x from some pre-existing dataset, for example an old survey, and then just do the simulation part for y given x.
Raphael replied:
Maybe not the silver bullet I had hoped for, but now I believe I understand what you mean.
To which I responded:
There is no silver bullet; there is no golden path to discovery. One of my problems with all the focus on p-hacking, preregistration, harking, etc. is that I fear that it is giving the impression that all will be fine if researchers just avoid “questionable research practices.” And that ain’t the case.
Again, this is not a diss on preregistration. Preregistration does one thing; it’s not intended to fix bad aspects of the culture of science such as the idea that you can gather a pile of data, grab some results, declare victory, go on the Ted talk circuit based only on the very slender bit of evidence that you seem to have been able to reject that the data came from a specific random number generator. That line of reasoning, where rejection of straw-man null hypothesis A is taken as evidence in favor of preferred alternative B, is wrong—but it’s not preregistration’s fault that people think that way!
P-hacking can be bad (but the problem here, in my view, is not in performing multiple analyses but rather in reporting only one of them rather than analyzing them all together); various questionable research practices are, well, questionable; and preregistration can help with that, either directly (by motivating researchers to follow a clear plan) or indirectly (by allowing outsiders to see problems in post-publication review, as here).
I am, however, bothered by the focus on procedural/statistical “rigor-enhancing practices” of “confirmatory tests, large sample sizes, preregistration, and methodological transparency.” Again, the problem is if researchers mistakenly think that following such advice will place them back on that nonexistent golden path to discovery.
So, again, I recommend to make assumptions, simulate fake data, and analyze these data as a way of constructing a pre-analysis plan, before collecting any data. That won’t put you on the golden path to discovery either!
All I can offer you here is blood, toil, tears and sweat, along with the possibility that a careful process of assumptions/simulation/pre-analysis will allow you to avoid disasters such as this ahead of time, thus avoiding the consequences of: (a) fooling yourself into thinking you’ve made a discovery, (b) wasting the time and effort of participants, coauthors, reviewers, and postpublication reviewers (that’s me!), and (c) filling the literature with junk that will later be collected in a GIGO meta-analysis and promoted by the usual array of science celebrities, podcasters, and NPR reporters.
Aaaaand . . . in the time you’ve saved from all of that could be repurposed into designing more careful experiments with clearer connections between theory and measurement. Not a glide along the golden path to a discovery; more of a hacking through the jungle of reality to obtain some occasional glimpses of the sky.
I’d like to put in a word about the challenges of comprehensive reporting of results. Yes, yes, yes: report all the model permutations you tried out, not as “robustness” checks but as a composite. But that said, how to structure such a report is complicated, I think. A giant table, presumably sequestered in an online supplement, is not very useful.
What interests me are ways to probe the mass of results a bit. For instance, maybe a particular covariant or set of covariants comes out differently under different specifications, and those differences correspond to changes in the outcome variable(s) of interest. Or maybe there are questions about how a particular variable is measured (its correspondence to the construct of interest), and different specifications suggest different interactions between measurement error and the fate of other covariants. And so on.
Perhaps there’s a literature out there that offers guidance on how to structure a comprehensive reporting of results, but if there is I haven’t seen it. Or maybe the structural strategies are too case-dependent to discuss in a general way.
Regarding the random number generator idea, I think of it as something like a random seed in many programming languages.
https://www.w3schools.com/python/ref_random_seed.asp
Also, the whole simulated data / fake data thing seems just like test data to me. I create a program to do something, and then I create data to test that it works correctly, normal cases, edge cases, bad data, etc.
Also, preregistration to me seems like vacation planning. I make a plan, a route where to go, where to eat, where to sleep. I don’t have to stick to the plan, particularly if something unexpected happens, but I have a plan to compare against to make sure I have enough time to get from A to B and so on.
Andrew W.:
Yes, exactly. Simulating data and pre-analysis plans are not just for psychology experiments; they’re general ideas that apply in many areas of life.
One other option to avoid spurious results is to think about what variables your testing, what variables you might not be able to control (most of them) and, when you make claims about the relationship between variables give yourself a few showers to think about what assumptions are implicit in that claim, and if there’s a hope in hell that they will hold true under any reasonable set of conditions.
I’m strongly in favor of Andrew’s presumably fantastic statistical advice. Nonetheless, you can save alot trouble by thinking through the research design with a hyper-critical mindset before you do any statistics at all. The worst part about it is that if you don’t find the your erroneous assumptions, you could get great results that obviously wrong to any sane person and you won’t know it until some rodent comes on a blog and blasts your conclusions to bits, despite your amazing fake data, jaw-dropping bootstrapped montecarlo lassoed parameterization, and even your socially laudable pre-registration.
You might find this interesting, Andrew. It does a deep dive into Millikan’s old experiment of the electron and contrasts his measurement and estimation techniques to those in social science.
https://philarchive.org/rec/WILRAM-7
Kj:
Yes, I agree with the general points made in this article. Indeed, Eric Loken and I have repeatedly argued for the importance of within-person comparisons in psychology experiments that are designed to estimate treatment effects on individuals. I have this plan sometime to write a paper with a title such as, “The before-after study as a paradigm for causal inference.” Also, I try to avoid using the term “randomized experiment,” instead saying “controlled experiment.” We tried to make this point clear in Regression and Other Stories but I’m sure we could do better.
The most frequent question I get about doing a priori fake data simulation is: how do I work out what parameter values to put into the simulation? The frequentists would just plug in point values, but people don’t realize the importance of existing data (even data from unrelated studies). I tried to write up a chapter in our Bayes book, The Art and Science of Prior Elicitation, to unpack for the reader how one could extract information from prior work, using several different approaches. A lot of this is inspired by the book Uncertain Judgements, which is about prior elicitation. Really, doing fake data simulation properly amounts to being able to elicit some priors from oneself, by deriving them from data, or both.
The difference is that when frequentists plug in point values it is definitely objective, whereas all this ‘prior elicitation’ is subjective funny business ;)