An anonymous correspondent writes:
I’m a PhD student in psychology with a background in computer science. I have struggled with the morality of the statistical approaches for a while, until I discovered Bayesian statistics. It doesn’t solve everything, but at least I don’t have to bend my mind in so many weird ways.
I would like to ask a question. In the last few years, you seem to embrace preregistration, as can be seen for example in this blogpost. However, I haven’t found a way to convince my co-authors of this. The reason is that my PhD project is part of an outside collaboration. We have automated large parts of the data collection from questionnaires and wearables. This way, we gather lots and lots of data. However, given that we want to steer our data collection procedures as early as possible and don’t have much literature to build our hypotheses on, the project managers push for analyzing the data continuously (exploring the data). To me, this is a big red flag. However, I do see their points as well. As another argument, since we have so much data, we can save a lot of time on being meticulous before doing anything. So, I came up with a research protocol
1. Explore data and find result
2. Report preliminary result to client
3. Create preregistration
4. Verify the result on new data
5. If the result doesn’t hold anymore, go back to step 1
6. Report results to client
7. Write paperDo you think this is still worthy to be called a “preregistration”? If not, how would you do it?
My reply: Sure, this sounds reasonable. Preregistration is a set of steps, it’s not anything precise. See for example my preregistered analysis here. It’s good to do this in a way that gives space for data exploration, because Cantor’s corner.
I agree with Andrew. To put it another way, a preregistration only means that this particular dataset has not been analyzed yet. Preliminary/pilot* data is a different dataset, so whether it’s been analyzed is irrelevant. In fact, the most clear preregistrations include the full analysis of pilot data.
* It’s important that the data that’s analyzed in the end does NOT include data from the preliminary dataset. They should have no overlap.
Instead of “find result”, it should say “guess explanation” “generate hypothesis”, “devise theory”, “derive model”, or similar. Those all mean the same thing.
How “the result” has become conflated with “the hypothesis” in your thinking would be interesting to figure out.
But yea, the first step of sciencinw is
1) Abduce an explanation for some observation(s). Better if you can come up with multiple alternatives. And it doesn’t matter how you do this. Exploring the data, praying, hallucinogens in the woods. Anything goes.
2) Now deduce a model/prediction from your explanation that tells what you expect to see in new data. More precise and surprising is better since this lets us distinguish between different explanations. The point of Bayes rule is to compare the relative fits of different models.
3) Then you collect new data and see how well the various models fit. This is where preregistration can benefit us.
If you get a bad fit, then there is a wrong assumption somewhere but you can’t tell where since ~(A & B) = ~A | ~B. Ie, the Duhem-Quine thesis.
If you get a good fit you can say the model is consistent with the data but can’t rule out some other model would be even better. Ie, affirming the consequent. This is why we want to populate the denominator of Bayes rule with as many explanations as possible.
Would the steps mentioned above leave one more exposed to selection bias? My thought process is if you keep going back to the same pool of people (e.g. individuals who wear an Apple watch, MTurk users, etc.) then the procedure above is more likely to uncover something that is true for this subset in particular. There’s nothing wrong with this, but I’d be worried about generalizing to a larger population/external validity.
Theodosius Dobzhansly, the population geneticist, put what Andrew has called “IOTT” this way: “Heaven is, when the experiment’s over, you don’t need statistics to know what the answer is.”