“The distinction between exploratory and confirmatory research cannot be important per se, because it implies that the time at which things are said is important”

This is Jessica. Andrew recently blogged in response to an article by McDermott arguing that pre-registration has costs like being unfair to junior scholars. I agree with his view that pre-registration can be a pre-condition for good science but not a panacea, and was not convinced by many of the reasons presented in the McDermott article for being skeptical about pre-registration. For example, maybe it’s true that requiring pre-registration would favor those with more resources, but the argument given seemed quite speculative. I kind of doubt the hypothesis made that many researchers are trying out a whole bunch of studies and then pre-registering and publishing on the ones where things work out as expected. If anything, I suspect pre-pre-registration experimentation looks more like researchers starting with some idea of what they want to see then tweak their study design or definition of the problem until they get data they can frame as consistent with some preferred interpretation (a.k.a. design freedoms). Whether this is resource-intensive in a divisive way seems hard to comment on without more context. Anyway, my point in this post is not to further pile on the arguments in the McDermott critique, but to bring up certain more nuanced critiques of pre-registration that I have found useful for getting a wider perspective, and which all this reminded me of.

In particular, arguments that Chris Donkin gave in a talk in 2020 about work with Aba Szollosi on pre-registration (related papers here and here) caught my interest when I first saw the talk and have stuck with me. Among several points the talk makes, one is that pre-registration doesn’t deserve privileged status among proposed reforms because there’s no single strong argument for what problem it solves. The argument he makes is NOT that pre-registration isn’t often useful, both for transparency and for encouraging thinking. Instead, it’s that bundling up a bunch of reasons why preregistration is helpful (e.g., p-hacking, HARKing, blurred boundary between EDA and CDA) misdiagnoses the issues in some cases, and risks losing the nuance in the various ways that pre-registration can help. 

Donkin starts by pointing out how common arguments for pre-registration don’t establish privileged status. For example, if we buy the “nudge” argument that pre-registration encourages more thinking which ultimately leads to better research, then we have to assume that researchers by and large have all the important knowledge or wisdom they need to do good research inside of them, it’s just that they are somehow too rushed to make use of it. Another is that the argument that we need controlled error rates in confirmatory data analysis and thus a clear distinction between explanatory and confirmatory research implies that the time at which things are said is important. But, if we take that seriously we’re implying there’s somehow a causal effect of saying what we will find ahead of time that makes it more true later. In other domains however, like criminal law, it would seem silly though to argue that because an explanation was proposed after the evidence came in, it can’t be taken seriously. 

The problem, Donkin argues, is that the role of theory is often overlooked in strong arguments for pre-registration. In particular, the idea that we need a sharp contrast between exploratory versus confirmatory data analysis doesn’t really make sense when it comes to testing theory. 

For instance, Donkin argues that we regularly pretend that we have a random sample in CDA, because that’s what gives it its validity, and the barebones statistical argument for pre-registration is that with EDA we no longer have a random sample, invalidating our inferences. However, in light of the importance of this assumption that we have a random sample in CDA, post-hoc analysis is critical to confirming that we do. We should be poking the data in whatever ways we can think up to see if we can find any evidence that the assumptions required of CDA don’t hold. If not, we shouldn’t trust any tests we run anyway. (Of course, one could preregister a bunch of preliminary randomization checks. But the point seems to be that there are activities that are essentially EDA-ish that can be done only when the data comes in, challenging the default). 

When we see pre-registration as “solving” the problem of EDA/CDA overlap, we invert an important distinction related to why we expect something that happened before to happen again. The reason it’s okay for us rely on inductive reasoning like this is because we embed the inference in theory: the explanation motivates the reason why we expect the thing to happen again. Strong arguments for pre-registration as a fix for “bad” overlap implies that this inductive reasoning is the fundamental first principle, rather than being a tool embedded in our pursuit of better theory. In other words, taking preregistration too seriously as a solution implies we should put our faith in the general principle that the past repeats itself. But we don’t use statistics because they create valid inferences, but because they are a tool for creating good theories.

Overall, what Donkin seems to be emphasizing in this is that there’s a rhetorical risk to too easily accepting that pre-registration is the solution to a clear problem (namely, that EDA and CDA aren’t well separated). Despite the obvious p-hacking examples we may think of when we think about the value of pre-registration, buying too heavily into this characterization isn’t necessarily doing pre-registration a favor, because it’s hiding a lot of nuance in ways that pre-registration can help. For example, if you ask people why pre-registration is useful, different people may stress different reasons. If you give preregistration an elevated status for the supposed reason that it “solves” the problem of EDA and CDA not being well distinguished, then, similar to how any nuance in intended usage of NHST has been lost, you may lose the nuance of preregistration as an approach that can improve science, and increase pre-occupation with a certain way of (mis)diagnosing the problems. Devezer et al. (and perhaps others I’m missing) have also pointed out the slipperiness of placing too much faith in the EDA/CDA distinction. Ultimately, we need to be a lot more careful in stating what problems we’re solving with reforms like pre-registration.

Again, none of this is to take away from the value of pre-registration in many practical settings, but to point out some of the interesting philosophical questions thinking about it critically can bring up.

26 thoughts on ““The distinction between exploratory and confirmatory research cannot be important per se, because it implies that the time at which things are said is important”

  1. Along these lines, see my recent post on exploratory and confirmatory data analysis, where I write:

    So-called exploratory and confirmatory methods are not in opposition (as is commonly assumed) but rather go together. The history on this is that “confirmatory data analysis” refers to p-values, while “exploratory data analysis” is all about graphs, but both these approaches are ways of checking models.

    This is a point that came up in my papers from 2003 and 2004 on EDA and Bayesian model checking.

  2. > we have to assume that researchers by and large have all the important knowledge or wisdom they need to do good research inside of them, it’s just that they are somehow too rushed to make use of it.

    In my review experience, the latter assertion is indeed a huge problem. Lots of papers with poorly constructed research designs. Maybe it is unique to my field (Comp Sci) with a strong culture of publishing fast and often.

    I don’t know if I can answer the first part (they have the wisdom inside) but the way we use pre-registration at least forces study designs to pass peer review and feedback first.

  3. “In other domains however, like criminal law, it would seem silly though to argue that because an explanation was proposed after the evidence came in, it can’t be taken seriously.”
    If the warrant was signed after the premises were searched, whatever evidence was found would not be admissible in court?

    I agree with Neil Ernst that per-registration incentivizes us to think about study design if there is a review of the submitted design. It is a sad reality that many need additional incentives to do this. The worst example I can think of is when I was invited to comment on a protocol for an experiment that was a repeat of an experiment the same researchers had done before. It was so vague I couldn’t help but ask whether it was a protocol at all? Some mixture of “we know what we are doing” and “we want to keep our options open”, which together lead to “it is just a formality and we don’t have time for this.” I believe they knew what they were going to do, but that belief did not follow from reading the protocol…

  4. This is interesting. I can see how it could be problematic if interpreted poorly. The discussion seems to all be about methods of analysis and the importance of good theory.

    However, I could see someone taking this to mean that a hypothesis beforehand derived from theory and confirmed is equally as valuable, using similar statistical methods to a serendipitous finding modifying a theory post hoc. Or am I missing something and this is really saying that those are equal in the context of a single experiment?

    • Thank you Psyoskeptic. I had a very similar thought…

      Yes, it would be absurd to argue a “causal effect of saying what we will find ahead of time that makes it more true later”, but that’s not why confirmatory evidence holds more sway. For one, confimatory evidence implies the existance of whetever additional evidence was used to generate the theory that is being confirmed. In contrast, exploratory findings are basically always the initial evidence for a theory.

      I too would be keen to hear if I’ve got the wrong end of the stick.

      Cheers
      Andrew

      • +1

        Exploratory evidence: “Wow! Look at the surprising correlation I found when I compared all the columns in my data set yesterday!”

        Confirmatory evidence: “Wow! Look what I found after spending a (day, week, month, year) researching the potential actual causal relationship between the two variables and developing the relevant data that bears on the question!”

        The fact that some apparent relationship emerges from “exploratory” analysis of some random and necessarily infinitesimally small subset of all the data in the universe doesn’t imply that there is an important relationship in the real world beyond the dataset. The fact that you have **some** data doesn’t mean it’s the **right** data to solve the problem – or even the right data to give the correct indication of which way the relationship works.

        Returning to the question of whether heavy traffic is “safer”, you can’t solve that question with traffic fatality or accident data. You need to know the risk profile of the driving public, since there is good chance it varies with time of day; and how humans’ diurnal habits affect accident risk, since that’s a pretty obvious part of the equation. If you don’t have that, your “exploratory” data from traffic fatalities is still exploratory – your missing critical data that bears on the question.

        • Chipmunk said: “developing the relevant data that bears on the question!” ”

          I should add to that: developing the relevant **knowledge** that bears on the question, since not all knowledge is available in convenient dataset form.

      • It’s not absurd in the least. Think about expectancy effects (subject and experimenter), placebo effects, reporting & framing & anchoring effects, social desirability bias, biases like healthy user bias, or the effects of pervasive systematic biases. (Why, indeed, bother with blinding if it’s absurd to imagine that knowing the hypothesis before could affect anything…? “We don’t need to blind because if we take that seriously we’re implying there’s somehow a causal effect of saying what we will find ahead of time that makes it more true later.”)

        And like Simons, I am particularly baffled by the claim that criminal law is a great example of the absurdity of caring about time – when criminal law cares intensely about when things were known, who knew what when, what a ‘reasonable’ person would think, what an officer could see or know prior to a search, arrest, or other action, what incentives are at play for all entities, and does expensive things like the doctrine of fruit of the poisoned tree (which frees a lot of obviously guilty people).

        • Further the criminal point, there is a difference between:

          – Jack is suspected of the crime. They compare Jack’s DNA test result with DNA from hair from the scene. It matches Jack.
          – They DNA test everyone in the city. They compare the results with DNA from hair from the scene. It matches Jack. Jack is suspected of the crime.

          (Admittedly this example works better with older DNA testing technology.)

  5. I see no problem with it pre-registration and support it, but IMO causal inference from statistics has far bigger problems that don’t require pre-registration. They just require people to follow sound basic scientific principles:

    1) obviously non-representative populations. The most common one is university students (at least hundreds if not thousands of studies)

    2) very small samples; (items in “why we sleep”);

    3) using survey data with low reliability (items in “why we sleep”);

    4) having no theoretical relationship between the variables other than “Wouldn’t it be (neat/bad) if….” (power pose)

    5) obviously not-realistic study conditions (I gave people a pretend million dollars to invest in the stock market for one week, here’s what it means for stock markets)

    I’m sure there are more. I can only imagine how much more reliable causal inference would be if everyone did sound fundamental science first, then worried about statistics.

    Really if someone wants to work out ways to improve causal inference, they should put together a list of common design mistakes, then have a regular contest to score papers on how well they meet these criteria (citing specific reasons for failure), then give an award every week or every month to the paper with the most fundamental scientific failings. Kind of like the Darwin awards.

  6. Lurker, first time commenter
    I would describe it as a useful applied tool to improve science in those already interested in doing it better by improving both self-awareness of the design and analytical degrees of freedom, and accountability towards the research community.
    A useful simil can be to think of it as calorie counting as an aid to fight overweight: First of all, it makes you more conscious of your eating, then, if shown to somebody else can help to explain why a diet has led to some given results and can be used to improve a future design of your diet.

  7. Thought provoking post, much of which I agree with. However, in the spirit of thinking about this with philosophical clarity, I want to make a small nitpick about the following line of reasoning:

    “If we buy the “nudge” argument that pre-registration encourages more thinking which ultimately leads to better research, then we have to assume that researchers by and large have all the important knowledge or wisdom they need to do good research inside of them, it’s just that they are somehow too rushed to make use of it.”

    This nudge argument does not entail such a strong implication. It can be true that following some ritual procedure (pre-registration) leads to a reliably directed change in a downstream variable (better research) without that meaning that the downstream variable thereby satisfies a criterion for attributing some property (being good research) to it. As an analogy, weak chess players make better moves in classical than in blitz chess (in which they have more time to think), but that doesn’t mean that their performance in classical chess satisfies some criterion for being good, nor that they have all important chess knowledge or wisdom.

  8. Picking or fitting a definite model is in effect saying the evidence rules out all other models. Similarly accepting a hypothesis in effect says the other hypothesis can’t be true. In both cases, you’re acting as though there’s no chance when there is some.

    Call this practice “shedding probabilities” There seems to be a general phenomena in statistics that could be stated:

    If your practical methods include “shedding probability”, your statistical philosophy will claim the import of evidence strongly depends on when you looked at it.

  9. ” we have to assume that researchers by and large have all the important knowledge or wisdom they need to do good research inside of them”

    I’m not sure this is a valid criticism. It’s one thing to (think you) know in your mind what you think you are doing. It’s another thing to show and explain it to others. Writing it out clearly for others to see:

    a) May expose hidden logical flaws to the designer (that can be fixed);

    b) Make it easier for others to find flaws and critique them;

    c) Make people less inclined to expose themselves to criticism for weak designs that would otherwise be less evident

  10. “In other domains however, like criminal law, it would seem silly though to argue that because an explanation was proposed after the evidence came in, it can’t be taken seriously.”

    The problem is in what claims/explanations the evidence counts towards. In court cases, you typically have only two sides: the defendant vs. the plaintiff, and new evidence coming in counts towards one or the other’s case. Also the major claims aren’t conditional on the evidence: the defendant either is or isn’t guilty, and while the relative probabilities of those claims may change as new evidence comes in, the claims themselves don’t. However, in science, you have an infinite (or at least very large) number of possible hypotheses, and in the case of HARKing, new evidence may lead to entirely new claims.

    HARKing would be like having a court case where each time you find a new piece of evidence, you can swap the defendant for a new suspect, and do this indefinitely.

    • And here’s another: Preregistered, result went in opposite direction as expected, p-value was less than 0.1, published in PNAS and received two awards. All the preregistration in the world won’t stop you if you’re (a) determined to declare victory and (b) in possession of a message that influential people want to hear.

      • Yeah, also on this one review the experiment. It’s hardly a test-tube chemistry experiment with purified reagents. The poor people that survey targets are exposed to are actors. They might or might not look the part, but how are their teeth? How do they smell? Lots of street people have that permanent smell of damp smokey Levi’s jacket.

        Even in the photo that identifies the petitioner, actor and target, it’s hardly obvious that the target even saw the poor actor. I mean, if this were Jane Goodall studying chimps, the film would be rolling full time and it would be reviewed carefully to prove that the target identified the actor and thus could be influenced by the actor’s presence.

        It’s better than most setups I’ve seen in social science papers, but it’s still substantially short of what needs to be done to demonstrate that the assumptions that underlie the experiment have any validity at all. I read a book on dog psychology that has a better foundation than this. The researcher filmed dogs for many hours at dog parks and studied every minute of the film very carefully, over and over, to verify and compare the different types of behavior.

    • As the authors themselves say, there isn’t any of the original substance left in homeopathy – it’s all been diluted away.

      Not if after shaking it each step you only take what is in/on bubbles at the top or stuck to the side of the container after dumping it. It could be they accidentally came across a way to concentrate certain contaminants and/or microdoses of the original substance based on hydrophobicity/etc.

      When I looked into it a few years ago the vast majority of papers did not actually check how much of the substance remained. There was at least one that did and saw it plateaued, but I can’t find it at the moment. Others saw concentration of silicates from the glass vials.

      Anyway, you have to distinguish between the theory for what is going on vs what was actually done. In this case the standard debunking is just as silly as the proposed theories for how it may work.

      • Anon:

        This is a common problem. Rejection of the null hypothesis (for example, zero difference between treated and control outcomes) is taken as positive evidence in favor of a preferred alternative, without any seeming recognition that rejection of A gives you not-A, it doesn’t give you B. As Lakeland put it, the null hypothesis is “a particular random number generator.”

        • Indeed, rejecting the “null” hypothesis typically tells us essentially nothing because there are too many possible explanations. And failing to reject it just means you didn’t have enough funding and/or your measurements are too imprecise.

          So it just measures the funding-weighted collective prior. Ie, how rich and well connected you are.

Leave a Reply

Your email address will not be published. Required fields are marked *