Preregistration: what’s in it for you?

Chris Chambers pointed me to a blog by someone called Neuroskeptic who suggested that I preregister my political science studies:

So when Andrew Gelman (let’s say) is going to start using a new approach, he goes on Twitter, or on his blog, and posts a bare-bones summary of what he’s going to do. Then he does it. If he finds something interesting, he writes it up as a paper, citing that tweet or post as his preregistration. . . .

I think this approach has some benefits but doesn’t really address the issues of preregistration that concern me—but I’d like to spend an entire blog post explaining why. I have two key points:

1. If your study is crap, preregistration won’t fix it. Preregistration is fine—indeed, the wide acceptance of preregistration might well motivate researchers to not do so many crap studies—but it doesn’t solve fundamental problems of experimental design.

2. “Preregistration” seems to mean different things in different scenarios:

A. When the concern is multiple comparisons (as most notoriously in that Bem ESP study), “preregistration” would require completely specifying—before the data are collected—the experimental design, data collection protocols, data exclusion rules, and selecting ahead of time the comparisons to be performed and the details of the statistical analysis.

B. When the concern is the file drawer (the idea that Gelman or some other researcher performs lots of studies, and we want to hear about the failures as well as successes), “preregistration” is a “bare bones summary” that can be tweeted.

I think we can all agree that Preregistration A and Preregistration B are two different things. There’s no way I could specify all my data processing and analysis decisions for a study, in only 140 characters!

Neither my point 1 or point 2 is an argument against preregistration! Sure, preregistration doesn’t solve problems of experimental design. It doesn’t cure cancer either, that doesn’t mean it’s useless. But I think it’s important to be clear what sort of preregistration we’re talking about in any particular case, and what we expect preregistration to be doing.

1. Preregistration is fine but it doesn’t solve your problem if studies are sloppy, variation is high, and effects are small—especially if you try to study within-person effects using between-person designs

Neuroskeptic has a more recent post expressing a noncommittal view of the claim that hormonal changes during the menstrual cycle impact political and religious beliefs. Given that political science is (I assume) not among Neuroskeptic’s areas of expertise, I respect his or her decision to hold off judgment on the issue. However, given that this is an area of my expertise, I can assure Neuroskeptic that the original published claim on this topic is indeed (a) unsupported by the researchers’ data, because of reasons of multiple comparisons (see here) for much discussion, and (b) extremely implausible given what we know about the stability of public opinion, especially in recent elections. Of this study and a recent nonreplication, Neuroskeptic writes:

I know of only one way to put a stop to all this uncertainty: preregistration of studies of all kinds. It won’t quell existing worries, but it will help to prevent new ones, and eventually the truth will out.

I think this is overstated and way too optimistic. In the case of the ovulation-and-voting study, the authors have large measurement error, high levels of variation, and they’re studying small effects. And all this is made even worse because they are studying within-person effects using a between-person design. So any statistically significant difference they find is likely to be in the wrong direction and is essentially certain to be a huge overestimate. That is, the design has a high Type S error rate and a high Type M error rate.

2. Should I preregister my research?

As I wrote above, the question is not just if I should preregister but also how I should do it.

I do think that blogging my research directions ahead of time (or archiving time-stamped statements, if I want to let my ideas develop in secret for awhile) would be a good idea. It would give people a sense of what worked, what didn’t work, and what I’m still working on. In many ways I do a lot of this already—blogging my half-formed thoughts which, years later, turn into papers—but I can see the argument for doing this more formally.

But this sort of blogging would not address multiple comparisons problems. For example, if Bem had tweeted ahead of time that he was doing an ESP study and focusing on prediction of sexual images, or if Beall and Tracy had tweeted ahead of time that hey were investigating the hypothesis that that fertile women were more likely to wear red, no, in neither cases would that have been enough. In both cases, there were just too many data-processing and data-analysis choices that would not fit in that research plan.

So, what about the full preregistration? Not what goes in a tweet but a complete step-by-step plan of how I would choose, code, and analyze the data? That might have been a good idea for Bem, Beall, and Tracy, or maybe not, but it would certainly not work for me in my political science research (or, for that matter, in my research in statistical methods). Many of my most important applied results were interactions that my colleagues and I noticed only after spending a lot of time with our data.

Just to be clear: you can see that I’m not disagreeing with Neuroskeptic. He or she suggested that I do tweet-like announcements of my research plans, and I agree that’s a good idea. It’s just not the same as preregistration, at least not as commonly understood.

P.S. This discussion published last year is relevant too.

46 thoughts on “Preregistration: what’s in it for you?

  1. “Many of my most important applied results were interactions that my colleagues and I noticed only after spending a lot of time with our data.”

    I don’t really see how this is a reason that you shouldn’t pre-register, as opposed to a reason that you should. Isn’t that exactly the critique you have made of those other studies? That you believe they made data-analytic and reporting decisions after they looked at the data, and that doing so biased the results?

    • Sanjay:

      Recall the above terminology, with two types of preregistration:

      A. When the concern is multiple comparisons (as most notoriously in that Bem ESP study), “preregistration” would require completely specifying—before the data are collected—the experimental design, data collection protocols, data exclusion rules, and selecting ahead of time the comparisons to be performed and the details of the statistical analysis.

      B. When the concern is the file drawer (the idea that Gelman or some other researcher performs lots of studies, and we want to hear about the failures as well as successes), “preregistration” is a “bare bones summary” that can be tweeted.

      I could (and possibly should) do B, but there’s no way I could do A, because I don’t know what analysis I want to do until I see the data! Given that “Neuroskeptic” wrote about preregistration that could be tweeted, I think he or she was referring to B as well. But the literature on multiple comparisons is all about A (see this paper, for example).

      Another issue is that I should (and, in fact, do) discuss my modeling choices in the papers I write. That might be called “postregistration.”

      • I understand. I was talking about real pre-registration (your option A). Motyl & Nosek gave themselves analytic flexibility in their first study; in the “real” story, when they collected replication data they discovered that the results obtained that way were unreliable. I’m asking, why would the results you obtain when you leave yourself analytic flexibility be different?

        • Sanjay:

          No, I could not do option A for my studies. I was only able to do these studies because I had complete flexibility in my analyses. If you take a look at the linked papers, you’ll see that it preregistration of type A would’ve been impossible or meaningless.

        • In such cases, pre-registration could be valuable if it merely consists of someone saying, “Here’s my plan, and it isn’t very specific — any details I come up with will be post hoc, after seeing the data.”

          You, of course, are admirably straightforward in admitting that you had “complete flexibility” in your analysis, and that much of it arose post hoc. Many other people, however, analyze data post hoc but then write it up as if they had a clear theory and model in advance, as if to hide all of the flexibility that they had. Pre-registration as noted above would at least make it clearer how such models and analyses arose.

        • What differentiates this kind of post-hoc analysis flexibility from the “researcher degrees of freedom multiple comparisons” that often gets criticized? is it the magnitude of the effect? is multiple comparisons primarily a criticism where small effects / marginal-underpowered tests are concerned?

  2. Many thanks for the post! I will reply in full on my blog, but some quick thoughts:

    You draw a useful distinction between preregistration to avoid multiple comparisons and preregistration to avoid the file drawer. These are indeed two of the main benefits of preregistration, and they’re two separate issues. However in the case of a researcher who is planning to conduct an analysis on some data, it might not be easy to separate them in practice.

    The example I gave in my post was someone who tweets “I am going to conduct a factor analysis on X Y and Z”. Now I think that the intended audience of that tweet would most likely interpret that as meaning that s/he might end up trying a number of different approaches that all fall under the heading of factor analysis, and that s/he might try various different parameters (e.g. how do you decide on the number of factors, do you use rotation).

    In other words – the hypothetical tweet was (implicitly) preregistering an intent to operate within a certain space of possibilities, with boundaries but also with some freedom. How much freedom would hopefully be understood by the researcher’s peers. This form of preregistration would not eliminate the problem of multiple analyses/comparisons, but it would reduce it: it’s not all or none. As to whether it would reduce it enough, that’s for readers to judge for themselves. There is certainly a balance to strike between ‘too flexible’ and ‘not flexible enough’.

    On the hormones paper, perhaps I was unclear when I said that “I know of only one way to put a stop to all this uncertainty: preregistration…” The uncertainty I was referring to was specifically the concerns about researcher degrees of freedom and the file drawer. My argument was that we cannot rely on post-hoc approaches that seek to detect and correct for these biases – the only way to be sure is to prevent the problem from arising in the first place.

    I didn’t mean to imply that preregistration would have solved all of the problems with that paper or with any other paper. On the contrary I agree with you: a study can be bad and if you preregister it, it’s still bad. And good preregistered protocols can be poorly executed. But the validity of the hormones claims was not the focus of my post.

  3. Let me come at right angles. I find that you poly sci/medical people are very concerned about statistics, confounding factors, Bonferoni, etc. But what about just research efficiency. And there are many fields where statistical fishing is not per se a problem (e.g. explorations of chemistry, physics, etc.)

    Perhaps writing down the planned study helps you think about it better, be more efficient, etc. Rather than just plunging in (realize a lot of researchers are grad students, too). Obviously, you should still be open to finding new things in the data, etc.

  4. it would certainly not work for me in my political science research (or, for that matter, in my research in statistical methods). Many of my most important applied results were interactions that my colleagues and I noticed only after spending a lot of time with our data.

    What’s the dividing line between this & fishing for a hypothesis? Sincere question; not trying to be snarky.

    Isn’t it a common refrain about BigData that if you trawl long enough you sure will notice some pattern? How does one distinguish a true, persistent robust effect from an artifact?

    Or how is what Andrew described different from trying various clothes attributes (color, type, length, brand, price etc.) till you just achieve borderline significance in a correlation for female fertility?

    • Rahul:

      One key point of our “garden of forking paths” paper is that multiple comparisons can be a problem even if researchers aren’t trying out a whole bunch of analyses. Even if researchers just only do one analysis, it can be contingent on the data.

      I think my own work is different, in part because we do discuss various other analyses that we’ve tried. To explain more would be a lot of effort, so let me right now just refer you to the linked papers. Take a look and then let me know if you have specific questions.

      • I think Rahul might be asking a similar question to what I was asking above. I have read the Gelman & Loken forking paths paper, I actually began having these kinds of questions after reading it.

        G&L was framed in terms of classical statistics and p-values, but it seems like a more general problem. Any time analytic decisions are being made contingent on the data, there is an opportunity for the researcher to introduce bias. I think that’s why Rahul and I both picked up on that quote.

        The main counter-argument you raise in Gelman & Loken is a pragmatic one — “we cannot easily gather data on additional wars, or additional financial crises, or additional countries…” Fair enough. But that doesn’t seem to address the core problem, it just rules out one of the solutions (replication).

        You also make the point that you disclose what analyses you’ve tried. That’s good if it allows a reader to detect the potential for bias in them, but that may be hard to do. And isn’t it possible that there are even more modeling decisions — potentially biasing ones — that you could have made without being aware of them, ala the forking paths paper?

        • I think good exploratory research would be rehabilitated through pre-registration of non-exploratory studies.
          Gelman can admit to exploring data and still is believed, because he’s known to be a good statistician, so he knows about the ins and outs of cross-validation, about multiple comparisons etc.
          Many people choose not to admit to exploratory research, because they did not do good exploratory research, but were driven by a desire to find a p < .05, stopping as soon as necessary.

          Now, some may do good exploratory research and they may "show the work" (discuss other analyses etc.) but they might not be believed because they don't have Gelman's reputation. Or, their story might not look as "clean" when all work is shown, which might work against them in publication.
          So they have an incentive to let their exploratory research pose as confirmatory, which gives confirmatory research a worse name.

          But nowadays they can also replicate their experiment after pre-registering their exact analysis plan to gain trust.
          And if they use existing data, or data on singular events (like Gelman), they could split their data, do exploratory analyses on one half, and confirm in the other (http://osc.centerforopenscience.org/2014/02/27/data-trawling/). That's what I'll aim to do, because I have neither time nor aptitude to become well-known as an eminent and trustworthy statistician, but this way trust can be fostered anyway.

        • Ruben:

          Splitting the data can work but sometimes in political science the datasets are small and we can’t afford to split. Also often the data are public so you can’t really hide any of the data. In such settings (and more generally) I think a better approach is to analyze all the data (as discussed in my paper with Jennifer and Masanao) and do partial pooling of all the estimates. I much prefer this to performing a single estimate and then giving it a flat prior that I don’t believe.

          That said, in my own analyses I don’t always do hierarchical modeling for the parameters of interest. But I’d like to think that the analyses I’ve published could be viewed as approximations to fully Bayesian analyses.

        • Yes, I agree – this is the better approach in that case (and of course starting out with a good experimental or study set up and within a sound statistical framework are not at odds with embedding those in a properly incentivised science environment).
          Maybe the point is that Andrew Gelman doesn’t have a believability crisis, but psychology does.
          You may find it harder to empathise why anyone would want the small increment in believability that would come from announcing something as general as “I’m going to do science to this” or a more detailed but still open-to-interpretation analysis plan (as is common in clinical trials, where pharma giants want to convince readers of their drug’s utility and have to overcome a large amount of distrust).

  5. This is about like passing a law saying “stealing” isn’t a crime if you tell everyone you’re going to do it beforehand.

    I’m going to go ahead and predict that this will have zero affect on science. No doubt it will boost somebody or other’s worthless career, but 20 years from now Economist, Political Scientists, Sociologists and Psychologists won’t be able to predict anything more than they can today.

    • Entosophy,

      I’m not really convinced that predicting behavior is the end-all, be-all of improved social science methods. At least two other things seems important:

      1 – Measuring the impact of some change or program that has occurred;

      2 – Narrowing down classes of models and sets of reasonable parameter values that provide insight into human decision making and behavior.

      As for (1), I think there have been a lot of advancements recently. I think we are much better at disentangling, ex post, the effects of various changes in the world. In many cases, these effects will differ when a similar program/policy is enacted elsewhere (because we don’t understand all the structural elements of the underlying human decision making), but we’re at least able to identify good and bad policies faster and more convincingly.

      As for (2), I think that, in the medium run, we are in fact getting better at understanding the kinds of forces that shape people’s decision making. The models we test and calibrate may not be great at predicting individual decisions in the future, but they are useful for understanding which kinds of trade-offs tend to dominate others on aggregate. That is a useful contribution to policy and program design as well.

      I also think there is some improvement in prediction, mostly the kind that comes from doing (1). The minimum wage literature is a good example of this, I think. But that one is gonnna be tested for us soon – we’ll see if the new increase in minimum wage (should it occur) really does have the essentially 0 impact on employment we (many of us) think it will have. Those priors were formed thanks to advances in causal econometrics, and if they are right, that’s some confirmation that direct improvements in statistical methodology can improve predictive power (older statistical models regularly estimated meaningful negative effects).

      Now, I agree we’ll never be able to nail down things like “What is that person’s reservation wage” or “How important is spite is shaping people’s decision making,” but we might get some idea of whether reservation wages or spite are going to influence how people respond to some thing happening in the world – whether the effect is likely to be of a meaningful magnitude.

      I’m not sure… I give it a fair chance I’m over-optimistic on both (1) and (2), or that my analysis here lacks an important piece of nuance, and that the distinctions I’m making are nonsense. That said, I do think the improvements in causal inference and applied econometrics (for instance) are bearing real fruit (even if that fruit is more like those perfectly yellow store bananas in the USA, which are 1,000 times prettier and 100,000 times less tasty than the small, weirdly shaped, incredibly sweet bananas you get in, say, South Asia).

      • jrc,

        You can nuance around this all you want, but people didn’t want Economist’s self proclaimed “advances” or “deeper insight” or whatever euphemism for “failure” is used, they wanted an accurate 4 year prediction of the unemployment rate after Obama’s stimulus passed. They want Psychologists to be able to predict when a stressed soldier will commit suicide; they want political scientists to predict when governments will topple.

        I guess you can claim these are impossible, but Physicists can deliver predictions at least as impressive as these, and if you do claim they’re impossible then you’re going to look pretty silly if at some point in the future someone can predict these things.

        And I really do mean failure. Currently, the US R&D budget alone funds the equivalent of somewhere between 15-20+ Manhattan projects EVERY YEAR (according to the NSF and using an inflation adjusted cost for the entire multi-year Manhattan project) .

        What exactly do we get for that? If your answer is the “minimum wage literature” then I say we got ripped off.

      • I’ve been trying to make a succinct statement of what bugs me about all this stuff. Here is my latest attempt:

        Scientists whose livelihood depends on the current status quo make the very strong assumption that modern science only needs tweaking because it’s been phenomenally successful.

        But it HASN’T been phenomenally successful. People in the past (before some point between 1900-1950) achieved vastly more, with dramatically less, and with surprising few of the accoutrements of modern science (p-values, peer review, preregistration, tenure, 100 author research papers and whatever else)

      • And please, for the love of God, don’t say Physicts got lucky and had all the easy questions. Splitting the atom isn’t easy compared to predicting unemployement.

        • I wouldn’t say that. I would say that atoms don’t have a culturally and historically determined sense of self, don’t make decisions based on personal desires forged through that history and their own idiosyncratic tastes, and can’t just be weird if they want to be weird. Human beings, though – we get to choose our own adventure.

          I agree totally that economics is far from “close” to right. I don’t even know what that would mean in economics – there isn’t some one, right way in which people are. But I don’t think that means economics is less valuable as a practice. And I don’t think it means that no insights can be gained from the quantitative evaluation of responses in the aggregate to economic incentives or policies.

          We just can’t hope to ever predict what one individual person will do in any given situation. And that is something that I actually find very reassuring about humanity. You know, because we’re all weird and unpredictable and, well, all too human.

          That said – you’re right of course that there have been long periods of stagnation, where tons and tons of money has been wasted on people basically writing difficult mathematical models and complex statistical procedures that sound good on paper but are really useless in understanding the world. And that is probably related to a lot of the incentives and structures of modern academia (along with the ambition and hubris of many a researcher) that you have been criticizing.

        • Splitting atoms is probably hard, I have not a clue on how to do it. But, predicting unemployment (I think you have some hard, very precise, unconditional form of prediction and a far into the future in mind) must be very hard as well. Smart dudes have tried it (also physicists, mathematicians, etc.) and they have failed to do so (I am referring to the implicit definition of prediction – my guess of what you exactly meant).

          I would not refer to luck, but I believe that indeed it is hard if not nearly impossible to do predictions in the sense of on March 20 202x the unemployment rate in Michigan will be x.xx% with a standard error of 0.0x%
          There is in fact some good economic work on that, arguing why this is the case…

          There is more to “social” science than prediction alone. I thought the same applied to physics, physiology and some other of the serious disciplines as well.

        • Well Louis, I’ve got three predictions for you:

          (1) It is possible to predict unemployment for the US years into the future.

          (2) This feat wont be achieved by anyone with a Ph.D in Econ from a top 15 Econ school.

          (3) 100 years after someone does achieve this, they will still be teaching it to Econ students. All those nuanced papers with their breakthrough “insights” will be forgotten – not even read by historians of science.

        • And not only will we be able to near-magically predict people’s preferences for labor and leisure, when new things will get invented and their effect on economic growth, global agricultural production and resource extraction patterns, and the political will of the citizenry of all countries, it will all be done without this future Newton of Econ even needing any of the insights that came before – she won’t even have bothered reading those papers to see what dead ends people had already tried. She’ll just be able to see the future, through math! Standing on shoulders of giants is for suckers.

        • Predicting aggregates isn’t that hard. Physicists can predict all kinds of global properties of the Sun without evening coming close to predicting the future state of any one single particle therein.

          “she won’t even have bothered reading those papers to see what dead ends people had already tried”

          I’m sure she will well aware that her professor’s papers are dead ends.

        • I think unemployment is kind of a suckers measure, let’s go with labor force participation. I predict 2020 we’ll see a labor force participation between 58 and 70 %. I leave it as an exercise for the reader to refine this prediction further. ;-)

        • I note with amusement that Marcial Losada’s Ph.D dissertation was a model of U.S. seasonal employment variations using Fourier transforms. (No, me neither.)

  6. In fact I’ll go further and state the Wilson’s First Axiom:

    Axiom I: If your quest for scientific truth can be derailed by taking more looks at the same data then you had originally planed, then your quest is irredeemably flawed in some way.

    Rather than avoiding looking at the data too hard, your time would be better spend fixing the flaw.

    • So what saves Bayes from the problem of multiple looks at the data?

      I have a vague notion/conjecture that one can’t be misled by moving from an initial Bayesian analysis to one in which the first analysis is “nested” in some sense. The idea is basically that the nested model can be viewed as a constrained version of the encompassing model. Relaxing the constraint (i.e., relaxing a modelling assumption) can only weaken one’s ability to draw strong conclusions from the available information, for basically the same reason that removing assumptions in a logical argument can never increase the set of reachable conclusions.

      I haven’t really thought deeply about if/how to make this notion crisp (or about looking for a counter-example).

      • Corey:

        A bad Bayesian analysis (for example, picking a single comparison and then using a flat prior) will be as bad as a bad classical analysis. A more correct Bayesian analysis would use strong priors to account for the fact that there aren’t so many huge effects floating around. A hierarchical model can make sense too; that’s the point of my paper with Jennifer and Masanao.

        • AG:

          I had the move from the two simpler no-pooling and complete-pooling models to exchangeable parameters (= multilevel modelling) in mind as an example of an encompassing model that nests (in this case) two simpler models within it (okay, okay, on the edge of the parameter space, but close enough).

          I feel safe making that move because the math is so clear; my vague conjecture is about how safe such model expansion moves will prove to be in general.

        • I’d be surprised if they were safe.
          Not sure whether your reasoning is meant to apply to this, but if you initially model your data to be exchangeable (i.i.d. given a parameter), this means that you put prior probability zero on the situation being non-exchangeable. If then after you’ve seen the data you decide that you should introduce an autoregression parameter to allow for some time series structure, you raise the prior probability from zero to something larger than zero (it may actually be one if you don’t put point mass on the parameter being exactly zero). I always thought that I’d be happy if somebody worked out what this implies but my intuition is that this is a big move and by no means harmless. And how can a “prior” be called a prior if it was only introduced after the data were seen.
          (Actually I have a toy example involving binary data illustrating how de Finetti’s bettor can incur sure loss if priors are changed after having observed data in order to allow for dependence/non-exchangeability. But I’m not sure whether you Jaynesians may accept this as a problem.)

        • So let me try to spell out the situation you’re proposing in more detail: I have grouped data; there are T groups indexed by t, and each group contains, let’s say, normal data with mean mu_t, variance 1. I start with an exchangeable prior for the mu_t, but after I look at the data I notice a hint of autocorrelation in the estimates of mu_t, just by eye. So I expand the model in such a way that the exchangeable model corresponds to an autoregression model with a delta prior at zero for the autoregression parameter. In the expanded model, the new prior, is, say, flat in the region of stable processes, which in the usual parameterization is (-1,1) IIRC. Is this about right?

          If so, then yes, my intuition clashes with yours. But I should spell out that the kind of safety I’m talking about is safety relative to apophenia-induced model expansion, that is, I’m imagining safety w.r.t. the case where there is no autocorrelation, but I think I’m seeing some. I’m not talking about protection against cases where the data look autocorrelated just by (mis)chance to the point where, say, Spanos’s model misspecification tests would reject the null hypothesis of no autocorrelation.

        • It’s not exactly what I had in mind but it is close enough I guess.
          Your intuition maybe right but it may depend quite a bit on how exactly “safe” is defined (I guess you wouldn’t bother much about de Finetti’s bettor) or what is meant by “protection”, and how exactly the situation is set up.
          Somebody with time should look at these things in detail…

        • It seems obvious to me that the move is safe in an asymptotic consistency sense (to the extent that the expanded model posterior inherits asymptotic consistency from the likelihood). I guess finite sample safety would have to be defined relative to some Mayo-style error probability — for preference, Type M error.

      • Corey, suppose you’re looking at just a simple difference in means between two groups, you start with a model in which your prior for the mean is normal with mean 0 and sd = 1000…. and then you move to a model where the prior is normal mean 1 and sd=4.

        Is the second model “nested” in the first, because all of its possibilities are contained within the high probability region of the previous model? If so, couldn’t this lead you astray if in fact your more specific prior is wrong?

        just trying to see if I understand what you’re talking about.

        • No, the move from sd=4 to sd=1000 would be an expansion — the normal prior is a kind of soft constraint, and the larger sd is, the weaker the “force” with which the constraint is imposed.

        • Oh wait, you asked if the second model is nested in the first. The way I’m using that phrase, yes it is. But I’m postulating that the move from the second to the first is the safe one, not vice versa.

      • Corey,

        I do think something like that is basically true (with some poetic license perhaps) if the initial model is “correct” in my very own special Bayesian sense.

        For example if mu is the mass of one of the neutrinos and a 99% Credibility Interval computed from the prior contains the true mass, then after taking some data, the 99% Credibility Interval computed from the posterior should be smaller but still contain the true mass.

      • I was going to expand on Andrew’s comment (the method he describes I really love). Perhaps a more everyday solution is to follow Laplace and Fisher(?) and only use significance as a que to further research. That plus switching to interval estimates rather than point estimates are easily the two most effective and practical solutions, although not optimal.

        Big picture wise, people have fundamentally misunderstood what’s causing the problem. The usual explanation is that spurious effects should be seen some percentage of the time, so if you look at enough things you see some spurious effects which appear to be significant.

        If taken literally and really thought through, this would bring all statistical inference to halt.

        What’s actually happening is this. Statistical models (as usually conceived) are really a kind of summary of a specific data set. Those tests which Frequentists believe “confirm” the model and supposedly tell you something about the real world actually do nothing more than verify that the model is a good summary of that specific data set.

        So in general, verifying a model by itself tells you absolutely nothing about the underlying physical reality. Frequentists and most Bayesians (in truth) mistakenly thinking they have confirmed a fact about the underlying physical process make strong predictions about what they’ll see in the future, which turn out to be wildly wrong for the most part.

        There are exceptions to this. The big one being if you’re sampling from a finite population. In that case you really can expect some “significant” features of one sample to appear in another sample under a wide range of circumstances. That happens for the simple reason that many properties are going to hold for almost any sample no matter what. That’s why Frequentists are so desperate to envision all “random variables” as sample from a (mythical) population even thought that’s hardly ever the case.

        If I’m right about this, then preregistration will have a negligible effect.

        • I believe this is correct. No study on its own should be taken as evidence for a physical reality, this requires multiple independent replications resulting in similar parameter estimates. However, even that is not enough. There must be the consistent data AND the ability to predict something that would otherwise be unlikely (a comet will appear april 6th at 5pm at location X). If a field is not capable of this it should be recognized as such. The purpose of that research is description and exploration, there is no hypothesis worth falsifying or corroborating.

    • Box suggests he learned this from his father-in-law http://www.jstor.org/stable/2286841

      If you are going to able to do some randomisation (and hence anticipate well how to do the analysis) might be best to pre-register to ensure you get the full value from that randomisation. Sort of like an elected official going along with fraud insurance rather them claiming – I’m honest no need to waste tax payer’s money.

  7. Andrew:

    There is a rich literature on trial protocols (WHO Recommended Format for a Research Protocol, CONSORT statement (which can be used for writing the protocol, not just reporting findings, Altman “Practical Stats”, Chapter 15, etc…), and, more recently, observational protocols (e.g.STROBE).

    IMHO the distinction your refer drives at the difference between exploratory trials, and theory testing (though nothing is really black and white). Presumably exploratory trials begin with some overarching hypothesis. It is easy to adapt existing recommendations to both approaches, and gain some protection from multiple hypothesis known ex ante, and file drawer problem.

    I like to eat my own dog food, so I have tried to register my research. In 2012 I tried registering an observational study in Poli Sci. At the time there were no registries so I ended up emailing the protocol to my advisers. We got a null result but I can assure you we could have spinned a wonderful story if we did just a little fishing. That was my first job market paper.

    Next we tried to register an experimental protocol. Again, at the time there was only EGAP. We registered there, and, after some convincing, at the International Standard Randomised Controlled Trial Number Register.

    Writing a protocol is a lot of work so we thought we would try to get it published. After all, in a trial situation the most valuable peer review is _before_ you go into the trial. We could find no takers in Poli Sci. In public health it was also hard.

    Two journals refused to consider our manuscript bc it was a social experiment not a (drug) clinical trials (as if health were not a social science!). In the end we got BMC Public Health to consider it. That was in January of 2013. As of today we are still waiting for a decision, which kind of defeats the point.

    In conclusion, I enjoyed the process. Learned a lot, and I think it made me a better researcher. However I did not get any publications out of it. More generally, the culture, incentives, and infrastructure are still precarious in Poli Sci.

    So why do I write protocols? Because for me a protocol is all about creating the conditions for Nature to speak – unperturbed by the researcher. To dramatize it, it’s almost like a pilgrimage to the oracle at Delphi. And now, as we begin analyzing the data according to plan, we wait for Nature to speak, or, more likely, mumble…

  8. Hi, Thank you for the information. I have a question about whether preregistered study can be changed or not. I mean, I planed a study which is longitudinal, in the study students will use technological devices for 3 week (each week 4 days, each days 1 hour). But I asked to some students (they are 15, I need 54-75 student for study), they found the 3 week very long and I decided to study 2 week. But my supervisor’s opinion is to register with 3 week version and than start study, if student will not attend we can change our preregistiration. Coul you inform me, is it possible? Should we change our plan before registration? Thank you!

  9. Pingback: The value (or lack of value) of preregistration in the absence of scientific theory « Statistical Modeling, Causal Inference, and Social Science

Comments are closed.