Replin’ ain’t easy: My very first preregistration

I’m doing my first preregistered replication. And it’s a lot of work!

We’ve been discussing this for awhile—here’s something I published in 2013 in response to proposals by James Moneghan and by Macartan Humphreys, Raul Sanchez de la Sierra, and Peter van der Windt for preregistration in political science, here’s a blog discussion (“Preregistration: what’s in it for you?”) from 2014.

Several months ago I decided I wanted to perform a preregistered replication of my 2013 AJPS paper with Yair on MRP. We found some interesting patterns of voting and turnout, but I was concerned that perhaps we were overinterpreting patterns from a single dataset. So we decided to re-fit our model to data from a different poll. That paper had analyzed the 2008 election using pre-election polls from Pew Research. The 2008 Annenberg pre-election poll was also available, so why not try that too?

Since we were going to do a replication anyway, why not preregister it? This wasn’t as easy as you might think. First step was getting our model to fit with the old data; this was not completely trivial given changes in software, and we needed to tweak the model in some places. Having checked that we could successfully duplicate our old study, we then re-fit our model to two surveys from 2004. We then set up everything to run on Annenberg 2008. At this point we paused, wrote everything up, and submitted to a journal. We wanted to time-stamp the analysis, and it seemed worthwhile to do this in a formal journal setting so that others could see all the steps in one place. The paper (that is, the preregistration plan) was rejected by the AJPS. They suggested we send it to Political Analysis, but they ended up rejecting it too. Then we sent it to Statistics, Politics, and Policy, which agreed to publish the full paper: preregistration plan plus analysis.

But, before doing the analysis, I wanted to time-stamp the preregistration plan. I put the paper up on my website, but that’s not really preregistration. So then I tried Arxiv. That took awhile too—it first they were thrown off by the paper being incomplete (by necessity, as we want to first publish the article with the plan but without the replication results). But they finally posted it.

The Arxiv post is our official announcement of preregistration. Now that it’s up, we (Rayleigh, Yair, and I) can run the analysis and write it up!

What have we learned?

Even before performing the replication analysis on the 2008 Annenberg data, this preregistration exercise has taught me some things:

1. The old analysis was not in runnable condition. We and others are now in position to fit the model to other data much more directly.

2. There do seem to be some problems with our model in how it fits the data. To see this, compare Figure 1 to Figure 2 of our new paper. Figure 1 shows our model fit to the 2008 Pew data (essentially a duplication of Figure 2 of our 2013 paper), and Figure 2 shows this same model fit to the 2004 Annenberg data.

So, two changes: Pew vs. Annenberg, and 2008 vs. 2004. And the fitted models look qualitatively different. The graphs take up a lot of space, so I’ll just show you the results for a few states.

We’re plotting the probability of supporting the Republican candidate for president (among the supporters of one of the two major parties; that is, we’re plotting the estimates of R/(R+D)) as a function of respondent’s family income (divided into five categories). Within each state, we have two lines: the brown line shows estimated Republican support among white voters, and the black lines shows estimated Republican support among all voters in the state. Y-axis goes from 0 to 100%.

From Figure 1:

fig1

From Figure 2:

fig2

You see that? The fitted lines are smoother in Figure 2 than in Figure 1, they seem to be tied closer to the data points. It appears as if this is coming from the raw data, which seem in Figure 2 to be closer to clean monotonic patterns.

My first thought was that this was something to do with sample size. OK, that was my third thought. My first thought was that it was a bug in the code, and my second thought was that there was some problem with coding of the income variable. But I don’t think it was any of these things. Annenberg 2004 had a larger sample than Pew 2008, so we re-fit to two random subsets of those Annenberg 2004 data, and the resulting graphs (not shown in the paper) look similar to the Figure 2 shown above; they were still a lot smoother than Figure 1 which shows results from Pew 2008.

We discuss this at the end of Section 2 of our new paper and don’t come to any firm conclusions. We’ll see what turns up with the replication on Annenberg 2008.

Anyway, the point is:
– Replication is not so easy.
– We can learn even from setting up the replications.
– Published results (even from me!) are always only provisional and it makes sense to replicate on other data.

31 thoughts on “Replin’ ain’t easy: My very first preregistration

  1. Important lesson!

    Check out OSF for pre-registration. You can upload paper, code, other materials etc.
    A nice thing is embargo. You can pre-reg now but go official with it after a set time. Less stressful than go official about doing something right away.

  2. Maybe I’m just being dense, but I don’t see what this has to do with model fitting. The data points (Figures 1 and 2) are measurements of two different things (2004 and 2008 voting preferences, correct?), so why should one expect that they should look similar, either in terms of values or smoothness? More generally, to know what to expect about their similarity or dissimilarity, wouldn’t one need some broader set of measurements, whose variance would tell you something? Or is that your point? (In which case it seems unfair to yourself to call it ‘replication.’) I think I’m missing the lesson here; admittedly I didn’t read the papers, except to briefly look to figure out the mystery of what your ‘y’ axis is. (Labels!)

    • Raghuveer:

      Yes, it’s a different election so the differences could be explained by that. Still, I was surprised. I wasn’t expecting the smoothness of the income-voting pattern to vary so much from 2004 to 2008, and it seems likely to me that the difference comes from methodology rather than a change in how people are voting.

      And I added some description of the plots to help out people who are too busy to click through to read the paper!

      • I, too, have not read either paper beyond trying to understand the figures you are comparing. But I do find it strange to see non-monotonic (with respect to income quintiles) relationships of these voting patters – especially when the later data exhibits such monotonicity. I think monotonicity is more appropriate than “smoothness” in discussing this comparison. And I do think it strange that the 2004 data shows what it does (Wisconsin and Massachusetts as prime examples). My only thought concerns sample size and I’m not sure your use of a comparable random sample size really tests for that. Since the smaller random samples provide similar results to the full sample, it makes me think that the smaller 2004 sample was in some sense “nonrandom.”

        • Dale:

          Yes, I found the non-monotonic behavior puzzling too. It’s been bothering me for years, and this was one of my main motivations for doing the replication study.

          Just to clarify: Figure 1 above is 2008 and Figure 2 above is 2004.

    • > what to expect about their similarity or dissimilarity, wouldn’t one need some broader set of measurements
      If you know the area (have background knowledge) you _should_ expect somethings to be similar (common) and others dissimilar.

      The broader set of measurement is needed to better assess those expectations.

      Then always keep in mind this assessment of similar/dissimilar “as always only provisional and it makes sense to replicate on other data”

  3. So, if I was interested in voting behavior such as in the figures, I’d develop a prior on the relevant effects from the first study, and update those estimates with the second dataset.

    Replication is then a non-issue, is it not? Isn’t it really jus a pedagogical exercise to prove to people that variation is A BIG DEAL (if they didn’t know it in the first place)?

    • Garnett:

      No. Your Bayesian argument relies on the model being correct and there being no selection bias in the data preparation and analysis. The point of the preregistered replication is that we are concerned about problems with the model and selection bias in the data preparation and analysis.

      • “Your Bayesian argument relies on the model being correct and there being no selection bias in the data preparation and analysis.”

        But isn’t that _always_ a concern? If it’s a barrier, then what is the role of published data, or any evidence-based expertise in accumulating knowledge?

        I would develop a suitable prior that accounts for my best understanding of among-study variation (for whatever reason including selection bias etc.) and update my understanding of the effects with the new data.

        • Garnett:

          Yes, these are always concerns, that’s why external validation is always a good idea! I agree that it’s good to develop a suitable model etc.; nonetheless your model can be flawed, and realistically it will be developed in the context of the data that you see, etc. We build our models, we fit them to data, then we see what went wrong, we fit to new data, see what goes wrong, etc. See chapter 6 of BDA.

        • OK, I misunderstood. I had thought of ‘REPLICATION’ as a means of proving or disproving (whatever that means) previous work, and not refining a model of the data collected from various sources (sensu BDA).

  4. One of my students is about to seal (or embargo or whatever it’s called) her first preregistration on OSF. It wasn’t that hard, as far as I could tell. I tried the OSF interface itself today for the first time with her assistance, and it was surprisingly easy to use.

  5. Two questions: (1) Pre-registration for analysis of observational data already available to the researcher seems like an unhelpful exercise no? I have seen such replications, but researched already have the data and could easily pre-register a study they have effectively already completed. (2) Do all journals invite submission of pre-replication plans, i.e. did AJPS reject on grounds of the plan qua plan or for substantive reasons?

    • Brett:

      1. It’s helpful to me, maybe unhelpful to you. I guess helpfulness is in the eye of the beholder. I agree that it would be silly to preregister a study after you’ve done it. But that’s not we’re doing: we’re preregistering the study before we’ve done it. One point of our paper is to show the amount of effort that was required to set up this preregistration.

      2. I can’t remember; I think AJPS may have rejected based on the grounds that we’re not really proposing to advance any ideas in political science.

      • At least some of the researchers engaging in rampant multiple hypothesis testing without adjustment are actively making the choice to venture into the pursuit of significance. I’d guess that in 40% or so of the cases where p-hacking occurs, maybe 40% of people don’t know better, 15% are doing it without realizing it, and 45% are intentionally data mining (although have little certainty in these estimates). Nevertheless, I really don’t see a future where replication of this type is useful if we define useful as below…

        It seems the source of the “usefulness” of pre-registration is signaling to consumers of that research about the validity of a null hypothesis (if we need to reject a null). The signal is costly for everyone, but more costly for would-be-p-hackers and more so when they have not yet seen/analyzed (at least some of) the data. I have little way of verifying that someone is not simultaneously testing or has previously tested a large number of hypotheses as they are about to submit their pre-registration plan.

        I would expect in the long run we will not see pre-registration of analyses of these sort in large numbers nor published with more frequency than non-registered studies. On the other hand, pre-registration in instances where (at least part of) the data is not yet available will become more prevalent due to adverse selection.

        • Brett:

          I don’t know about those 40% or whatever, but in my case I’m not trying to “signal” anything or am I testing any hypotheses. I’m just trying to learn about the world. My colleagues and I published a paper, I was concerned about potential problems with the model and potential selection bias in our analysis, and so we decided to do some replication. Preregistration was a useful constraining step to remove certain possibilities of selection bias. Also, I hear a lot of talk about preregistration so I thought it was worth trying it out myself to see what all the fuss was about.

        • Certainly! I should have explicitly stated that this is not directed toward your work but rather in response to seeing this strategy adopted by people who have very different goals and histories.

        • “I have little way of verifying that someone is not simultaneously testing or has previously tested a large number of hypotheses as they are about to submit their pre-registration plan.”

          Sounds like yet another reason statistical significance (as usually used with a default nil null hypothesis; this qualifier should be taken as assumed below) is a worthless metric.

          1) It doesn’t mean what you* think.
          2) The null hypothesis has nothing to do with anyone’s theory, so whether it is true or false is irrelevant to any meaningful discussion.
          3) The threshold is an arbitrary convention that scales with how much it costs to collect data in a given field.
          4) You can’t trust it because normal scientific behavior renders the null hypothesis false (usually something like an iid assumption as is the case for multiple comparisons, etc).

          *Anyone who still thinks statistical significant deviations have any meaning at all either doesn’t know what a p-value is, or has never gone through the whole process of collecting and analyzing data themselves. I have never met anyone who would rely on statistical significance that fails to meet at least one of those two criteria.

        • >”I’m sorry you feel that way.”

          This pretty representative of the responses I got trying to figure out why I was supposed to test “the null hypothesis” (I hold my current position after years of daily conversions with people doing it, dozens of emails, and hundreds of hours reading papers and searching). The stats guy at my grad school even told me that the only reason he taught that to medical personnel was he would get fired otherwise, like his friends did.

          In the end my conclusion is that there is no formal logic behind doing it. There are only arguments from authority/consensus, confusion on some point or the other that can’t be admitted because it would mean so may people’s careers have been wasted, or plain dismissive snark from those who don’t care that much about actually performing successful science.

    • “Pre-registration for analysis of observational data already available to the researcher seems like an unhelpful exercise no? I have seen such replications, but researched already have the data and could easily pre-register a study they have effectively already completed.”

      My view on this is that anyone who wants to game the system could, regardless of most technical safeguards. Instead, what’s important is the development of scientific norms for pre-registrated studies. Once those norms are in place, then violation of them is scientific fraud. Many of the data fabricators (Marc Hauser, Hwang Woo-suk) were caught by students or collaborators. Most of us collaborate with numerous students and colleagues, and almost none of them would be willing to jeopardize their careers by participating in fraud.

      Should the norm for pre-registered studies be that the data have not been collected at the time of pre-registration? Or should it be that researchers not have examined the data in any way, shape or form? Or??

      Whatever norm is adopted, if all authors attest in the paper that the norm has been followed, then I think that’s as good as we can get (and probably as good as we need).

      • Let’s reserve the term “preregistration” for studies in which the analysis plan was posted before data were available for at least the outcome variable.

        If researchers want to formally claim that they developed their research design without formally peeking at already-available data, that’s fine. But select a different term to describe that because it’s not the same thing.

        • Lj:

          Our study is preregistered. But if you want to come up with your own special word for studies in which which the analysis plan was posted before data were available for at least the outcome variable, feel free to do so!

          If you want to worry about ambiguities in language, you might start with the term “replication,” which means so many different things.

        • Hi Andrew,

          If preregistration refers to any study for which there is a claim that the research design is independent of the data analysis, then:

          * Research design posted before outcome data were available = temporal preregistration

          * Research design posted after outcome data were available but researchers claim they did not peek at the data = testimonial preregistration

        • I agree that the word “pre-registration” is inappropriate when the survey (or experiment, etc.) has already been conducted.

          At the very least, your “pre-registration” would need to be written up before the survey data was made publicly-available, although this does allow the possibility that you could have a friend at the survey company who leaked the data to you early.

          I’m certainly not accusing Andrew of doing this, but it seems to me like it would have been entirely possible that he did all the analysis with the “new” data before even thinking about doing a “pre-registration”.

        • Anon:

          You write, “it seems to me like it would have been entirely possible that he did all the analysis with the “new” data before even thinking about doing a “pre-registration.”

          That’s nuts! Our whole point with the preregistration was, when seeing what we could learn from the new data, to remove possibilities of selection bias. If we had done all the analysis already, we’d have no motivation to preregister in the first place.

        • I said that it was _possible_ (but that I wouldn’t accuse you of it). Your seem like an honest scientist, but, as you remind us on a regular basis, there are lots of dishonest scientists out there.

          You say that “Our whole point with the preregistration was .. to remove possibilities of selection bias”. What I am saying is that what you did in no way removed those _possibilities_.

          If, at the end of the day, all you can say is, “trust me, I promise I didn’t peak at the data before doing the pre-registration”, then why do the pre-registration at all? Why not just write the paper, and say, “trust me, I promise I didn’t peak at the data before writing the paper”?

        • Anon:

          I didn’t do the preregistration for you, I did it for me. Had we just tried to write the paper without preregistration, we would’ve had the usual situation of putting it all together, realizing we had problems, going back and changing our model, and so forth.

          It’s not about trust, it’s about avoiding selection bias, which is a real problem.

          Take a look at the Nosek et al. “50 shades of gray” paper. They did an experiment and analyzed in a way that made sense to them. But then, just to check, they ran a preregistered replication and they found that, in the absence of selection bias, their effect no longer appeared. They learned something by doing a preregistered replication.

          Similarly, computer scientists evaluate out-of-sample predictive performance.

          I think you’re saying that if I wanted to, I could cheat and analyze data, then report the analysis as if it were preregistered. Sure, I could do that, but why would I want to do that? What a waste of time.

          The point is that, conditional on your trusting me, you can learn from the results of my preregistered replication in a way that you might not learn from a usual analysis that is subject to selection bias. Similarly, I trust Nosek et al., and I learned something from the 50 shades of gray paper that I would not have learned had they merely done a conventional analysis.

        • I think we can both agree on one point: In an ideal world, everyone would do something like what you did here with _every_ study of this type. This should be the “default”, not some exotic option.

          We should be able to read a paper and (correctly) assume that the authors did not follow a garden of forking paths just to find some p < 0.05.

          Unfortunately, because most scientists have never done anything like what you did, we have no reason to assume that they did so (or even to believe them if they tell us they did so). Thus, the purpose of a "pre-registration" is to convince _others_ (not ourselves) that we were honest.

          But kudos to you for at least taking the first step of trying to convince yourself that you are being honest. If people like Cuddy did that, you'd have no material to blog about!

      • As Ed suggests, “anyone who wants to game the system could”.

        Maybe the only convincing evidence of an “honest” pre-registration is one where the new analysis makes the author look like a complete idiot!

Leave a Reply

Your email address will not be published. Required fields are marked *