Open data and quality: two orthogonal factors of a study

It’s good for a study to have open data, and it’s good for the study to be high quality.

If for simplicity we dichotomize these variables, we can find lots of examples in all four quadrants:

– Unavailable data, low quality: The notorious ESP paper from 2011 and tons of papers published during that era in Psychological Science.

– Open data, low quality: Junk science based on public data, for example the beauty-and-sex-ratio paper that used data from the Adolescent Health survey.

– Unavailable data, high quality: It happens. For reasons of confidentiality or trade secrets, raw data can’t be shared. An example from our own work was our study of the NYPD’s stop and frisk policy.

– Open data, high quality: We see this sometimes! Ideally, open data provide an incentive for a study to be higher quality and also can enable high-quality analysis by outsiders.

I was thinking about this after reading this blog comment:

There is plenty to criticize about that study, but at least they put their analytic results in a table to make it easy on the reader.

Open data and better communication are a good thing. Also, honesty and transparency are not enough. Now I’m thinking the best way to conceptualize this is to consider openness and the quality of a study as two orthogonal factors.

This implies two things:

1. Just because a study is open, it doesn’t mean that it’s of high quality. A study can be open and still be crap.

2. Open data and good communication are a plus, no matter what. Open data make a good study better, and open data make a bad study potentially salvageable, or at least can make it more clear that the bad study is bad. And that’s good.

16 thoughts on “Open data and quality: two orthogonal factors of a study

  1. This leaves a question hanging: if a study does not have open data, how can we tell if it is a “good” study or a “bad” one?
    Ideas that come to mind (all problematic): reputation, credentials, citations, affiliation,…

    • Dale:

      It depends on the subject area, but sometimes you can tell that a study is of high quality because of performance. For example, whether or not the code for AlphaGo is public, it actually won the match. Or a psychology study could perform well under replications.

      • > This leaves a question hanging: if a study does not have open data, how can we tell if it is a “good” study or a “bad” one?

        I had a related question – has anyone done any kind of meta-level analysis of the effect of the open data push? For example, whether studies with open data might replicate more reliably – but then how could you test for replicability of studies where the data aren’t open, or wouldn’t attempts to replicate studies where the data aren’t open necessarily be less likely successful?

        • This would be an interesting exercise and in principle I would be for it.

          However, the barriers to obtain anything useful are so vast I feel it a fool’s errand. Put simply: open data/code papers give a much more precise account of what was actually done in terms of experiment / data analysis / coding (by definition of course), and this effects the meaning of a successful or failed replication.

          In the case of an open data paper the replicative protocol is likely to be pretty faithful to the original (though not exact certainly) and hence a failed replication is apt to be bad news bears. For closed-data papers, such a failed replication need not give the theory a black-eye as it could just as easily be a difference in protocol (unknown of course) in the replication that differed in a material way; in which case you have not replicated the original study at all rather an entirely different predictive conjunct and have falsified THE NEW ONE.

          So even if a meta-analysis turns up that open source papers perform better in replications you still don’t know if that’s because they are somehow better, in a scientific way, or because important protocols were (unknowingly) not followed in the replications of the closed-data ones! See Meehl lecture 3 (1989 series) around 31 minutes in where he talks about what gets included vs not in a paper.

  2. I think you’re leaving out things like the quality of the methods and the analysis as it is found in the paper. I agree that there’s lot of things you can’t verify without the data (and the code, which is left out in Andrew’s post), but I do think that if you can get at least some sense of a paper by examining such things as how they operationalized the variables under investigation and how the analyse their results.

  3. AllanC –

    > and this effects the meaning of a successful or failed replication.

    Yah, that’s what I meant by: “couldn’t attempts to replicate studies where the data aren’t open necessarily be less likely successful?”

    Not to mention other possible confounders – like maybe studies with open data/code are more likely to be conducted by better researchers. That wouldn’t be irrelevant information, but it would complicate any attempt to determine if opening data per se improved the quality of research. Similar to what you described in your last paragraph.

    • Absolutely. I believe we are in agreement.

      I would like to see Keith O`Rourke chime in given his background with meta-analysis. IIRC he is rather unsympathetic to it as a means to appraise theories (or even treatments?) without data from original studies + known collection methods. But my memory might be a bit hazy on this.

  4. Off Topic: I’d be interested to see what statistics people think of this NYT analysis of COVID-19 death rate data.

    https://www.nytimes.com/interactive/2021/04/23/us/covid-19-death-toll.html

    Quite a few of the comments on the article point out problems with the article’s analysis. Here’s a summary of one of the complaints:

    The sub-headline of the article states, “The U.S. death rate in 2020 was the highest above normal since the early 1900s — even surpassing the calamity of the 1918 flu pandemic.”

    If one takes the “normal” death rate to be that of the year prior to a pandemic and one assumes that the total population doesn’t change all that much from one year to the next, then this sub-headline seems to be seriously incorrect. If one eyeballs the “Total deaths in the U.S. over time” chart in the article and then compares the jumps due to the 1918 pandemic and the 2020 pandemic, it seems pretty clear that the percentage increase in the number of deaths (and thus the death rate, assuming a roughly constant population) from 1917 to 1918 is much greater than the percentage increase from 2019 to 2020. The jump from 1917 to 1918 looks to be around 40% while the jump from 2019 to 2020 looks to be around 15% (based on measurements of a screenshot of the graph using Photoshop’s ruler tool).

    • Charles25:

      I don’t know. At that link, there’s a time series of deaths per 100,000 and a times series of total deaths. (As an aside, I don’t know why they give deaths per 100,000, which is a scale that we have little intuition on. It seems to me that a death rate of 2.6% is more interpretable than a death rate of 2600 per 100,000.) Here’s what they have for 1917 and 1918 (I’m reading roughly off the graphs here):

      1917: 2300 deaths per 100,000 and a total of 1 million deaths
      1918: 2600 deaths per 100,000 and a total of 1.4 million deaths.

      This is an increase of 13% in the rate but an increase of 40% in the total. But I looked up U.S. population and it seems to have been roughly constant between 1917 and 1918, so these above numbers can’t all be correct!

      According to wikipedia, the U.S. population was 103 million in 1917 and 1918. 1 million deaths divided by 103 million people is 1%, not 2.3%. So I’m not quite sure what is meant by “death rate” in that article.

      The problem also arises in other years. For example, the article says that 3.4 million Americans died in 2020. Our population is 330 million, so, again, that’s a death rate of about 1%. But the 2020 death rate in their “Death rate in the U.S. over time” chart is less than 1%.

      I’m guessing that their death rate graph is some sort of age-adjusted death rate . . . ummmm, yeah, ok, I see it at the bottom of the page:

      Death rates are age-adjusted by the C.D.C. using the 2000 standard population.

      Compared to 1918, the 2000 population has a lot of old people. So the age-adjusted death rate overweights the olds (compared to 1918) and slightly underweights the olds (compared to 2020). The big picture here is that it makes 1918 look not so bad because the 1918 flu was killing lots of young people.

      Also, one other thing. The note at the bottom of the article says, “Expected rates for each year are calculated using a simple linear regression based on rates from the previous five years.” One reason why 1918 is not more “above normal” than it is, is that there happens to be an existing upward trend during the five years preceding 1918, so the implicit model would predict a further increase even in the absence of the flu. I’m not quite sure how to think about that.

Leave a Reply to Charles25 Cancel reply

Your email address will not be published. Required fields are marked *