Honesty and transparency are not enough

[cat picture]

From a recent article, Honesty and transparency are not enough:

This point . . . is important for two reasons.

First, consider the practical consequences for a researcher who eagerly accepts the message of ethical and practical values of sharing and openness, but does not learn about the importance of data quality. He or she could then just be driving very carefully and very efficiently into a brick wall, conducting transparent experiment after transparent experiment and continuing to produce and publish noise. The openness of the work may make it easier for later researcher to attempt—and fail—to replicate the resulting published claims, but little if any useful empirical science will be done by anyone concerned. I don’t think we’re doing anybody any favors by having them work more openly using data that are inadequate to the task.

The second reason that honesty and transparency do not suffice is that, even with the best of intentions and scholarly practices, researchers will fail. And, when transparency is touted as a solution to the replication crisis, I worry that the reasoning will go in both directions, so that replication failure will be taken as a sort of moral failing of the original experimenter. Yes, sometimes unreplicable research can be associated with ethical violations—even if researchers are innocently unaware of statistical errors in their published work, it is poor behavior for them to refuse to admit they could ever have been wrong—but one can also make mistakes in good faith. Making errors is inevitable; learning from them is not. To learn from errors we want a system of science that facilitates and incentivizes such learning; we don’t want an attitude that automatically links error to secrecy or dishonesty.

Some background:

In experimental sciences such as psychology, challenges arise not just in reproducing published work but in replicating via new experiments. Conditions for data collection are often unclear (for example, survey forms and complete descriptions of experimental protocols are often not available, even in supplementary materials), which leads to potentially endless ways that a replication cannot be trusted. For example, biologist Simone Schnall points to a report claiming that “even in studies with mice seemingly irrelevant factors such as the gender of the experimenter can make a difference,” and it is presumably not a requirement of published papers to supply full demographic information in all lab assistance.

As a result, scientific claims can seem to be taken as the personal property of their promoters. For example, psychologist Daniel Kahneman wrote in 2014 that if replicators make no attempts to work with authors of the original work, “this behavior should be prohibited, not only because it is uncollegial but because it is bad science. A good-faith effort to consult with the original author should be viewed as essential to a valid replication.” I believe that Kahneman’s suggestions are offered in good faith but it seems clear that they give a privileged role to authors of published, or publicized papers.

Various reforms have been suggested to resolve the replication crisis. I am in favor of more active post-publication review as part of a larger effort to make the publication process more continuous, so that researchers can release preliminary and exploratory findings without the requirement that published results be presented as being certain. Any published paper is just part of a larger flow of data collection and analysis and should be treated as such. Related proposed institutional reforms include reducing the role of high-profile journals in academic hiring and promotion, and making it easier to publish criticisms and replications. . . .

But honesty and transparency are not enough, and I worry that the push toward various desirable procedural goals can make people neglect the fundamental scientific and statistical problems that, ultimately, have driven the replication crisis. In recent years we have seen many published and publicized papers which were essentially dead on arrival because they were attempting to identify small effects in the presence of noisy and often biased data. . . .

Honest reporting is important, even necessary. But design is important too. Dishonest reporting can mislead, but honest reporting is not enough to save a poor design. It doesn’t matter how honest and careful the researcher is, if the data are just too noisy to learn anything, from an exploratory or a confirmatory perspective. . . .

And I conclude:

All this discussion echoes similar points in ethics more generally. For example, doctors should first do no harm and should care for their patients. But there are times when all the care in the world is no substitute for a good antibiotic. We want medical researchers to appropriately convey their uncertainty, but more important is for them to have useful results to talk about, and this is facilitated not just by openness but by close links between data, measurement, and substantive theory.

38 thoughts on “Honesty and transparency are not enough

  1. I’m not sure exactly what you’re saying here. If it’s that ethical behavior could still be bad science I don’t think anyone would disagree with you. And everyone would agree that the solution to that is, while maintaining ethical behavior, to improve the science. But ethical behavior and pure scientific expertise are orthogonal. And I’d sure as hell rather start with scrupulously ethical researchers who happen to bed scientists and train them to do the science than start with complete snakes who were science-brilliant, since I have no idea how to reform the snake, and it doesn’t matter how brilliant the science is if you can’t trust the results.

    And is your point about “researchers will fail,” do you mean fail to be transparent? How exactly do you fail to be transparent? If you mean instead that they will make errors and do science that fails to replicate, why as that a moral failing *so long as they were transparent in what they did?*

    So, sure, honesty and transparency will not solve the problem of crappy studies getting published (although transparency alone would probably lead to *fewer* bad studies being published.) But honesty and transparency are, essentially, free. (Not quite, since it would make it difficult to use some proprietary data sets.)

    My analogy would be: all papers should be published in a language someone could understand, not some private language known only to the writer. This is an *absolute requirement* to do a study, but it doesn’t make the study good by itself. We are now in a situation where some researchers revel in the ability to produce their studies using (methodologically) a private language that only they understand. That’s bullshit, but fixing it won’t improve the quality of the studies *by itself.* So what?

    • oops… that’s “be bad scientists” not “bed scientists,” though I have a fondness for those who “bed scientists” as well.

    • Jonathan:

      I wrote this article because of two scenarios:

      (a) Researcher does bad study (noisy data, poor connection between theory and measurement, etc). It’s hard to criticize the work because the researcher takes the criticism as a statement that he or she is dishonest or not transparent.

      (b) Researcher is planning a new study and decides to be fully honest and transparent, maybe even preregisters (not necessary, in my opinion, but it does add to transparency). But the study is still dead on arrival because of noisy data, poor connection between theory and measurement, etc. With all the focus on honesty and transparency, the researcher still didn’t design a useful study.

      So, my point is: Honesty and transparency are important. But a researcher can be scrupulously honest, and design and analysis can be completely transparent, but the study can still be completely useless for statistical reasons.

      • Granted. But I come at this from the standpoint of expert witness testimony, where you know how skewed the incentives are to produce results rather than science and the *only* weapon you have is the mutual adherence to transparency. And I think this process works. A nontransparent study (when the system is working correctly) has no credibility no matter what its conclusions. A transparent study is meticulously replicated (in the literal sense, not in the sense of the application of the same methodology to a different dataset) before any critique of the methodology itself begins. You can bet that this squeezes lots of error, both honest and dishonest, out of the process. It still leaves the problems of noisy data, theory and measurement disconnects, etc, etc. on the table for the next stage. But neither the dishonest Lacours of this world not the honest but sloppy Wansinks (to give him that much credit) can survive the first step. That’s not a panacea, but it’s where to start.

        • Jonathan:

          Honesty and transparency are important, indeed it’s hard to do much without them. And, long run, honesty and transparency should motivate higher quality work. But this won’t necessarily happen right away: there’s an intermediate step in which the studies need to have higher quality.

          Again, imagine a completely transparent researchers on sex ratio, ovulation and clothing, ESP, etc. If they preregister everything, then they’ll only get statistical significance 5% of the time (or maybe, say, 5.5% of the time if there’s a true underlying effect under all that noise), and, sure, that’s a step forward. But, still, big problems remain: (a) it’s a waste of resources to perform dead-on-arrival studies that are basically pure noise, and (b) the 5.5% of apparent successes will just be misleading (huge type M and type S errors) and can make things even worse.

          When I say “honesty and transparency are not enough,” I’m not dissing honesty and transparency. I’m just saying you can be as honest and transparent as anybody, but you’re still not gonna move forward if you’re trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.

        • “sure, that’s a step forward” may be selling a bit short.

          Currently, with ingenuity, you can do 1 study and get 1 publication. If the rate falls to 5%, then you have to do 20 studies for 1 publication. Doing studies business-as-usual would not be sustainable. You would not be allowed to (a) waste resources in this way for long and, if publishing your misses was part of ‘honesty’, nor (b) mislead too many people with your few hits.

        • Bjs12:

          Sure, but I don’t want to wait so long. That’s why, for example, I was careful to avoid recommending that the ovulation-and-clothing researchers do a preregistered replication. I didn’t want them to waste any more of their time!

        • The best way to avoid having them waste more time is to falsify the methods/measurements. Even if this one study isn’t replicated, the same protocol is used again & the literature on “ovulation and….” grows.

        • I think an analogy can be made with how open source software operates. The transparency of having the source code out there is necessary for someone to *eventually* detect a bug, but it doesn’t guarantee that the code is a) of adequate quality to begin with and b) liable to be fixed quickly. Only with the addition of substantial resources* devoted to software quality, such as code reviews and unit testing, can open source projects really shine. When projects aren’t allocated these resources they end up abandoned or issues like Heartbleed arise due to neglect.

          * Usually from a private company (Google, Red Hat) or foundation (Mozilla, Apache).

        • I agree that honesty and transparency are not enough, but I disagree that open data won’t have an immediate benefit. I previously blogged about this: https://medium.com/@OmnesRes/openness-reproducibility-87d2a844f2e3

          Look at pizzagate. The data set contains a bunch of impossible responses, and these were included in the original analyses. If they had decided to post the data set when they published the papers maybe they would have taken a closer look at it, or had more people look over the work and caught their mistakes (not that these mistakes really matter given the problems of p-hacking, salami slicing, poor data quality, poor study design, and small sample sizes and small effects).

  2. I worry that you are expanding the scope of scientific and statistical practice to a level that renders improvement impractical. I don’t disagree with your gist – honesty and transparency are not sufficient for good analysis. But if sufficiency is the standard sought, then it is more than a statistical matter. It also involves being familiar with prior literature, having critical thinking skills, understanding appropriate methodology in your discipline, and yes, understanding sample sizes, effect sizes, noise, type S and M errors, etc. You are setting a standard that is admirable but makes me wonder who is supposed to enforce it.

    You use medicine as an example. Medicine has a number of standards – one that is unappreciated is the relevant regional standard of care. Standards vary in different areas of the country (not to mention the world). What is accepted in some areas would not be accepted in others. How do we view this practice? I find it disturbing, but I’m not sure who should be addressing this. Similarly, what is accepted in psychology may differ from economics – and I find that also disturbing, but I’m not sure who should be addressing this.

    If your point is that all our institutions – from training undergraduates, to graduate students, to evaluating instructors and professionals, to peer review, to granting authorities, … – need to think about honesty, transparency, and lots of other things – I can’t disagree. But it seems to me that we have wandered far from addressing the types of problems you have been highlighting in this blog. Without some focus, I fear that the issues become so vague and general that they sound like a complaint that some people just don’t do good work. Undoubtedly true, but where does that leave us? Isn’t it possible to focus on the replication process without considering all the other (real, but unfocused) issues of quality of work?

    Maybe the answer is no, but I’d be interested to hear that.

    • The correct identification of research programs like Cuddy’s and Wansink’s as “impractical” is the point.

      As for different standards in different fields, that’s fine. In medicine, standards of care vary, but they converge, particularly controlling for economic conditions. And there is nothing wrong with saying clinical procedure X or drug regimen Y is the recommendation under considerable uncertainty; that would allow other considerations (adverse reactions, personal preferences, etc..) to be given proper weight by a physician and patient. We could avoid the whipsaw of, for one instance, recommending hormone replacement theory to all women, then deciding to allow it to almost none, when we know nowhere near enough to justify either extreme.

      I don’t understand your suggestion that this wanders from the theme of this blog. Time and time again, the story is good people using bad statistics are producing bad science. (Often they become bad, or do unproductive advocacy because they are wedded to their unjustified conclusions.) Overall, honestly and transparently producing bad science is pretty weak progress.

  3. Andrew,

    When I was a graduate student, I felt like I could spot bad studies: ones where there was some flaw in the design or analysis. Or uninteresting studies: ones where the question posed might be obvious based on prior research. But I never realized until reading your blog late in my graduate student life that the studies I liked, that I thought were reasonable or well conducted, could still be uninformative from an evidentiary standpoint or what I ended up calling void-for-vagueness studies. That is, such studies present evidence that is too vague to guide researchers/practitioners/interested policy makers.

    I think you are one of the few (or perhaps just the loudest) who sees this problem. So thanks for pointing it out. Even if it gets repetitive for you.

  4. So you’re saying that a good researcher is not only honest and open, but also competent? :-)

    Another way of looking at this. The standard we should set ourselves is that our work should contain no known flaws according to the state of the art at the time of publication.

    • Boris:

      Unfortunately, “no known flaws according to the state of the art at the time of publication” isn’t good enough. That was the problem with Bem, power pose, etc: these papers passed peer review, thus there were no known flaws according to the state of the art at the time of publication. Remember, the problem with peer review is the peers.

      Without good measurement, all is hopeless, even if a paper happens to follow all of the cargo-cult rules extant at the time of its publication.

        • In particular, I would include attention to quality of measurement as part of “state of the art”.

          Also, if peer review is done according to “common practice” that is less than state of the art, then I would not consider it state of the art peer review.

        • Martha:

          I know what you’re saying, but . . . consider again the Bem paper on ESP. It was accepted at JPSP which is considered a very serious, top-ranking psychology journal. And the editors were pretty clear they only accepted it because they considered the work to be state of the art (I don’t remember the exact phrase they used, but I think it was something like that). It was a terrible paper, but people were not attuned to those problems at the time.

        • If I’m not mistaken, you are referring to Bem’s 2011 paper. Assuming that is correct, I’m going to push back at your assertion that “people were not attuned to those problems at the time.” In particular, “Which people?” is the issue. Perhaps most people in psychology were not attuned to those problems then. But there were people outside psychology who were aware of at least some of the problems. In particular, I started teaching Analysis of Variance in 2005, using a text published in 1999 (by folks from an industrial engineering perspective), that pointed out problems with multiple testing, data snooping, power, etc., and in 2010 started teaching a continuing education course discussing such topics. So there were cautions already “out there”, but people in psychology (and perhaps a lot of other fields) didn’t go beyond their own backyards to become aware of them. This is the difference I was trying to make between “common practice” and “state of the art”.

        • Martha:

          Even in psychology, people were aware of the problems with data snooping etc. But readers of Bem’s paper did not catch that these problems were there. It’s the garden of forking paths: readers just saw the comparisons were presented and didn’t realize that, had the data been different, Bem could’ve shown other comparisons.

          To me, the most convincing example was the 50 Shades of Gray paper by Nosek and his collaborators, because there they did a completely reasonably seeming analysis that yielded statistical significance, then they did their own preregistered replication which failed miserably. Nosek et al. hadn’t been aware of the garden of forking paths in their own research.

        • Andrew: Re “Even in psychology, people were aware of the problems with data snooping etc.”

          I’d agree that *some* (but by no means all) people were aware of these problems.

          “But readers of Bem’s paper did not catch that these problems were there.”

          This gets back to the difference between “state of the art” and “common practice”: It was common practice in 2011 not to report results in a way that detailed all that was done, and symmetrically not common practice to ask for or to expect transparency in describing all that was done.

          (I agree that Nosek et al’s paper was really helpful in making more people aware of how important these problems are and of how they needed to pay more attention to them. “Common practice” still hasn’t gotten to “state of the art,” but has made progress in that direction.)

        • While we’re talking about honesty and transparency, I honestly believe there was something additional going on with the Bem paper. I’m willing to bet that 90% of the people hearing about that paper dismissed it because they already *knew* that ESP or PSI or whatever you wanna call it does not exist. Probably this stance contributed to Bem et al becoming the watershed paper that it has now become. At the same time, I’m sure those same people did not know the relevant parapsychological literature well enough to make that judgement.

          This is worrisome. It defeats the purpose of science. How are we supposed to ever discover something new if we feel we don’t need to look at the evidence first?

          (To be clear: I’m not arguing for or against ESP here. )

        • Alex, the problem with your argument is that you assume that every result should be evaluated in a vacuum, rather than against the background of centuries of science and settled knowledge. There are certainly scientific revolutions, but they still have to explain why the previous theory was wrong. Until they can, trenchant disbelief is the only logical approach. Although Bem might look like a typical psychology experiment, its conclusions called into doubt much of modern science. For ESP to work would require such a fantastic, unsuspected mechanism. They didn’t even scratch the surface of that problem – obviously light-years from the required standard of proof.

        • Boris:

          “the problem with your argument is that you assume that every result should be evaluated in a vacuum, rather than against the background of centuries of science and settled knowledge.”

          That’s exactly not what I’m assuming. I’m saying that people should check the available evidence before making a judgement. In the case of ESP, there *is* a background of decades of research.

          Beyond that, it seems to me that you yourself are making the very move I criticized: you seem not aware of the large body of ESP research and you dismiss it before you’ve examined that literature. My argument is that this is not a scientific approach. You wouldn’t do that in any other field of science, so why is it OK to do it here?

          I consider the arguments you offer (that ESP would call into doubt much of modern science and would require a “fantastic, unsuspected mechanism”) as nothing but lazy excuses. Given that we have a gaping hole in the middle of the explanatory edifice of science (consciousness), and that all the quantum weirdness we now know to exist would have perfectly qualified as “fantastic, unsuspected mechanism” just a century ago, such sentiments don’t count for much.

      • Andrew – I certainly wasn’t intending to equate “state-of-the-art” with “can sneak past careless peer review at some journal”. Nor with “cargo-cult rules”. Maybe a slightly clearer description would be “best practice”, in which one could certainly include solid inference, corroborating evidence and a plausible mechanism.

        We are in complete agreement that about a million papers per year are published without coming close to this standard, so peer review is a very poor filter.

        I would see best practice as emerging and evolving from continuous discussion between researchers (disclaimer – I’d love to see this happen on platforms like PubPeer, with which I am involved). If that sounds terribly messy and subject to individual interpretation, it probably will be. But, like democracy, it will be better than any other system we’ve tried. Journals and their referees may or may not follow. Although some journals sometimes have half-reasonable guidelines, they are currently doing an appalling job at enforcing them.

      • The problems of selective reporting, data dredging, post-data subgroups, and a slew of other biasing selection effects were well known for donkey’s years. Perhaps fraudbusters had to rediscover the long-known results about trying and trying again and optional stopping, but it’s hard to believe a fairly large literature on verification biases was just lost. Especially in the land of ESP studies. Bem admitted going on a fishing expedition and recommending his students slice and dice data til they found something. Ironically, some of his best known critics downplay those problems and instead reanalyze his data in a manner that allowed Bem to disregard the criticism (high prior spike on the null, “default” prior on largish effect size).

        • Deborah:

          1. The problems of selective reporting etc. have indeed been known forever but I don’t think it was well understood how prevalent these problems were. At least for me, it was only gradually that I realized how many bits of prominent research were nothing more than finding patterns in noise. And even now lots of researchers don’t get it.

          2. I agree with you regarding discussion of the Bem paper. There was much too much willingness to ignore the problems with selection etc. and try to analyze Bem’s published data summaries on their own terms, not recognizing that these summaries were not a good representation of the totality of the data.

  5. I think that some individuals, whether credentialed or or not in a given field, just are better diagnosticians. They may not even need a quantitative background, a point that is dismissed summarily by some experts. They may have practical experiences even outside the field in which a question has been raised. I guess I was influenced by a Feyerabend talk. I haven’t yet looked at this hypothesis sufficiently. But I think Serge Lang had made a similar point. I know that he would seek out non-expert views as he looked to those without conflicts of interests.

Leave a Reply to Dale Lehman Cancel reply

Your email address will not be published. Required fields are marked *