Hysteresis corner: “These mistakes and omissions do not change the general conclusion of the paper . . .”

All right, then. The paper’s called Attractive Data Sustain Increased B.S. Intake in Journals Attractive Names Sustain Increased Vegetable Intake in Schools.

Seriously, though, this is just an extreme example of a general phenomenon, which we might call scientific hysteresis or the research incumbency advantage:

When you’re submitting a paper to a journal, it can be really hard to get it accepted, and any possible flaw in your reasoning detected by a reviewer is enough to stop publication. But when a result has already been published, it can be really hard to get it overturned. All of a sudden, the burden is on the reviewer, not just to point out a gaping hole in the study but to demonstrate precisely where that hole led to an erroneous conclusion. Even when it turns out that a paper has several different mistakes (including, in the above example, mislabeling preschoolers as elementary school students, a change that entirely changes the intervention being studied), the author is allowed to claim, “These mistakes and omissions do not change the general conclusion of the paper.” It’s the research incumbency effect.

As I wrote in the context of a different paper, where t-statistics of 1.8 and 3.3 were reported as 5.03 and 11.14 and the authors wrote that this “does not change the conclusion of the paper”:

This is both ridiculous and all too true. It’s ridiculous because one of the key claims is entirely based on a statistically significant p-value that is no longer there. But the claim is true because the real “conclusion of the paper” doesn’t depend on any of its details—all that matters is that there’s something, somewhere, that has p less than .05, because that’s enough to make publishable, promotable claims about “the pervasiveness and persistence of the elderly stereotype” or whatever else they want to publish that day.

When the authors protest that none of the errors really matter, it makes you realize that, in these projects, the data hardly matter at all.

In some sense, maybe that’s fine. If this is the rules that the medical and psychology literatures want to play by, that’s their choice. It could be that the theories that these researchers come up with are so valuable, that it doesn’t really matter if they get the details wrong: the data are in some sense just an illustration of their larger points. Perhaps an idea such as “Attractive names sustain increased vegetable intake in schools” is so valuable—such a game-changer—that it should not be held up just because the data in some particular study don’t quite support the claims that were made. Or perhaps the claims in that paper are so robust that they hold up even despite many different errors.

OK, fine, let’s accept that. Let’s accept that, ultimately what matters is that a paper has a grabby idea that could change people’s lives, a cool theory that could very well be true. Along with a grab bag of data and some p-values. I don’t really see why the data are even necessary, but whatever. Maybe some readers have so little imagination that they can’t process an idea such as “Attractive names sustain increased vegetable intake in schools” without a bit of data, of some sort, to make the point.

Again, OK, fine, let’s go with that. But in that case, I think these journals should accept just about every paper sent to them. That is, they should become Arxiv.

Cos multiple fatal errors in a paper aren’t enough to sink it in post-publication review, why should they be enough to sink it in pre-publication review?

Consider the following hypothetical Scenario 1:

Author A sends paper to journal B, whose editor C sends it to referee D.

D: Hey, this paper has dozens of errors. The numbers don’t add up, and the descriptions don’t match the data. There’s no way this experiment could’ve been done as desribed.

C: OK, we’ll reject the paper. Sorry for sending this pile o’ poop to you in the first place!

And now the alternative, Scenario 2:

Author A sends paper to journal B, whose editor C accepts it. Later, the paper is read by outsider D.

D: Hey, this paper has dozens of errors. The numbers don’t add up, and the descriptions don’t match the data. There’s no way this experiment could’ve been done as desribed.

C: We sent your comments to the author who said that the main conclusions of the paper are unaffected.

D: #^&*$#@

[many months later, if ever]

C: The author published a correction, saying that the main conclusions of the paper are unaffected.

Does that really make sense? If the journal editors are going to behave that way in Scenario 2, why bother with Scenario 1 at all?

29 thoughts on “Hysteresis corner: “These mistakes and omissions do not change the general conclusion of the paper . . .”

  1. I’d like to talk about Scenario 1 a bit more because ASAPbio recently had a meeting about transparency in peer review, http://asapbio.org/peer-review, and concurrently I got into a debate with bioRxiv over their commenting policies, https://medium.com/@OmnesRes/i-knew-biorxiv-wouldnt-post-my-peer-review-and-that-s-the-problem-cca7413e9ff5

    If the goal of journals really is to protect the quality of the scientific literature, then don’t they have an ethical responsibility to make negative reviews public?

    Let’s imagine that this paper was first rejected by a couple journals before finally getting accepted, and at those first journals the reviewers noticed all the problems with the paper. Don’t those journals have a responsibility to make sure Wansink isn’t able to trick the next journal into publishing it?

    Some journals do have open peer review, but I’m pretty sure these reviews only get posted if the paper is accepted. So essentially, we only ever see positive reviews. Of course, scientists know negative reviews exist, we get them all the time. I’m not quite sure why the scientific community goes to such great lengths to sweep these under the rug.

    I literally just discovered this–BMC Nutrition had open peer reviews for one of the pizza papers, and they were kind of hilarious. But now with the retraction of the paper they are gone! https://bmcnutr.biomedcentral.com/articles/10.1186/s40795-015-0030-x/open-peer-review

    How do I know they existed? Because I previously made fun of them a bit: https://medium.com/@OmnesRes/peer-review-is-just-security-theater-145b6dd87db5

    Who is the journal protecting by taking down the open peer reviews of a retracted paper? I guess the reviewers might (and should be) embarrassed that they accepted a paper where numbers literally just randomly changed from one table to the next.

    I think this is very telling. Journals (and reviewers) are afraid of being wrong in public. If a reviewer gives a positive review to a retracted paper they want it redacted. If a journal identifies fraud in one of their papers they want to be able to say it didn’t change the conclusions. Everyone wants to present an air of infallibility. It’s almost as if they believe that if they show any weakness everyone will smell blood in the water and pounce.

    But we’ve seen the exact opposite–people who readily admit mistakes are applauded while those who don’t are attacked. So I don’t really understand what’s going on, I’ll have to think about this some more.

    • Jordan:

      Along similar lines: I once criticized a published paper for a flagrant error, and later I was contacted by one of the referees of that paper, who said these problems had been pointed out in the referee report, which was then ignored by the editor. Details omitted to prevent any retaliation.

      • The whole thing is a mess. Journals can get away with accepting terrible papers because we only judge journals based on citation counts…and sometimes a paper is cited specifically for being bad! Journals are basically just tabloids trying to get clicks, and accept whatever they think will get them the most clicks (citations).

        And papers are only cited based on what journal they appear in!

        What we have is a system basically comprised of sheep just seeing what other sheep are doing.

        • “we only judge journals based on citation counts”

          So I guess we need to think about how to change this. At the moment, I have no ideas. Anybody have any?

        • Citation counts fall into the category of metrics that are too easy to compute – it just measures quantity not quality. Similar metrics are followers, views, clicks and so on that have been held up as stars of digital marketing.
          One solution requires hard work – shoe leather. It goes back to developing a better dataset, and pre-processing the data. For example, have real people comb through the citations to note their relevance and importance to the paper. Or, for each paper, list the 5-10 most important citations and only count those, discounting the rest.
          This solution will strike some as unrealistically cumbersome but in data & statistics, no pain is no gain.

        • If I had to evaluate the output of 1000 economists, making a dataset of easy-to-measure quantities and creating models which relate them makes sense. But (although I left academia 35 years ago) my recollection is that tenure candidates are evaluated by committees formed just for that purpose one at a time by their actual peers in the department, plus someone from outside just to give the process a patina of legitimacy. It sort of seems like *more* work to keep up with the citation literature well enough to apply these sort of citation filters than to just read the papers (maybe as they come out — does anyone have any interest in their colleagues’ output?) and make, y’know, *judgments.* Although one thing that does occur to me (and wasn’t really an issue 35 years ago) is that you want to insulate yourself from charges of bias and objective bad standards are, for that purpose, way better than any subjective standards of quality.

        • The tradeoff between quantity and quality is everywhere observed, and sadly, most people prefer quantity because it is easy to collect and run models on. However, the models created are only as good as the data and many models fall short. Another good example in education is teacher evaluations. It is more “objective” and “easy” to run surveys, from which data we can derive models, while shoving issues of response rate, sampling and response bias, grade inflation, vindictive behavior, etc. under the rug. Having real people observe class provides a much richer dataset but can only be done on a smaller scale and is much more costly.

        • Kaiser and Jonathan both make good points. I am a little out of what goes on in academic promotions now (having retired several years ago, so don’t have to be on committees any more), but my impression is that review of a candidate’s papers is still part of within-department hiring and promotion decisions, but the “metrics” are wanted by the administration.

  2. There is an unfortunate tendency of mathematical scientists like statisticians to think that data answers questions. But for many researchers, particularly in the social sciences, data exists as an independent rhetorical device to compel beliefs already held by the researcher, but perhaps not held by the reader. The philistine reader who elevates the data portion above the clever thinking that went into hypothesis creation and justification is tolerated, but not really encouraged. It is churlish to require, after surmounting the pygmy reviewers who nitpick about computer and mathematical trivia, to submit this obviously corroborative rhetoric to the further ignominy of carping jealous nobodies on the Internet.

    • Jonathan:

      I’d really have no problems if these journals didn’t present data at all, or made it clear that proof-by-data is not necessary. In that case they should remove their implicit requirement for “statistical significance.” I think all these papers we’re discussing would be just as strong if they were to present their data as anecdotes and scatterplots, with no p-values at all. To put it another way: when those authors reported t-statistics of 5.03 and 11.14, when they were really 1.8 and 3.3 . . . I never asked them to present t-statsitics at all! If journals are going to publish this sort of thing, I’d prefer they move to a pre-statistical mode of operation in which theories are justified based on plausibility, anecdote, and perhaps some aggregate data. Indeed, if it were explicitly recognized that the data represent an ornament rather than a proof, this would put more burden on the theoretical development, and perhaps researchers wouldn’t be able to get away with such sloppy theories. As it is, the sloppy theories lean on the data analysis, and the sloppy data analyses lean on the theories. Bad news all around.

      • It is not enough to have a clever idea. You have to have a clever idea and belong (or satisfy) the social group that owns the role of creating such a clever idea. The data, stats, language, citations, and peer review signal membership in certain groups.

        If there weren’t any statistics, the Comp Lit deparment could do it! (Just teasing, folks! Don’t send the mythological beasts after me. I estimate you could do it better, too!)

        Similarly, astrologers work in a tradition, they don’t just make stuff up.

    • All you have to do to post on bioarxiv is take it the journal name. This is a “problem” created entirely by your own obstinancy and is not remotely illegal. All you want is attention for your needless drama.

  3. Andrew: You are on to something with the hysteresis. It seems to be more general than research papers. It’s easy to make all kinds of data findings (essentially insufficient generalization) but then once the initial findings are spread around, it is very hard to dispute those findings, even if they are contradicted. We’re seeing similar things with “fake news” and I think we’ll find that efforts to counter fake news will face up to this hysteresis effect. (Even mass media is known to bury their corrections in the inside pages, which makes it worse.)

  4. Oh, this is probably a good time to post this:
    http://cornellsun.com/2018/02/08/cornell-professors-continuous-retractions-stained-lab-reputation-students-say/

    And this:
    https://twitter.com/psforscher/status/962010342959194112

    Basically, as we thought, it seems Wansink already knows what story he wants to tell before he collects the data. I guess the data collection isn’t for him, but to convince the rest of us that his ideas are correct. I mean, I assume the government wouldn’t have given him a bunch of money simply for ideas.

    P.S. Obviously having a hypothesis is fine, but if the data doesn’t agree with your hypothesis you don’t change the data to fit your hypothesis. You consider the possibility that you are wrong and try to see what the data is telling you (assuming you trust the data–given how sloppy the data collection is with this lab I don’t think we could draw any conclusions from any of their data).

    • In the “real world”, the thinking goes like this: I have a hypothesis based on my gut feeling – I am going to run this experiment in order to prove my hypothesis. If the data justifies the hypothesis, I am so happy (may mean I get budget). If the data contradicts the hypothesis, first I think the data must be wrong, then I think an external factor must have affected my experiment, next I think the hypothesis must have been too general, so now I start slicing and dicing the data to find the segment for which the hypothesis is justified. I won’t give up because I have invested time and energy in the experiment, and it’s sad if I have “nothing” to show at the end. Given the amount of variables that is available now, it is almost surely the case that we can find something – as shown on this blog all the time. One wishes this weren’t so in academia/science.

      • Kaiser: “sad if I have “nothing” to show at the end”

        Is it not sadder, if at the end you hold on to that “nothing” for something?

        Is there something in the “real world” that mitigates that?

        • I don’t mind it so much if people try to make nothing sound like something as long as in their hearts of hearts they know that’s what they did and make use of the flowing resources to actually do something in the future. It gets carried away when people make nothing sound like something, forget they did this on purpose, and go and spend more of their resources on the same line of work.

        • Just to clarify that I don’t agree with that line of thinking. I always tell managers that learning NOT to waste time and money doing something useless is also valuable.

  5. You really should have mentioned that at least the carrots add up now!

    Sure, they didn’t add up before, and Wansink gave bs excuses for why they didn’t add up, and I received a version of the correction where they still didn’t add up. But they do now! Progress.

    • Jordan:

      I can only imagine that the reaction of Wansink and his collaborators to all this hullabaloo is as expressed in Jonathan’s comment above. They’re doing all this important work that’s changing people’s lives, and a bunch of bitter losers are trying to bring them down.

      Just to be clear: I think Jonathan’s comment is offered in a spirit that is some mix of parody and empathy. That is, I don’t think Jonathan really believes his Daniel-Gilbert-level rhetoric of “pygmy reviewers . . . carping jealous nobodies,” but I do think he’s trying to capture something of the attitudes of the people who do this sort of work.

  6. I know I’ve made this point before and it’s a bit off-topic for your post, but it goes to the heart of the problem with this and so many other studies we’ve discussed such as the power pose. If you start with an effect that is directionally intuitive, even obvious (appealing presentation increases consumption, posture affects attitude), then claim the magnitude of that effect is many times what previous research, experience, and common sense would indicate, the important part of the claim is magnitude, not direction.

    Claims of implausible magnitude should be held to a higher threshold of proof just as claims of implausible direction should be, and defenses of those claims (see the recent articles on the power pose in the New York Times and the Boston Globe) should also be required to focus on magnitude.

    • Mark: It’s worse than that. An implausible magnitude might be something like 60% paid when there is both a sign and a box asking for a donation vs 40% paid when there is just a box and no sign which is a huge 20% gap. Then, the result will be generalized to “we proved that people don’t tend to pay unless prodded, and putting a sign up will change the behavior.” However, even if the experimental result is replicated, 40% will still not pay when there is a sign and a box.

    • Mark:

      Sure, but I’d put it slightly differently. Effects might well be large, but we’d expect them to very by person and by context.

      Also, it’s not so obvious to me that average effects of these interventions would be possible, even if they sound reasonable and represent common sense. Yes, it sounds like a good idea to describe vegetables appealingly or to have good posture. But we should remember that these interventions are not occurring in a vacuum. Public health authorities are already trying to figure out ways to get kids to eat healthier, and job candidates are already trying to improve their chances in interviews. So, sure, the snappy veggie names might help—or they might antagonize the kids (or the food service workers) and make things worse. It depends on context. Similarly, the advice to sit up straight, or stand in some pose or whatever, might help—or it might hurt, if it distracts the job candidate from everything he or she was trying to remember about the content of the interview. Again, it depends on context.

      So I want to avoid a sort of unidirectional thinking (the fallacy of the one-sided bet) where we accept without question that ideas such as Wansink’s are correct, even in their direction. I really don’t know. In a world where people are already trying to lose weight, will a switch to smaller plates move them in that direction? Maybe. I don’t know. In a world where people are already trying to present themselves in a positive light, will a focus on posture move them in that direction? Maybe. I don’t know. Answering those questions is one reason we do systematic research, and in that case strategies such as clear hypotheses, high-quality measurement, and transparent reporting can be helpful.

      I do agree that one reason that researchers such as Wansink are so quick to make declarations such as “These mistakes and omissions do not change the general conclusion of the paper” is that they’re convinced ahead of time that their ideas are correct. But in that case I’d rather them flat-out admit this and forget the statistical evidence entirely. If Wansink’s ideas on eating behavior and how to measure it are good, he can just publish them, and let others actually run the experiments.

      • Andrew,

        Yeah, to borrow a line from the movie Parenthood, “there’s so much here that offends me, is difficult to pick just one thing.” That said, I think the question of generalization and magnitude both require serious attention. Both are big problems.

        Most of these studies take a small, highly homogenous group, expose the subjects to a very specific treatment under incredibly artificial circumstances, then draw sweeping conclusions about the effect holding everywhere. Even if everything about the study was done properly and the specific results are valid, that doesn’t justify the sweeping conclusions that are inevitably made in the Ted talks and best-selling self-help books.

        But the point I’m hammering here (which is, I apologize, somewhat off-topic from your post) is that if the notable part of a claim is the magnitude, any defense of that claim has to cover both direction and magnitude. One interesting aspect of the appealing name study is that we have a tremendous amount of research (most of it proprietary, but the findings are still easy to infer) that product names, packaging, advertising have an effect on purchasing (consumption is a thornier question) and that these effects tend to generalize fairly well.

        Wansink has made a career out of publishing results that are consistent with existing research and personal experience directionally but are wildly implausible in terms of magnitude. If he is going to claim that problems with his numbers do not affect his conclusions, he has explicitly has to address both direction and magnitude.

  7. Great post! Of the many depressing / funny pieces of the “Attractive Names Sustain Increased Vegetable Intake in Schools” correction, my favorite is that “Study 1 was planned in 2007, but it was conducted in the Spring of 2008 shortly after the first author was asked to take a 15-month leave-of-absence to be the Executive Director for USDA’s Center for Nutrition Policy and Promotion in Washington DC.” I’m sure that the policies being promoted are rigorously justified…

  8. Republicans also have the advantage of incumbency, which, on average, allows members to run about seven percentage points ahead of the national party. But Republicans have gradually been losing the advantages of incumbency as well, most obviously because of 34 recent retirements in Republican-held congressional districts.

Leave a Reply to Kaiser Cancel reply

Your email address will not be published. Required fields are marked *