Alon Honig points to a post from psychology researcher Joe Hilgard:
This is the story of how I [Hilgard] found what I believe to be scientific misconduct and what happened when I reported it.
Science is supposed to be self-correcting. To test whether science is indeed self-correcting, I tried reporting this misconduct via several mechanisms of scientific self-correction. The results have shown me that psychological science is largely defenseless against unreliable data. . . .
You should understand that there are probably a few people in your field producing work that is either fraudulent or so erroneous it may as well be fraudulent. You should understand that their work is cited in policy statements and included in meta-analyses. You should understand that, if you want to see the data or to report concerns, those things happen according to the inclinations of the editor-in-chief at the journal. You should understand that if the editor-in-chief is not inclined to help you, they generally not accountable to anyone and they can always ignore you until the statute of limitations runs out.
Basically, it is very easy to generate unreliable data, and it is very difficult to get it retracted.
It’s a story about a published article that “that appeared to have gibberish for all its statistics (Zhang, Espelage, & Zhang, 2018). None of the numbers in the tables added up: the p values didn’t match the F values, the F values didn’t match the means and SDs, and the degrees of freedom didn’t match the sample size. . . .” Then Hilgard read further articles by this author, which “would often report impossible statistics. Many papers had subgroup means that could not be combined to yield the grand mean. For example, one paper reported mean task scores of 8.98ms and 6.01ms for males and females, respectively, but a grand mean task score of 23ms. Other papers had means and SDs that were impossible given the range. . . . More seriously still, tables of statistical output seemed to be recycled from paper to paper. . . .”
As Hilgard points out, these errors can have consequences because the results are reported as being studies with very large sample sizes (3000 is large for a psychology study), and:
If these numbers were wrong, they were going to receive a lot of weight in future meta-analyses. . . . experiments with sample sizes totaling to more than 11,000 participants (8,000 given the Aggressive Behavior correction). This is an amount of data that rivals entire meta-analyses and ManyLabs projects. If this data is flawed, it will have serious consequences for reviews and meta-analyses.
After being notified of the problems, the author made minimal corrections that seemed to just perpetuate the errors: “What was remarkable about these corrections is that they would simply add an integer to the F values so that they would be statistically significant.” Hilgard shares more detail showing how crap these data are. He contacted the author’s university, and they replied that there was “insufficient evidence to prove that data fraud” and that “these discussions belong to academic disputes.” Kind of like the University of California said about the sleep guy.
I can’t imagine what it would be like to work in the same department as one of those people.
Hilgard then contacted the journals that published these articles. He found that “Some journals appeared to make good-faith attempts to investigate and retract. Other journals have been less helpful”:
In the cases that an editor in chief has been willing to act, the process has been very slow, moving only in fits and starts. I have read before that editors and journals have very little time or resources to investigate even a single case of misconduct. It is clear to me that the publishing system is not ready to handle misconduct at scale.
In the cases that an editor in chief has been unwilling to act, there is little room for appeal. Editors can act busy and ignore a complainant, and they can get indignant if one tries to go around them to the rest of the editorial board. It is not clear who would hold the editors accountable, or how. I have little leverage over Craig Anderson or Duncan Lindsey besides my ability to bad-mouth them and their journals in this report. At best, they might retire in another year or two and I could have a fresh editor with whom to plead my case.
Hilgard concludes:
In total, trying to get these papers retracted has been much more difficult, and rather less rewarding, than I had expected. The experience has led me to despair for the quality and integrity of our science. If data this suspicious can’t get a swift retraction, it must be impossible to catch a fraud equipped with skills, funding, or social connections.
Yup. It’s too hard to publish criticisms and obtain data for replication. Sometimes the problem is research misconduct and flat-out fraud, other times it’s just sloppiness or the authors being unaware of some subtle statistical issues. In any case, it’s a problem.
Very interesting story. Sadly, this is consistent with my experience of reporting research misconduct to Johns Hopkins University. See this link
https://personal.rhul.ac.uk/uhte/014/Spagat%20Fabrication%20Conference.pdf
I’d be giddy happy if I could retract my own papers & no more.
Of course, there is no blame to go around, merely a change of mind. A correcting paper is simply not warranted in its own right, and a correction-only mention does not appear to exist [else, I would be happy enough to merely lodge a note on my change of position rather than truly have the original text forgotten]. Co-authors are not of any concern – they either do not exist or are game for my withdrawing [after a few polite yeras have passed since publication].
Back in the naïve first decade of my career I was a coauthor on a paper for which we belatedly discovered a pervasive but ultimately meaningless data management error. A large portion of the parameter estimates reported had trivial changes in the confidence in the confidence interval bounds due to the mistake. For instance a hazard ratio confidence interval of (1.12, 2.53) might have been miscalculated as (1.10, 2.57) or similar.
At the time I was shocked and appalled that the editor of the journal told us basically “you gotta be kidding”. There was no way they were going to publish an erratum or correction over something not materially affecting the conclusions of the analysis. Until then I thought “science” was supposed be super picky and want the exact details to be as truthful and error free as possible. I mean why report all those CI’s to three digits of precision if errors in the third digit don’t even matter, correct or incorrect?
Now I realize the reality of the situation. We could have just made up all those C.I.’s by pulling digits out of a hat and nobody in the world would have cared.
Yeah, I once got asked to provide some additional model details for a meta-analysis, and during the re-runs in R found that the intercepts in the published tables were actually from the pre-R&R model spec. (Didn’t affect substantive interpretation but was off by much more than a rounding error.) I wrote the editor, who essentially said, you don’t really think we need to do a corrigendum for THAT, do you? Sigh.
Off topic, but this might be of interest to you, Andrew:
https://www.econjobrumors.com/topic/wow-just-wow…?replies=17#post-8263664
It’s a critique of an analysis of vaccine efficacy performed by CDC. Long story short, it looks like they took a sample of hospitalized patients, regressed test positivity on vaccination status, adjusted for demographics, and called this vaccine efficacy for preventing hospitalization (presumably intended to be comparable to the clinical trial estimates). Seems kind of dumb but curious if you can make any sense of it.
Can we please, please not have yet another comment thread derailed by Covid?!
Sorry. Not about COVID, about a particular analysis by CDC of vaccine efficacy that seems plausibly problematic. If Andrew thinks it’s interesting enough maybe he can do a post on it.
The methodology is called “test-negative case-control design”. You use a hospitalized population where everyone is getting a test for the same reason, and you compare vaccination status against test results. Here’s a 2019 discussion if you’re worried that this methodology was invented post-COVID.
https://pubmed.ncbi.nlm.nih.gov/31430265/
Thanks for the pointer. Reading some related papers now, and have to admit I don’t really understand the argument that this is a consistent estimator of VE with respect to hospitalization with COVID, but I will need to chew on it more.
OK, I might be really thick, but the paper that (I think) introduced this design seems to be making a couple of fundamental conceptual errors:
https://www.sciencedirect.com/science/article/pii/S0264410X13002429?casa_token=QeMGqGGOrWoAAAAA:O6rTC0HdU9z861raX5mmASOe6znHY9sigiWwEqBbH6gWSCqpIhtlDIt3FRAR-nRPyUM69BfO0w
Read section 2, where they lay out their conceptual framework and define their estimand. The story is that there are people who would be hospitalized if they got an ARI, and people who would not be. Some of each type are vaccinated and some of each type are infected with flu.
In this framework, what is VE with respect to hospitalization with flu? It is 1 minus the failure rate ratio between vaccinated and unvaccinated people. (Here, failure is being hospitalized with flu.) Using their notation (see table 1), the vaccinated failure rate is A / (N1 + N2), and the unvaccinated failure rate is G / (N3 + N4). But their equation 1 states (without explanation) that these are A / N1 and G / N3, respectively. So that right there is an error in defining the VE correctly.
Secondly, they say that they assume no effect modification by patient type (would be versus wouldn’t be hospitalized with an ARI). But this makes no sense—a patient who wouldn’t be hospitalized with an ARI ipso facto wouldn’t be hospitalized with flu, so the failure rate is 0 in both groups in patients of this type, and VE is undefined. So this is a second error.
The rest of the paper seems to just take this estimand for granted, and talk about how you can estimate it using only patients hospitalized with an ARI. But this is the wrong estimand! And the other papers I’ve seen seem to uncritically cite this paper as if it demonstrates the possibility of doing the impossible—estimating the effect of a vaccine on hospitalization without any data on unhospitalized patients.
Am I crazy or is this as silly as it seems to me?
I honestly think you are being a bit crazy about this. You say that the authors assert Equation 1 “without explanation” but Equation 1 is literally preceded by this text:
“if VE against influenza infection is the same in those who do vs.
do not seek care for ARI (that is, there is not effect modification by care-seeking), then the VE parameter of interest is”
+1
And I love your 18 elephants story. Basically data augmentation in order to have an easier solution to a problem, used a lot in certain areas of statistics. Do you make that connection when you teach?
Thanks!
I just recently went through something similar with nonsense data tables. The journal cleared the paper of misconduct. However, my issue was not a retraction but error correction, as I stated at the outset. Once I clarified that I didn’t want a retraction but correction the journal stopped responding to me.
Maybe that’s worse?
Psyoskeptic:
I was on the editorial board of a journal where something came up about a paper that everyone agreed was wrong. The other editors didn’t want to retract the paper because they viewed retraction as a punishment, and they didn’t think the authors had done things wrong on purpose. I found some documents saying that retraction is not a punishment etc., but the other editors didn’t listen to me. It was very annoying.
Andrew: “I can’t imagine what it would be like to work in the same department as one of those people.”
Many possibilities come to mind. Maybe their colleagues ignore them. (Even at a well-functioning university, there’s nothing to be gained by pointing out awful work that some people are doing, and one might delude oneself into thinking it’s not actual fraud.) Or maybe they’re unaware. Or maybe its a collective, conscious effort to game, by hook or by crook, a system they consider flawed. Or maybe there’s a shared culture of bad work. The last possibility reminds me of a fun paper:
“The LL game: The curious preference for low quality and its norms”
https://journals.sagepub.com/doi/full/10.1177/1470594X11433740
(The PDF is easy to find.)
“… agents in our low-quality worlds are oddly ‘pro-social’: for the advantage of maximizing their raw self-interest, they prefer to receive low-quality goods and services, provided that they too can in exchange deliver low quality without embarrassment. … We argue that high-quality collective outcomes are endangered not only by self-interested individual defectors, but by ‘cartels’ of mutually satisfied mediocrities.”
+1. The LL paper is fascinating, thanks for sharing
“The curious preference for low quality and its norms”
Interesting paper but was this heretofore unknown? Seems like this is common behavior in “mature” industries, or even in individual companies. People suppress anything that might rock the boat or otherwise endanger a predictable/robotic mode of operations, where no one has to think or make significant decisions, they can just keep stamping papers.
It’s exactly this kind of “norm” that makes industries and businesses ripe for disruptive innovation.
Ron Chernow’s book on Rockefeller describes the nascent oil industry as mostly very sloppily run companies seeking to make the most money as quickly as possible with the least amount of work. Rockefeller disrupted this equilibrium with an intense focus on quality, efficiency and productivity – which generated the cash that allowed him to steamroller across the industry, buying one refiner after another, dramatically increasing the efficiency of operations of each and generating more cash to buy the next.
Of course, government operations aren’t subject to disruptive innovation, so mediocrity in government operations likely steadily increases as it is indefinitely perpetuated.
I wonder if there’s sociology of science papers on this phenomenon. The science communication talking point is always that science self-corrects. But, well, we all know it doesn’t.
“Science self-correcting” has two meanings I think. I might be nice to think that individual papers that are wrong, seriously flawed, methodologically suspect etc. might be highlighted and withdrawn from the scientific literature. That does happen occasionally but generally it doesn’t as in the examples on this thread.
The more general meaning of self-correcting continues to work pretty well IMHO, but perhaps mostly in the biological/physical sciences where there is considered to be a fairly well-defined external reality. In this case “self-correcting” means something more like “the truth will out” and stuff that turns out to be wrong will be largly ignored apart perhaps from some interest from science historians. The “truth will out” meaning results from the presence of a community of scientists working on what are considered to be important problems for which there is a common (some of it competitively personal) interest in finding out what is real and what isn’t (and likely to be a source of funding!) and since the focus of the research effort is some external reality it’s likely that progress towards that reality will be made even if there are some false starts and blind alleys along the way. It’s possible that there may be long periods of confusion and misdirection before the truth outs – e.g dark matter might turn out to be bunk and anyone reading Lee Smolin’s excellent “The Trouble with Physics” might get the feeling that string theory is a massive diversion of effort away from more productive approaches….
But in general people do stuff, find stuff out and publish it. Some of it may be obviously bunk or rubbish and this is simply ignored. If someone publishes something that may be interesting, looks useful or may actually be controversial, it’s very likely that it will be replicated, although the vast majority of replications are not direct replications. For example if someone publishes a structure of a Covid-19 protein and a group are interested in finding drug molecules that might bind to this protein, they may well repeat the structure determination in the presence of their drug molecules of interest. If there is a problem with the original study this is more likely to be found out through this sort of follow up study. This is the sense in which the more general principle of “self-correction” applies.
At least in the past one could assume that research was mostly done in good faith. Researchers were genuinely interested in finding stuff out, planned and performed their experiments carefully and would attempt to test alternative interpretations for their preferred findings. I think this does apply still but there may be a growing underbelly of efforts” with alternative agendas, and there certainly is a whole lot more rubbish that can largely be ignored. The incentives for publishing are misaligned unfortunately which doesn’t help.
I can’t think of good examples of “self correction” that aren’t biological, biomedical, biophysical. One wellish-known example might be determining the structure of DNA (double helical, antiparallel etc). Lots of people worked on this and made various contributions. Linus Pauling published a model for DNA that turned out to be wrong, and the true structure was determined shortly afterwards by Watson and Crick. The structure was bound to be determined at some point. Note that Pauling didn’t retract his wrong paper – why should he have?? But that bit of science self-corrected – Paulings paper was not followed up since it was wrong whereas the Watson and Crick structure had a profound influence. The truth did out –
It’s possible that this post may appear a tad Pollyannaish …
The same thing happens in psycholinguistics. I have seen published papers in a major psych journal that had impossible statistics given the reported means etc. The editors were famous (I mean really famous) psycholinguists, but I think they didn’t even bother to read the paper. I got hold of the data and reanalyzed the data, and the claims in the paper did not hold up. I contacted the editors, one didn’t even bother to reply, and the other one was crestfallen. I suggested that the paper should be withdrawn, and the second editor asked the authors what they wanted to do; the authors replied that they “preferred to not retract the paper”. This was all some years ago, and I haven’t followed up on this line of work, but I think I will write about it some day.
I recently reviewed a paper for another major psych journal; same topic and same problem. The reported claims did not hold up when one looks at the data, and several of the key stats analyses were clearly p-hacked. One of the authors of this paper was a famous psychologist (US, white mainstream type person) who should have known better than to allow such garbage to be submitted to a journal. I regretted reviewing that paper, as it was a complete waste of my time.
In many cases, science is just a publication game. It doesn’t really matter what’s in the paper.
I submitted a critique ( https://www.frontiersin.org/articles/10.3389/fnut.2022.896500/full ) of a paper that received some media attention back in March. The managing editor of the original paper was assigned as the managing editor of my critique. While he sent my critique out for review, which came back positive, he subsenquently ghosted the journal for several weeks. The journal eventually reassigned my critique to another editor, and the piece was subsequently accepted. While it was encouraging to have a critique in print, in line with what others have said, my sense from this and other experiences is that nobody cares…just count the “pubs” and move on.