“Evidence-based medicine”: does it lead to people turning off their brains?

Joshua Brooks points us to this post by David Gorski, “The Cochrane mask fiasco: Does EBM predispose to COVID contrarianism?” EBM stands for “evidence-based medicine,” and here’s what Gorski writes:

A week and a half ago, the New York Times published on Opinion piece by Zeynep Tufekci entitled Here’s Why the Science Is Clear That Masks Work. Written in response to a recent Cochrane review, Physical interventions to interrupt or reduce the spread of respiratory viruses, that had over the last month been widely promoted by antimask and antivaccine sources, the article discusses the problems with the review and its lead author Tom Jefferson, as well as why it is not nearly as straightforward as one might assume to measure mask efficacy in the middle of a pandemic due to a novel respiratory virus. Over the month since the review’s publication, its many problems and deficiencies (as well as how it has been unrelentingly misinterpreted) have been discussed widely by a number of writers, academics, and bloggers . . .

My [Gorski’s] purpose in writing about this kerfuffle is not to rehash (much) why the Cochrane review was so problematic. Rather, it’s more to look at what this whole kerfuffle tells us about the Cochrane Collaborative and the evidence-based medicine (EBM) paradigm it champions. . . . I want to ask: What is it about Cochrane and EBM fundamentalists who promote the EBM paradigm as the be-all and end-all of medical evidence, even for questions for which it is ill-suited, that can produce misleading results? . . .

Back in the day, we used to call EBM’s failure to consider the low to nonexistent prior probability as assessed by basic science that magic like homeopathy could work its “blind spot.” Jefferson’s review, coupled with the behavior of EBM gurus like John Ioannidis during the pandemic, made me wonder if there’s another blind spot of EBM that we at SBM have neglected, one that leads to Cochrane reviews like Jefferson’s and leads EBM gurus like Ioannidis to make their heel turns so soon after the pandemic hit . . .

[Regarding the mask report,] perusing the triumphant gloating on social media from ideological sources opposed to COVID-19 interventions, including masks and vaccines, I was struck by how often they used the exact phrase “gold standard” to portray Cochrane as an indisputable source, all to bolster their misrepresentation. . . .

Gorski continues:

I’ve noticed over the last three years a tendency for scientists who were known primarily before the pandemic as strong advocates of evidence-based medicine (EBM), devolving into promoters of COVID-19 denial, antimask, anti-public health, and even antivaccine pseudoscience. Think Dr. John Ioannidis, whom I used to lionize before 2020. Think Dr. Vinay Prasad, of whose work on medical reversals and calls for more rigorous randomized clinical trials of chemotherapy and targeted therapy agents before FDA approval we generally wrote approvingly.

Basically, what Jefferson exhibited in his almost off-the-cuff claim that massive RCTs of masks should have been done while a deadly respiratory virus was flooding UK hospitals was something we like to call “methodolatry,” or the obscene worship of the RCT as the only method of clinical investigation. . . .

But it’s not so simple:

Human trials are messy. It is impossible to make them rigorous in ways that are comparable to laboratory experiments. Compared to laboratory investigations, clinical trials are necessarily less powered and more prone to numerous other sources of error: biases, whether conscious or not, causing or resulting from non-comparable experimental and control groups, cuing of subjects, post-hoc analyses, multiple testing artifacts, unrecognized confounding of data due to subjects’ own motivations, non-publication of results, inappropriate statistical analyses, conclusions that don’t follow from the data, inappropriate pooling of non-significant data from several, small studies to produce an aggregate that appears statistically significant, fraud, and more.

Evidence-based medicine eats itself

For some background on the controversies surrounding “evidence-based medicine,” see this news article from Aaron Carroll from 2017.

Here’s how I summarized things back in 2020, my post entitled “Evidence-based medicine eats itself”:

There are three commonly stated principles of evidence-based research:

1. Reliance when possible on statistically significant results from randomized trials;

2. Balancing of costs, benefits, and uncertainties in decision making;

3. Treatments targeted to individuals or subsets of the population.

Unfortunately and paradoxically, the use of statistics for hypothesis testing can get in the way of the movement toward an evidence-based framework for policy analysis. This claim may come as a surprise, given that one of the meanings of evidence-based analysis is hypothesis testing based on randomized trials. The problem is that principle (1) above is in some conflict with principles (2) and (3).

The conflict with (2) is that statistical significance or non-significance is typically used at all levels to replace uncertainty with certainty—indeed, researchers are encouraged to do this and it is standard practice.

The conflict with (3) is that estimating effects for individuals or population subsets is difficult. A quick calculation finds that it takes 16 times the sample size to estimate an interaction as a main effect, and given that we are lucky if our studies are powered well enough to estimate main effects of interest, it will typically be hopeless to try to obtain the near-certainty regarding interactions. That is fine if we remember principle (2), but not so fine if our experiences with classical statistics have trained us to demand statistical significance as a prerequisite for publication and decision making.

Bridges needed

The above-linked Gorski post was interesting to me because it presents a completely different criticism of the evidence-based-medicine paradigm.

It’s not that controlled trials are bad; rather, the deeper problems seem to be: (a) inferential summaries and decision strategies that don’t respect uncertainty (that was my concern) and (b) research agendas that don’t engage with scientific understanding (that was Gorski’s concern).

Regarding that latter point: a problem with standard “evidence-based medicine” or what I’ve called the “take a pill, push a button model of science” is not that it ignores scientific theories, but rather that it features a gap between theory and evidence. On one side there are theory-stories of varying levels of plausibility; on the other side there are statistical summaries from (necessarily) imperfect study.

What we need are bridges between theory and evidence. This includes sharper theories that make quantitative predictions that can be experimentally studied, and empirical studies measuring intermediate outcomes, and lab experiments to go along with the field studies.

The authors of research papers have no obligation to share their data and code, and I have no obligation to believe anything they write.

Michael Stutzer writes:

This study documents substantial variability in different researchers’ results when they use the same financial data set and are supposed to test the same hypotheses. More generally, I think the prospect for reproducibility in finance is worse than in some areas, because there is a publication bias in favor of a paper that uses a unique dataset provided by a firm. Because this is proprietary data, the firm often makes the researcher promise not to share the data with anybody, including the paper’s referees.

Read the leading journals’ statements carefully and you find that they don’t strictly require sharing.

Here is the statement for authors made by the Journal of Financial Econometrics: “Where ethically and legally feasible, JFEC strongly encourages authors to make all data and software code on which the conclusions of the paper rely available to readers. We suggest that data be presented in the main manuscript or additional supporting files, or deposited in a public repository whenever possible.”

In other words, an author wouldn’t have to share a so-called proprietary data set as defined above, even with the papers’ referees. What is worse, the leading journals not only accept these restrictions, but seem to favor such work over what is viewed as more garden-variety work that employs universally available datasets.

Intersting. I think it’s just as bad in medical or public health research, but there the concern is sharing confidential information. Even in settings where it’s hard to imagine that the confidentiality would matter.

As I’ve said in other such settings, the authors of research papers have no obligation to share their data and code, and I have no obligation to believe anything they write.

That is, my preferred solution is not to nag people for their data, it’s just to move on. That said, this strategy works fine for silly examples such as fat arms and voting, or the effects of unionization on stock prices, but you can’t really follow it for research that is directly relevant to policy.

My two courses this fall: “Applied Regression and Causal Inference” and “Communicating Data and Statistics”

POLS 4720, Applied Regression and Causal Inference:

This is a fast-paced one-semester course on applied regression and casual inference based on our book, Regression and Other Stories. The course has an applied and conceptual focus that’s different from other available statistics courses.
Topics covered in POLS 4720 include:
• Applied regression: measurement, data visualization, modeling and inference, transformations, linear regression, and logistic regression.
• Simulation, model fitting, and programming in R.
• Causal inference using regression.
• Key statistical problems include adjusting for differences between sample and population, adjusting for differences between treatment and control groups, extrapolating from past to future, and using observed data to learn about latent constructs of interest.
• We focus on social science applications, including but not limited to: public opinion and voting, economic and social behavior, and policy analysis.
The course is set up using the principles of active learning, with class time devoted to student-participation activities, computer demonstrations, and discussion problems.

The primary audience for this course is Poli Sci Ph.D. students, and it should also be ideal for statistics-using graduate students or advanced undergraduates in other departments and schools, as well as students in fields such as computer science and statistics who’d like to get an understanding of how regression and causal inference work in the real world!

STAT 6106, Communicating Data and Statistics:

This is a one-semester course on communicating data and statistics, covering the following modes of communication:
• Writing (including storytelling, writing technical articles, and writing for general audiences)
• Statistical graphics (including communicating variation and uncertainty)
• Oral communication (including teaching, collaboration, and giving presentations).
The course is set up using the principles of active learning, with class time devoted to discussions, collaborative work, practicing and evaluation of communication skills, and conversations with expert visitors.

The primary audience for this course is Statistics Ph.D. students, and it should also be ideal for Ph.D. students who do quantitative work in other departments and schools. Communication is sometimes thought of as a soft skill, but it is essential to statistics and scientific research more generally!

See you there:

Both courses have lots of space available, so check them out! In-person attendance is required, as class participation is crucial for both. POLS 4720 is offered Tu/Th 8:30-10am; STAT 6106 will be M/W 8:30-10:am. These are serious classes, with lots of homework. Enjoy.

Chris Chambers’s radical plan for Psychological Science

Someone pointed me to this Vision Statement by Chris Chambers, a psychology professor who would like to edit Psychological Science, a journal that just a few years ago was notorious for publishing really bad junk science. Not as bad as PNAS at its worst, perhaps, but pretty bad. Especially because they didn’t just publish junk, they actively promoted it. Indeed, as late as 2021 the Association for Psychological Science was promoting the ridiculous “lucky golf ball” paper they’d published back in the bad old days of 2010.

So it does seem that the Association for Psychological Science and its journals are ripe for a new vision.

See here for further background.

Chambers has a 12-point action plan. It’s full of details about Accountable Replications and Exploratory Reports and all sorts of other things that I don’t really know about, so if you’re interested I recommend you just follow the link and take a look for yourself.

My personal recommendation is that authors when responding to criticism not be allowed to claim that the discovery of errors “does not change the conclusion of the paper.” Or, if authors want to make that claim, they should be required to make it before publication, a kind of declaration of results independence. Something like this: “The authors attest that they believe their results so strongly that, no matter what errors are found in their data or analysis, they will not change their beliefs about the results.” Just get it out of the way already; this will save everyone lots of time that might otherwise be spent reading the paper.

Differences between boy and girl dinosaurs: Going beyond p-values and dichotomous thinking

John Williams points us to this post by Evan Thomas Saitta on sexual dimorphism in dinosaurs. Saitta’s post begins:

In most animal species, males and females differ. This is true for people and other mammals, as well as many species of birds, fish and reptiles. But what about dinosaurs? In 2015, I proposed that variation found in the iconic back plates of stegosaur dinosaurs was due to sex differences.

I was surprised by how strongly some of my colleagues disagreed, arguing that differences between sexes, called sexual dimorphism, did not exist in dinosaurs.

I am a paleontologist, and the debate sparked by my 2015 paper has made me reconsider how researchers studying ancient animals use statistics.

The limited fossil record makes it hard to declare if a dinosaur was sexually dimorphic. But I and some others in my field are beginning to shift away from traditional black-or-white statistical thinking that relies on p-values and statistical significance to define a true finding. Instead of only looking for yes or no answers, we are beginning to consider the estimated magnitude of sexual variation in a species, the degree of uncertainty in that estimate and how these measures compare to other species. This approach offers a more nuanced analysis to challenging questions in paleontology as well as many other fields of science.

This is a good example of how scientists can think better by moving beyond p-values and statistical significance. Also, dinosaurs are fun.

I guess that most of the readers of our blog would be happy if we posted more about dinosaurs and less about politicians.

Cureus, its reviewing, and its “Scholarly Impact Quotient”

I have come across Cureus during my debunking of COVID-19 articles, as a journal (affiliated with Springer Nature Group) with an “interesting” model. So far, my appreciation of it was that the expedite reviewing it offers is its main weakness and all of the paper that I  was sent that were published there were of rather low quality. When we know how long publishing a manuscript can take, it seems like an advantage to obtain faster reviewing, but I have argued in the past that, if anything, expedite reviewing is an issue in particular if the reviewers’ reports are not available.

Anyways, I guess that it is not too much of a problem to provide faster reviewing for case reports which Cureus seems to publish a lot of. But this is more problematic for papers that are not case reports. For instance, a re-analysis of two studies published on Cureus about Ivermectin highlights that: “Untreated statistical artefacts and methodological errors alone lead to dramatic apparent risk reduction associated with Ivermectin use in both studies.” More problematic papers have been found on the “miracle Ivermectin drug” on Cureus and they all somehow have quite damning reports on Pubpeer (see e.g., this thread, or this one).

But what I found interesting and was brought to my attention only today by Thomas Kesteman is the use of the Scholarly Impact Quotient which is, according to Cureus: “SIQ™ is designed to supplement our pre-publication peer review process and allows authors to receive immediate, valuable reader feedback without the politics inherent in most other metrics determining so-called “importance.” We do not consider impact factor to be a reliable or useful metric of individual article importance. Cureus is a signer of DORA – the San Francisco Declaration on Research Assessment – and does not promote the use of journal impact factors.”

I find that quite interesting and I am usually in favour of journals/initiative trying something different. But I am quite afraid that the metric can be easily gamed. For instance, the bad IVM studies all seem to have good scores. In any case, I am not condemning an initiative such as this one at all, just wondering what it will provide and how it will be gamed.

Editorial processes and ethics approval — Case study of 248 studies with the same ethics approval number

This is Lonni Besançon, new contributor to this blog. A quick presentation before I dive into the topic of this first post. I am an assistant professor at Linköping University in Sweden. My research focus is on data visualization (which Jessica often talks about here). In particular I have gotten more and more interested in the idea to avoid common pitfalls of dichotomous thinking with statistics thanks to different visual representations. This idea was recently discussed on the blog here. This research and my interest for metascience and methodological work have slowly pushed me to focus more and more on the importance of transparency (see e.g., my analysis of the lack of transparency of the papers published in the first 6 months of COVID-19, including 200+ papers reviewed in a day or less with 30% of them having editorial conflicts of interest), the importance of post-publication peer-review (e.g., a successful re-analysis of published results, and I say successful because Scientific Report eventually retracted the paper), and the importance of ethics and sleuthing efforts.
Expect my contributions to the blog to be on these topics, I do hope that I can bring some conversations and awareness around these topics.

This then brings me to today’s post. With collaborators, we have analyzed 456 clinical/microbiology papers from a single institute to look at the ethics approval mentioned. Our findings are published in the journal Research Integrity and Peer Review. We have some concerns: we have found 248 heterogeneous studies (different samples, different populations, different countries) with the same single ethics approval number. You can find a detailed breakdown of how these studies differ below. We have found similar concerns (e.g., recruitment of participants before ethics approval was obtained) in another set of 200 papers.


What is worrying here? First, there exist no real COPE guidelines on what can be done on an editorial or whistleblower level. Then, we have reached out to all editors who published these papers and not a lot have replied yet. This is perhaps unsurprising since such matters always take time to be resolved, but we believe that an EoC coud easily be appended to these articles while investigations take place. In addition, a vast majority of these papers was published in journals that the authors had some editorial conflicts of interest with, as visible below (see this excellent paper by Scanff et al. on nepotism in science which features some of this institute’s research, or our Open Science during COVID paper again).
We have not, yet, had the possibility to check what the ethics approval reused 248 times could cover, so we only raise concerns so far and cannot conclude anything beyond the number of reuses and the nature of the studies.


Following these findings we recommend that submission processes be amended to include ethical information/approval numbers for clinical research such that an analysis like ours is made easier and perhaps such that cross-checking if ethics approval may be abused becomes easier.

Report on the large language model meeting at Berkeley

I’m not the only one who thinks GPT-4 is awesome. I just got back from an intense week of meetings at the Large language models and transformers workshop at Berkeley’s Simons Institute for the Theory of Computing. Thanks to Umesh Vazirani for organizing and running it so calmly.

Here are the videos of the talks.

Human feedback models

I gave a short talk Friday afternoon on models of data annotation.

  • Bob Carpenter. Softening human feedback improves classifier calibration

  • The step from the language model GPT-4 to the chatbot ChatGPT involves soliciting human feedback to rank potential outputs. This is typically done by converting the human feedback to a “gold” standard and retraining the baseline GPT-4 neural network.

    Chris Manning (who introduced me to statistical natural language processing when we were both professors at CMU), provides a nice high-level overview of how OpenAI uses reinforcement learning with human feedback to try to align ChatGPT to the goals of being helpful, harmless, and truthful.

    Chris Manning. Towards reliable use of large language models: better detection, consistency, and instruction-tuning.

    Humans rank potential ChatGPT output and their feedback is used as input for a Bradley-Terry model of conversational reward. This is then used to retrain the network. Chris suggests a much simpler approach than the one they use.

    While at the workshop, John Thickstun, a Stanford CS postdoc, pointed me to the following (and also filled me in on a bunch of technical details in between sessions).

    Chen Cheng, Hilal Asi, and John Duchi. 2022. How many labelers do you
    have? A close look at gold-standard labels
    . arXiv.

    It makes some simplifying assumptions to prove results including the bias of majority voting. I show similar things through simulation in a case study I posted on the Stan forums a while back,

    Bob Carpenter. For probabilistic prediction, full Bayes is better than point estimators.

    More on that soon when Seong Han and I finish our recent paper on annotation models.

    LLMs and copyright

    The highlight of the entire event for me was a captivating talk by a brilliant professor of intellectual property law at Berkeley

    Pamela Samuelson. Large language models meet copyright law.

    If you’re at all interested in copyright and AI, you should watch this. She very clearly explains what copyright is and how the law sees works of artistic expression different than function and hence how it sees code and (other) artistic works differently. She also covers the basis for the cases currently being litigated. She was also masterly at handling the rather unruly crowd. I’ve never been to an event with so many interruptions by the audience members. It was almost like the audience was practicing to be DARPA program managers (a notoriously interruption-prone group).

    Is ChatGPT just a stochastic parrot?

    The other talk I’d encourage everyone to watch is

    Steven Piantadosi. Meaning in the age of large language models.

    He goes over a lot of the cognitive science and philosophy of language necessary to understand why ChatGPT is not just a “stochastic parrot.” He focuses on the work of, wait for it…Susan Gelman (Andrew’s sister). Susan works in my favorite area of cognitive science—concept development.

    I can’t recommend this one highly enough, and I’ll be curious what people get out of it. This one’s closest to my own background (my Ph.D. is joint in cognitive science/computer science and I taught semantics, philosophy of language, and psycholinguistics as well as NLP at CMU), so I’m curious how understandable it’ll be to people who haven’t studied a lot of cognitive anthropology, philosophy of mind, and cognitive development.

    Sanjeev Arora gave a talk about combining skills and how he did a simple combinatorial experiment of combining five “skills” among thousands of skills (not defining what a skill was drove the audience into a frenzy of interruptions that’s quite something to watch). This is behavior that “emerged” in GPT-4 that isn’t so great in the less powerful models.

    Speaking of parrots, the West Coast Stats Views blog (which Andrew often cites) is parroting mainstream chatbot FUD (fear, uncertainty, and doubt); see, e.g., Thursday tweets. I say “parrot” because the blog’s Thursday posts just point to things we used to call “tweets” before a certain someone decided to throw away a brand name that’s become a verb. The irony, of course, is that they accuse GPT of being a parrot!

    Scaling laws

    There were a couple of nice talks by Yasaman Bahri (DeepMind) on Understanding the origins and taxonomy of neural scaling laws and Sasha Rush (Cornell/Hugging Face) on Scaling data-constrained language models. These are important as they’re what allows you to decide how much data to use and how large a model to build for your compute budget. It’s what gave companies the incentive to invest hundreds of millions of dollars in infrastructure to fit these models. Sasha’s talk also discusses the roles researchers can take who don’t have access to big-tech compute power.

    Watermarking LLMs

    Scott Aaronson (UT Austin, on sabbatical at OpenAI) gave a really interesting talk,

    Scott Aaronson. Watermarking of large language models

    The talks a masterclass in distilling a problem to math, explaining why it’s difficult, and considering several solutions and their implications. I felt smarter after watching this one.

    You might also want to check out the competition from John Thickstun, Robust distortion-free watermarks for language models, which takes an encryption key-based approach.

    In-context learning

    “In-context learning” is what people call an LLM’s ability to be given zero or more examples and then to complete the pattern. For example, if I say “translate to French. oui: “, we get what’s called “zero-shot learning”. If I give it an example, then it’s called “one-shot learning”, for example, “translate to French. notre: our, oui: “, and so on. ChatGPT can manage all sorts of nuanced language tasks given only a few examples. It’s so good that it’s competitive with most custom solutions to these problems.

    Everyone kept pointing out in-context learning did not learn in the sense of updating model weights. Of course it doesn’t. That’s because it’s conditioning, not learning. The whole process is Markovian, returning a draw from Pr[continuation | context]. The issue is whether you can do a good job of this prediction without being AI-compete (i.e, a general artificial intelligence, whatever that means).

    A whole lot of attention was given to ChatGPT’s poor performance on arithmetic problems coded as characters like “123 * 987”, with a couple different talks taking different approaches. One trained a transformer with the actual digits and showed it could be made to do this, pointing to the problem being the encoding of math in language. Perhaps the most insightful is that if you give GPT access to an API (with in-context learning, no less), it can call on that API to do arithmetic and the problem goes away. The final talk on this was during the lightning sessions, where Nayoung Lee (a Ph.D. student from Wisconsin-Madison) showed if you reversed the digits in the output (so that they were least significant first, the way we usually do arithmetic), transformers could be trained to do arithmetic very well; here’s a link to her arXiv paper, Teaching arithmetic to small transformers.

    Sparks of artificial general intelligence

    Yin Tat Lee kicked off the program talking about the Microsoft paper, Sparks of general AI. If you haven’t read the paper it’s a fascinating list of things that ChatGPT can do with relatively simple prompting.

    One of the interesting aspects of Yin Tat’s talk is his description of how they treated ChatGPT (4) as an evolving black box. To me, this and a bunch of these other talks that people did probing GPT’s abilities, point out that we need much better evaluation methods.

    Federated learning and task specialization

    For those interested in hierarchical modeling and the idea of a foundation model that can be adapted to different tasks, Colin Raffel (UNC/Hugging Face) gave an interesting talk on federated learning and deployment.

    Colin Raffel. Build an ecosystem, not a monolith

    This was not unrelated to Sasha’s talk (perhaps not surprising as they’re both affiliated with Hugging Face). They also talk about the ecosystem of image models sprouting up around Stable Diffusion and the ability to fine-tune them using low rank methods.

    OpenAI is closed

    Ilya Sutskever, CTO of OpenAI, gave a talk that I can best describe as adversarial. It was the only talk that filled the room to staning room only. He said he couldn’t talk about anything computational or anything related to LLMs, so he spent an hour talking about the duality between probability and compression and Kolmogorov complexity.

    Bob Carpenter thinks GPT-4 is awesome.

    Bob writes:

    GPT4 is even better at Stan than GPT3 and even better at C++. It’s sooo much better at longer chains of reasoning, writing style, and being able to maintain coherence. Check this two turn dialogue to get it to write a non-centered nonlinear mixed effects model. I followed it up with a one-prompt C++ question to get it to write a full N-body simulation, which it nailed. I’m sure it could’ve used other integration method like Euler’s method or Runge-Kutta.

    It’s easy at this point to have it start riffing on the model by adding effects, removing effects, switching to varying slope/varying intercept, etc. It can store up to about 5000 words of context when generating the next 1000 words (and there’s a version you can pay even more for with a roughly 25K word context).

    The fact that it can keep all the indexes coherent and line everything up is still mind-boggling to me. GPT4 is qualitatively better at this than GPT3 was. This thing was never taught to code Stan per se. It was just taught which kinds of randomly generated code look like real code! Nevertheless, it picked up things that look like the weakly informative priors people use in practice for logistic regression.

    Of course, it has trained on everything available about Stan online, which isn’t much. It’s way better at C++, Python, and R. I included a C++ example using physics below that you should appreciate.

    I had GPT3 coach me through writing a speech input, speech output interface to GPT. I built it in Python using the Qt GUI toolkit, the Elevenlabs text to speech API, and the Whisper speech to text API. I now have a (push to talk) speech interface to GPT.

    BC: I want to write a Stan program to define a nonlinear mixed effects model. I want to use a logistic regression to predict response to an intervention in hospital trauma centers. Each binomial measurement is a number of successes out of number of trials. Each hospital has a state in which it is located (50 states), and a trauma center level (1, 2, or 3), both of which get random effects with independent hierarchical priors in the model. Then there is are fixed effect for the average income of the hospital patients and the average severity of the cases it sees. And there should be an overall intercept. The fixed effects get fixed priors.

    GPT4: To write a Stan program that defines a nonlinear mixed-effects model with the described specifications, you can follow the below structure:

    data {
     int<lower=0> N;  // Number of observations
     int<lower=0> J;  // Number of hospitals
     int<lower=0> K;  // Number of states
     int<lower=1, upper=3> level[J];  // Trauma center level
     int<lower=1, upper=K> state[J];  // State in which each hospital is located
     int<lower=0> successes[N];  // Number of successes
     int<lower=0> trials[N];  // Number of trials
     real<lower=0> avg_income[N];  // Average income of hospital patients
     real<lower=0> avg_severity[N];  // Average severity of the cases
    }
    
    parameters {
     real<lower=0> intercept;  // Overall intercept
     real<lower=0> beta_income;  // Fixed effect of income
     real<lower=0> beta_severity;  // Fixed effect of severity
     real<lower=0> sigma_state;  // State random effect standard deviation
     real<lower=0> sigma_level;  // Trauma center level random effect
    standard deviation
     vector[K] state_effect_raw;  // State random effects (non-centered)
     vector[3] level_effect_raw;  // Trauma center level random effects
    (non-centered)
    }
    
    transformed parameters {
     vector[K] state_effect = sigma_state * state_effect_raw;  // State random effects
     vector[3] level_effect = sigma_level * level_effect_raw;  // Trauma center level random effects
    }
    
    model {
     // Priors for fixed effects
     intercept ~ normal(0, 10);
     beta_income ~ normal(0, 10);
     beta_severity ~ normal(0, 10);
    
     // Priors for random effects standard deviations
     sigma_state ~ cauchy(0, 2.5);
     sigma_level ~ cauchy(0, 2.5);
    
     // Priors for non-centered random effects
     state_effect_raw ~ normal(0, 1);
     level_effect_raw ~ normal(0, 1);
    
     // Model
     for (i in 1:N) {
       int hospital_idx = ...; // Index of the hospital corresponding to
    the i-th observation
       successes[i] ~ binomial_logit(trials[i], intercept + beta_income *
    avg_income[i] + beta_severity * avg_severity[i] +
    state_effect[state[hospital_idx]] +
    level_effect[level[hospital_idx]]);
     }
    }
    

    This Stan program defines a logistic regression model with fixed effects for average income and average severity of cases, and random effects for the state and trauma center level. The program takes in the number of observations, number of hospitals, number of states, trauma center level, and state of each hospital, as well as the number of successes, number of trials, average income, and average severity of the cases. The model parameters include the overall intercept, fixed effects for income and severity, standard deviations for the random effects, and non-centered random effects for state and trauma center level.

    The transformed parameters block calculates the actual random effects for the state and trauma center level by scaling the non-centered random effects with their respective standard deviations. In the model block, priors are set for the fixed effects and the standard deviations of the random effects. The non-centered random effects are given standard normal priors.

    Finally, the model block iterates through each observation, and for each observation, it identifies the corresponding hospital index. The binomial_logit distribution is used to model the number of successes out of the number of trials, incorporating the fixed effects, the state random effect, and the trauma center level random effect.

    BC: That’s really good, but there are still two errors. (1) There should not be lower bounds on the random effects because we are going to put the linear predictor through an inverse logit transform (implicit in the binomial_logit distribution you used). (2) The indexing is wrong in the model block. The number of observations and the number of hospitals is the same and the index can just represent the hospital (e.g., in order to index the random effects). Please fix these errors and give me just the corrected Stan program.

    GPT4:

    My apologies for the errors in the previous response. I have corrected the errors in the Stan program below:

    data {
     int<lower=0> N;  // Number of observations (equals to the number of hospitals)
     int<lower=0> K;  // Number of states
     int<lower=1, upper=3> level[N];  // Trauma center level
     int<lower=1, upper=K> state[N];  // State in which each hospital is located
     int<lower=0> successes[N];  // Number of successes
     int<lower=0> trials[N];  // Number of trials
     real<lower=0> avg_income[N];  // Average income of hospital patients
     real<lower=0> avg_severity[N];  // Average severity of the cases
    }
    
    parameters {
     real intercept;  // Overall intercept
     real beta_income;  // Fixed effect of income
     real beta_severity;  // Fixed effect of severity
     real<lower=0> sigma_state;  // State random effect standard deviation
     real<lower=0> sigma_level;  // Trauma center level random effect standard deviation
     vector[K] state_effect_raw;  // State random effects (non-centered)
     vector[3] level_effect_raw;  // Trauma center level random effects
    (non-centered)
    }
    
    transformed parameters {
     vector[K] state_effect = sigma_state * state_effect_raw;  // State random effects
     vector[3] level_effect = sigma_level * level_effect_raw;  // Trauma center level random effects
    }
    
    model {
     // Priors for fixed effects
     intercept ~ normal(0, 10);
     beta_income ~ normal(0, 10);
     beta_severity ~ normal(0, 10);
    
     // Priors for random effects standard deviations
     sigma_state ~ cauchy(0, 2.5);
     sigma_level ~ cauchy(0, 2.5);
    
     // Priors for non-centered random effects
     state_effect_raw ~ normal(0, 1);
     level_effect_raw ~ normal(0, 1);
    
     // Model
     for (i in 1:N) {
       successes[i] ~ binomial_logit(trials[i], intercept + beta_income * avg_income[i] + beta_severity * avg_severity[i] + state_effect[state[i]] + level_effect[level[i]]);
     }
    }
    

    I have removed the lower bounds from the random effects in the parameters block and corrected the indexing in the model block, as the number of observations and the number of hospitals are the same.

    ==================physics time============================

    BC: Can you write a C++ program to carry out an N-body problem simulation using the leapfrog integrator for the Hamiltonian dynamics derived from Newtonian mechanics. It should randomly initialize positions and masses and velocities.

    GPT4: Here’s a simple C++ program for an N-body simulation using the leapfrog integrator. This program initializes positions, masses, and velocities randomly and then simulates the system using the leapfrog method.

    cpp
    #include <iostream>
    #include <vector>
    #include <cmath>
    #include <random>
    
    const double G = 6.67430e-11; // Gravitational constant
    
    struct Particle {
       double m;
       double x, y, z;
       double vx, vy, vz;
    };
    
    void compute_forces(const std::vector<Particle>& particles,
    std::vector<Particle>& forces) {
       for (size_t i = 0; i < particles.size(); ++i) {
           double fx = 0, fy = 0, fz = 0;
           for (size_t j = 0; j < particles.size(); ++j) {
               if (i == j) continue;
               double dx = particles[j].x - particles[i].x;
               double dy = particles[j].y - particles[i].y;
               double dz = particles[j].z - particles[i].z;
               double d = std::sqrt(dx*dx + dy*dy + dz*dz);
               double f = G * particles[i].m * particles[j].m / (d*d);
               fx += f * dx / d;
               fy += f * dy / d;
               fz += f * dz / d;
           }
           forces[i].x = fx;
           forces[i].y = fy;
           forces[i].z = fz;
       }
    }
    
    void leapfrog_step(std::vector<Particle>& particles, double dt) {
       std::vector<Particle> forces(particles.size());
       compute_forces(particles, forces);
    
       for (size_t i = 0; i < particles.size(); ++i) {
           particles[i].vx += forces[i].x / particles[i].m * dt / 2;
           particles[i].vy += forces[i].y / particles[i].m * dt / 2;
           particles[i].vz += forces[i].z / particles[i].m * dt / 2;
           particles[i].x += particles[i].vx * dt;
           particles[i].y += particles[i].vy * dt;
           particles[i].z += particles[i].vz * dt;
       }
    
       compute_forces(particles, forces);
       for (size_t i = 0; i < particles.size(); ++i) {
           particles[i].vx += forces[i].x / particles[i].m * dt / 2;
           particles[i].vy += forces[i].y / particles[i].m * dt / 2;
           particles[i].vz += forces[i].z / particles[i].m * dt / 2;
       }
    }
    
    int main() {
       const int n_particles = 100;
       const double t_total = 1000.0;
       const double dt = 0.01;
    
       std::random_device rd;
       std::mt19937 gen(rd());
       std::uniform_real_distribution<> mass_dist(1e20, 1e22);
       std::uniform_real_distribution<> pos_dist(-1e11, 1e11);
       std::uniform_real_distribution<> vel_dist(-1e4, 1e4);
    
       std::vector<Particle> particles(n_particles);
    
       for (int i = 0; i < n_particles; ++i) {
           particles[i].m = mass_dist(gen);
           particles[i].x = pos_dist(gen);
           particles[i].y = pos_dist(gen);
           particles[i].z = pos_dist(gen);
           particles[i].vx = vel_dist(gen);
           particles[i].vy = vel_dist(gen);
           particles[i].vz = vel_dist(gen);
       }
       int n_steps = static_cast<int>(t_total / dt);
       for (int step = 0; step < n_steps; ++step) {
           leapfrog_step(particles, dt);
    
       // Print the state of the system at each step (optional)
       /*
       for (const auto& p : particles) {
           std::cout << "Mass: " << p.m << ", ";
           std::cout << "Position: (" << p.x << ", " << p.y << ", " << p.z << "), ";
           std::cout << "Velocity: (" << p.vx << ", " << p.vy << ", " << p.vz << ")\n";
       }
       std::cout << "-----\n";
       */
     }
    
     // Print final state of the system
     for (const auto& p : particles) {
       std::cout << "Mass: " << p.m << ", ";
       std::cout << "Position: (" << p.x << ", " << p.y << ", " << p.z << "), ";
       std::cout << "Velocity: (" << p.vx << ", " << p.vy << ", " << p.vz << ")\n";
     }
     return 0;
    }
    

    This program initializes the positions, masses, and velocities randomly, and then simulates the system using the leapfrog method for a given total time `t_total` and time step `dt`. The state of the system can be printed at each step by uncommenting the optional print statements in the loop. The final state of the system is printed after the simulation is complete.

    BC: What is the best integration method to use for an N-body problem of the scale you just coded?

    GPT4: For N-body problems, especially when simulating gravitational interactions, the best integration methods are those that are symplectic, time-reversible, and have good long-term energy conservation properties.

    One of the popular choices for such problems is the leapfrog integrator, which was used in the code I provided earlier. Leapfrog is a second-order symplectic integrator with good energy conservation properties and is relatively simple to implement. It is suitable for many N-body simulations and provides accurate results for a wide range of problems.

    However, when dealing with highly eccentric orbits or close encounters between bodies, higher-order symplectic integrators like the Verlet, Forest-Ruth, or Yoshida integrators might be more appropriate. These methods provide better accuracy and stability for such challenging scenarios, albeit at the cost of increased complexity and computational effort.

    I asked Bob if he’d like to blog this, and he said, “Overall, everyone’s fatigued from everyone saying how great GPT is, so I’m not so keen to blog it. But hey, feel free if you’d like to.”

    So I decided to post it on lag. By the time the post is appeared, perhaps people won’t be talking about GPT anymore. The only thing I added above were the html tags to make the code more readable on the blog.

    My overall reaction is the same as Bob’s: I’m impressed. As we see more and more of these chatbot dialogues, it becomes more clear how they work, in general terms. As Bob and others have explained, they’ve pulled tons of text off the internet and are reconstructing it, monkey-on-typewriter style. Oddly enough, the regurgitating nature of the algorithms becomes clearer when the chatbots work well, as above, than when they fail, as in Gary Smith’s examples we discussed awhile ago. So, yeah, it doesn’t understand Stan; it can just find various Stan codes on the web that are connected to various statistical questions, and put these together. Kinda like the way that, if I’m trying to get a widget working in Latex and I’m too lazy to try to figure out what’s going on, I’ll search the web for some relevant code. But . . . damn, yeah, it’s pretty impressive how well this works, how close it comes to a reasonable answer, even while not following the logic that a human would use to answer these questions. Or, maybe it’s not so far from the logic that a human would use to try to get in the right ballpark.

    Also, I don’t like the blank lines between paragraphs of the code. When I’m coding, a key resource is real estate on the screen, and I don’t like to waste space in that way. Every blank line represents a line of code somewhere else that I don’t get to see. But we can’t blame the chatbot—it’s just copying all the code-with-blank-lines that’s already on the web!

    DSGE and GPT3

    Pedro Franco sends in two items.

    The first relates to dynamic stochastic general equilibrium (DSGE) models in economics. Franco writes:

    Daniel McDonald and Cosma Shalizi take a standard DSGE (short-run, macro models) and subject them to some fairly simple and clear tests (run the model and see how well it can estimate with centuries of data; take data and shuffle it around so inflation is GDP, GDP is interest rates, etc) and… the results are not great.

    Noah Smith, who interviewed you a while ago, ended up beating me to the punch with an extensive article about the basics of what happened.

    There were further developments afterwards that I haven’t seen summarized anywhere. In short, a bunch of macroeconomists on Twitter were surprised with what had been found in the paper and tried to replicate the results unsuccessfully. As far as I can tell (I’d need to check carefully), the authors of the paper haven’t yet replied to this, so it’s been an interesting example of “open science” overall, where the criticism is presented, counter-argued and things continue.

    I don’t have anything to say about this one. I know next to nothing about macroeconomics. Shalizi is a reasonable guy but there’s just too much going on here for me to follow without having to learn a lot of stuff first. So I’ll just share it here as an example of scientific discussion.

    Franco’s second item is about the performance of chatbots on standardized tests. He writes:

    The latest iteration of GPT4 was presented together with the results of it going through a series of standardized exams (see here and here for the paper), something that got a lot of attention due to the impressive results it achieved. When I first saw this, I was equally impressed (and still am), but I think there’s a question to be had here that I haven’t seen so far (caveat, I’m not a researcher in the area, so I could easily have missed it) about a couple of things related to this.

    There’s a smaller issue that I’ll go through first, which is with the way they test for contamination in their sample, that being when they check if the questions they use to test Chat’s capacity may have been a part of their training data. They basically test to see if 3 substrings of 50 characters of each question they use in each test was part of the training data and consider it a contaminated question if any of the substrings is present; it’s unclear if they manually check once the method detects a positive.

    Having worked with language processing before, I understand that they’re probably doing this (instead of something more complex/thorough) for computational/manpower reasons, but it does mean there’s a chance that a much larger portion of their exam questions is contaminated than they realize/claim; I wish they had shown some sort of validation for their method, which I don’t think would have been hard to do. I’m specially worried given that the types of tests they’re using (GRE, SAT, etc) are heavily discussed on the internet and it would not be hard for small variations to exist with appropriate answers.

    That said, what I think is the really interesting question is what exactly do we learn/measure with these tests? I.e., how do we interpret a LLM like ChatGPT doing well in standardized testing, as opposed to something that isn’t? Let me elaborate a little bit.

    For humans, (at least one reason) we use these tests as we’re hoping/expecting that being able to do well in these types of questions means those humans will do well with novel questions and situations too, when it comes to applying their knowledge and reasoning. But is this sort of generalization reasonable for a LLM? The issue is that, as far as I understand things, the nature of standardized tests (with limited variation) means that answers to them are likely to be much more predictable to a LLM than to a human. With it’s massive training set, that sort of predictability seems to make it relatively easier for LLMs to deal with.

    Essentially, we see the machine doing well in a task that we, humans, find hard, but hopefully measures our capacity to apply what we understood in other contexts/situations, and I think that we may end up anthropomorphising the LLM a bit too much and believing that the same tests are as useful in testing their abilities. To get back to initial question, I’m curious as to what exactly are we measuring when we do this, a recurrent question on your blog.

    My reply: I’d separate this into two questions. First, what can the chatbot do; second, what are we asking from humans. For the first question: Yes, the chatbot seems to be able to construct strings of words that correspond closely to correct answers on the test. For the second question: This sort of pattern-matching is often what students learn how to do! We can look at this in a couple ways:

    (a) For most students, the best way to learn how to give correct answers on this sort of test is to understand the material—in practice, actually learning the underlying topic is a more effective strategy than trying to pattern-match the answers without understanding.

    (b) A student who is really confused can sometimes still get an OK grade by carefully studying practice problems and figuring out how to solve each kind of problem, without ever really understanding what is going on. But this can still be a good thing, because once you learn how to solve all these individual problems, you can start to figure out the big picture, similar to how if you know enough vocabulary and you immerse yourself in a foreign language, eventually it can start to make sense.

    For computers, it’s different. Statement (a) won’t be the case with a computer, at least not with modern chatbots. As for statement (b), I don’t know what to say. I don’t see pattern-matching as leading to understanding, but maybe that could change in the future.

    So I guess I’m agreeing with Franco, at least for now.

    The connection between junk science and sloppy data handling: Why do they go together?

    Nick Brown pointed me to a new paper, “The Impact of Incidental Environmental Factors on Vote Choice: Wind Speed is Related to More Prevention-Focused Voting,” to which his reaction was, “It makes himmicanes look plausible.” Indeed, one of the authors of this article had come up earlier on this blog as a coauthor of paper with fatally-flawed statistical analysis. So, between the general theme of this new article (“How might irrelevant events infiltrate voting decisions?”), the specific claim that wind speed has large effects, and the track record of one of the authors, I came into this in a skeptical frame of mind.

    That’s fine. Scientific papers are for everyone, not just the true believers. Skeptics are part of the audience too.

    Anyway, I took a look at the article and replied to Nick:

    The paper is a good “exercise for the reader” sort of thing to find how they managed to get all those pleasantly low p-values. It’s not as blatantly obvious as, say, the work of Daryl Bem. The funny thing is, back in 2011, lots of people thought Bem’s statistical analysis was state-of-the-art. It’s only in retrospect that his p-hacking looks about as crude as the fake photographs that fooled Arthur Conan Doyle. Figure 2 of this new paper looks so impressive! I don’t really feel like putting in the effort to figuring out exactly how the trick was done in this case . . . Do you have any ideas?

    Nick responded:

    There are some hilarious errors in the paper. For example:
    – On p. 7 of the PDF, they claim that “For Brexit, the “No” option advanced by the Stronger In campaign was seen as clearly prevention-oriented (Mean (M) = 4.5, Standard Error (SE) = 0.17, t(101) = 6.05, p < 0.001) whereas the “Yes” option put forward by the Vote Leave campaign was viewed as promotion-focused (M = 3.05, SE = 0.16, t(101) = 2.87, p = 0.003).": But the question was not "Do you want Brexit, Yes/No". It was "Should the UK Remain in the EU or Leave the EU". Hence why the pro-Brexit campaign was called "Vote Leave", geddit? Both sides agreed on before the referendum that this was fairer and clearer than Yes/No. Is "Remain" more prevention-focused than "Leave"? - On p. 12 of the PDF, they say "In the case of the Brexit vote, the Conservative Party advanced the campaign for the UK to leave the EU." This is again completely false. The Conservative government, including Prime Minister David Cameron, backed Remain. It's true that a number of Conservative politicians backed Leave, and after the referendum lots of Conservatives who had backed Remain pretended that they either really meant Leave or were now fine with it, but if you put that statement, "In the case of the Brexit vote, the Conservative Party advanced the campaign for the UK to leave the EU" in front of 100 UK political scientists, not one will agree with it. If the authors are able to get this sort of thing wrong then I certainly don't think any of their other analyses can be relied upon without extensive external verification. If you run the attached code on the data (mutatis mutandis for the directories in which the files live) you will get Figure 2 of the Mo et al. paper. Have a look at the data (the CSV file is an export of the DTA file, if you don't use Stata) and you will see that they collected a ton of other variables. To be fair they mention these in the paper ('Additionally, we collected data on other Election Day weather indicators (i.e., cloud cover, dew point, precipitation, pressure, and temperature), as well as historical wind speeds per council area.5 The inclusion of other Election Day weather indicators increases our confidence that we are detecting an association between wind speed and election outcomes, and not the effect of other weather indicators that may be correlated with wind speed.") My guess is that they went fishing and found that wind speed, as opposed to the other weather indicators that they mentioned, gave them a good story. Looking only at the Swiss data, I note that they also collected "Income", "Unemployment", "Age", "Race" (actually the percentage of foreign-born people; I doubt if Switzerland collects "Race" data; Supplement, Table S3, page 42), "Education", and "Rural", and threw those into their model as well. They also collected latitude and longitude (of the centroid?) for each canton, although those didn't make it into the analyses. Also they include "Turnout", but for any given Swiss referendum it seems that they only had the national turnout because this number is always the same for every "State" (canton) for any given "Election" (referendum). And the income data looks sketchy (people in Schwyz canton do not make 2.5 times what people in Zürich canton do). I think this whole process shows a degree of naivety about what "kitchen-sink" regression analyses (and more sophisticated versions thereof) can and can't do, especially with noisy measures (such as "Precipitation" coded as 0/1). Voter turnout is positively correlated with precipitation but negatively with cloud cover, whatever that means. Another glaring omission is any sort of weighting by population. The most populous canton in Switzerland has a population almost 100 times the least populous, yet every canton counts equally. There is no "population" variable in the dataset, although this would have been very easy to obtain. I guess this means they avoid the ecological fallacy, up to the point where they talk about individual voting behaviour (i.e., pretty much everywhere in the article).

    Nick then came back with more:

    I found another problem, and it’s huge:

    For “Election 50”, the Humidity and Dew Point data are completely borked (“relative humidity” values around 1000 instead of 0.6 etc; dew point 0.4–0.6 instead of a Fahrenheit temperature slightly below the measured temperature in the 50–60 range). When I remove that referendum from the results, I get the attached version of Figure 2. I can’t run their Stata models, but by my interpretation of the model coefficients from the R model that went into making Figure 2, the value for the windspeed * condition interaction goes from 0.545 (SE=0.120, p=0.000006) to 0.266 (SE=0.114, p=0.02).

    So it seems to me that a very big part of the effect, for the Swiss results anyway, is being driven by this data error in the covariates.

    And then he posted a blog with further details, along with a link to some other criticisms from Erik Gahner Larsen.

    The big question

    Why do junk science and sloppy data handling so often seem together? We’ve seen this a lot, for example the ovulation-and-voting and ovulation-and-clothing papers that used the wrong dates for peak fertility, the Excel error paper in economics, the gremlins paper in environmental economics, the analysis of air pollution in China, the collected work of Brian Wansink, . . . .

    What’s going on? My hypothesis is as follows. There are lots of dead ends in science, including some bad ideas and some good ideas that just don’t work out. What makes something junk science is not just that it’s studying an effect that’s too small to be detected with noisy data; it’s that the studies appear to succeed. It’s the misleading apparent success that’s turns a scientific dead end into junk science.

    As we’ve been aware since the classic Simmons et al. paper from 2011, researchers can and do use researcher degrees of freedom to obtain apparent strong effects from data that could well be pure noise. This effort can be done on purpose (“p-hacking”) or without the researchers realizing it (“forking paths”), or through some mixture of the two.

    The point is that, in this sort of junk science, it’s possible to get very impressive-looking results (such as Figure 2 in the above-linked article) from just about any data at all! What that means is that data quality doesn’t really matter.

    If you’re studying a real effect, then you want to be really careful with your data: any noise you introduce, whether in measurement or through coding error, can be expected to attenuate your effect, making it harder to discover. When you’re doing real science you have a strong motivation to take accurate measurements and keep your data clean. Errors can still creep in, sometimes destroying a study, so I’m not saying it can’t happen. I’m just saying that the motivation is to get your data right.

    In contrast, if you’re doing junk science, the data are not so relevant. You’ll get strong results one way or another. Indeed, there’s an advantage to not looking too closely at your data at first; that way if you don’t find the result you want, you can go through and clean things up until you reach success. I’m not saying the authors of the above-linked paper did any of that sort of thing on purpose; rather, what I’m saying is that they have no particular incentive to check their data, so from that standpoint maybe we shouldn’t be so surprised to see gross errors.

    “They got a result they liked, and didn’t want to think about the data.” (A fish story related to Cannery Row)

    John “Jaws” Williams writes:

    Here is something about a century-old study that you may find interesting, and could file under “everything old is new again.”

    In 1919, the California Division of Fish and Game began studying the developing sardine fishery in Monterey. Ten years later, W. L. Scofield published an amazingly through description of the fishery, the abstract of which begins as follows:

    The object of this bulletin is to put on record a description of the Monterey sardine fishery which can be used as a basis for judging future changes in the conduct of this industry. Detailed knowledge of changes is essential to an understanding of the significance of total catch figures, or of records of catch per boat or per seine haul. It is particularly necessary when applying any form of catch analysis to a fishery as a means of illustrating the presence or absence of depletion or of natural fluctuations in supply.

    As detailed in this and subsequent reports, the catch was initially limited by the market and the capacity of the fishing fleet, both of which grew rapidly for several decades and provided the background for John Steinbeck’s “Cannery Row.” Later, sardine population famously collapsed, and never recovered.

    Sure enough, just as Scofield feared, scientists who did not understand the data subsequently misused it as reflecting the sardine population, as I pointed out in this letter (which got the usual kind of response). They got a result they liked, and didn’t want to think about the data.

    The Division of Fisheries was not the only agency to publish detailed descriptive reports. The USGS and other agencies did as well, but generally they have gone out of style; they take a lot of time and field work, are expensive to publish, and don’t get the authors much credit.

    This comes to mind because I am working on a paper about a debris flood on a stream in one of the University of California’s natural reserves, and the length limits for the relevant print journals don’t allow for a reasonable description of the event and a discussion of what it means. However, now I can write a separate and more complete description, and have it go as on-line supplementary material. There is some progress.

    Recently in the sister blog

    Scientific and folk theories of viral transmission: A comparison of COVID-19 and the common cold:

    Disease transmission is a fruitful domain in which to examine how scientific and folk theories interrelate, given laypeople’s access to multiple sources of information to explain events of personal significance. The current paper reports an in-depth survey of U.S. adults’ (N = 238) causal reasoning about two viral illnesses: a novel, deadly disease that has massively disrupted everyone’s lives (COVID-19), and a familiar, innocuous disease that has essentially no serious consequences (the common cold). . . .

    An understanding of viral transmission and viral replication existed alongside folk theories, placeholder beliefs, and lack of differentiation between viral and non-viral disease. For example, roughly 40% of participants who explained illness in terms of the transmission of viruses also endorsed a non-viral folk theory, such as exposure to cold weather or special foods as curative. . . .

    Although comparisons of COVID-19 and the common cold revealed relatively few differences, the latter, more familiar disease elicited consistently lower levels of accuracy and greater reliance on folk theories. Moreover, for COVID-19 in particular, accuracy positively correlated with attitudes (trusting medical scientists and taking the disease more seriously), self-protective behaviors (such as social distancing and mask-wearing), and willingness to be vaccinated.

    For both diseases, self-assessed knowledge about the disease negatively predicted accuracy.

    P.S. Some interesting discussion in comments.

    The above-linked paper characterizes “folk theories” as “deviating from scientific consensus but not invoking supernatural causes” and also refers to these theories as “medically inaccurate,” giving examples such as “cold weather causes colds, that ‘starving’ a fever can speed recovery, and that chicken soup and excess vitamin C cure colds.”

    Commenters argue that some of these beliefs, folk though they may be in a historical sense or in terms of the theoretical frameworks that have traditionally motivated them, do not deviate from scientific consensus, nor are they medically inaccurate. Several commenters discuss evidence that cold weather can increase the risk of getting sick from infectious disease, and another commenter points to a Cochrane review stating that “in certain contexts vitamin C is beneficial against colds” (although the evidence there doesn’t seem so strong).

    These discussions are relevant to the main point of the above-linked paper: to the extent that the folk theories are true or supported by science, that changes the implication of statements such as “self-assessed knowledge about the disease negatively predicted accuracy.” It’s tough to think about all this given how many different folk theories are out there, ranging from general claims (various cold-weather conditions can make it easier for colds to spread and harder for your body to fight them off), speculation which could be true and can never really be proved false (various claims about positive effects of vitamins), goofy traditions such as the chicken soup (for which it’s still possible to come up with theories to support), etc.

    In the meantime, the psychological processes discussed in the above-linked article are happening, irrespective of the ultimate effectiveness of various folk cures. Another twist on all of this is that people often think deterministically (for example, trying to figure out what was the one cause of them getting sick, or supposing that a remedy will almost always work). I’m not quite sure how to study all this in an environment such as health and medicine where so much is unknown.

    Deja vu on researching whether people combined with LLMs can do things people can do

    This is Jessica. There has been a lot of attention lately on how we judge whether a generative model like LLM has achieved human-like intelligence, and what not to do when making claims about this. But I’ve also been watching the programs of some of the conferences I follow fill up with a slightly different rush to document LLMs: papers applying models like GPT-4 to tasks that we once expected humans to do to see how well they do. For example, can we use ChatGPT to generate user responses to interactive media? Can they simulate demographic backstories we might get if we queried real populations? Can they convince people to be more mindful? Can they generate examples of AI harms?  And so on. 

    Most of this work is understandably very exploratory. And if LLMs are going to reshape how we program or get medical treatment or write papers, then of course there’s some pragmatic value to starting to map out where they excel versus fail on these tasks, and how far we can rely on them to go. 

    But do we get anything beyond pragmatic details that apply to the current state of LLMs? In many cases, it seems doubtful.

    One problem with papers that “take stock” of how well an LLM can do on some human task is that the technology keeps changing, and even between the big model releases (e.g., moving from GPT-3 to GPT-4) we can’t easily separate out which behaviors are more foundational, resulting from the pre-training, versus which are arising as a result of interactive fine-tuning as the models get used. This presents a challenge to researchers who want something about their results to be applicable for more than a year or two. There needs to be something we learn that is more general than this particular model version applied to this task. But in this kind of exploratory work, that’s hard to guarantee. 

    To be fair, some of these papers can contribute intermediate level representations that help characterize a domain-specific problem or solution independent of the LLM. For instance, this paper developed a taxonomy of different types of cognitive reframing that work for negative thoughts in applying LLMs to the problem. But many don’t.

    I’m reminded of the early 2010s when crowdsourcing was really starting to take off. It was going to magically speed up machine learning by enabling annotation at scale, and let behavioral researchers do high throughput experiments, transforming social science. And it did in many ways, and it was exciting to have a new tool. But if you looked at a lot of the specific research coming out to demonstrate the power of crowdsourcing, the high level research question could be summarized as “Can humans do this task that we know humans can do?” There was little emphasis on the more practical concerns about whether, in some particular workflow, it makes sense to invest effort in crowdsourcing, how much money or effort it took the researchers to get good results from crowds of humans, or what would happen if the primary platform at the time (Amazon Mechanical Turk) stopped being supported. 

    And now here we are again. LLMs are not people, of course, so the research question is more like “By performing high dimensional curve fitting on massive amounts of human-generated content, can we generate human-like content?” Instead of being about performance on some benchmark, this more applied version becomes about whether the AI-generated content is passable in domain X. But since definitions of passable tend to be idiosyncratic and developed specific to each paper, it’s hard to imagine someone synthesizing all this in any kind of concrete way later. 

    Part of my distaste for this type of research is that we still seem to lack an intermediate layer of understanding of what more abstract behaviors we can expect from different types of models and interactions with models. We understand the low-level stuff about how the models work, we can see how well they do on these tasks humans usually do, but we’re missing tools or theories that can relate the two. This is the message of a recent paper by Holtzman, West, and Zettlemoyer, which argues for that researchers invest more in developing a vocabulary of behaviors, or “meta-models” that predict aspects of an LLM’s output, to replace questions like What is the LLM doing? with Why is the LLM doing that? 

    I guess one could argue that this kind of practical research is a more worthwhile use of federal funding than the run-of-the-mill behavioral study, which might set out to produce some broadly generalizable result but shoot itself in the foot by using small samples, noisy measurements, an underdefined population, etc. But at least in studies of human behavior there is usually an attempt at identifying some deeper characterization of what’s going on, so the research question might be interesting, even if the evidence doesn’t deliver. 

    “Nobody’s Fool,” by Daniel Simons and Christopher Chabris

    This new book, written by two psychology researchers, is an excellent counterpart to Lying for Money by economist Dan Davies, a book that came out a few years ago but which we happened to have discussed recently here. Both books are about fraud.

    Davies gives an economics perspective, asking what are the conditions under which large frauds will succeed, and he focuses on the motivations of the fraudsters: often they can’t get off the fraud treadmill once they’re on it. In contrast, Simons and Chabris focus on the people who get fooled by frauds; the authors explain how it is that otherwise sensible people can fall for pitches that are, in retrospect, ridiculous. The two books are complementary, one focusing on supply and one on demand.

    My earlier post was titled “Cheating in science, sports, journalism, business, and art: How do they differ?” Nobody’s Fool had examples from all those fields, and when they told stories that I’d heard before, their telling was clear and reasonable. When a book touches on topics where the reader is an expert, it’s a good thing when it gets it right. I only wish that Simons and Chabris had spent some discussing the similarities and differences of cheating in these different areas. As it is, they mix in stories from different domains, which makes sense from a psychology perspective of the mark (if you’re fooled, you’re fooled) but gives less of a sense of how the different frauds work.

    For the rest of this review I’ll get into some different interesting issues that arose in the book.

    Predictability. On p.48, Simons and Chabris write, “we need to ask ourselves a somewhat paradoxical question: ‘Did I predict this?’ If the answer is ‘Yes, this is exactly what I expected,’ that’s a good sign that you need to check more, not less.” I see what they’re saying here: if a claim is too good to be true, maybe it’s literally too good to be true.

    On the other hand, think of all the junk science that sells itself on how paradoxical it is. There’s the whole Freakonomics contrarianism thing. The whole point of contrarianism is that you’re selling people on things that were not expected. If a claim is incredible, maybe it’s literally incredible. Unicorns are beautiful, but unicorns don’t exist.

    Fixed mindsets. From p.61 and p.88, “editors and reviewers often treat the first published study on a topic as ‘correct’ and ascribe weaker or contradictory results in later studies to methodological flaws or incompetence. . . . Whether an article has been peer-reviewed is often treated as a bright line that divides the preliminary and dubious from the reliable and true.” Yup.

    There’s also something else, which the authors bring up up in the book: challenging an existing belief can be costly. It creates motivations for people to attack you directly; also it seems to me that the standards for criticism of published papers are often much higher than for getting the original work accepted for published in the first place. Remember what happened to the people who squealed on Lance Armstrong? He attacked them. Or that Holocaust denier who sued his critic? The kind of person who is unethical enough to cheat could also be unethical enough to abuse the legal system.

    This is a big deal. Yes, it’s easy to get fooled. And it’s even easier to get fooled when there are social and legal structures that can make it difficult for frauds to publicly be revealed.

    Ask more questions. This is a good piece of advice, a really important point that I’d never thought about until reading this book. Here it is: “When something seems improbable, that should prompt you to investigate by asking more questions [emphasis in the original]. These can be literal questions . . . or they can be asked implicitly.”

    Such a good point. Like so many statisticians, I obsess on the data in front of me and don’t spend enough time thinking about gathering new data. Even something as a simulation experiment is new data.

    Unfortunately, when it comes to potential scientific misconduct, I don’t usually like asking people direct questions—the interaction is just too socially awkward for me. I will ask open questions, or observe behavior, but that’s not quite the same thing. And asking direct questions would be even more difficult in a setting where I thought that actual fraud was involved. I’m just more comfortable on the outside, working with public information. This is not to disagree with the authors’ advice to ask questions, just a note that doing so can be difficult.

    The fine print. On p.120, they write, “Complacent investors sometimes fail to check whether the fine print in an offering matches the much shorter executive summary.” This happens in science too! Remember the supposedly “long-term” study that actually lasted only three days? Or the paper whose abstract concluded, “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications,” even though the study itself had no data whatsoever on people “becoming more powerful”? Often the title has things that aren’t in the abstract, and the abstract has things that aren’t in the paper. That’s a big deal considering: (a) presumably many many more people read the title than the abstract, and many many more people read the abstract than the paper, (b) often the paper is paywalled so that all you can easily access are the title and abstract.

    The dog ate my data. From p.123: “Many of the frauds that we have studied involved a mysterious, untimely, or convenient disappearance of evidence.” Mary Rosh! I’m also reminded of Dan Davies’s famous quote, “Good ideas do not need lots of lies told about them in order to gain acceptance.”

    The butterfly effect. I agree with Simons and Chabris to be wary of so-called butterfly effects: “According to the popular science cliché, a butterfly flapping its wings in Brazil can cause a tornado in Texas.” I just want to clarify one thing which we discuss further in our paper on the piranha problem. As John Cook wrote in 2018:

    The butterfly effect is the semi-serious claim that a butterfly flapping its wings can cause a tornado half way around the world. It’s a poetic way of saying that some systems show sensitive dependence on initial conditions, that the slightest change now can make an enormous difference later. . . . The lesson that many people draw from their first exposure to complex systems is that there are high leverage points, if only you can find them and manipulate them. They want to insert a butterfly to at just the right time and place to bring about a desired outcome.

    But, Cook explains, that idea is wrong. Actually:

    Instead, we should humbly evaluate to what extent it is possible to steer complex systems at all. . . . The most effective intervention may not come from tweaking the inputs but from changing the structure of the system.

    The point is that, to the extent the butterfly effect is a real thing, the point is that small interventions can very occasionally have large and unpredictable results. This is pretty much the opposite of junk social science of the “priming” or “nudge” variety—for example, the claim that flashing a subliminal smiley face on a computer screen will induce large changes in attitudes toward immigration—which posit reliable and consistent effects from such treatments. That is: if you really take the butterfly idea seriously, you should disbelieve studies that purport to demonstrate those sorts of bank-shot claims about the world.

    Clarke’s Law

    One more thing.

    In his book, Davies talks about fraud in business. There’s not a completely sharp line dividing fraud from generally acceptable sharp business practices; still, business cheating seems like a clear enough topic that it can make sense to write a book about “Lying for Money,” as Davies puts it.

    As discussed above, Simons and Chabris talk about people being fooled by fraud in business but also in science, art, and other domains. In science in particular, it seems to me that being fooled by fraud is a minor issue compared to the much larger problem of people being fooled by bad science. Recall Clarke’s law: Any sufficiently crappy research is indistinguishable from fraud.

    Here’s the point: Simons and Chabris focus on the people being fooled rather than the people running the con. That’s good. It’s my general impression that conmen are kind of boring as people. Their distinguishing feature is a lack of scruple. Kind of like when we talk about findings that are big if true. And once you’re focusing on people being fooled, there’s no reason to restrict yourself to fraud. You can be just as well fooled by research that is not fraudulent, just incompetent. Indeed, it can be easier to be fooled by junk science that isn’t fraudulent, because various checks for fraud won’t find the problem. That’s why I wrote that the real problem of that nudge meta-analysis is not that it includes 12 papers by noted fraudsters; it’s the GIGO of it all. You know that saying, The easiest person to fool is yourself?

    In summary, “How do we get fooled and how can we avoid getting fooled in the future?”, is a worthy topic for a book, and Simons and Chabris did an excellent job. The next step is to recognize that “getting fooled” does not require a conman on the other side. To put it another way, not every mark corresponds to a con. In science, we should be worried about being fooled by honest but bad work, as well as looking out for envelope pushers, shady operators, and out-and-out cheats.

    How does Bayesian inference work when estimating noisy interactions?

    Alicia Arneson writes:

    I am a PhD student at Virginia Tech studying quantitative ecology. This semester, I am taking Deborah Mayo’s Philosophy of Statistics course, so I can’t help but to think more critically about statistical methods in some of the papers I read. To admit my current statistical bias – I do work in a lab that is primarily Bayesian (though this is my first year so I am still somewhat new to it), but Dr. Mayo does have me questioning some aspects of Bayesian practice. One of those questions is the topic of this letter!

    Recently, I read a paper that aimed to determine the effect of increased foraging costs on passerine immune function. The experiment seemed really well designed, but I was somewhat frustrated when I got to the statistical analysis section. The authors used Bayesian univariate response models that fit each immune outcome to upwards of 26 parameters that included up to four-way interactions. My initial feeling was that there is no good way to (a) interpret these or (b) to feel at all confident about the results.

    In investigating those thoughts, I came across your blog post entitled “You need 16 times the sample size to estimate an interaction than to estimate a main effect.” I thought this was a very interesting read and, while it applies more to frequentist frameworks, I noticed in the comments that you suggested not that we shouldn’t try to estimate interactions, but rather that it would be better to estimate them using a Bayesian approach. I can somewhat understand this suggestion given the examples you used to demonstrate how standard errors can change so much, but what is less clear to me is how Bayes provides a better (or at least more clear) approach when estimating interaction effects.

    Therein lies my questions. If you have some time, I am curious to know what you think about:

    (a) how a Bayesian approach for estimating interactions is better than doing so under a frequentist methodology, and

    (b) can researchers use Bayesian methods to “go too far,” so to speak, when trying to estimate interaction effects that their design would not have captured well (thinking along the lines of classical experimental design and higher order effects being masked when sample sizes are too small), i.e. should a relatively small experiment ever attempt to quantify complex interactions (like a 4-way interaction), regardless of the framework?

    Lots to chew on! Here are my responses:

    1. As discussed, estimates of interactions tend to be noisy. But interactions are important! Setting them to zero is not always a good solution. The Bayesian approach with zero-centered priors partially pools the interactions toward zero, which can make more sense.

    2. We need to be more willing to live with uncertainty. Partial pooling toward zero reduces the rate of “statistical significance”—estimates that are more than two posterior standard deviations from zero—as Francis Tuerlinckx and I discussed in our article from 2000 on Type M and Type S errors. The point is, if you do a Bayesian (or non-Bayesian) estimate, we don’t recommend acting as if non-statistically-significant parameters are zero.

    3. I think the Bayesian method will “go too far,” in the sense of apparently finding big things that aren’t really there, if it uses weak priors. With strong priors, everything gets pulled toward zero, and the only things that remain far from zero are those where there is strong evidence.

    4. Bayesian or otherwise, design matters! If you’re interested in certain interactions, design your study accordingly, with careful measurement and within-person (or, in your case, within-animal) measurements; see discussion here. There are problems with design and data collection that analysis can’t rescue.

    5. To look at it another way, here’s an article from 2000 where we used frequentist analysis of a Bayesian procedure to recommend a less ambitious design, on the grounds that inferences from the more ambitious design would be too noisy to be useful.

    The vicious circle of corroboration or pseudo-confirmation in science and engineering

    Allan Cousins writes:

    I have recently been thinking about the way in which professionals come to accumulate “knowledge” over their careers and how that process utilizes (read: abuses) the notion of corroboration. I believe this might be of interest to both of you and so I wanted to see if either of you might have any insights or comments.

    In particular, I have been thinking about professional endeavours that have dichotomous outcomes where the range of possibilities is restricted to (or perhaps more accurately, viewed as) it either worked or it did not work. For the purposes of this discussion I will look at structural engineering but I believe the phenomenon I am about to describe is just as applicable to other similarly characterized disciplines. In structural engineering: the structure either stood up or it collapsed, the beam either carried the load or it did not, etc. In my experience there are nearly as many theories of how structures work as there are structural engineers. But this wide range of opinions among structural engineers is certainly not because the underlying concepts are not well understood. That may have been true in 1850 but not today. In fact, structural engineering is quite mature as a field and there are very few concepts (except at the edges of the field) where such a diverse range of thought could be justified.

    This begs the question of how could this unsatisfactory state of affairs have come to pass? I have often pondered this but only recently have come to what I think to be a reasonable explanation. First, let us rule out the idea that structural engineering professionals are of below average intelligence (or rather below some required intelligence threshold for such endeavors only known to Omniscient Jones). Under such an assumption I believe that the likely answer to our question comes down to an interplay between industry dynamics, an abuse of the concept of corroboration, and the nature of the outcomes inherent to the field.

    Even if engineers have never heard of the concept of Philosophy of Science (and most have not) they are apt to act in ways akin to the typical scientist. That is, they go about their enterprise (designing structures) by continuously evaluating their understanding of the underlying structural mechanics by looking at and seeking out corroborating evidence. However, unlike scientists structural engineers don’t usually have the ability to conduct risky tests (in the popperian sense) in their day to day designs. By definition the predicted outcome of a risky test is likely to be wrong in absence of the posited theory and if structural engineers were routinely conducting such field tests newspaper headlines would be replete with structural engineering failures. But today structural engineering failures are quite rare and when they happen they are usually small in magnitude (one of the greatest structural engineering failures in US history was the Hyatt Regency Walkway collapse and it only caused 114 deaths. For comparison that is about the same number of deaths caused by road accidents in a single DAY in the US). Indeed, building codes and governing standards are codified in such a way that the probability of failure of any given element in a system is quite a rare event (global failure even rarer still). What that means is that even if what a structural engineer believes to be true about the structural systems that they design actually has very little verisimilitude (read: is mostly wrong and to a severe degree) their designs will not fail in practice as long as they follow codified guidelines. It is only when structural engineers move away from the typical (where standard details are the norm and codes contain prescribed modes of analysis / design) where gaps in their understanding become apparent due to observed failures. What this means then is that while the successful outcome of each “test” (each new structural design) is likely to be taken by the designer as corroborating their understanding (in the same sense that it does for the scientist), it does not necessarily even provide the most meager of evidence that the designer has a good grasp of their discipline. In fact, it is possible (though admittedly not overly likely) that a designer has everything backwards and yet their designs don’t fail because of the prescribed nature of governing codes.

    The above leaves us with an interesting predicament. It seems clear that structural engineers or others in similarly situated disciplines cannot rely on outcomes to substantiate their understanding. Though in practice that is what they largely do; they are human after all.

    This lack of ability to conduct risky tests interplays with industry dynamics and in not a particularly promising way. Those who commission structural designs are unlikely to care about the design itself (except to the extent that it doesn’t fail and doesn’t mess with the intended aesthetic), and as a result, structural engineering tends to be treated like a commodity product where the governing force is price. What that means is that there is an overwhelming pressure to get designs out the door as quickly as possible lest a structural engineering firm lose money on its bid. This pressure all but guarantees that even if senior structural engineers have a good understanding of structural principles the demands for their time leave few hours in the day to be spent on mentorship and review of young engineers’ work product. As a result, young engineers are unlikely to be able to rely on senior engineers to correct their misunderstanding of structural principles. That pretty much leaves only one other avenue for the young engineer to gain true understanding and that is via self-teaching of the literature and the like. However, given the lack of ability to construct risky tests (see above) the self-learning route is apt to lead young structural engineers to think that they have a good understanding of certain concepts (because they see corroborating evidence in their “successful” designs) where that is not the case. Though to be fair to my brethren I am assuming that the average young engineer does not have the ability to discern true engineering principles from the literature on their own without aid. However, I believe this assumption to hold, on average.

    This leads to a cycle where young engineers – who have a less than perfect understanding of structural systems that goes unchecked – become senior engineers who in turn are looked up to by a new crop of young engineers. The now senior engineers mentor the young engineers, to the extent time demands allow, and distill their misknowledge to them. Those young engineers eventually become senior. And in the extreme, the cycle repeats progressively until “knowledge” at the most senior levels of the field is almost devoid of any verisimilitude at all. Naturally there will be counterbalancing forces where some verisimilitude is maintained but I do think the cycle, as I have described it, is at least a decent characterture of how things unfold in practice. It’s worth remarking that many on the outside will never see this invisible cycle because it is shielded from them by the fact that structures tend to stand up!

    It seems to me that this unfortunate dynamic is likely to play out in any discipline where outcomes are dichotomous in nature and where the unwanted outcome (such as structural failure) is a low probability event by construction (and is unconnected to true understanding of the underlying concepts). It is certainly interesting to think about, and when the above phenomenon is coupled with human tendency to ascribe good outcomes to skill, and poor outcomes to bad luck, the result in terms of knowledge accumulation / dissemination may be quite unsatisfactory.

    I think what I have just argued is that professional activities that become commoditized are likely to be degenerative over time. This would certainly accord with my experience in structural engineering and other fields where I have some substantive knowledge. And I wanted to see if you would agree or not. Do you have any stark counter examples from your professional life that you can recall? Do you think I am being unduly pessimistic?

    There are two things going on here:

    1. Corroboration, and the expectation of corroboration, as a problem. This relates to what I’ve called the confirmationist paradigm of science, where the point of experimentation is to confirm theories. The motivations are then all in the wrong places, just in general. Quantitative analysis under uncertainty (i.e., statistics) adds another twist to the vicious cycle of confirmation, with the statistical significance filter and the 80% power lie, by which effects get overestimated, motivating future studies that overestimate effect sizes, etc., until entire subfields get infested with wild and unrealistic overestimates.

    2. The sociological angle, with students following their advisors, advisors promoting former students, etc. I don’t have so much to say about this one, but I guess that it’s part of the story too.

    Also relevant to this discussion is the recent book, False Feedback in Economics: The Case for Replication, by Andrin Spescha.

    “The Role of Doubt in Conceiving Research.” The capacity to be upset, to recognize anomalies for what they are, and to track them down and figure out what in our understanding is lacking:

    Stanley Presser sends along this article, “The Role of Doubt in Conceiving Research.” Presser has taught for many years at the University of Maryland but not when I was a student there, also he teaches sociology and I’ve never taken a sociology class.

    Presser’s article has lots of interesting discussion and quotes about learning from failure, the problem of researchers believing things that are false, the challenge of recognizing what is an interesting research question, along with some specific issues that arise with survey research.

    I’m reminded of the principle that an important characteristic of a good scientist is the capacity to be upset, to recognize anomalies for what they are, and to track them down and figure out what in our understanding is lacking. This sort of unsettled-ness–aan unwillingness to sweep concerns under the rug, a scrupulousness about acknowledging one’s uncertainty—is, I would argue, particularly important for a statistician.

    P.S. That last link is from 2016, to a post that begins as follows:

    I’ve given many remote talks but this is the first time I’ve spoken at an all-electronic conference. It will be a challenge. In a live talk, everyone’s just sitting in the room staring at you, but in an electronic conference everyone will be reading their email and surfing the web. . . . At the very least, I have to be more lively than my own writing, or people will just tune me out and start reading old blog entries.

    Funny to see this, seven years later, now that electronic conferences are the standard. And I think they really are worse than the in-person variety. It’s hard for a speaker to be more interesting than whatever is in everybody’s inbox, not to mention the world that is accessible from google.

    Incompetent officeholders, retractions of published papers, and norms

    Reading this story of Dianne Feinstein’s apparent dementia reminds me of when she was first running for the Senate, a few decades ago, and advertised herself as the only Democrat favoring both abortion and the death penalty.

    I understand the general principle—most voters want legal abortion, most voters want the death penalty, so if you’re the only candidate supporting both these positions, it makes sense to announce this—; still, there was something funny about the juxtaposition.

    One of the frustrating aspects of the Feinstein story is not just that she’s been sitting in Congress for who knows how long being propped up by her aides like some old-style Politburo member, but also the idea that, even now, there doesn’t seem to be serious talk of her resigning. Congress has the ability to force her to leave, but if anything happens it seems that it would have to be her stepping down voluntarily, or her staff making that decision for her.

    What to do in a setting where everyone agrees there’s a problem?

    This situation—there’s an obvious problem, there’s an existing mechanism to deal with it, but the people in charge of the mechanism don’t want to use it—I’ve seen this before.

    It came up a few years ago with a scholarly journal. There was an issue with a paper that the journal had published a few years earlier. Problems had been pointed out with that published article: the analysis had been done incorrectly, and that changed the results of the paper. The author of the article was asked to write a response to the criticisms, and the journal editor judged the response to be deceptive and incomplete. In addition, the results could not be replicated with the data that had been provided with the article.

    The question was, what to do about it?

    One of the members of the editorial board wrote:

    While this is unfortunate, I do not see deliberate research misconduct, which is basically the only reason to force a retraction of an article (see COPE [Committee on Publication Ethics] guideline).This just seems like incredibly sloppy work. . . .

    I replied:

    From the COPE guidelines (I just looked up online):

    “Journal editors should consider retracting a publication if:

    • they have clear evidence that the findings are unreliable, either as a result of misconduct (e.g. data fabrication) or honest error (e.g. miscalculation or experimental error) . . .”

    So, no, the COPE guidelines do not require misconduct for a retraction. Honest error is enough. The key is that the findings are unreliable.

    Someone else on the editorial board responded:

    I am not sure a retraction is necessary if this wasn’t academic fraud but an honest mistake or sloppiness. We shouldn’t necessarily jump to the conclusion that this was misconduct.

    I replied again:

    Again, retraction does not require fraud or misconduct. The COPE guidelines explicitly say that journal editors should consider retraction if they have clear evidence the findings are unreliable, either as a result of misconduct . . . or honest error.”

    If we don’t want to retract, fine. But if it’s clear that the findings are unreliable, then I think we should retract. Retraction does not imply fraud or misconduct.

    Another board member added:

    Retractions are actually good in that they mean we are policing ourselves for honesty and accuracy and therefore retain credibility.

    Another board member then wrote:

    My opinion in that retraction should only be used in a situation where there is clear evidence of research fraud or misconduct.

    As the journal does not have a policy on this, we should be very careful with discussion of retraction.

    Retraction of an article is a career-altering, or career-damaging, move. Most people in the field will assume that retraction is generated by fraud or misconduct, no matter how
    much or how loudly those associated with the journal might say to the contrary.

    I don’t see why retraction should be a career-altering, or career-damaging, move—except to the very minor extent that it damages your career by making that one paper no longer count.

    That said, I also don’t really see a difference between “retraction” and “correction” (which I’ve done 4 times for published papers of my own).

    In posting the above discussion, I’m not at all trying to say that I was right and the other board members were wrong. There’s no requirement that a journal retract a paper, and I continue to think that all retractions could be relabeled as corrections with no harm, as long as the correction is clear and unambiguous, with no wiggle room, as for example here:

    My point is that, even in a case where there is “clear evidence that the findings are unreliable,” it was still difficult to get much done. Retraction was taken to be such a big step. It seems like something similar is going on in the Senate: removal of a senator is so rare that nobody wants to do it.