Here’s what the highest salary of any university president in America was in 1983 in 2022 dollars: $342,000.

The above quote comes from Paul Campos, who also reports that “The mean (not the median) compensation of university presidents [in 1983] was $184,000 (again 2022 dollars!).” And, perhaps even more amazingly, that the highest-paid college football coach in 1981 (Oklahoma’s Barry Switzer) “was making $150,000, including benefits, which is $457,000 in 2022 dollars.”

This got me curious. I took my first academic job in 1990 and my salary was $42,000. According to the inflation calculator, that’s worth $90,000 today. That must not be far from what we pay new faculty. But maybe there’s more spread on the high end than there used to be? I don’t know how much the senior faculty were paid back in 1990, or 1983, but I guess it was less than the university president.

Regarding the university presidents and football coaches: in addition to their take-home pay and benefits, they also were given subordinates: secretaries and assistants and coaches and so forth they could boss around. This makes me think that part of the salary change was a delayed transition, stimulated by the tax cuts of the 1980s, toward paying in $ rather than perks?

In any case, lots of faculty nowadays (regular faculty like me, not law/biz/medicine or coaches) get paid more than all the college presidents were in 1983. Amazing. Again, I guess the presidents were also paid in kind in the sense that they were allowed to hire a zillion assistants. But still. Imagine someone offering the president of a major university a salary of $184,000 today. He’d be insulted!

Another way of saying this is that in 1983, the range of pay for faculty at a university was about 10 (approx $200,000 at the top compared to approx $20,000 at the bottom). But now the ratio is more like 100 (approx $3 million down to approx $30,000). I guess to look for the reasons for why this is happening, we’d want to look at society and the economy more generally, as what’s happening at the university has been happening in many other institutions.

Moving cross-validation from a research idea to a routine step in Bayesian data analysis

This post is by Aki.

Andrew has a Twitter bot @StatRetro tweeting old blog posts. A few weeks ago, the bot tweeted link to a 2004 blog post
Cross-validation for Bayesian multilevel modeling. Here are some quick thoughts now.

Andrew started with a question “What can be done to move cross-validation from a research idea to a routine step in Bayesian data analysis?” and mentions importance-sampling as possibility, but then continues “However, this isn’t a great practical solution since the weights, 1/p(y_i|theta), are unbounded, so the importance-weighted estimate can be unstable.”. We now have Pareto smoothed importance sampling leave-one-out (PSIS-LOO) cross-validation (Vehtari, A., Gelman, A., Gabry, J., 2017) implemented, e.g., in `loo` R package and `ArviZ` Python/Julia package, and they’ve been downloaded millions of times and seem to be routinely used in Bayesian workflow! The benefit of the approach is that in many cases the user doesn’t need to do anything extra or add a few lines to their Stan code, the computation after sampling is really fast, and the method has diagnostic to tell if some other computationally more intensive approach is needed.

Andrew discussed also multilevel models: “When data have a multilevel (hierarchical) structure, it would make sense to cross-validate by leaving out data individually or in clusters, for example, leaving out a student within a school or leaving out an entire school. The two cross-validations test different things.”PSIS-LOO is great for leave-one-student-out, but leaving out an entire school often changes the posterior too much so that even PSIS can’t handle it. It’s still the easiest way to use K-fold-CV in such cases (ie do brute force computation K times, with K possibly smaller than the number of schools). It is possible to use PSIS, but then additional quadrature integration over the parameters for the left put school is needed to get useful results (e.g. Merkel, Furr, and Rabe-Hesketh, 2019). We’re still thinking how to do cross-validation for multilevel models easier and faster.

Andrew didn’t discuss time series or non-factorized models, but we can use PSIS to compute leave-future-out cross-validation for time series models (Bürkner, P.-C., Gabry, J., and Vehtari, A., 2020a) and for multivariate normal and Student-t models we can do one part analytically and rest with PSIS (Bürkner, P.-C., Gabry, J., and Vehtari, A., 2020b).

Andrew mentioned DIC, and we have later analyzed the properties of DIC, WAIC, and leave-one-out cross-validation (Gelman, A., Hwang, J., and Vehtari, A., 2014), and eventually PSIS-LOO has provided to be the most reliable and has the best self-diagnostic (Vehtari, A., Gelman, A., Gabry, J., 2017).

Andrew also mentioned my 2002 paper on cross-validation, so I knew that he was aware of my work, but it still took several years before I had the courage to contact him and propose a research visit. That research visit was great, and I think we can say we (including all co-authors and people writing software) have been able to make some concrete steps to make cross-validation a more routine step.

Although we are advocating a routine use of cross-validation, I want to remind that we are not advocating cross-validation for model selection as a hypothesis testing (see, e.g. this talk, and Gelman et al. 2020). Ideally the modeller includes all the uncertainties in the model, integrates over the uncertainties and makes model checking that the model makes sense. There is no need then to select any model, as the model that in the best way expresses the information available for the modeller and the related uncertainties is all that is needed. However, cross-validation is useful for assessing how good a single model is, model checking (diagnosing misspecification), understanding differences between models, and to speed-up the model building workflow (we can quickly ignore really bad models, and focus on more useful models, see e.g. this talk on Bayesian workflow).

You can find more papers and discussion of cross-validation in CV-FAQ, and stay tuned for more!

Stan goes mountain climbing, also a suggestion of 3*Alex

Jarrett Phillips writes:

I came across this recent preprint by Drummond and Popinga (2021) on applying Bayesian modeling to assess climbing ability. Drummond is foremost a computational evolutionary biologist (as am I) who is also an experienced climber. The work looks quite interesting. I was previously unaware of such an application and thought others may also appreciate it.

I’m not a climber at all, but it’s always fun to see new applications of Stan. In their revision, I hope the authors can collaborate with someone who’s experienced in Bayesian data visualization and can help them make some better graphs. I don’t mean they should just take their existing plots and make them prettier; I mean that I’m pretty sure there are some more interesting and informative ways to display their data and fitted models. Maybe Jonah or Yair could help—they live in Colorado so they might be climbers, right? Or Aleks would be perfect: he’s from Slovenia where everyone climbs, and he makes pretty graphs, and then the paper would have 3 authors named some form of Alex. So that’s my recommendation.

The science bezzle

Palko quotes Galbraith from the classic The Great Crash 1929:

Alone among the various forms of larceny [embezzlement] has a time parameter. Weeks, months or years may elapse between the commission of the crime and its discovery. (This is a period, incidentally, when the embezzler has his gain and the man who has been embezzled, oddly enough, feels no loss. There is a net increase in psychic wealth.) At any given time there exists an inventory of undiscovered embezzlement in—or more precisely not in—the country’s business and banks.

. . .

This inventory—it should perhaps be called the bezzle—amounts at any moment to many millions of dollars. It also varies in size with the business cycle. In good times, people are relaxed, trusting, and money is plentiful. But even though money is plentiful, there are always many people who need more. Under these circumstances, the rate of embezzlement grows, the rate of discovery falls off, and the bezzle increases rapidly. In depression, all this is reversed. Money is watched with a narrow, suspicious eye. The man who handles it is assumed to be dishonest until he proves himself otherwise. Audits are penetrating and meticulous. Commercial morality is enormously improved. The bezzle shrinks.

He also quotes John Kay:

From this perspective, the critic who exposes a fake Rembrandt does the world no favor: The owner of the picture suffers a loss, as perhaps do potential viewers, and the owners of genuine Rembrandts gain little. The finance sector did not look kindly on those who pointed out that the New Economy bubble of the late 1990s, or the credit expansion that preceded the 2008 global financial crisis, had created a large febezzle.

Palko continues:

In 2021, the bezzle grew to unimaginable proportions. Imagine a well-to-do family sitting down to calculate their net worth last December. Their house is worth three times what they paid for it. The portfolio’s doing great, particularly those innovation and disruption stocks they heard about on CNBC. And that investment they made in crypto just for fun has turned into some real money.

Now think about this in terms of stimulus. Crypto alone has pumped trillions of dollars of imaginary money into the economy. Analogously, ending the bezzle functions like a massive contractionary tax. . . .

And this reminds me of . . .

Very parochially, this makes me all think of the replication crisis in psychology, medicine, and elsewhere in the human sciences. As Simine Vazire and I have discussed, a large “bezzle” in these fields accumulated over several decades and then was deflated in the past decade or so.

As with financial bezzles, during this waking-up period there was a lot of anger from academic and media thought leaders—three examples are here, here and here. It seems they were reacting to the loss in value: “From this perspective, the critic who exposes a fake Rembrandt does the world no favor: The owner of the picture suffers a loss, as perhaps do potential viewers, and the owners of genuine Rembrandts gain little.”

This also relates to something else I’ve noticed, which is that many of these science leaders are stunningly unbothered by bad science. Consider some notorious “fake Rembrandts” of 2010 vintage such as the misreported monkey experiments, the noise-mining ESP experiments, the beauty-and-sex ratio paper, the pizzagate food studies, the missing shredder, etc etc. You’d think that institutions such as NPR, Gladwell, Freakonomics, Nudge, and the Association for Psychological Science would be angry at the fakers and incompetents who’d put junk science into the mix, but, to the extent they show emotion on this at all, it tends to be anger at the Javerts who point out the problem.

At some level, I understand. As I put it a few years ago, these people own stock in a failing enterprise, so no wonder they wants to talk it up. Still, they’re the ones who got conned, so I’d think they might want to divert some of their anger to the incompetents and fraudsters who published and promoted the science bezzle—all this unreplicable research.

P.S. Related, from 2014: The AAA tranche of subprime science.

“Published estimates of group differences in multisensory integration are inflated”

Mike Beauchamp sends in the above picture of Buster (“so-named by my son because we adopted him as a stray kitten run over by a car and ‘all busted up'”) sends along this article (coauthored with John F. Magnotti) “examining how the usual suspects (small n, forking paths, etc.) had led our little sub-field of psychology/neuroscience, multisensory integration, astray.” The article begins:

A common measure of multisensory integration is the McGurk effect, an illusion in which incongruent auditory and visual speech are integrated to produce an entirely different percept. Published studies report that participants who differ in age, gender, culture, native language, or traits related to neurological or psychiatric disorders also differ in their susceptibility to the McGurk effect. These group-level differences are used as evidence for fundamental alterations in sensory processing between populations. Using empirical data and statistical simulations tested under a range of conditions, we show that published estimates of group differences in the McGurk effect are inflated when only statistically significant (p < 0.05) results are published [emphasis added]. With a sample size typical of published studies, a group difference of 10% would be reported as 31%. As a consequence of this inflation, follow-up studies often fail to replicate published reports of large between-group differences. Inaccurate estimates of effect sizes and replication failures are especially problematic in studies of clinical populations involving expensive and time-consuming interventions, such as training paradigms to improve sensory processing. Reducing effect size inflation and increasing replicability requires increasing the number of participants by an order of magnitude compared with current practice.

Type M error!

Academic jobs in Bayesian workflow and decision making

This job post (with two reserach topics) is by Aki (I promise that next time I post about something else)

I’m looking for postdocs and doctoral students to work with me on Bayesian workflow at Aalto University, Finland. You can apply through a joint call (with many more other related topics) application forms for postdocs) and for doctoral students.

We’re also looking for postdocs and doctoral students to work on Probabilistic modeling for assisting human decision making in with Finnish Center for Artificial Intelligence funding. You can apply through a joint call (with many more probabilistic modeling topics) application form.

To get some idea on how we might approach these topics, you can check what I’ve been recently talking and working.

For five years straight, starting in 2018, the World Happiness Report has singled out Finland as the happiest country on the planet

Scientific communication: over the wine-dark sea to the rose-fingered dawn

On the topic of the Homeric epics, Thomas Jones writes:

The illiterate performers who recited or sang epic poems in Ancient Greece did not learn them by rote. (Boris Johnson’s botched renditions of the Iliad are a double failure: failing to learn it by rote and trying to learn it in the first place.) Rather, a poet would improvise his song using formulaic words and phrases. Every performance was in some sense a new composition, but also a seamless continuation of the tradition. . . .

[Athena] is variously ‘Pallas Athena’, ‘grey-eyed Athena’, ‘the goddess grey-eyed Athena’ and so on according to the demands of grammar and metre: as Parry points out, ‘Homer had to hand a particular word for each of ten metrical exigencies that might arise.’ These didn’t always conform to logic. Ships are described as ‘hollow’, ‘swift’, ‘black’, ‘well-decked’, ‘seafaring’, ‘trim’, ‘many-tholed’, ‘curved’, ‘huge’, ‘famed’, ‘well-built’, ‘many-benched’, ‘vermilion-cheeked’, ‘prowed’ or ‘straight-horned’, according to where they appear in the line of verse rather than where, or if, they appear on the ‘wine dark’, ‘grey’ or ‘loud-roaring’ sea: the Greeks’ ‘swift’ and ‘seafaring’ ships are beached throughout the Iliad. ‘Early rose-fingered dawn’ is mentioned so often in Homer for much the same reason a blues singer might tell you he ‘woke up this morning’: in part to buy time while composing the next line.

This reminds me of a discussion we had the other day about improvisation in academic talks, when commenter Gec wrote:

Good improvisers spend a great deal of time preparing! It’s just that their prep time is not spent rehearsing a set performance.

I don’t know anything about acting, but I have a lot of personal experience playing jazz badly. Better players prepare by practicing riffs (roughly, snippets that form a kind of combinatorial repertoire or, at least, something to fall back on when you don’t have any better ideas), technique (boring stuff like scales, chord progressions, etc.), and building up a web of knowledge and references they can rely on to construct a long-form performance and build on what others are doing.

That jibes with my speaking style. I have lots of riffs (examples and ideas that I’m familiar with), technique (statistical methods), and a web of knowledge (decades of experience), and all that allows me to improvise a talk.

When planning a talk I often prepare some written text that I read word for word. Perhaps surprisingly, reading a well-written set of paragraphs word for word can work well in a live talk. I don’t think it would go so well for me to read two pages straight, but a few clean sentences can do wonders.

Most of the time, though, I’m working from a rough outline or sketched set of points, I’m doing a lot of riffing and transitioning, which I guess is like those Homeric bards and blues singers, that I have some phrases that sound good, and I use these as building blocks.

It’s slightly different in that an academic talk is made up not of words and music but of ideas, so it’s not so much that I stick in various phrases as that I stick in various ideas. I have a few hundred examples bouncing around my head at any given time, and when I speak, I can let them spill out, Tetris-style, to fill in the space.

More generally, it’s not just about giving talks; it’s about communication, laying out ideas and seeing how they fit as they come to mind.

It’s not just about me; you can do that too.

How much should we trust assessments in systematic reviews? Let’s look at variation among reviews.

Ozzy Tunalilar writes:

I increasingly notice these “risk of bias” assessment tools (e.g., Cochrane) popping up in “systematic reviews” and “meta-analysis” with the underlying promise that they will somehow guard against unwarranted conclusions depending on, perhaps, the degree of bias. However, I also noticed multiple published systematic reviews referencing, using, and evaluating the same paper (Robinson et al 2013; it could probably have been any other paper). Having noticed that, I compiled the risk of bias assessment by multiple papers on the same paper. My “results” are above – so much variation across studies that perhaps we need to model the assessment of risk of bias in review of systematic reviews. What do you think?

My reply: I don’t know! I guess some amount of variation is expected, but this reminds me of a general issue in meta-analysis that different studies will have different populations, different predictors, different measurement protocols, different outcomes, etc. This seems like even more of a problem, now that thoughtless meta-analysis has become such a commonly-used statistical tool, to the extent that there seem to be default settings and software that can even be used by both sides of a dispute.


Hey, fans of blood sports! Good news! You no longer need to spend your money on bullfight tickets in that “dream vacation in Spain.” You can get your fun right here in the city.

I say this because today we were taking our usual route home, and on 109 St between Amsterdam Ave and Broadway we saw something like 12 dead rats. And not all in one place, either! They were all along the block. I’m not exaggerating here. We’ve seen our share of dead rats on the street, and the occasional live one. But not so many at once. Maybe there was some nest of rats that got displaced and all ran out on the street at once and a bunch of them got run over?

Anyway, no need to go to Spain for this sort of excitement. You can get it all here on the good ol’ U S of A.

Bayes factors measure prior predictive performance

I was having a discussion with a colleague after a talk that was focused on computing the evidence and mentioned that I don’t like Bayes factors because they measure prior predictive performance rather than posterior predictive performance. But even after filling up a board, I couldn’t convince my colleagues that Bayes factors were really measuring prior predictive performance. So let me try in blog form and maybe the discussion can help clarify what’s going on.

Prior predictive densities (aka, evidence)

If we have data y, parameters \theta, sampling density p(y \mid \theta) and prior density p(\theta), the prior predictive density is defined as

p(y) = \int p(y \mid \theta) \, p(\theta) \, \textrm{d}\theta.


The integral computes an average of the sampling density p(y \mid \theta) weighted by the prior p(\theta). That’s why we call it “prior predictive”.

Bayes factors compare prior predictive densities

Let’s write p_{\mathcal{M}}(y) to indicate that the prior predictive density depends on the model \mathcal{M}. Then if we have two models, \mathcal{M}_1, \mathcal{M}_2, the Bayes factor for data y is defined to be

\textrm{BF}(y) = \frac{p_{\mathcal{M}_1}(y)}{p_{\mathcal{M}_2}(y)}.

What are Bayes factors measuring? Ratios of prior predictive densities. Usually this isn’t so interesting because the difference between a weakly informative prior and one an order of magnitude wider usually doesn’t make much of a difference for posterior predictive inference. There’s more discussion of this with examples in Gelman an et al.’s Bayesian Data Analysis.

Jeffreys set thresholds for Bayes factors of “barely worth mentioning” (below \sqrt{10}) to “decisive” (above 100). But we don’t need to worry about that.

Posterior predictive distribution

Suppose we’ve already observed some data y^{\textrm{obs}}. The posterior predictive distribution is

p(y \mid y^{\textrm{obs}}) = \int p(y \mid \theta) \, p(\theta \mid y^{\textrm{obs}}) \, \textrm{d}\theta.


The key difference from the prior predictive distribution is that we average our sampling density p(y \mid \theta) over the posterior p(\theta \mid y^{\textrm{obs}}) rather than the prior p(\theta).


In the Bayesian workflow paper, we recommend using cross-validation to compare posterior predictive distributions and we don’t even mention Bayes factors. Stan provides an R package, loo, for efficiently computing approximate leave-one-out cross-validation.

The path from prior predictive to posterior predictive

Introductions to Bayesian inference often start with a very simple beta-binomial model which can be solved analytically online. That is, we can update the posterior by simple counting after each observation. Each posterior is also a beta distribution. We can do this in general and consider our data y = y_1, \ldots, y_N arriving sequentially and updating the posterior each time.

p(y_1, \ldots, y_N) = p(y_1 \mid \theta) \ p(y_2 \mid y_1, \theta) \, \cdots \, p(y_N \mid y_1, \ldots, y_{N-1}, \theta).


In this factorization, we predict y_1 based only on the prior, then y_2 based on y_1 and the prior and so on until the last point is modeled in the same way as leave-one-out cross-validation as p(y_N \mid y_1, \ldots, y_{N-1}). We can do this in any order and the result will be the same. As N increases, prior predictive density converges to posterior predictive density on an average (per observation y_n) basis. But for finite amounts of data N \ll \infty, the measures can be very different.

This blog is like a Masterclass except the production values are close to zero, it’s free, and we don’t claim that reading it will make you better at anything.

“From a biochemical perspective, wakefulness is low-level brain damage”

So, I picked up this week’s New Yorker and saw an article on an online “edutainment” platform called Masterclass. Masterclass . . . that rang a bell . . . I googled and found this post from last year, recounting a breathless ad for a Masterclass from famed sleep expert and data misrepresenter Matthew Walker.

The New Yorker article, by Tad Friend, was fine, a mix of warmth and gentle skepticism. Eventually, though, it reached the point that I was fearing:

The best classes give you a new lens on the world. . . . Matthew Walker, the site’s sleep expert, warns against caffeine, alcohol, and naps—three of my favorite things. “From a biochemical perspective,” he observes, “wakefulness is low-level brain damage.”

“A new lens on the world,” huh? Yeah, anybody could give you a new lens on the world too, if they’re willing to make stuff up and pretend it’s true.

Let’s play that one again:

“From a biochemical perspective,” he observes, “wakefulness is low-level brain damage.”

First, what the hell does that mean? Does it mean anything at all? “From a biochemical perspective,” indeed. From a theological perspective, I think this is B.S.

Second, I’m disappointed that the reporter took off his reporter hat in writing that sentence. The appropriate word is “states” or “pontificates” or maybe “claims,” not “observes.”

OK, this is just a tiny part of a long article. It just happens to be the one part I happen to know something about. So I’m ranting.

This blog is like a Masterclass except the production values are close to zero, it’s free, and we don’t claim that reading it will make you better at anything.

Example of inappropriate use of causal language from observational data

George Dickinson points to this article and writes:

“.. analyses .. show that .. more intelligent individuals were less satisfied with their lives during the COVID-19 global pandemic *because* they were more intelligent” (emphasis in original)

Seems to me that a causal assertion such as the above needs a stronger tool than the multiple ordinal regression analysis they use in the paper. What do you think?

My reply: Oh yeah, this is terrible. The first author of this paper is notorious for publishing articles with bad statistics; see here. I’ll say this for the guy, though: he has an absolute talent for getting this bad research published in real journals.

As often seems to be the case, the problem appears to a mixture of: (1) genuine confusion about the distinction between descriptive and causal inference, (2) overconfidence about what can be learned from statistical data, (3) journals like to publish big claims, and (4) big claims get noticed.

P.S. Zad sends along the above picture of someone who is probably making inappropriate causal inference all the time—that’s what the executive function of the brain does for a living!—but that’s ok because adorable. Also I guess it all worked out because evolution.

High-intensity exercise, some new news

This post is by Phil Price, not Andrew.

Several months I noticed something interesting (to me!) about my heart rate, and I thought about blogging about it…but I didn’t feel like it would be interesting (to you!) so I’ve been hesitant. But then the NYT published something that is kinda related and I thought OK, what the hell, maybe it’s time for an update about this stuff. So here I am.

The story starts way back in 2010, when I wrote a blog article called “Exercise and Weight Loss: Shouldn’t Somebody See if there’s a Relationship?” In that article I pointed out that there had been many claims in the medical / physiology literature that claim that exercise doesn’t lead to weight loss in most people, but that those studies seemed to be overwhelmingly looking at low- and medium-intensity exercise, really not much (or at all) above warmup intensity. When I wrote that article I had just lost about twelve pounds in twelve weeks when I started doing high-intensity exercise again after a gap of years, and I was making the point that before claiming that exercise doesn’t lead to weight loss, maybe someone should test whether the claim is actually true, rather that assuming that just because low-intensity exercise doesn’t lead to weight loss, no other type of exercise would either.

Eight years later, four years ago, I wrote a follow-up post along the same lines. I had gained some weight when an injury stopped me from getting exercise. As I wrote at the time, “Already this experience would seem to contradict the suggestion that exercise doesn’t control weight: if I wasn’t gaining weight due to lack of exercise, why was I gaining it?” And then I resumed exercise, in particular exercise that had some maximum short-term efforts as I tried to get in shape for a bike trip in the Alps, and I quickly lost the weight again. Even though I wasn’t conducting a formal experiment, this is still an example of what one can learn through “self-experimentation,” which has a rich history in medical research.

Well, it’s not like I’ve kept up with research on this in the mean time, but I did just see a New York Times article called “Why Does a Hard Workout Make You Less Hungry” that summarizes a study published in Nature that implicates a newly-discovered “molecule — a mix of lactate and the amino acid phenylalanine — [that] was created apparently in response to the high levels of lactate released during exercise. The scientists named it lac-phe.” As described in the article, the evidence seems pretty convincing that high-intensity exercise helps mice lose weight or keep it off, although the evidence is a lot weaker for humans. That said, the humans they tested do generate the same molecule, and a lot more of it after high-intensity exercise than lower-intensity exercise. So maybe lac-phe does help suppress appetite in humans too.

As for the interesting-to-me (but not to you!) thing that I noticed about my heart rate, that’s only tangentially related but here’s the story anyway. For most of the past dozen years a friend and I have done bike trips in the Alps, Pyrenees, or Dolomites. Not wanting a climb up Mont Ventoux or Stelvio to turn into a death march due to under-training, I always train hard for a few months in the spring, before the trip. That training includes some high-intensity intervals, in which I go all-out for twenty or thirty seconds, repeatedly within a few minutes, and my heart rate gets to within a few beats per minute of my maximum. While I’m doing this training I lose the several pounds I gained during the winter. Unfortunately, as you may recall we have had a pandemic since early 2020. My friend and I did not do bike trips. With nothing to train for, I didn’t do my high-intensity intervals. I still did plenty of bike riding, but didn’t get my heart rate up to its maximum. I gained a few pounds, not a big deal. But a few months ago I decided to get back in shape, thinking I might try to do a big ride in the fall if not the summer. My first high-intensity interval, I couldn’t get to within 8 beats per minute of my usual standard, which had been nearly unchanged over the previous 12 years! Prior to 2020, I wouldn’t give myself credit for an interval if my heart rate hadn’t hit at least 180 bpm; now I maxed out at 172. My first thought: blame the equipment. Maybe my heart rate monitor isn’t working right, maybe a software update has changed it to average over a longer time interval, maybe something else is wrong. But trying two monitors, and checking against my self-timed pulse rate, I confirmed that it was working correctly, I really was maxing out at 172 instead of 180. Holy cow. I decided to discuss this with my doctor the next time I have a physical, but in the mean time I kept doing occasional maximum-intensity intervals…and my max heart rate started creeping up. A few days ago I hit 178, so it’s up about 6 bmp in the past four months. And I’ve lost those few extra pounds and now I’m pretty much back to my regular weight for my bike trips. The whole experience has (1) reinforced my already-strong belief that high-intensity exercise makes me lose weight if I’m carrying a few extra pounds, and (2) made me question the conventional wisdom that everyone’s max heart rate decreases with age: maybe if you keep exercising at or very near your maximum heart rate, your maximum heart rate doesn’t decrease, or at least not much? (Of course at some point your maximum heart rate goes to 0 bpm. Whaddyagonnado.)

So, to summarize: (1) Finally someone is taking seriously the possibility that high-intensity exercise might lead to weight loss, and even looking for a mechanism, and (2) when I stopped high-intensity exercise for a couple years, my maximum heart rate dropped…a lot.

Sorry those are not more closely related, but I was already thinking about item 2 when I encountered item 1, so they seem connected to me.


A garland of retractions for the Ohio State Department of Chutzpah Cancer Biology and Genetics

This is a story that puzzles me: it’s a case where seemingly everyone agrees there was research fraud, but for some reason nobody wants to identify who did it. Just business as usual in the War on Cancer?

Ellie Kincaid reports:

On a Saturday last November, Philip Tsichlis of The Ohio State University received an email no researcher wants to get.

Another scientist had tried to replicate a finding in a recent paper of his, and couldn’t. “We believe that our results should lead to some revision of the model you propose,” stated the email, which was released to us by OSU following a public records request.

It turned out that was an understatement. The email eventually led Tsichlis to discover data fabrication in that paper and a related article. Within a week, he requested the retraction of both papers . . .

OK, so far so good. Here’s the background:

The email that spurred Tsichlis to reevaluate his lab’s papers came from Alexandre Maucuer of INSERM on November 13th.

It referred to an article his group had published in Nature Communications in July, titled “AKT3-mediated IWS1 phosphorylation promotes the proliferation of EGFR-mutant lung adenocarcinomas through cell cycle-regulated U2AF2 RNA splicing,” and was addressed to Tsichlis and the paper’s first author, Georgios I. Laliotis. . . .

On the Friday after he received Maucuer’s email, Tsichlis emailed the editor of a related paper his group had published in October in Communications Biology, requesting to retract the article and explaining what had happened . . .

In the letter, Tsichlis referred to “evidence of data manipulation,” and here was the published retraction note:

The authors are retracting this Article as irregularities were found in the data that indicate the splicing of the U2AF2 exon 2 does not occur as reported in the Article. The irregularities call into question the conclusions and undermine our full confidence in the integrity of the study. The authors therefore wish to retract the Article.

All of the authors agree with the retraction.

My question

OK, here’s my question. If the retraction is happening because someone faked the data (that’s what “data manipulation” means, right? From the description, it doesn’t sound like a mere coding error like this), and all the authors agree with the retraction, then . . . who did the faking?

Here’s the author list:

Georgios I. Laliotis, Adam D. Kenney, Evangelia Chavdoula, Arturo Orlacchio, Abdul Kaba, Alessandro La Ferlita, Vollter Anastas, Christos Tsatsanis, Joal D. Beane, Lalit Sehgal, Vincenzo Coppola, Jacob S. Yount & Philip N. Tsichlis

That’s 13 authors. If 1 of them faked the data, then that person would have 12 very angry collaborators, right??

I just don’t understand what’s going on here. All 13 authors agreed there was data fabrication. There must be a culprit, no? So why isn’t anyone saying who did it?

OK, I get it, nobody wants to point the finger. After all, the sort of person who would fake data on a publication also could be the sort of person who would sue. But, can’t they get around it another way, by each of the non-cheaters releasing a statement saying, I didn’t do it?

If I were one of the 12 non-cheating authors on this paper, I’d be mad as hell that the cheater is getting off the hook here.

But maybe the cheater isn’t any of those people! In his letter, Tsichlis points to manipulation in Figure 1g of this other paper with an overlapping but different author list:

Georgios I. Laliotis, Evangelia Chavdoula, Maria D. Paraskevopoulou, Abdul Kaba, Alessandro La Ferlita, Satishkumar Singh, Vollter Anastas, Keith A. Nair II, Arturo Orlacchio, Vasiliki Taraslia, Ioannis Vlachos, Marina Capece, Artemis Hatzigeorgiou, Dario Palmieri, Christos Tsatsanis, Salvatore Alaimo, Lalit Sehgal, David P. Carbone, Vincenzo Coppola & Philip N. Tsichlis

The real question

Which of these 20 authors is the manipulator? Again, if I was one of the 19 honest people, I’d be furious. On the other hand, it’s possible the cheater has the power to damage co-authors’ careers. That’s the sort of thing that can make people scared to blow the whistle.

This work was funded by the National Institutes of Health:

This work was supported by the National Institute of Health grants R01CA186729 to P.N.T., and R01 CA198117 to P.N.T and V.C, and by the National Institutes of Health/National Cancer Institute P30 Grant CA016058 to the Ohio State University Comprehensive Cancer Center (OSUCCC). G.I.L was supported by a Pelotonia Post-Doctoral fellowship from OSUCCC.

I don’t think we can expect Ohio State University to look into this one too carefully, as it would be bad publicity. But the NIH, they should be really annoyed that millions of dollars have been going to research fraud. They could launch an investigation, no? This is really bad.

More evidence

The article with the problematic Figure 1g has this section:


G.I.L. conceived and performed experiments, analyzed data, prepared figures, and contributed to the writing of the paper. E.C. designed and performed the mouse xenograft experiments, including the characterization of the tumors, performed the IHC experiments on human TMAs, and provided comments contributing to the writing of the paper. M.D.P. performed bioinformatics analyses of RNA-seq data that led to the identification of alternatively spliced targets of the IWS1 pathway. A.D.K. performed experiments, under the supervision of G.I.L. A.L.F. performed bioinformatics analyses of microarray and TCGA data. S.S. contributed to the FACS experiments. V.A. performed and analyzed the proliferation experiments. S.A. performed and supervised bioinfomatics analyses. A.O. prepared extracts of human tumors and provided comments on the paper. K.A.N. performed experiments, under the supervision of G.I.L. V.T. prepared the cells and the mRNA for the RNA-Seq experiment. I.V. Bioinformatics analyses of RNA-seq data. M.C. prepared extracts of human tumors used in this study A.H. Provided supervision for the bioinformatics analyses of the RNA-seq data. D.P. provided technical advice on several experiments in this paper and contributed to the design of these experiments. Provided comments on the paper. C.T. advised on the design of experiments. L.S. advised on the design of experiments. D.P.C. advised on the biology of lung cancer and on the design of experiments, provided cell lines, and reagents. V.C. contributed to the overall experimental design. P.N.T. conceived and initiated the project, contributed to the experimental design, supervised the work and monitored its progress, and wrote the paper, together with G.I.L.

Can this help us figure anything out? G.I.L. is the only author listed as preparing figures, so that might lead us to think that he’s the one who manipulated the data. But we can’t be sure. It could be that the Contributions statement is inaccurate and someone else made the graph. Or maybe something else went on that we don’t know about.

The Retraction Watch article has some interesting comments, including pointers to other papers by this research group that seem to have data problems, and a link to a letter by Tsichlis and the other tenured faculty in his department supporting the notorious Carlo Croce (from Wikipedia, “Croce’s research and publications have been scrutinized by the scientific community for possible scientific misconduct, including image and data manipulation. While working at Jefferson, federal investigators alleged Croce and a colleague had submitted false claims for research never undertaken. The university settled the allegations, paying $2.6 million to the government without admitting any wrongdoing. In 2007, OSU investigated Croce for misconduct after the National Institutes of Health (NIH) returned a funding application that contained major portions identical to an application submitted months earlier by Croce’s junior colleague. OSU later cleared Croce of misconduct after accusations that he had patented a researcher’s work without providing proper credit, that members of his lab had inappropriately used grant money for personal trips abroad, and that Croce improperly pressured colleagues for research attribution. Since 2013, several scientists have claimed research misconduct on the part of Croce, and as of 2020 these allegations remain under investigation by the federal Office of Research Integrity (ORI). . . . In 2013, following accusations from science critic Clare Francis of image manipulation in over 30 research papers, OSU instructed Croce to correct or retract some of his research publications; in 2015, the journal Clinical Cancer Research issued a correction after being contacted on the matter by a newspaper. In 2014, the Proceedings of the National Academy of Sciences of the United States of America dismissed a challenge that Croce’s 2005 paper on the WWOX gene contained manipulated western blots, but in 2017 the journal agreed to correct the paper after consulting with experts. In 2016, Croce was found to have plagiarized a paper he published in PLoS One from six separate sources. In 2017, the journal Cell Death and Differentiation retracted a paper Croce had published in 2010 after it learned that images had been copied from a 2008 paper published in another journal. Also in 2017, the Journal of Biological Chemistry retracted a paper Croce had published in 2008 due to image/figure irregularities. . . . In 2018, two cancer researchers at OSU, Samson T. Jacob and Ching-Shih Chen, both colleagues and co-authors with Croce on two papers each, were found to have engaged in scientific misconduct. On May 10, 2017, Croce filed a lawsuit against The New York Times and several of its writers and editors for defamation, invasion of privacy, and intentional infliction of emotional distress/ In November 2018, United States District Judge James Graham dismissed virtually all of Croce’s lawsuit. In 2017 Croce also filed a defamation lawsuit against critic David Sanders of Purdue University, who was quoted in The New York Times article. In May 2020 Croce lost the defamation lawsuit against Sanders . . .”).

Amusingly (or horrifyingly, depending on how you feel about fraud conducted at taxpayers’ expense), the letter by the 14 tenured faculty mentions that Croce is a member of the National Academy of Sciences (that’s supposed to be a good thing??? What does it take to get kicked out of that august body?) and that their department “enjoys the highest per-capita NIH funding on the entire campus.”

Also this from the Wikipedia page:

In 1994, Croce joined the Council for Tobacco Research’s scientific advisory board, where he remained until the group closed after the Tobacco Master Settlement Agreement, and during which time tobacco companies used Croce’s research into fragile histidine triad (FHIT) to argue that lung cancer was an inherited condition. . . . In 2016, Croce was paid more than $850,000 by Ohio State. In 2019, Croce was removed as chair of the Department of Cancer Biology and Genetics at OSU, and he subsequently sued OSU to be reinstated, losing his request for a temporary restraining order although retaining his salary of $804,461 per year.

And, here he is being given the full hero treatment by Smithsonian magazine:

The driver is Carlo Croce, a 64-year-old Italian scientist with a big voice, disheveled curly hair and expressive dark eyes. He heads the Human Cancer Genetics Program at Ohio State University, and his silver Scaglietti Ferrari is a fitting symbol of his approach to science: grand, high-powered and, these days especially, sizzling hot.

This was in 2009, 15 years after he did his dirty work for the Council for Tobacco Research, and 2 years after he was investigated for misconduct over an NIH application. Anyway, I guess he needed that $804,461 per year to pay for the gas for his Ferrari.

What’s going on there in Columbus, Ohio, anyway? How could it be that all 14 tenured faculty of the Department of Cancer Biology and Genetics think this is OK? Can they, like, just shut the department down and start over? What does the National Academy of Sciences think about their name being used in this way? What about the National Institutes of Health? The Ohio State University Board of Trustees? Then again, I’m still wondering what the Columbia University Board of Trustees thinks about this whole U.S. News thing.

It’s a dangerous world out there, and there’s a lot worse things going on than scientific corruption, misuse of government funds, and the trashing of the reputation of Ohio State University. Still, this all seems pretty bad.

If I were Philip N. Tsichlis, I’d be pretty angry. People keep fabricating data on papers I’m publishing but nobody ever reveals who the fabricators are, and I’m being paid less than half of the salary of a guy who’s had umpteen retractions:

At some point the dude’s gotta lose his patience, right? In the meantime, I’m kinda stunned that NIH continues to be funding these people. I guess the idea is, they might be faking their data, but they’re still curing cancer?

Kaiser Fung’s review of “Don’t Trust Your Gut: Using Data to Get What You Really Want in Life” (and a connection to “Evidence-based medicine eats itself”)

Kaiser writes:

Seth Stephens-Davidowitz has a new book out early this year, “Don’t Trust Your Gut”, which he kindly sent me for review. The book is Malcolm Gladwell meets Tim Ferriss – part counter intuition, part self help. Seth tackles big questions: how to find love? how to raise kids? how to get rich? how to be happier? He invariantly believes that big data reveal universal truths on such matters. . . .

Seth’s book interests me as a progress report on the state of “big data analytics”. . . .

The data are typically collected by passive observation (e.g. tax records, dating app usage, artist exhibit schedules). Meaningful controls are absent (e.g. no non-app users, no failed artists). The dataset is believed to be complete. The data aren’t specifically collected for the analysis (an important exception is the happiness data collected from apps for that specific purpose). Several datasets are merged to investigate correlations.

Much – though not all – of the analyses use the most rudimentary statistics, such as statistical averages. This can be appropriate, if one insists one has all the data, or “essentially” all. An unstated axiom is that the sheer quantity of data crowds out any bias. This is not a new belief: as long as Google has existed, marketing analysts have always claimed that Google search data are fully representative of all searches since Google dominates the market. . . .

If the analyst incorporates model adjustments, these adjusted models are treated as full cures of all statistical concerns. [For example, the] last few chapters on activities that cause happiness or unhappiness report numerous results from adjusted models of underlying data collected from 60,000 users of specially designed mobile apps. The researchers broke down 3 million logged events by 40 activity types, hour of day, day of week, season of year, location, among other factors. For argument’s sake, let’s say the users came from 100 places, ignore demographic segmentation, and apply zero exclusions. Then, the 3 million points fell into 40*24*7*4*100 = 2.7 million cells… unevenly but if evenly, each cell has an average of 1.1 events. That means many cells contain zero events. . . . The estimates in many cells reflect an underlying model that hasn’t been confirmed with data – and the credibility of these estimates rests with the reader’s trust in the model structure.

I observed a similar phenonmenon when reading the well-known observational studies of Covid-19 vaccine effectiveness. Many of these studies adjust for age, an obvious confounder. Having included the age term, which quite a few studies proclaimed to be non-significant, the researchers spoke as if their models are free of any age bias.

Kaiser continues:

A blurred line barely delineates using data as explanation and as prescription.

Take, for example, the revelation that people who own real estate businesses have the highest chance of being a top 0.1% earner in the U.S., relative to other industries. This descriptive statistic is turned into a life hack, that people who want to get rich should start real-estate businesses. Nevertheless, being able to explain past data is different from being able to predict the future. . . .

And, then, Kaiser’s big point:

Most of the featured big-data research aim to discover universal truths that apply to everyone.

For example, an eye-opening chart in the book shows that women who were rated bottom of the barrel in looks have half the chance of getting a response in a dating app when they messaged men in the most attractive bucket… but the absolute response was still about 30%. This produces the advice to send more messages to presumably “unattainable” prospects.

Such a conclusion assumes that the least attractive women are identical to the average women on factors other than attractiveness. It’s possible that such women who approach the most attractive-looking men have other desirable assets that the average woman does not possess.

It’s an irony because with “big data”, it should be possible to slice and dice the data into many more segments, moving away from the world of “universal truths,” which are statistical averages . . .

This reminds me of a post from a couple years ago, Evidence-based medicine eats itself, where I pointed out the contradiction between two strands of what is called “evidence-based medicine”: the goal of treatments targeted to individuals or subsets of the population, and the reliance on statistical significant results from randomized trials. Statistical significance is attained by averaging, which is the opposite of what needs to be done to make individualized or local recommendations.

Kaiser concludes with a positive recommendation:

As with Gladwell, I recommend reading this genre with a critical eye. Think of these books as offering fodder to exercise your critical thinking. Don’t Trust Your Gut is a light read, with some intriguing results of which I was not previously aware. I enjoyed the book, and have kept pages of notes about the materials. The above comments should give you a guide should you want to go deeper into the analytical issues.

I think there is a lot more that can be done with big data, we are just seeing the tip of the iceberg. So I agree with Seth that the potential is there. Seth is more optimistic about the current state than I am.

Imperfectly Bayesian-like intuitions reifying naive and dangerous views of human nature

Allan Cousins writes:

After reading your post entitled “People are complicated” and the subsequent discussion that ensued, I [Cousins] find it interesting that you and others didn’t relate the phenomenon to human propensity to bound probabilities into 3 buckets (0%, coin toss, 100%), and how that interacts with anchoring bias. It seems natural that if we (meaning people at large) do that across most domains that we would apply the same in our assessment of others. Since we are likely to have more experiences with certain individuals on one side of the spectrum or the other (given we tend to only see people in particular rather than varied circumstances) it’s no wonder we tend to fall into the dichotomous trap of treating people as if they are only good or bad; obviously the same applies if we don’t have personal experiences but only see / hear things from afar. Similarly, even if we come to know other circumstances that would oppose our selection (e.g. someone we’ve classified as a “bad person” performs some heroic act), we are apt to have become anchored on our previous selection (good or bad) and that reduces the reliance we might place on the additional information in our character assessment. Naturally our human tendencies lead us to “forget” about that evidence if ever called upon to make a similar assessment in the future. In a way it’s not dissimilar to why we implement reverse counting in numerical analysis. When we perform these social assessments it is as if we are always adding small numbers (additional circumstances) to large numbers (our previous determination / anchor) and the small numbers, when compared to the large number, are truncated and rounded away; of course possibly leading to the possibility that our determination to be hopelessly incorrect!

This reminds me of the question that comes up from time to time, of what happens if we use rational or “Bayesian” inference without fully accounting for the biases involved in what information we see.

The simplest example is if someone rolls a die a bunch of times and tells us the results, which we use to estimate the probability that the die will come up 6. If that someone gives us a misleading stream of information (for example, telling us about all the 6’s but only a subset of the 1’s, 2’s, 3’s, 4’s, and 5’s) and we don’t know this, then we’ll be in trouble.

The linked discussion involves the idea that it’s easy for us to think of people as all good or all bad, and my story about a former colleague who had some clear episodes of good and clear episodes of bad is a good reminder that (a) people are complicated, and (b) we don’t always see this complication given the partial information available to us. From a Bayesian perspective, I’d say that Cousins is making the point that the partial information available to us can, if we’re not careful, be interpreted as supporting a naive bimodal view of human nature, thus leading to a vicious cycle or unfortunate feedback mechanism where we become more and more set in this erroneous model of human nature.

“Data Knitualization: An Exploration of Knitting as a Visualization Medium”

Amy Cohen points us to this fun article by Noeska Smit. Here’s the description of the above-pictured fuzzy heart model:

The last sample I [Smit] knit is a simplified 3D anatomical heart (see Figure 4) in a wool-nylon blend (Garnstudio DROPS Big Fabel), based on a free pattern by Kristin Ledgett. She has created a knitting pattern that is knit in a combination of in the round and flat knitting techniques. This allows the entire heart to be knit in one piece, with only minimal sewing where the vessels split, visible in Figure 4b. The heart is filled with soft stuffing material while it is knit.

This sample is a proof of concept for how hand knitting can be used to represent complex 3D structures. While this sample is not anatomically correct, it demonstrates how the softness and flexibility of the stuffed knit allow for complex 3D shapes to be created using only basic knitting techniques. As the vessels are not sewn down, this particular model can be ‘unknotted’ and put back together freely. This sample only requires basic knitting knowledge on how to cast on, knit, purl, increase and decrease stitches, and bind off. The combination of the soft stuffing with the fuzzy knitted material gives an almost cartoon-like impression, in stark contrast to how disembodied human hearts typically appear in the real world. In a way, this makes a medical concept where a realistic representation can elicit a strong negative response more approachable. Perhaps it is similar to how surgical images can be made more palatable by using color manipulation and stylization.

What can I say? I love this sort of thing.

P.S. More here: “‘Knitting Is Coding’ and Yarn Is Programmable in This Physics Lab”

Webinar: On using expert information in Bayesian statistics

This post is by Eric.

On Thursday, 23 June, Duco Veen will stop by to discuss his work on prior elicitation. You can register here.


Duco will discuss how expert knowledge can be captured and formulated as prior information for Bayesian analyses. This process is also called expert elicitation. He will highlight some ways to improve the quality of the expert elicitation and provide case examples of work he did with his colleagues. Additionally, Duco will discuss how the captured expert knowledge can be contrasted with evidence provided by traditional data collection methods or other prior information. This can be done, for instance, for quality control or training purposes where experts reflect on their provided information.

About the speaker

Duco Veen is an Assistant Professor at the Department of Global Health situated at the Julius Center for Health Sciences and Primary Care of the University Medical Center Utrecht. In that capacity, he is involved in the COVID-RED, AI for Health, and [email protected] projects. In addition, he is appointed as Extraordinary Professor at the Optentia Research Programme of North-West University, South Africa. Duco works on the development of ShinyStan and has been elected as a member of the Stan Governing Body.

Don’t believe the “Breaking News” hype (NYT science section version)

Gary Schwitzer explains why we should not believe the above claim, which points to this news article. Here’s Schwitzer:

What does/should breaking news mean? I suggest that it means “Drop everything….this is something you must know right now…it will rock your world” – or something similar.

But here is what the term was applied to:

– a story about one person in an experiment – a person apparently not even being followed by the researchers anymore;

– a finding reported by two researchers who – according to the revelation that begins in the story’s seventh paragraph – were found by the German Research Foundation to have committed scientific misconduct and were sanctioned by that group;

– a finding which one researcher quoted in the story said “should be taken with a massive mountain of salt”;

– a story that was also reported the same day by STAT – perhaps even earlier in the day by STAT. Since I don’t buy into the whole breaking news hype, I don’t get into judgment about who was first. Either way, it makes the whole “Breaking News” BS more evident.

– a story that STAT’s headline handled much more reasonably, stating: “With new ‘brain-reading’ research, a once-tarnished scientist seeks redemption.” That puts the elephant in the room actually in the room much sooner for all readers to see immediately.

– a story that the NYT writer promoted in Twitter in this way: “tadaa! My first article for @nytimes is out.”

Schwitzer concludes:

I emphasize that research looking for ways to help paralyzed people communicate is important. Some impressive, meaningful advances have already been made in this field.

Readers don’t benefit from “Breaking News” hype. Please allow the science, the data, the long-term results replicated by others to speak for itself.

I agree. The good news is that the NYT article has lots of interesting details and discussions. In many ways it’s an excellent science story. Which makes it even more frustrating that they hype it in this way and that they bury the “elephant” of the researchers’ history of scientific fraud, which is discussed in detail in the STAT article by Meghana Keshavan.

I saved the best for last

If you go that NYT article on the web, you’ll find this at the very end:

Wow. Really going all-in on the classic pattern of finding the most trivial, unimportant errors to correct.

Propagation of responsibility

When studying statistical workflow, or just doing applied statistics, we think a lot about propagation of uncertainty. Today’s post is about something different: it’s about propagation of responsibility in a decision setting with many participants. I’ll briefly return to workflow at the end of this post.

The topic of propagation of responsibility came up in our discussion the other day of fake drug studies. The background was a news article by Johanna Ryan, reporting on medical research fraud:

In some cases the subjects never had the disease being studied or took the new drug to treat it. In others, those subjects didn’t exist at all.

I [Ryan] found out about this crime wave, not from the daily news, but from the law firm of King & Spaulding – attorneys for GlaxoSmithKline (GSK) and other major drug companies. K&S focused not so much on stopping the crime wave, as on advising its clients how to “position their companies as favorably as possible to prevent enforcement actions if the government comes knocking.” In other words, to make sure someone else, not GSK, takes the blame. . . .

So how do multi-national companies like GSK find local clinics like Zain Medical Center or Healing Touch C&C to do clinical trials? Most don’t do so directly. Instead, they rely on Contract Research Organizations (CROs): large commercial brokers that recruit and manage the hundreds of local sites and doctors in a “gig economy” of medical research. . . .

The doctors are independent contractors in these arrangements, much like a driver who transports passengers for Uber one day and pizza for DoorDash the next. If the pizza arrives cold or the ride is downright dangerous, both Uber and the pizza parlor will tell you they’re not to blame. The driver doesn’t work for them!

Likewise, when Dr. Bencosme was arrested, the system allowed GSK to position themselves as victims not suspects. . . .

Commenter Jeremiah concurred:

I want folks to be careful in giving the Sponsor (such as GSK) any pass and putting the blame on the CRO. The Good Clinical Practices (GCP) here are pretty strong that the responsibility lies with the Sponsor to do due diligence and have appropriate processes in place (for example see ICH E6(r2) and ICH E8(r1)

The FDA has been flagging problems with 21CFR 312.50 for years. In fy2021 they identified multiple observations that boil down to a failure to select qualified investigators. The Sponsor owns that and we should never give a pass because of the nature of contract organizations in our industry.

Agreed. Without making any comment on this particular case, which I know nothing about, I agree with your general point about responsibility going up and down the chain. Without such propagation of responsibility, there are huge and at times overwhelming incentives to cheat.

It does seem that when institutions are being set up and maintained, that insufficient attention is paid to maintaining the smooth and reliable propagation of responsibility. Sometimes this is challenging—it’s not always os easy to internalize externalities (“tragedies of the comments”) through side payments—but in a highly regulated area such as medical research, it should be possible, no?

And this does seem to have some connections to ideas such as influence analysis and model validation that arise in statistical workflow. The trail of breadcrumbs.