Ben Recht writes:
I’m a fan of physician-blogger John Mandrola. He had a nice response to your blog, using it as a jumping-off point for a short tutorial on his rather conservative approach to medical evidence assessment.
John is always even-tempered and constructive, and I thought you might enjoy this piece as an “extended blog comment.” I think he does a decent job answering the question at hand, and his approach to medical evidence appraisal is one I more or less endorse.
My post in question was called, How to digest research claims? (1) vitamin D and covid; (2) fish oil and cancer, and I concluded with this bit of helplessness: “I have no idea what to think about any of these papers. The medical literature is so huge that it often seems hopeless to interpret any single article or even subliterature. I don’t know what is currently considered the best way to summarize the state of medical knowledge on any given topic.”
In his response, “Simple Rules to Understand Medical Claims,” Mandrola offers some tips:
The most important priors when it comes to medical claims are simple: most things don’t work. Most simple answer answers are wrong. Humans are complex. Diseases are complex. Single causes of complex diseases like cancer should be approached with great skepticism.
One of the studies sent to Gelman was a small trial finding that Vitamin D effectively treated COVID-19. The single-center open-label study enrolled 76 patients in early 2020. Even if this were the only study available, the evidence is not strong enough to move our prior beliefs that most simple things (like a Vitamin D tablet) do not work.
The next step is a simple search—which reveals two large randomized controlled trials of Vitamin D treatment for COVID-19, one published in JAMA and the other in the BMJ. Both were null.
You can use the same strategy for evaluating the claim that fish oil supplementation leads to higher rates of prostate cancer.
Start with prior beliefs. How is it possible that one exposure increases the rate of a disease that mostly affects older men? Answer: it’s not very possible. . . .
Now consider the claims linked in Gelman’s email.
– Serum Phospholipid Fatty Acids and Prostate Cancer Risk: Results From the Prostate Cancer Prevention Trial
– Plasma Phospholipid Fatty Acids and Prostate Cancer Risk in the SELECT Trial
While both studies stemmed from randomized trials neither were primary analyses. These were association studies using data from the main trial, and therefore, we should be cautious in making causal claims.
Now go to Google. This reveals two large randomized controlled trials of fish oil vs placebo therapy.
– The ASCEND trial of n-3 fatty acids in 15k patients with diabetes found “no significant between-group differences in the incidence of fatal or nonfatal cancer either overall or at any particular body site.” And I would add no difference in all-cause death.
– The VITAL trial included cancer as a primary endpoint. More than 25k patients were randomized. The conclusions: “Supplementation with n−3 fatty acids did not result in a lower incidence of major cardiovascular events or cancer than placebo.”
Mandrola concludes:
I am not arguing that every claim is simple. My case is that the evaluation process is slightly less daunting than Professor Gelman seems to infer.
Of course, medical science can be complicated. Content expertise can be important. . . .
But that does not mean we should take the attitude: “I have no idea what to think about these papers.”
I offer five basic rules of thumb that help in understanding medical claims:
1. Hold pessimistic priors
2. Be super-cautious about causal inferences from nonrandom observational comparisons
3. Look for big randomized controlled trials—and focus on their primary analyses
4. Know that stuff that really works is usually obvious (antibiotics for bacterial infection; AEDs to convert VF)
5. Respect uncertainty. Stay humble about most “positive” claims.
This all makes sense, as long as we recognize that randomized controlled trials are themselves nonrandom observational comparisons: the people in the study won’t in general be representative of the population of interest, also issues such as dropout, selection bias, realism of treatments, etc., which can be huge in medical trials. Experimentation is great; we just need to avoid the pitfalls of (a) idealizing studies that have randomization (we should avoid making the “chain is as strong as its strongest link” fallacy) and (b) disparaging observational data without assessing its quality.
For our discussion here, the most relevant bit of Mandrola’s advice was this from the comment thread:
Why are people going to a Political Scientist for medical advice? That is odd.
I hope Prof Gelman’s answer was based on a recognition that he doesn’t have the context and/or the historical background to properly interpret the studies.
The answer is: Yes, I do recognize my ignorance! Here’s what I wrote in the above-linked post:
I’m not saying that the answers to these medical questions are unknowable, or even that nobody knows the answers. I can well believe there are some people who have a clear sense or what’s going on here. I’m just saying that I have no idea what to think about these papers.
Mandrola’s advice given above seems reasonable to me. But it can be hard for me to apply in that he’s assuming a background medical knowledge that I don’t have. On the other hand, when it comes to social science, I know a lot. For example, when I saw that claim that women during a certain time of the month were 20 percentage points more likely to vote for Barack Obama, it was immediately clear this was ridiculous, because public opinion just doesn’t change that much. This had nothing to do with randomized trials or observational comparisons or anything like that; it was just too noisy of a study to learn anything.
“Of course it’s simple, just use all that knowledge you learned in medical school!”
From the replication projects your prior on any given *experimental* claim should be 10-20% that it would replicate.
For observational studies a rule of thumb is they are an order of magnitude worse. So maybe 1% chance of replication.
So that is our baseline. You should be able to find out if a successful direct replication was done with some basic internet skills, since that is a big deal worth mentioning.
Then comes the different, and harder part of science: explaining why we observed a given phenomenon. This part requires more background knowledge and cleverness to do right. But you can use the rule of thumb that another 90-99% of the explanations/theories (for the subset that does replicate) are way off-base. The (repeatable) result is due to some experimental artifact, etc. Here is where conceptual replications could help get a better idea.
So for an experimental study 1-2% of claims are correct, for observational itll be more like 0.1-0.01%.
Essentially you can’t trust anything you read, and it is a crisis. A totally unnecessary crisis.
I’m not sure I agree with the exact numbers but I agree with this sentiment. The areas I have the most familiarity with are civil engineering and biology. My dissertation demolished the major theory of how soil liquefaction works that was in all the textbooks, literally thousands of people published thousands of articles describing how they’d done soil sample tests in tabletop “triaxial” testing machines to try to move the ball forward on understanding how soil microstructure affected the risk of liquefaction. My paper proved that the widely verified textbook taught Darcy equation for fluid flow implied that the entirety of that research program was studying exclusively the properties of the rubber membrane surrounding the sample.
In biology where I’ve helped my wife and a few friends on multiple papers for one example I did modeling of cancer survival duration. We could NOT figure out why my results were dramatically different from the Kaplan Meier curves everyone else used in hundreds of papers. After working hard to find bugs in my code I realized that maybe the K-M curves were just wrong! Sure enough in an afternoon I modeled a generating process which showed that the kind of censorship you’d expect from cancer patients (basically that at some point they get too sick to show up to office visits and then they die shortly after) leads to KM curves just like what you’d see in the literature DRAMATICALLY overestimating survival and that when this known bias exists the Bayesian method I was using gets the right answer while the thousands of people doing KM curves would get very wrong answers.
The point of these two anecdotes is that we apparently don’t teach even physical or biological sciences people enough basic science ideas to get the right answers in fairly simple situations where the mechanisms are clear and widely known. It’s a crisis!
Another failure mode I’ve noticed is that so-called “side effects” get incorporated into the primary outcome.
Eg, use symptoms of insomnia as part of your depression questionnaire then give a drug with the “side effect” of somnolence.
Or my favorite example of nausea/vomiting/loss-of-appetite after cancer treatments but no accounting for the role of caloric restriction in slowing tumor growth. I have *never* seen a cancer RCT address this issue.
Ie, the drug can “work”, but in a totally different way than assumed. The correct comparison is improving sleep hygiene and restricting your diet vs expensive and dangerous pills. At the very least accounting for these factors should allow lower dosages (thus safer and cheaper pills).
“So for an experimental study 1-2% of claims are correct, for observational itll be more like 0.1-0.01%.”
That’s quite the superpower you have there, to be able to assess the validity of medical papers with 98% accuracy. Without even reading them.
You are posting to a stats blog… the “superpower” is generalizing from sample to population.
In general people are able to accomplish many seemingly super/impossible things just by using thorough scholarship and actual science. That is in contrast to the bizarro NHST-based version of it widely in practice today.
It is similar to the “sufficiently advanced technology is indistinguishable from magic” idea.
My standards:
A finding of an association should always be considered as a first step; it may or may not show something.
Of course, it helps if there are many supporting findings, particularly if they are done with large samples, where potential confounding variables and possible mediators, moderators, and interaction effects are extensively evaluated, where the limitations of the findings are comprehensively discussed.
But most importantly, is there longitudinal evidence not just cross-sectional, and are at least some of the Bradford Hill criteria met?
I don’t see how applying some kind of generic rule of thumb, even on a statistical blog (or actually, maybe particularly on a statistical blog) is of anything more than quite limited utility.
Oh, and I forgot.
The second step is to theorize about mechanistic explanations about the association found, and then interrogate those explanations.
Seems to me that only after multiple investigations of the theorized explanations can you really begin to add some weight to attaching causality to associations, and likewise consider generic statistical probabilities to be of much use.
People seem very impatient, and cognitive science explains why. We are pattern-finding machines. It’s how we make sense of the world. Depending on the context, pattern finding can have a relatively high hit rate. Medical research, I don’t think, is one of those contexts. That’s ok. Even low frequency hit rates are meaningful when it comes to extending life and reducing suffering. Of course, sometimes reducing life and increasing suffering result from the low hit rates.
Anyway, I wonder if most of this will be moot in the not too distant future. AI might dramatically shift the entire landscape. We may be essentially worrying about the best way to evaluate typewriters in the early 2000’s.
I have written this before, but I think any study about vitamin C, vitamin D, zinc and some herbs should always be treated with a fair degree of skepticism (and I know certain people on here will likely flame me). If you look back over the last 50-60 years the claim has been made that they prevent or cure the disease du jour, and then it never pans out. Ask anyone in a community affected by AIDS whether they would trade their PReP (did I spell that right?) regimen for vitamins C, D, and zinc.
I also think analyzing the efficacy of cancer drugs can be very tricky. I am only able to write this because of a drug that only works 20% of the time, but when it does work the results can be very dramatic. These days they are better at identify when it will work based on the genetics of the cancer, and also is why some drugs now are approved for a variety of cancers, because it is the genetics of the cancer (certain receptors) that matter. But not all of this is known in the initial trials, and at least for my treatment if you read the initial trial results they are much more disappointing because you generally are looking at the “average effect” and the initial trials were based on the “type” of cancer, not the genetics of the cancer.
Please treat all studies with the same amount of skepticism.
The way these studies are designed we would never figure out to put water on a housefire or that shipping tools, wood, nails, etc into a town hit by a tornado can help them rebuild. I am not exaggerating, that is how ridiculous they are.
The premise is that your body needs to get a certain amount of various molecules (vitamins/minerals), into certain tissues, to function correctly. Just like too much or too little water isn’t going to save the house on fire, nor are too few or unnecessary supplies going to help out the tornado-devastated town.
Further, imagine sending the town hammers, but no nails. The excess of hammers would seem to do nothing.
If you accept this premise, it is obvious that the amount of beneficial vitamin must be related to its rate of metabolism (which is a function of health, genetics, etc). You will benefit if deficient, and possibly be harmed by too much. Just like water on the fire, and supplies to the town.
A generation was wasted on NHST junk studies, meanwhile we still don’t have good, let alone convenient, ways to assess deficiency until it is so severe your body starts falling apart and the amount in your blood becomes near undetectable.
At my advanced age, I am skeptical of almost every medical evaluation; see the recent neuroscience blotting scandals at Johns Hopkins, Stanford, USC and Ohio State. But this sentence caused me to pause:
—————————————–
“Mandrola concludes:
I am not arguing that every claim is simple. My case is that the evaluation process is slightly less daunting than Professor Gelman seems to infer.”
————————————————————————————-
Should “infer” be replaced by “imply” or augmented by “imply”? From
https://dictionary.cambridge.org/grammar/british-grammar/imply-or-infer
“We imply something by what we say [or write]. We infer something from what somebody else says [or writes]. The main difference between these two words is that a speaker [or writer] can imply, but a listener [or reader] can only infer.
When someone implies something, they put the suggestion into the message.”
—————————————————————————————
The English language is changing rapidly to the point that “fulsome” will soon mean “full” and “fortuitously” will soon be synonymous with “fortunately.” And, “penultimate” will soon take on the meaning of beyond ultimate. “For you and I” is already the preferred grammatical standard on NPR.
> Should “infer” be replaced by “imply” or augmented by “imply”?
Why should “imply” have anything to do with that sentence?
Would you have asked the same question had he written “less daunting than Professor Gelman seems to conclude” or “less daunting than Professor Gelman seems to deduce”?
paul alper:
You are, of course, correct that “imply” and “infer” mean different things. It’s conceivable that Mandrola actually meant that Professor Gelman was inferring something (from some argument Andrew had read or heard), but given what is in the OP here it is much more natural to think that me meant that Professor Gelman is implying something.
Having been raised by an English teacher, I agree strongly with Dr. Alper.
It has been pointed out previously at this site that in most cases of grammatical failure, the message is still clear, so it can be argued that they are a simplification of language. However, the record of this sub-thread shows that we do not know whether the use of “infer” was a mistake (which it often has been) or a synonym of “deduce”. So for clarity, either “imply” or “deduce” should have been used. (My money is on “imply”.)
I acknowledge that Dr. Mandrola nevertheless wrote a good article and would have received less criticism from my mother than I probably would have in an article of that length.
“… natural to think that *he* meant …”
“…the people in the study won’t in general be representative of the population of interest…”
As noted many times by Stephen Senn, “representativeness” isn’t the purpose of clinical trials in medicine. We’re not trying to identify *patients* who will be *”responders,”* but rather *treatments* that have *intrinsic efficacy.*
Since patient “responsiveness” to a treatment doesn’t usually hinge on some occult/”inborn” feature of the patient that will be consistent from one exposure to the next (except in a very small proportion of clinical situations e.g., treatments whose responses are strongly genetically-determined), ensuring patient “representativeness” in clinical trials really isn’t the goal.
This notion that treatment “responsiveness” (or lack thereof) is a “latent” feature of each patient is incorrect in the vast majority of clinical scenarios, yet continues to present a really pernicious barrier to understanding for those outside the biologic sciences. After reading about the econometrics concept of “4 latent categories of people- e.g., always-benefiter, never-benefiter,…” I suspect that this notion, improperly extrapolated to the medical context, lies at the root of the push toward “personalized medicine” that many physicians consider to be completely unrealistic.
For physicians, seeing that a treatment *can* work in a setting with maximally-minimized bias is usually reasonable grounds to bet that it might work for *other* patients with the same disease. We often can’t do better than this when treating patients.
“…we should avoid making the “chain is as strong as its strongest link” fallacy…
But this is *exactly* the approach we take when assessing therapeutic efficacy, and for good reason (!) Randomized controlled trials demonstrating treatment efficacy are required for approval of new drugs. This requirement was born following multiple high-profile tragedies that litter the history of medical therapeutics.
For non-pharmacologic interventions, physicians and practice guideline writers don’t usually consider observational evidence to be sufficiently reliable to recommend widescale implementation. Rather, the main niches for observational evidence in medicine are 1) to provide descriptive evidence (e.g., disease incidence/prevalence; patient demographics), and 2) to identify safety signals after a treatment has been approved based on randomized evidence of efficacy.
“…and (b) disparaging observational data without assessing its quality.”
A very high proportion (but not all) of observational research in medicine is of poor or very poor quality. “Gold standard” methods (using DAGs in the design-phase etc…) are used in only a small fraction of published studies. They are extremely labour-intensive and journals will publish without them, so where’s the incentive to change practice?… And where they *are* used, their quality and application are often questionable.
The problem with much observational “evidence” in medicine is that design choices are so flexible that any motivated *researcher* can generate a study result that will support his personal biases and any *reader* can, in turn, find reasons to criticize and discount any study that doesn’t conform to *his* biases.
Viewing medical research in the above light, it really does become pretty easy to sort the wheat from the chaff. Every primary care physician is constantly bombarded with questions from patients on dubious treatments they have read about online. Here’s my approach: 1) patient tells me a claim he read about; 2) I rapidly consider how a study would need to be designed in order for me to be convinced of the claim; 3) I quickly refer to a well-regarded physicians’ reference source (e.g., “Uptodate”) and compare the *actual* “evidence” to the evidence that I would need to see in order to feel confident prescribing the treatment. If Uptodate tells me that the “evidence” for a practice consists of an RCT with n=8 or is only observational (rather than supported by robust RCT(s)), I am unlikely to recommend that the patient spend his/her time and money on it.
Es:
Thanks for the comments. To clarify, let me replace “the people in the study won’t in general be representative of the population of interest” with “the people and scenarios in the study won’t in general be representative of the population of interest.” Treatment effects vary by situation as well as person, especially when considering endpoints such as recovery or survival. We discuss this general issue in our paper on causal quartets.
I agree that studies vary widely in quality. I guess that some of my concern regarding the “wheat from chaff” thing is that I’ve seen too many papers in social science and public health that had some aspect of randomized assignment, random sampling, or causal identification that led the producers and consumers of this research to take it way too seriously. From the other direction, a blanket dismissal of observational data is close to meaningless given the observational aspects that arise even in apparently clean studies.
Overall I am sympathetic to guidelines to help people evaluate the quality of studies, and, as I wrote in my above post, I don’t know a lot about medical trials and I completely accept Mandrola’s point that people with subject-matter expertise can often have a good sense of the quality of published papers in the area, in the same way that I can often take a quick look at a social science paper and quickly and reasonably characterize it as reasonable, ridiculous, or something in between.
ES: “Since patient “responsiveness” to a treatment doesn’t usually hinge on some occult/”inborn” feature of the patient that will be consistent from one exposure to the next (except in a very small proportion of clinical situations e.g., treatments whose responses are strongly genetically-determined), ensuring patient “representativeness” in clinical trials really isn’t the goal. ”
That seems like a good rule of thumb for infections, poisoning, and cancer (although sex is a very important variable, and there is a lot of research that the numerous medical trials of only males lead to treatments which don’t work as well on females) but there is also psychiatry. Psychiatric medicine has a lot of illnesses where you try this and it works for some people, then if it does not work you try that. When we don’t know the underlying cause of symptoms, medicine has a lot of trial and error.
Sean:
Let me also add that even in straight-up take-a-pill examples, a key factor affecting efficacy will be the stage of the disease. The stage of a disease is not an “occult/’inborn’ feature of the patient,” but it is something that will need to be considered when generalizing from an experiment to general practice.
i think that is a very good example. I think the original one-patient trial of penicillin failed because the infection was too advanced for the amount of drug they had manufactured.
“Stage of disease” is definitely important, but its impact is usually captured by considering “risk magnification” rather than by including healthier patients in clinical trials. See the link below:
https://www.fharrell.com/post/hteview/
This is the approach we take when estimating the benefit that a patient might see from taking a statin in a primary prevention context (i.e., *before* he/she suffers a cardiac event).
Interestingly, treatments whose efficacy was demonstrated in trials done several decades ago among patients whose prognoses were much worse (because “disease-modifying” treatments didn’t exist), might no longer have meaningful efficacy if tested in “modern” patients (whose prognoses are better due to treatment with cocktails of superior, disease-modifying drugs).
The defensive reaction by experts in one field aimed at critics coming from another discipline (most often econ, stats, and pol sci) is one that occurs often at university level meetings on policy where data and data analysis are at the center of the decision making process. Pointing out issues in research design and associated inference problems often precipitates reactions like: “What? You’re going to go all science on us now?” — a common reply by people in the humanities,” or “Sorry, biostatistics are quite different than regular statistics. Stick to your knitting,” or, “We can add another student without increasing our costs, so the marginal cost of instruction is zero which means it’s impossible to calculate the cost of instruction.” The latter statement made by someone who was at the time a top 25 university provost, soon to become an ivy league president. It can all get so very discouraging.