Should every correlation be published during the COVID-19 pandemic?

Nans Florens, Gideon Meyerowitz-Katz, Jérôme Barriere, Eric Billy, Fabrice Frank, Véronique Saada, Alexander Samuel, Barbara Seitz-Polski, Kyle Sheldrick, and Lonni Besançon write an article with the above title that begins as follows:

There is a strong correlation between deaths by swimming pool drowning in the USA and Nicholas Cage’s apparition in movies per year. While not establishing causal relationships, this finding raises concerns regarding Nicholas Cage-induced drownings and should call for a halt to the actor’s film career. If you think this paragraph is goofy, then you should be concerned by your interpretation of the too-quickly published correlations during the COVID-19 pandemic. Recently, Sun et al. published an article in Scientific Reports entitled: “Increased emergency cardiovascular events among under-40 population in Israel during vaccine rollout and third COVID-19 wave.” In this paper, the authors highlighted correlations between the Israeli vaccine campaign and an increased number of severe cardiovascular event calls to Emergency Management Services in the under-40 population. Even if the correlation seemed statistically significant, it is, to our opinion, clinically irrelevant.

During the COVID-19 pandemic, rapid publication efforts were made by authors and publishers to improve our knowledge and care toward greater safety and efficiency of the management of this disease. Simultaneously, many publications have been rushed potentially without rigorous peer-review and there has been a non-negligible increase in the duplication and waste of scientific efforts. In 2012, Frank Messerli highlighted in the New England Journal of Medicine, a strong correlation between chocolate consumption by country and the number of Nobel laureates. This study was published to warn scientists about over-interpreting their correlation data. Despite this notorious demonstration of what may be a spurious correlation, many scientists and editors continue to publish articles with correlations that have little clinical relevance. Sun et al.’s article, in our opinion, is the perfect demonstration of this phenomenon.

Some of the conclusions of this article are questionable from a clinical perspective. . . . First, the absolute number of calls for cardiac arrests is low in the under-40 population, peaking at approximately 10 during the vaccine rollout. . . . Second, regarding acute coronary syndrome (ACS), despite a statistically significant correlation, this association suffers from interpretation biases that are too large to be properly supported. . . . Third, the first signals regarding the occurrence of cardiac events in young patients vaccinated with mRNA-containing vaccines date from April 2021 and quickly led to the adaptation of vaccine recommendations in many countries . . . Furthermore, the authors did not consider seasonal or yearly trends in their analysis. . . . Finally, the main criticism of the dataset presented in the manuscript is the absence of confirmation of either the vaccination status, COVID-19 status, or underlying comorbidities among included patients. . . .

There are several additional issues with the statistical analysis as presented in the document which substantially undermine its conclusions and indicate a lack of rigor in the manuscript. . . . While the authors have described their analysis as “not establishing causal relationships,” they have in fact not even established useful correlations.

This last point is an example of my dictum that Correlation does not even imply correlation.

Regarding the question in the title of this post, yes, let’s publish everything. It’s gonna happen in any case. To put it another way, the problem of publishing the vaccine-and-cardiac-arrest correlation isn’t that they published the correlation, it’s that they only published that one correlation and then they overinterpreted it. Publish everything.

To put it another way, when I say “publish,” I mean, “make public.” I don’t mean, “give the seal of approval to” or “claim that something has scientific or policy relevance.” In that sense, the problem is coming not from the publication (in the sense that I use the term) but in the attitude that, because something is published, we should believe it.

Also, I’m saying we should publish every correlation, not that we should publish every foolish claim.

110 thoughts on “Should every correlation be published during the COVID-19 pandemic?

  1. “There is a strong correlation between deaths by swimming pool drowning in the USA and Nicholas Cage’s apparition in movies per year. ”

    Interesting!

    Presumably this this is the relevant plot and data?

    The title says “strong correlation” but when I estimate the values from the chart and plot drownings (y) as a function of cage flicks (x) generate R^2, I get:

    R² = 0.41

    Stropng correlation?

    Maybe one way to tell bogus correlations from real correlation is to use the measure that was designed to characterize the strength of the correlation?

    Cage Flicks: 2, 2, 2, 3, 1, 1, 2, 3, 4, 1, 4
    Drownings: 110, 102, 102, 98, 84, 96, 96, 98, 122, 95, 102

    • 40% of the variance in annual drownings is explained by the number of Cage flicks and you’re saying that is -not- a large correlation? Jeez. I disagree.

      But I think the more relevant question in this context is whether the error bounds on the slope exclude zero with high confidence.

      • Just to get it out there, I gotta believe that summer weather = cage flicks and drownings. So it’s not even that surprising at 40%. This is likely a causal relationship.

        • That’s a good point, but it’s still plausible right? I mean, if the weather is warmer, there are more people in pools in that time of year. If the weather is cooler there are fewer people in pools. A Cage flick being held for the end of summer could be released the following year if the summer weather peters out… it’s not entirely unreasonable. We’re talking about either 1, 2, or 3 cage flicks. So instead of releasing one in August they release it next June.

        • But not the sort of causal relationship implied, i.e. rather than cage flicks causing drownings, it would be something else causing both.

  2. I’m actually in agreement with you on the publishing everything (yay e-Life).

    Short update on this matter:

    400+ days ago, we submitted the rebuttal discussed above to Sci. Rep. We argue in the rebuttal how flawed beyond repair in ALL of its statistical analyses the original paper is. Reviewers who read both our rebuttal + authors response agreed that the paper could not be saved. Yet, a week ago, the EiC gave the authors of original paper the possibility to change their paper (although the critical flaws we point to won’t be addressed) and asked us to wait, read the updated version and submit our concerns again. Interestingly enough apparently the authors have originally been invited to submit their paper to Sci.Rep

  3. With those arguments back in 2020 we wouldn’t have had the covid19 story nor anything related to it. Since the fake-vaxes rollout MSM have been publishing some hundreds ‘correlations’ to cardiac issues (pizza, sex, cold, warm, high bills, eating fast, sleeping too much/little, hot tea, … ) to cover up the actual cause.
    And that’s not just cardiac issues. Check @EthicalSkeptic on tw – here’s caption from latest pinned tweet:

    “Cancer Mortality Update Week 22 2023
    (1 wk remaining in spring lull)

    ➼ Cancer UCoD = 2.8% Excess
    ➼ Cancer MCoD = 6.9% Excess
    ➼ Ages 0-54 Cancers = 20.5% Excess
    ➼ PPI Cancer Treatment = 11.3% Excess

    NOT due to Covid/Long Covid/deferred screenings”

      • There is a small but acknowledged risk associated with the mRNA covid vaccines which could be described as cardiovascular – see e.g. https://www.gov.uk/government/publications/myocarditis-and-pericarditis-after-covid-19-vaccination/myocarditis-and-pericarditis-after-covid-19-vaccination-guidance-for-healthcare-professionals

        In this light, the explanations in the original article about why we should ignore correlations could appear condescending and disingenuous; I am not surprised that they have created some backlash and distrust.

        Another unfortunate cause of distrust is the apparent official lack of interest in excess deaths since covid, and the apparent lack of any investigation using detailed case histories and other medical information available to the state but not to other interested parties. One reason why broad correlations are so popular is that more useful finer grained information is not available.

        • A.G. –

          The idea that communications about the vaxes, as flawed as they’ve been, have “created” backlash on any significant scale stands entirely unproven.

          Yet we see such confident assertions all the time.

          Vaxes have become an identity signal along the lines of many other vectors of cultural cognition – like climate change. We have seen for pretty much decades now, people saying that their views on climate change are based on “backlash” to how climate scientists have communicated about the science of climate change. But the evidence for that direction of cauality is non-existent.

          Instead what we see is that opinions on climate change have mostly been made in ignorance of what scientists actually say, and that we can pretty much uniformly divide people’s views on climate change by assessing their ideological orientation.

          There are some exceptions, of course, but there a small bit of noise amidst the overall signal.

          > I am not surprised that they have created some backlash and distrust.

          You’re not surprised, I’d say, because what you’re seeing is what you want to see.

        • Daniel –

          > David, they’re substantially higher than the vax, possibly like 10-100 x higher.

          In all fairness, thst comparison could be a bit misleading as the risks aren’t mutually exclusive and to some degree are additive. I’d say the more salient point is that the the risks from the vax are quite rare (varying by age and sex and other factors) and for the most part (although not always) not severe.

          https://twitter.com/GidMK/status/1665601436930183169?s=19

        • So ~1 per 10k, wasn’t that always the rate being claimed? What they should check is whether there is a correlation with aspiration (checking whether the needle accidently hit a blood vessel).

          And the comparison is vaccine + covid vs covid, there is no meaningful effect of the vaccine on getting infected (there was a large one on getting *tested* for a time but now that appears to have reversed).

          And there is still no explanation for why there was no apparent vaccine benefit on excess or all cause mortality:

          United States reported 3,353,800 deaths, for the 52 weeks of year 2020 (all years of age). Expected deaths were 2,910,693. That is an increase of +443,107 deaths (+15.2%).

          United States reported 3,457,530 deaths for the 52 weeks of year 2021 (all years of age). Expected deaths were 2,937,434. That is an increase of +520,096 deaths (+17.7%).

          United States reported 3,275,829 deaths for the 52 weeks of year 2022 (all years of age). Expected deaths were 2,979,305. That is an increase of +296,524 deaths (+10.0%).

          Year to date, United States reported 1,324,940 deaths for the 22 weeks of year 2023 (all years of age). Expected deaths were 1,328,149. That is an increase of -3,209 deaths (-0.2%).

          https://www.usmortality.com/deaths/excess-yearly-cumulative

          Then in the RCTs there were 37 total mrna vaccine deaths vs 33 placebo (out of ~10k subjects per group). And 6 deaths in placebo vs zero in vaccine were attributed to covid. If we accept that, then it was 37 vs 27 non-covid deaths.

          If the vaccine is so protective against severe covid, why is it so hard to find a benefit in overall mortality? There must be some side effect just as bad, or worse, than covid. I’d guess theres millions of people out there with bizarre symptoms due to PEG allergies. If not, we need some explanation for why you can inject so many people with PEG + adjuvant and not trigger immunity vs PEG.

        • Anoneuoid –

          > If the vaccine is so protective against severe covid, why is it so hard to find a benefit in overall mortality?

          Maybe not.

          If you dig into this Twitter you will find links to (from what I can judge) a good analysis with relatively good data – which seems pretty conclusive to me. It even has the advantage of considering a person vaxxed on the day they got vaxxed (as opposed to two weeks later – which is an issue that some vaccine skeptics argue – without any actual evidence – misses a significant amount of vaccine harm. You will also find in her links a related sensitivity analysis that shows no impact).

          https://twitter.com/sarahcaul_ons?lang=en

          https://twitter.com/sarahcaul_ons?lang=en

      • ‘vaccine’ definition was changed in 2020 to include this stuff in that class, such that they could take the fast lane for approval and skip most checks for non-vax drugs, and pretend to apply the much biased AER scheme for vaccines.
        By that new definition a wasp/bee sting, a snake bite are ‘vaccine’ too.

        “conspiracy crap”?
        well, see what this stuff really is and why whatever injuries reported are “clinically irrelevant” and “legally irrelevant” anyway: https://bailiwicknews.substack.com/p/public-health-emergencies-are-camouflaged

        A non-goofy version of the example in the paper would be “There is a strong correlation between deaths by swimming pool drowning in the USA and Nicholas Cage’s apparition in movies per year. People were mandated to watch such movies in full. Some 80% yielded out of coercion, blackmail, threats. While not establishing causal relationships, this finding raises concerns regarding such Nicholas Cage-forced-watching-induced drownings and should call for a halt to such mandates and let the actor’s film career depend on people’s free choice to see them.”

        And, as others pointed out, there’s still that excess deaths in US, EU and elsewhere in ’21, ’22 and ’23 to explain.

        • @joshua jun26 (don’t see the ‘reply’ on that comment)

          Thanks but I fail to read any explanation.
          “Non-zero VE against non-COVID mortality” is it a rewording of ‘adverse effects’? Anyway seems she’s educating on how to interpret ONS data and get the right answers. But others question those methods, e.g. https://wherearethenumbers.substack.com/p/the-illusion-of-vaccine-efficacy and ‘long-term’ VE for this stuff look very ‘problematic’ – drops to 0 in a few months then dives deep into negative – see eg https://cmsindipendente.it/sites/default/files/2023-06/CMS%20-%2020230604%20-%20Letter%20to%20the%20WHO%20Director-General%20%28inviata%29.pdf

          Here’s the all-cause Average Death Rates wrt 2015-2019 for a few countries from Human Mortality Database:

          C\Y 2020 2021 2022 2023
          FI 2,1% 7,0% 16,7% (w2-18) 3.5%
          CA 8,0% 7,8% 15,3% (w2.3) 1.2%
          US 20,3% 20,8% 14,5% (w2-18) 4.8%
          DE 5,6% 5,0% 12,5% (w2-22) 5%
          IT 23,0% 11,0% 12,4% (w2-12) -0.6%
          AT 13,0% 5,0% 10,6% (w2-22) 0.6%
          FR 9,6% 5,5% 9,8% (w2-18) -0.7%
          UK 19,0% 5,9% 8,5% (w2-22) 6%
          ES 22,0% 6,7% 7,5% (w2-21) 1.1%
          IL 5,2% 5,4% 7,5% (w2-20) -3.3%
          SE 11,3% -2,0% -0,8% (w2-20) -5%

          JPA Ioannids et al (PMID:37162934, Apr’23, “Variability in excess deaths across countries with different vulnerability during 2020-2023”) consider country ‘vulnerability’ along 3 economic indicators, and summarize:

          “The USA would have had 1.50 million fewer deaths if it had the performance of Sweden, 1.13 million fewer deaths if it had the performance of Finland, and 0.93 million fewer deaths if it had the performance of France. Excess deaths started deviating in the two groups after the first wave when correlational patterns with the 3 economic indicators also started to emerge.”

          They did not look at ‘vaccination’ factor here, but it’s being touted as saving humanity from the pandemic since the very beginning, so why we don’t see a clear pattern (countries with) ‘vaccination’ mandates -> (big) drop in mortality? Is it because we unduly expect positive “VE against non-COVID mortality”?

        • Anonymous and Mark –

          More one excess deaths (with implications related to vaccine status):

          https://twitter.com/astokespop/status/1673423775688462336?s=20

          Anonymous –

          I’m not sure what your point is. The ONS analysis seems to be pretty clear and extremely thorough. If you have question, I suggest you ask them at the Twitter feed. She’s quite good about answering questions. (FWIW, Fenton, from what I’ve seen, is not a particularly reliable analyst). Also comparing Sweden to the US seems like pretty much a non-starter, IMO; very different rates of persons per household, or intergenerational households, or ability to work from home, and vax rates, stages of the pandemic (and related issues like the virulence of the different strains, improvements of treatment), etc. I fail to see much utility in comparing Sweden to the US is instructive as to excess deaths vax vs. non-vax. There are some sophisticated analyses that do take on the excess deaths vax vs. non-vax issue by cross-country comparisons, but with so many moving variables I’m skeptical of the utility of doing so. Even comparisons like state to state in the US, while probably better, seem dubious to me given the heterogeneity even with given states. At any rate, the link above in this comment might be of interest to you, and one thing discussed is rural versus urban longitudinally across the different stages of the pandemic which seems to me to get closer to being worthwhile.

        • @joshua jun26 (don’t see the ‘reply’ on that comment)

          Thanks but I fail to read any explanation.
          “Non-zero VE against non-COVID mortality indicates that residual confounding may impact the results, …”
          residual confounding or maybe confounding analisys, sampling and definitions, i.e ‘illusions’ https://wherearethenumbers.substack.com/p/the-illusion-of-vaccine-efficacy
          but if your life expectancy is few months that illusion may hold, but ‘long-term’ VE for this stuff look very ‘problematic’ – drops to 0 in a few months then dives deep into negative – see eg https://cmsindipendente.it/sites/default/files/2023-06/CMS%20-%2020230604%20-%20Letter%20to%20the%20WHO%20Director-General%20%28inviata%29.pdf

          Here’s the total all-cause Average Death Rates wrt 2015-2019 for a few countries from Human Mortality Database:

          C\Y 2020 2021 2022 2023
          FI 2,1% 7,0% 16,7% (w2-18) 3.5%
          CA 8,0% 7,8% 15,3% (w2.3) 1.2%
          US 20,3% 20,8% 14,5% (w2-18) 4.8%
          DE 5,6% 5,0% 12,5% (w2-22) 5%
          IT 23,0% 11,0% 12,4% (w2-12) -0.6%
          AT 13,0% 5,0% 10,6% (w2-22) 0.6%
          FR 9,6% 5,5% 9,8% (w2-18) -0.7%
          UK 19,0% 5,9% 8,5% (w2-22) 6%
          ES 22,0% 6,7% 7,5% (w2-21) 1.1%
          IL 5,2% 5,4% 7,5% (w2-20) -3.3%
          SE 11,3% -2,0% -0,8% (w2-20) -5%

          from same database, JPA Ioannids et al (PMID:37162934, Apr’23, “Variability in excess deaths across countries with different vulnerability during 2020-2023”) consider country ‘vulnerability’ along 3 economic indicators, and summarize:

          “The USA would have had 1.50 million fewer deaths if it had the performance of Sweden, 1.13 million fewer deaths if it had the performance of Finland, and 0.93 million fewer deaths if it had the performance of France. Excess deaths started deviating in the two groups after the first wave when correlational patterns with the 3 economic indicators also started to emerge.”

          They did not look at ‘vaccination’ factor here, but it was being touted as saving humanity from the pandemic since the very beginning, so (countries with) ‘vaccination’ mandates should stand out with (big) drop in mortality, but it seems the converse.

          And that’s “just deaths”, what about the figures for severe injuries?
          Active VARES vs passive shows the latter undersimates by 1000x. Swden didn’t mandate the ‘vaccine’ and promptly stopped the campaign for age groups at very first signals, besides no lockdowns masks and all the rest of the madness.
          So what’s the culprit, the other common factor, if any, among the ‘bad peformers’ above beside the ‘vaccination’ the lockdown, mask, isolation, stay-home?

        • Mark –

          Andrew has made it clear he’s not vibing with having a debate on his blog about vaccines, but I’ll make one more comment (with an link in a 2nd comment).

          >… but it was being touted as saving humanity from the pandemic since the very beginning,

          I certainly think it’s fair to say that the vaccines did not live up to some expectations and in some cases could fairly be described as over-hyped. But on the other hand your hyperbole isn’t constructive, imo. For the most most part the vaccines were “touted” as offering the potential to save many lives and all the evidence I’ve seen indicated they’ve done so, rather remarkably given the unprecedented time scale of development. Granted their efficacy in preventing infections and forward transmission early on didn’t stand the test of time with the more evasive variants, and as Anoneuoid likes to point out there were solid scientific reasons to suspect they wouldn’t do so. But while lamentable, I can see some non-nefarious reasons why some mistakes were made in that regard.

          > so (countries with) ‘vaccination’ mandates should stand out with (big) drop in mortality, but it seems the converse.

          Just comparing countries is facile. You have to take in obvious factors like accuracy of testing and deaths accountable to covid, and mean age and background health and…
          well a ton o’ stuff. From what I’ve seen, quality analyses that attempt to account for such factors demonstrate very significant benefit from the vaxes. Again, the ONS analysis of excess deaths is one place to look.

          > And that’s “just deaths”, what about the figures for severe injuries?

          If you’re going to account for serious illness it only emlarges the window for positive benefits from vaccination.

          > Active VARES vs passive shows the latter undersimates by 1000x.

          Using “passive” VARES for measuring harm comprehensively from these vaccines should be a non-starter for obvious reasons. Behind that, there’s been a ton o’ just flat out misleading information hyped by gritters on this issue, like with “Died Suddenly” or that retracted article that got a lot of hype by projecting deaths by looking at survey results. It’s become a mis- and sis-information industry. Nothing is certain but the analysts I tend to trust uniformly agree that overall, serious vaccine harm has been rare. It’s clear you think otherwise but in my view the people promoting the idea that there’s been widespread vaccine harms are overwhelmingly shysters and/or driven by political agenda. I don’t want to offend but from what I’ve seen it’s been a conspiracy theory-a-palooka, free-for-all from folks like RFK Jr. and all the associated hangers on.

          > Swden didn’t mandate the ‘vaccine’ and promptly stopped the campaign for age groups at very first signals, besides no lockdowns masks and all the rest of the madness.

          Sweden would seem hardly the country you’d want to point to if you’re critical of the vaxes, given their vax rate. So I’m not sure what your point is there unless it’s just to them move the discussion into one of the costs/benefits of NPIs. Suffice to say I think we’re likely to disagree there also as to whether outcomes in Sweden make good case against NPIs.

          > So what’s the culprit, the other common factor, if any, among the ‘bad peformers’ above beside the ‘vaccination’ the lockdown, mask, isolation, stay-home?

          I’ll just say this, in my view ALL the people who argue that NPIs “caused” negative outcomes fail to account for the counterfactual assumptions they’re making. I have not seen a single one address what seems to me to be a rather obvious issue in that regard. I’ll put a related link in the next comment down, if you’re interested.

        • Mark, anonymous

          Don’t know if you’ll see this but I just came across it:

          > However, considering the low levels of excess mortality in countries in which COVID-19 transmission, infection and mortality rates were low during some of the analysed period (for example, Malaysia, Mongolia, Uruguay in 2020) or its entirety (for example, Australia, Japan, New Zealand), suggests that in many countries the greater proportion of excess deaths can be attributed to COVID-19 directly.

          https://www.nature.com/articles/s41586-022-05522-2

    • “And that’s [sic] not just cardiac issues.”

      What do you call the opposite of a panacea, anyway? We really need to know what’s in that vaccine, there has never before been a toxin known to man that can kill in so many unrelated ways.

  4. It seems like even if we didn’t want to publish every correlation, this particular example is obviously way more relevant than the Nicholas Cage/Drowning or Chocolate/Nobel Laureates correlations. I mostly agree with the last sentence of the article: “Given the severe methodological weaknesses, there seems to be little option but to retract the article as soon as possible to avoid harming public health”. However it isn’t clear to me that this correlation is not ‘clinically relevant’ or that we should avoid publishing correlations on sensitive topics (though I agree the negative impact is potentially larger there). I don’t know what the exact timeline was but this to me seems to be an example of the system at least somewhat working. The article being criticized was published on Apr 28, 2022, and on May 5, 2022 there was a warning posted that the conclusions were subject to criticisms. The question remains why it’s taken so long since then to resolve things

    There’s some connection to be made here between this example and the Harvard ‘censorship’ post from a few days ago. That certainly seems like an area where the topic is sufficiently sensitive to warrant concern, I wonder how much of the literature there we’d have to axe. Lastly, not that it really matters, but it’s funny seeing 9 authors listed on such a short paper

    • See my update above on the matter (I’m one of the co-authors).

      “Short update on this matter:

      400+ days ago, we submitted the rebuttal discussed above to Sci. Rep. We argue in the rebuttal how flawed beyond repair in ALL of its statistical analyses the original paper is. Reviewers who read both our rebuttal + authors response agreed that the paper could not be saved. Yet, a week ago, the EiC gave the authors of original paper the possibility to change their paper (although the critical flaws we point to won’t be addressed) and asked us to wait, read the updated version and submit our concerns again. Interestingly enough apparently the authors have originally been invited to submit their paper to Sci.Rep”

      • I did read your update, mostly I just don’t understand what they’re waiting for now. Another way to frame it is that up until at least May 5th, 2022, everything seemed to be working fine. If they retracted this article within 1-2 months of that date, I don’t think there would have been a problem. Though I do also think the framing/rhetoric in your article likely encouraged the original authors to dig their heels in, but that’s a bit besides the point.

        • “If they retracted this article within 1-2 months of that date, I don’t think there would have been a problem.”
          I agree completely with this

          “I do also think the framing/rhetoric in your article likely encouraged the original authors to dig their heels in, but that’s a bit besides the point.”
          I also agree with this actually, but we were quite fed up with the whole “having to debunk bad studies”

  5. I fail to understand why there’s so little attention paid to the Bradford Hill criteria when medical researchers publish on correlations in epidemiological data.

    Personally, I think there ‘aughta be a law’ that requires a theory of causal mechanism attached to any such paper.

    • “I think there ‘aughta be a law’ that requires a theory of causal mechanism attached to any such paper.”

      I disagree. Consider the earliest studies of, say, the association of lung cancer and cigarette smoking. On what basis, back then, would anybody have been able to provide a theory of causal mechanism. Does that mean that the very strong association should be ignored? Of course, not. It should be published, with a caveat that we don’t really know what is behind this, and it should stimulate further research that would clarify whether we have spurious correlation or something more substantive, and, eventually, scientists might discover a mechanism.

      Should Isaac Newton not have published his law of gravity. Newton himself wrote that he could not imagine any conceivable mechanism for a force acting at a distance–he characterized it as absurd. It took hundreds of years for Einstein to figure that out (and correct the formula, to boot).

      The law that “aughta be” is a proscription of making more of an association (whether specifically a correlation coefficient or some other measure) than can be justified based on the totality of evidence and understanding.

      • Clyde –

        OK, that’s fair enough.

        But I just think that the conflation of correlation with causation is kind of out of control. And I think that corresponding plausible mechanisms of causality is an important control measure. But OK, maybe it shouldn’t be a law but a (strong) suggestion. Kind of like how drivers in Boston view driving laws: suggestions but not really laws.

        And maybe it oughtn’t even be a law to address Hill’s criteria of cauality – but I sure would like to see it strongly suggested, and that cauality should be heavily caveated when there’s only cross-sectional (and not longitudinal) data available (mind of like Nick mentions below).

  6. Why is anybody even discussing “correlations” (meaning, unless someone has evidence to the contrary, Pearson product-moment correlations) being performed on time-series data? Just because you can put two columns of numbers into SPSS and click “Correlate” doesn’t mean you’re actually doing anything meaningful.

    I find it disappointing that the critics of people who publish such spurious results rarely seem to explain why Nicolas Cage movies are highly likely to be “correlated” with swimming pool drownings. It’s not that they trawled through 10,000 combinations of 100 actors and 100 causes of death. Large “correlations” will occur anywhere you put two inherently auto-correlated sequences into a Pearson correlation.

  7. In addition to all the above, I don’t understand how, after more than 400 days, the editors have not been able to say “you can’t have a post hoc power of 1.00 for an NS test result using the observed effect size” or “that p value is not possible from that rho with that many observations”.

    That’s “get out a calculator and check” stuff, not “have a debate” stuff.

      • Anonymous –

        > So the solution to methodological problems in medical science is to be more like economics?

        Hmmm. I get your joke but that’s maybe a bit unfair. They’re saying that there’s benefit in using some methods commonly found in economics, such as natural experiments, to improve nutrition research. I thought their example of a natural experiment of sugar intake across different families was a pretty good example. Do you disagree?

        • I mean, natural experiments (including mendelian randomization) are fine and are actually used more and more, but it’s quite naive to think that it will solve all of our problems. Not all foods or substances have a corresponding mutation like alcohol, or have been in shortage like sugar. And even so, a lot of other things were happening after the war than sugar shortage, so even that is a bit iffy.

          It’s clear that the state of nutrition epi is bad, it’s been known for years, and maybe econometrics can help, but what can definitely help is the kind of things that are discussed here all the time: better measurements, better study designs and analyses, less dichotomization and focus on p values…

        • Anonymous –

          > but it’s quite naive to think that it will solve all of our problems

          I didn’t see where anyone was thinking that

        • Joshua:
          > I didn’t see where anyone was thinking that

          The authors of this piece clearly imply that natural experiments would create a revolution in the field in their conclusion: “Although medical researchers are increasingly taking advantage of natural experiments (…) these methods remain undertaught and underused, particularly when it comes to diet. This important research needs a credibility revolution of its own.”

  8. From the NYT article:

    Take the case of artificial sweeteners. Randomized studies — in which people are randomly assigned to one treatment or another to ensure that no other factors interfere — are considered the gold standard

    That is the problem, randomized comparisons of groups is most definately not the gold standard of science. It is a minor initial, even optional, method of collecting information.

    The gold standard is what was done in physics, which is come up with models that describe how a system behaves and compare predictions to future observations. Eg, Edmond Haley didn’t need any randomization to predict when his comet would return.

    How is it physicists can manage such feats without the “gold standard”, but areas that do consider randomized comparisons a gold standard fail to achieve anything close? Seems more like a “lead standard” that weighs down the area of study.

    • “…How is it physicists can manage such feats without the “gold standard”, but areas that do consider randomized comparisons a gold standard fail to achieve anything close?…”

      Because it’s easy. You have natural ‘laws’ that are stable and noise appears only at mind-boggling levels of precision. Those laws are doing the heavy lifting for you, not the Physicists. They don’t even need statistics, or at least not the complicated, fancy kind. Things are much more difficult in bio-sciences, where all you see is noise and for every metabolic pathway you describe, there are ten other ones doing just the opposite.

        • I don’t buy it. I have a PhD in a biomed field and can attest that I was not trained in calculus (besides high school) or programming numerical simulations at all.

          How can you say the gold standard (physics) approach doesn’t work when you don’t even try it?

          And here is an example of what you get when you combine inferring simple laws (arrived at during WWI) with modern computational power:

          Neuronal circuits are composed of a large variety of branched structures – axons and dendrites – forming a highly entangled web, reminiscent of a stochastic fractal [1]. Despite this apparent chaos, more than a century ago Ramón y Cajal was able to extract order from this neuroanatomical complexity, formulating fundamental anatomical principles of nerve cell organization [2]. Cajal described three biological laws of neuronal architecture (Chapter V, p.115–125, in [2]): optimization principles for conservation of space, cytoplasm and conduction time in the neural circuitry. These principles helped him to classify his observations and allowed him to postulate a wide variety of theories of functionality and directionality of signal flow in various brain areas.

          […]

          In summary, we find that a simple growth algorithm which optimizes total cable length and the path length from any point to the root in an iterative fashion can generate synthetic dendritic trees that are indistinguishable from their real counterparts for a wide variety of neurons. This represents a direct validation of the fundamental constraints on neuronal circuit organization described originally by Cajal.

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2916857/

          I found many examples like this from the pre-1940s era (SIR models also originated then), simply because people were taking a “physics” approach. Nowadays we are training people to think it is impossible and relying on methods that don’t work.

        • I’m with Anoneuoid. My background is math/physics/engineering, but I have a considerable number of biology publications (I haven’t looked recently, but maybe 8-10?). In most cases someone I know in biology came to me with a question and I started asking for more and more details and in the end the “standard methods” they described made no sense and I convinced them to do something else, with the result that we published stuff well outside the standard for biology and with an emphasis on mechanistic models with Bayesian methods. I’ve published accelerated aging models for cancer survival, agent based models for the growth and differentiation of rib bones, bayesian estimation of muscular forces in biomechanics problems, etc.

          An early dimensional analysis of rib bone post-surgery regeneration suggested that mathematical laws implied that to observe the given results the causative agents should come radially from the surrounding tissue. Sure enough my wife proved through experiment that the essential cell types are resident in the periosteal membrane.

          Funding for Biology research is already extremely competitive, and typical funding rates are like 10% of grants in any given session, but it’s particularly impossible to get funding for anything outside the norm in terms of methodology wise. Quite honestly the biologists responsible for assessing grant proposals simply don’t have the knowledge to understand the methodology or the importance of these things. Or even, for example in the case of the cancer survival model, the model clearly invalidates almost every survival paper you could find in the literature (which rely heavily on Kaplan-Meier curves that in follow up unpublished research we proved through simulation were vastly and horribly incorrect). If all you have is a hammer, a PCR machine, and RNA-seq machines with NHST p-values to rank “important genes” in pushbutton software like Partek Flow that costs millions of dollars to site-license… no one wants to rock the boat. Step out of line and you’re going to get squashed. There are too many other “normal” grants that could attract the funding instead.

        • Hi Daniel, that early rib-bone regeneration analysis you mentioned sounds interesting, do you happen to have a link to it? Or anything similar you can think of?

        • DJAD

          It wasn’t something we published just a discussion between me and my wife. I may have written an email or something.

          The idea was to look at the rib as a geometric cylinder and then imagining trying to fill up that cylinder with damage responsive cells emanating from the cut ends of the rib, and coming in radially from the surrounding tissue. The cut ends supply something proportional to pi*D^2v/4 while the surrounding tissue supplies an influx of pi*D*L*v. The total volume is pi*D^2*L, you wind up with a dimensionless equation equation like

          v*t/4L + v*t/D = 1

          The first term comes from the cut ends and the second from the surrounding tissue. When the length of the removed piece is large, the contribution of the first term goes to zero, therefore if you see large repairs the contributing agents must be coming from the surrounding tissue.

          Also, when the length is small, as in a simple fracture, you expect a different mechanism, with the fractured ends providing the responsive cells.

        • whoops, that’ll teach me to do symbolic math on my phone. The total volume is pi*D^2L/4 so there’s a missing factor of 4:

          (pi*D^2*v/4 + pi*D*L*v)*t/(pi*D^2*L/4) = 1

          v*t/L + 4*v*t/D = 1

      • While the underlying physical laws may be simple, working out the consequences is not. Look up analytical ephemerides, it is mind bogglingly complex:

        https://en.wikipedia.org/wiki/Perturbation_(astronomy)

        Similarly, in bio there simple “laws” of viral transmission like SIR models that no one really expects to fit observations due to various “perturbations”. Eg, rate of testing and accuracy of the tests.

        For some reason a single positive PCR result was treated as the gold standard during covid. When this was finally compared to a real gold standard in the 2022 human challenge study (positive culture, symptoms, and rise of antibodies), we saw that one positive is essentially worthless, but two consecutive positives 12 hrs apart was a very good proxy for infection in the past month. Then two consecutive positive PCRs plus cold/flu symptoms was needed for active infection.

        We can’t expect the SIR model to closely follow the data when we aren’t even sure who was infected when. Then we need to account for perturbations like seasonality, waning immunity, mutations, and so on.

        The SIR model is akin to newtons law of gravitation, it gives the general outline but can’t be expected to account for everything.

        • Just saw this on the perturbation page:

          Newton in 1684 wrote: “By reason of the deviation of the Sun from the center of gravity, the centripetal force does not always tend to that immobile center, and hence the planets neither move exactly in ellipses nor revolve twice in the same orbit. Each time a planet revolves it traces a fresh orbit, as in the motion of the Moon, and each orbit depends on the combined motions of all the planets, not to mention the action of all these on each other. But to consider simultaneously all these causes of motion and to define these motions by exact laws admitting of easy calculation exceeds, if I am not mistaken, the force of any human mind.”

          “Mind-boggling” = “Exceeding the force of any human mind”

    • Anoneuoid –

      > That is the problem, randomized comparisons of groups is most definately not the gold standard of science.

      They didn’t say that they are the “gold standard of science.”

      Maybe “gold standard” isn’t a good descriptor, but the point is that they have significant advantages, relative to other methods of analysis, such as observational analyses. So they are the “gold standard” if you will, relative to other ways to investigate medical interventions.

      Seems that we get into this yet again. Life ain’t perfect, nor is it binary. Because RCTs have problems (and often lead people to mistakenly conflate correlation with causation), doesn’t mean that they’re useless, or that they don’t bring greater value than other forms of analysis.

    • Anon, you say “ The gold standard is what was done in physics, which is come up with models that describe how a system behaves and compare predictions to future observations.” I agree. But that’s what randomized controlled trials are attempting to do.

      For instance, if you think large doses of Vitamin D should reduce the severity of Covid, you can randomly assign people admitted to hospital due to covid to groups to receive either large doses of vitamin C or a placebo (or to receive a small dose, if that’s normal clinical practice). Then you measure outcomes like deaths, days on a ventilator, etc etc.

      “compare predictions to future observations” is what you want to do; a randomized controlled trial is how you do it.

      The reason for the control group is that it hugely reduces the complexity of the predictions. As with your nice Newton example of planetary motions, only more so, it’s effectively impossible to predict precisely how a group of hospitalized covid patients will do in terms of death rate, patient-hours on ventilators, etc; indeed the prediction wouldn’t just depend on details about the patients but also the doctors and nurses and equipment etc. By doing a control group for which all of those are identical to the cases, you allow your prediction to be much simpler: if given vitamin D supplements, fewer patients will die and fewer hours will be spent on ventilators and they’ll be discharged from the hospital faster etc. I can’t think of a persuasive way to test something like the vitamin D hypothesis without an RCT.

      That said, I would agree that in a lot of cases the idea of an RCT seems pretty nutty and the benefits of the approach can be wildly overstated. I’m thinking of studies where 5 towns are selected at random to get some education or health intervention and the researchers implicitly (or sometimes even explicitly) act as though they have a large sample size because the towns collectively have a population of 5 million or whatever. No, you have n=5.

      Putting it all together, I think there are many settings in which a randomized controlled trial is, and should be, the gold standard for testing a prediction. But a lot of important questions simply don’t allow this.

      • RCTs are a useful type of experiment. Unfortunately they have been used as an alternative to real theory in many cases. “Do x vs y and measure z in the two groups and look for a difference” is WAY better than “read the entrails of chickens” but it’s not better than building some theory into the experimental design and analysis.

        For example, in the COVID trials it took forever for them to approve under 12 year old doses, and they had to go back and redo their clinical trials on under 5 year olds a third time I think. Why? Well, in part because we’ve ossified everything into RCT results as the gold standard. The dose response results seem to have been used very little, and worse yet, the entire field doesn’t even seem to be (at the externally visible level) aware of dimensionless dosing or have built a thoughtful model of dosage. I’ll admit that the internal thought processes are obscure, but externally the discussion was all about 10ug vs 20ug vs 30ug and not about the underlying factor that could actually predict the outcome, namely the ratio of the quantity injected to some quantity of “responsiveness” which would drive side effects. And the stories were that Moderna actually ran trials by just guessing “50, 100, and 200 ug” with 200 being such an incredible over-dose that it made people really sick and they canceled that arm if I remember correctly.

        The quantity of responsiveness I suggested in discussions on this blog would be related to the volume of the bone marrow where the immune responsive cells live, a proxy for which could be calculated via height, weight, and age. Essentially the length of the long bones is related to height H, the cross sectional area of the long bones should be related to weight W. So the mass of the long bone marrow should be something like rho_marrow * H * (W/(rho_avg * g))^(2/3), so the dimensionless dose of the drug would be calculated D/(rho_marrow * H * (W/(rho_avg * g))^(2/3)) where D is the 10 or 20 or 30 ug amount but each individual would have totally different H, W, and the rho_avg would probably be age and sex specific.

        If you’d calculated dose like this you’d have found that there weren’t 3 or 4 doses tried, there was a different dose for each participant. You’d have found that the generally shorter and smaller 12 year olds were given a MUCH higher dose than the 6ft tall male adults, and that in fact 12 year olds had a hard time with side effects like vomiting and flu like symptoms and the rare cardiovascular disorders. You’d have also found that the under 5 year olds were given too high a dose when injected with amounts appropriate to 5-12 year olds.

        Instead of saying “what would happen if we gave 8ug… let’s do a separate RCT!” you’d have said “what is the antibody response as a function of dimensionless dose” and “what is the side effect profile as a function of dimensionless dose” and you’d be able to graph these with continuous dosage data and fit a couple of curves. If you look at it in terms of ug injected though you have 2 or 3 discrete values and a lot of confusion about how to understand why some people got much worse side effects than others, or why some people might have had less effectiveness than others.

        The entire thing could have been analyzed in the context of a reasonable model and dosage figured out from rational principles. Also you could test other theories of dosing. They should all be dimensionless, but maybe it’s not the bone marrow that controls but rather the volume of blood times the density of circulating immune cells, and maybe the mass of the muscle into which its injected matters… etc

        We are, I think, excessively allergic to mathematical modeling in biomedical sciences. I say this as someone who regularly interacts with biomedical science people, and yes I KNOW there are pharmacokinetics people who do a lot of good stuff, but if they don’t get enough power to affect the decision making in a life-threatening pandemic then I still stand by the statement that we are excessively allergic to this stuff.

        • We’ve previously discussed the fact (and it is a fact) that the clinically optimal dose for a 100-pound person can’t be the same as for a 300-pound person, by which I don’t mean to imply that the dosage should scale proportionally to weight. On the other hand, there are good reasons for having quantized doses. I think that at least ‘small’ and ‘large’ should be used, at least for adults, and presumably ‘very small’ and ‘small’ for children of different sizes.

          But none of that is directly relevant to RCT’s. Even if you decide to do a vaccine dose that is a function of bone marrow mass or whatever, the ‘gold standard’ would still be to test the vaccine in an RCT.

        • Phil, agreed you’d quantize the dose when giving it out, because it’s too hard to ask people to do math every time and measure out to the nearest 1 ug or whatever. But that doesn’t mean you analyze the results of the trial incorrectly!

          The problem I’m pushing back against isn’t the use of randomization or controls, it’s the use of NHST comparisons of average treatment to control in an RCT as a substitute for “real science” which I think means model building and mechanistic analysis. If you build models and test them with RCTs it’s good stuff, if you just try A vs B in an RCT and don’t bother with the modeling it’s not.

          When it came time to design the dosage for children, if I remember correctly, they tried a couple different values, found that 10ug worked well for 5-11 year olds, but was too severe for under 5 year olds, then spent another 8 months or something doing RCTs on under 5 year olds. By then it was probably too late for the vast majority of under 5 year olds to benefit from getting the vaccine before the disease. I’m sure the incidence of “long COVID” among 0-12 year olds is higher than it would have been if the analysis of the adult dosage would have been done correctly and the information extrapolated appropriately to the younger age groups resulting in trials that were successful and quick instead of prolonged and repeated several times.

          The point being that model prediction was given 0% credibility and the average result of 10k people being given exactly the proposed dose in RCT was given 100% credibility, to the detriment of everyone under about age 16 (the 12-15 year olds got too high a dose and got quite sick on average, the under 12 year olds had between 12 months and 2 years of additional delay)

          RCTs are great, but a small p value in an RCT comparison is not a substitute for building models and understanding the mechanisms and being able to predict the outcome of experiments. The issue is so bad that the average biologist wouldn’t even understand why you’d analyze these things by plotting say antibody response at 2 weeks vs dimensionless dose instead of antibody response at 2 weeks vs micrograms injected, and they wouldn’t even think to try to normalize grams injected by an estimate of the mass of the bone marrow, or even understand that you should normalize by a mass rather than say by a height or a surface area.

          The issue is so bad that we’ve spent DECADES dealing with the problem of using BMI, which is a dimensional ratio rather than a dimensionless ratio. A big part of its problem is precisely that being a dimensional ratio it provably can’t possibly be directly a determining factor of anything. There must be a couple thousand articles published and hundreds of datasets collected and analyzed and thousands and thousands of hours spent discussing BMI. We’ve finally had the AMA come out and say not to use BMI https://www.msn.com/en-us/health/medical/bmi-alone-is-a-poor-indicator-of-health-the-ama-says-these-metrics-may-be-better/ar-AA1cUhGD but it still isn’t about the fact that it’s fundamentally flawed, its on the basis that the dataset used in the original regression was limited/biased to non-hispanic white people.

          The vast majority of that could have been avoided if very basic ideas in applied mathematics and physics were more widely known. I tried to fit my own replacement for BMI to every dataset I could come up with online. Most of them were “only available to qualified researchers” and other ones simply **didn’t even measure the relevant variables** (namely, weight, height, waist circumference and estimate of percentage bodyfat).

        • +1 to everything Daniel says here. In my experience this is just one of those things where, once you see it, this way of thinking comes to seem intuitively obvious and it gets more and more baffling that whole communities of PhD holding scientists seem unaware or unable to think this way :)

      • I can’t think of a persuasive way to test something like the vitamin D hypothesis without an RCT.

        How would you do an RCT for the hypothesis that putting out house fires with water is better than not? You need the same type of information for the vitamin intervention to be effective.

        1) Size of house
        2) Type (materials and airflow) of house
        3) Size of fire
        4) Type of fire (electrical, etc)
        5) Amount of water
        6) Rate of water being applied
        7) Where/how the water is applied (dropped on it vs from a hose)
        8) When the water is applied
        9) Environmental conditions (drought vs currently raining, etc)

        Just applying the same water treatment to every house on fire is going to lead to conflicting results. It will be insufficient for larger fires, wash away smaller houses, and so on. If that is how we solved problems we would never figure out the most obvious and basic solutions to problems. Another would be providing raw materials to rebuild a town after a bad storm hit it. Should we just give the same amount of wood/nails/etc to every town after getting hit by a tornado?

        So the first thing that needs to be done is figure out how to measure deficiency accurately along with rate of metabolism of the vitamin (both vitamins C and D are antioxidants that get consumed quicker in diseased/damaged tissue).

        • You’re doing something of the same thing Daniel did: assuming that a randomized controlled trial must be done very stupidly. “Applying the same water treatment to each house”, where does that come from?

          Suppose I have some additive to water that I believe should improve its ability to put out fires. Maybe it increases the viscosity or the boiling point or something; in any case I have reason to believe it will be useful in fighting fires, but it’s not something that can be proven from first principles.

          First thing to do is test it in a lab: get some standardized piles of wood or something, light them on fire, and apply either plain water or treated water. Measure how much liquid it takes to put them out etc.

          You’d then move on to increasing levels of realism, and then ultimately to real-world tests.

          If the effect is large enough then you don’t necessarily need an RCT — maybe randomization isn’t so important as long as you don’t bias things too badly — but an RCT would still be the gold standard.

        • You’re doing something of the same thing Daniel did: assuming that a randomized controlled trial must be done very stupidly. “Applying the same water treatment to each house”, where does that come from?

          They are pretty much all “done stupidly”. Here is a reanalysis of one done for vitamin C and covid outpatients:

          https://www.frontiersin.org/articles/10.3389/fimmu.2021.674681/full

          You can see all get the same dose and they do not consider vitamin C levels (not even baseline) nor timing of the treatment. This is like firemen not paying attention to the size of the fire, or even if the house is still standing, when “testing” the water on fire intervention.

          Then they ended the study early “for futility” even though they still saw an effect 20% higher than used for the power analysis.

          The thing is by the time you are collecting good enough data to not generate conflicting results, it is about time you have a model for what is going on anyway (rate of absorption, distribution, metabolism in various tissues, etc).

        • The issue, in my opinion, is that the “script” for using RCTs is written around the idea of “treat the system as a random number generator and see if you can affect the mean value” or similar (maybe you’re not trying to affect the mean value, but you’re always treating it as a random number generator and trying to “detect a significant difference in some parameter”).

          We’ve taught ~ 2 whole generations of scientists that the “gold standard” for thinking about the world is “as if it were a random number generator and design experiments that can distinguish between different RNGs”. If you succeed by identifying a way to shift the parameters of the RNG by detecting a significant difference between different conditions in an RCT and report it with a small p value it will leave you “impervious” to criticism (this imperviousness to criticism is the real “value” since it’s the “gold standard” you can’t be laughed off the stage). Whereas if you make some other claim, it will always be subject to “correlation is not causation” and people who have been taught that it’s not actually feasible to do mechanistic models anyway so anyone claiming to do them should be taken to be possibly a crank.

          And you see this right in this thread, people saying that outside of “physics” it’s just impossible to describe things mechanistically and its pointless to even try. Nothing could be farther from the truth. Even within “physics” there are TONS of complex systems that are plenty hard to describe. Global circulation models, scour from turbulence at the base of bridge piers, transport of pollutant chemicals through waterways, even golf swing mechanics.

          There are tons of good mechanistic research results in biology. In fact, biologists intuitively know this, and typically view statistics and RCTs as kind of “bureaucracy” unless they are in a directly patient facing position (ie. drug design / treatment design etc). There are even good results in mechanistic modeling of economics (Schelling’s model of segregation is an example from back in 1978).

          I’m friends with one of the commenters here jrc who works on development economics in third world countries. It’s easy to see how you could describe mechanistic effects that might result in parents choosing to devote different amounts of resources to different children thus resulting in different health outcomes for the children. We even probably know things about bone growth and access to different kinds of calories, and frequency of GI illness and absorption of calories, and the role of clean water in GI illness and the role of minerals like calcium and arsenic and such.

          The thing that’s missing is reasonable attempts to measure some of the things you’d prefer to measure, as well as in general the Frequentist viewpoint being highly antagonistic to mechanistic modeling. How do you “design an estimator” for a rule-based process? What is the “standard error” in that context? These are typically high dimensional descriptions. You could imagine for example the child’s experience as 48 months in each month there were 6 or so different determining factors that occurred, and for each child the values of those determining factors were different. So there’s a 288 dimensional space to describe the experience of the child, and there are zero repetitions within your 1000 children.

          It’s straightforward in a Bayesian framework, it’s NOT straightforward in a Frequentist framework. You have zero repetition of anything. Every distribution needs to be estimated from a single data point. The only way forward is to aggregate across all the experiences and describe some aggregate “effect” which means to ignore the actual mechanism.

          Take a given child whose height was measured 4 years in a row. Take a rule that says something about how much food, clean water, and medical care they got based on the rainfall, crop harvest, which birth order they are (first,second,third child etc), their sex, their parents money income, and the administrative region they live in (county etc). There is no way in the Frequentist framework to say “we know that people make decisions on the basis of these things, but we don’t know precisely which decisions they would make, but we know within some bounds which regions of decision space are more or less likely to occur” and turn that into a Frequentist estimator of the properties of the ensemble of children.

          If you live in a world where instead of “rules” there is “randomness” and your whole view of science is to “measure the randomness” and look for “the best kinds of randomness” then it’s not going to be a surprise that after 60 years of that you’ll be stalled out and unable to really describe how things work. People will even push back that it’s “not even possible” to describe the world beyond the flight of golf balls.

          And yet, somehow people design concrete bridges and can predict how rainfall miles away would affect levels of water in a dam far downstream, and can figure out how to build fish ladders and how fertilizer and pesticide runoff causes sex changes in frogs, and how forest fires spread under different types of canopy density and fuel distribution… The world isn’t a random number generator and yet our “gold standard” is tied up with the view that it is.

        • You guys are looking at badly conceived or badly designed randomized trials and somehow concluding that randomized trials are bad because a lot of them are badly conceived. You are throwing out the baby with the bath water.

        • Phil as I said elsewhere

          “The problem I’m pushing back against isn’t the use of randomization or controls, it’s the use of NHST comparisons of average treatment to control in an RCT as a substitute for “real science” which I think means model building and mechanistic analysis.”

          RCTs as a type of experiment are great. They simplify the analysis required and allow you to model a phenomenon rather than a phenomenon altogether with selection effects and other irrelevancies of experimental design.

          But you still have to do the modeling, and the vast vast majority of RCTs are not done in a Bayesian+model based approach, they are done in a “black box RNG based NHST analysis” framework. I don’t think anyone can reasonably argue otherwise.

        • RCTs aren’t “bad”, they’ve just been elevated to a position they aren’t qualified to fill.

          It is a minor, initial method of evidence-gathering. Not a gold standard. Here is a thought experiment:

          A generation is raised without knowledge of modern physics or how it was achieved (actually, maybe this is happening now…), then the brightest of them are trained to approach physics with RCTs as the gold standard.

          What would be the outcome?

        • Daniel –

          > But you still have to do the modeling, and the vast vast majority of RCTs are not done in a Bayesian+model based approach, they are done in a “black box RNG based NHST analysis” framework.

          Not to disagree with that, but to stress that in the real world, RCT’s as a “gold standard” means RCT’s as an alternative to observational analyses that similarly lack a Bayesian+modeling approach. I just think that objections to “RCTs are the gold standard” should be taken in that context.

        • Insects love my wife. When we go for a walk, even in a place with a lot of mosquitoes, I usually don’t bother with bug repellent because the bugs will ignore me and go to her.

          A week or so ago we were traveling in a place with a lot of midges, and we were out of bug stuff, so we stopped in a fly-fishing shop that happened to be nearby and bought some of their bug stuff.

          Today, my wife and I were going out for a walk, and, having access to our usual bug goop, she started to apply it. I said hey, why not apply the old stuff to one arm and the new stuff to the other arm, and see which one works better? I was kinda kidding but kinda not.

          Suppose I am formulating an insect repellent and I want to test my formula against others on the market, or perhaps to test several versions of my formula against each other and against others on the market. How should I do that?

          Anon and Daniel, if I understand you correctly you are both arguing _against_ actually trying these things in the real world. That is not the gold standard. The gold standard, in your opinion, is to use whatever theory of bug-attraction I have, and choose whatever my theory says is best. Is that seriously your argument?

          Tell you what, rather than say an RCT in a situation like this is not the gold standard, you tell me: what IS the gold standard? What is the absolute best way to test which bug repellent would work best in the real world?

        • Anon and Daniel, if I understand you correctly you are both arguing _against_ actually trying these things in the real world.

          I didn’t and don’t see where Daniel did either. Instead I (we?) claim this type of data is a provisional step to figuring things out. It can even be skipped altogether.

        • Phil, I too am mystified where it comes from that you think I advocate using pure logic without any experiments or whatever. Especially immediately after me re-quoting for the second time that I have no problem with randomization or with controls, it’s the analysis methodologies that bother me.

          Suppose you do an experiment, of any sort, in which you try testing mosquito product A and mosquito product B, in any way at all. Suppose you see some differences. This is the *starting point* for doing some science. Let’s explain why those differences occur and how to control them.

          Start with running each formula through a gas chromatograph / mass spectrometer to identify the components. Then let’s look at the species of mosquito. Identify the evaporation rate of the different components at body surface temperatures. Let’s also observe the behavior of the mosquitos, do they hover and never land, do they land but take off shortly after, do they insert their proboscis but pull out immediately? Let’s capture or breed some mosquitos and place them in a box with a cotton ball soaked in each of the major components. does any one component stand out as particularly repellant? Let’s try washing the subjects skin with either simple soaps or isopropanol before applying the products, does the product interact with stuff on your skin that is washed away with the washing? Let’s vary the type of mosquito used and see how long the mosquito lands for…

          Finally, having observed multiple types of experiments in both controlled laboratory conditions and perhaps real-world conditions, let’s build a model of mosquito behavior as a function of mosquito type, humidity, activity level, skin temperature, ambient temperature, and the concentration of each of the main ingredients. Let’s also build a model of side-effects such as skin irritation as a function of the various ingredients including the “inactive” ones. Once we have a model we need data to fit its parameters, so let’s test several different formulations and measure a number of outcomes such as rashes, irritation, systemic absorption, number of bug bites, number of mosquitos that land, duration of time they stay on the skin, and etc. Finally, given this large dataset and our model, let’s estimate the optimal mixture of ingredients which balances the good effect of reduced mosquito bite rate with the bad effects of frequency and severity of skin irritation. Let’s formulate 3 or 4 different formulas based on the predictive model which are in the vicinity of the optimal but with reasonable spread in ingredient mixes, and test them in a trial. Yes, RCT if you like. Let’s compare the predicted outcomes with the actual outcomes and determine in what ways these differed and try to determine which causal factors we left out of the model if there are any meaningful deviations. Finally, when we feel that the model works well enough, we can choose which is the optimal formulation and give recommendations on application and re-applying times etc and market the product.

          This isn’t really so different from the real world way in which mosquito products are formulated. Certainly a lot of these things are done, which is why we have mosquito repellant at all. But there are plenty of situations where this sort of stuff is shortcut and/or where excessive emphasis is placed on RCT outcomes without quality theory. One I can think of recently discussed on this blog was something like treating children with malaria using suppositories if I remember correctly. One group argued that they had a mechanistic model and reasons to believe that the medication worked, the other group had some real world RCT that showed the treatment group wasn’t doing well. The model based group suggested that was to be expected because the treated group was more severely affected even before treatment (or some such thing). The RCT group said that if you don’t get a statistically significant positive improvement in an RCT then you don’t have any real effect, even if in fact there was some non-blinding and some non-randomization, and in fact as measured there were differences in initial conditions. What mattered was the stat-sig p value in the RCT. The difference in world view was pretty stark.

          IMHO the RCT is just one relatively minor component of the whole thing. Usually the reason people think of RCTs as the gold standard is because they observe that in a rush for money fame and fortune people shortcut a lot of that science stuff and just try shit and see if it works. A lot of it doesn’t.

          There are whole “methodologies” in bio-tech for “screening” experiments. Put 10000 chemicals in pipette plates and then have robots pipette them onto cell cultures and see which ones kill the cancer cells. Stuff like that. That’s the “try shit and see if it works” approach. Whole companies worth billions are built on that stuff. They don’t usually get very far.

          If you do the science you can save the world a lot of RCTs, and get to a place where we have knowledge of how mosquitos do their thing and you can accumulate knowledge, but if you try shit and see if it works you may make a ton of money, even if it doesn’t work. Like Tamiflu. So the try shit and see if it works tends to get a lot of traction. Fortunately there are still a few scientists around who spend decades eeking out a living trying to perfect mRNA vaccines and such, using “we understand the biology” as the basis, otherwise we’d maybe have seen a couple tens of millions more deaths worldwide in the last 3 years.

          my view on RCTs: one small component of science.

        • Ok, I think I see the issue.

          RCT’s are the gold standard for making certain kinds of comparisons, like comparing the effectiveness of one bug repellent to another.

          But science is not about making comparisons. RCTs are not necessarily very helpful in advancing science, depending on what is being studied.

          If you are trying to develop a better bug repellent, you might identify what biochemical pathways are causing insects to identify where yo go and whether to bite, and then look for chemicals that can block those pathways, etc etc. Agreed.

          But once you’ve used that knowledge to develop a new bug repellent, do you just assert that it’s better than what is out there, or do you test it? And, if you test it, how do you do so?

          Daniel, perhaps you would agree that the best approach would use controlled trials, perhaps randomized controlled trials? Anon clearly would not, but he won’t tell us what yo do instead.

        • Phil:

          I think the “controlled” part is much more important than the “randomized” part, and it’s too bad that statisticians and others have become so focused on the randomization.

        • Phil, yes you’re getting my drift now!

          When making comparisons RCTs have lots of advantages, I agree with this. But they aren’t the only way to do things well, and they are really just one type of experiment and experiments are just one part of science. Particularly, suppose you’ve done a trial of a drug using dosages of say 0, 5,10,20,50 mg in some relatively homogeneous population. Let’s suppose you have that 5mg definitely doesn’t have a big enough effect and 50 has too much side effects. You have data on main effects and side effects for every patient. Suppose that the main effect is expected to saturate as you bind up all of the receptors, and the side effects do not saturate as they involve a separate pathway.

          There is absolutely nothing wrong with fitting a model of the main and side effect to the available data and choosing the expected optimal dose which is a dose within the range (5,50) mg and just going ahead with using that dose. Let’s say it’s 18mg

          Do we need to get some “gold standard” of a new RCT of 18mg vs 15 or 23? No, there is no new information that the RCT really offers us because the existing data and model is sufficient. On the other hand if you decide that there are minimal side effects and calculate 130mg is the optimal, yes you should confirm using additional testing, but depending on what information is available you may well choose to use smaller trials. For example if safety tolerance was already established in an earlier trial up to 250mg and you believe efficacy saturates, then you should run a different trial than you would if efficacy doesn’t saturate etc. The model predictions inform the experimental design.

          Basically the gold standard for science is that you have a predictive model which works well in a wide range of conditions and that when making new predictions within the range of validity of the model the predictions are borne out in tests.

          RCTs are a good kind of experiment but by no means the only or even the universally best kind. I think Anoneuoid agrees with me here. There’s nothing wrong with collecting data from an RCT it’s valid and useful data, but not magic and not by itself the “gold standard of science”

        • Anon clearly would not, but he won’t tell us what yo do instead.

          You use science. Derive a model that predicts the rate of mosquito bites under various conditions, then check it against observation and refine (or come up with a better model).

          The type of repellant used is only one factor that interacts with all others. Which of two repellants is associated with fewer bites on average is of very limited value. People experience the real world, not an average world.

        • Anoneuoid –

          > The type of repellant used is only one factor that interacts with all others.

          Of course this is true. It can also be largely irrelevant I’m some conditions.

          You spray one arm with one spray and the other with another, after deriving your different spray formulas from molecular modeling, and the importance of any variety of other factors is vanishingly small at least under those conditions.

          So then understanding that you experiment other other conditions to see how outcomes change – again with a similar basic RCT methodology.

          If you’re suggesting to market a spray without any such kind of controlled trial (with maybe some elements of randomization) – and nothing other than a theoretical construct, seems rather silly to me unless your just banking on a killer marketing campaign.

        • Andrew,
          Understandably you did not read the entire comment string in detail. I mentioned in an earlier comment that randomization may not be so important. I still think it’s fair to call it the gold standard for many comparisons, but often the silver standard or bronze standard is good enough.

          Daniel,
          For comparing Bug Repellent A to Bug Repellent B right now — within the next few months — some sort of controlled trial is the best approach. Agree or disagree?

          Anon,
          Don’t perform a controlled experiment, “use science.” Awesome.

      • Daniel,
        I agree with your point about interpolating to find the right amount or whatever.

        I basically agree with what you’re saying but then I’m confused about why you got started down this path. I guess I should go back and read the original comments on this thread but it’s too exhausting. I think when you said you agree with anonueoid I assumed you meant that you agree that RCTs are not the gold standard for anything. I think they are the gold standard for a lot of comparisons, but that most science is not about comparisons. Some of it is, though!

        • I guess I interpreted “not the gold standard for anything” in a little bit less literal way than you did.

          Most of science should be about model building, it should involve a lot of different types of experiments, many of which might be laboratory controlled experiments testing just portions of our understanding, and it should involve synthesizing multiple sources of information. When it comes to determining if a theory is a good theory, we need to compute the prediction/consequences and see if the consequences of the model hold approximately in the real world. In that step *sometimes* we should do RCTs. For example bug spray A vs B in mid spring in Alaska then yeah, maybe we randomize left and right arms among 20 volunteers and do a 5 mile hike in the marshy meadows. But when it comes to say does alloying additive A make aluminum more or less ductile than additive B there may be basically no meaningful randomization, you melt down some alloys, machine some specimens stick them in a testing machine and then pull them until they break, measuring stress and strain. It’s controlled in the sense that you’re comparing A to B, but it’s not randomized in any meaningful way.

          Maybe when it comes to attempting to save a certain unique environment from destruction by some invasive plant we have to go with just experiments in laboratories with the plants and then extrapolate to the unique field conditions and choose the best one on the basis of our lab model because there’s no meaningful way in which we can get a sample of 20 identical environments to try things on. There’s only one Suisun Bay Delta or whatever.

          Sometimes we’re just building a model and then waiting for something to happen to see if it fits the model. The LIGO for example, we build some model of what a gravitational wave moving through it should do to the interferometer beams, then we wait until we see if that happens. The Michelson-Morely experiment was similar. Maybe sometimes we do “accidental experiments” like handing out coal on one side of a river and doing nothing on the other. If we have a good model of the situation, and a bunch of sources of data, we can maybe find out some effects of pollution on health. If we have a crap model and poor data, maybe we just wind up finding out nothing.

          RCTs are just “a kind of experiment that has some advantages”. If those advantages hold in the case you have at hand, then using that kind of experiment is a good idea. In every case though, if you’re doing science you’re coming up with an explanation for the way the world works and then testing to see if its predictions hold in various conditions. that’s what Anoneuoid means when he says “do science” even if he’s being a bit flippant about that.

        • If the assertion is that an RCT is not always (or even often) necessary for scientific advancement, of course I agree.

          But it seems to me like Anon really means it when he says an RCT is not the gold standard for comparing two treatments. Whereas I think it is, at least in many cases. It will be a long, long time before any model of human physiology is so good that I would trust the model over an RCT in determining the safety and effectiveness of a drug or vaccine.

        • I guess I don’t really get why there’s so much confusion about randomization. It seems pretty straightforward: if there is a large component of unpredictable variation, then the two arms of a test need to be randomized to (hopefully) distribute that variation evenly across both arms of a test (e.g., testing bug dope and power pose effects); and furthermore, the number of test subjects (n) has to be large enough to accomodate all the uncontrolled variation, and you need to analyze the result as an aggregate (statistically). However, if all the variables can be fully controlled then randomization serves no purpose (e.g., testing the properties of alloys) and the number of test subjects can be reduced to one, and the result is effectively certain within the conditions of the test.

          IMO the big rubs when it comes to randomization are that you have to guess at a) the scope (number of variables); b) magnitude (scale of the variation of each); c) ability to control each of the variables in order to select the sample and sample size. So maybe you find a medication that seems to work better on people with a certain blood type; but just by restricting to that blood type does that mean you’ve controlled all the variation along that dimension?

        • Chipmunk

          Randomization is there to make unknown causes of variation inconsistent so that the outcomes range over the full range of plausible variation. The idea that you need “groups” is wrong mostly just administrative convenience / cost cutting. You could literally give different treatments to each person, you could do so serially (ie. See what happens with patient 1 and then choose the treatment for patient 2 on that basis), ideally you should use all the available information to estimate an individual outcome not just group averages. The NHST standard script for drug trials is not scientific gold standard.

          Let’s take an example, maybe thyroid replacement. Suppose you have observed 300 individual patients as their thyroid has been damaged by autoimmune issues. You have a history of their thyroid hormone levels weekly for 1 year, and you have covariates like 5 different inflammation markers in their bloodstream, allergies they have, sex, age, weight, height, belly circumference, 20 different hormones in their blood samples every week, and have asked them to take a photo of every plate of food they’ve eaten each day on a random sample of days (to reduce the burden on them), and have estimated quantities and constituents of food eaten.

          So you build a model, its a dynamic compartment model involving the ways in which inflammation changes thyroid output, food intake, and hormone and inflammation marker levels and how that changes adipose tissue storage and feeds back to alter hormones.

          Based on the model you design a treatment algorithm. It involves taking blood work and other measurements then choosing a predicted optimal dose of thyroid replacement hormone, and monitoring changes monthly for a year, altering the dose each month after recomputing optimal dose from new measurements. As the data comes in each month you include it in the model and refit the Bayesian model so that the model becomes more informed about what is working.

          There is no meaningful randomization here. The system is a feedback control system, the individual is their own control (compare historical trends to current status). There is partial pooling of results across all participants. Is there anything unscientific about this? Heck no. Is there any randomization of treatment? Any necessity for p values comparing mean weight before and after treatment in groups? No. Is the treatment divided into groups at constant milligram doses? No. Etc etc.

          Doddering on about how RCT is the gold standard harms science. RCT is one kind of experiment. Nothing more or less.

        • Daniel, I simply disagree. For comparing bug repellent A to bug repellent B an RCT really is the gold standard. We are nowhere near having a model that would tell you which works better with more certainty than simply doing a trial.

          But randomization is probably not important in that case; a non-randomized controlled trial would probably be fine. Get experimental subjects to put one repellent on one arm and the other on the other arm; although randomizing which goes where is ideal, it almost certainly doesn’t really matter.

          Daniel of you would trust a model over an experiment in a case like this, I think you’re nuts.

        • Phil, there’s two parts to science, there’s building understanding, and there’s testing the understanding against data. You aren’t doing science unless you’re doing both.

          RCTs are one kind of data. They’re particularly good at handling situations where we don’t know very much and would be ok with just finding out what the center and width of the distribution of outcomes looks like. They are appropriate where there’s virtually no understanding and you’re trying to get **started** building that understanding.

          This is ironically why they are called the gold standard of science, because they give you answers when you’re too lazy, deluded, or poor to do much science. Lazy like Wansink, deluded like people who say “biology is too hard to build models” or poor like some things really would cost hundreds of millions or billions to study properly.

          Suppose you have a carrier lotion and 10k chemicals you think might be candidates for bug repellent. Are you going to do RCTs on each of the 100M pairings? Of course not. Because RCTs actually are a poor way to extract information. They’re a ROBUST way but poor and inefficient one.

          How could a poor and inefficient but robust method for extracting information from experiment be “the gold standard for science”? Mainly I’d say it’s politically because they’re a cheap and robust way to avoid having to DO science.

          Chemical A vs B wrt mosquitos is actually really easy so long as the difference is considerable. Do you really think you’d need an RCT in the field to detect DEET vs water? DEET vs your favorite skin lotion? DEET vs olive oil? A cotton swab in a box would be fine until you’d found a candidate product that had sort of similar efficacy to DEET. Once you’ve got to the point you might need to run a real world RCT it’s because you’ve done a TON of ordinary science already and the differences are more difficult to discern. If you’re doing good science then along the way you’ve recorded a lot of data and you’ve built a model which tells you something about which chemical structures mosquitos don’t like…

          “RCTs are the gold standard for doing comparisons after you’ve done 30 rounds of optimization and a decade of science to get to the point where it’s no longer easy to see what works better than what” just doesn’t roll off the tongue though right?

        • I kind of left hanging the fact that indeed RCTs are useful when you don’t know much, but they’re also useful when you have optimized something to within a close margin of another thing and need to look at fine distinctions that are within the noise margin. They’re also useful politically to prevent lying/gaming the system. The big reason we use them in drugs is because of the huge political issue with just “trusting big pharma”.

          If drug approval had to be done by handing the drug to a 3rd party and the govt pays the 3rd party to determine whether the drug was good enough to be approved, then that 3rd party could run lots of other kinds of experiments than the ones we think of as RCTs. Depending on the drug, you might do something like the process I outlined above for thyroid treatment. But you could never trust a company that stands to make $100B over 20 years to do that for you. The incentives are too high for cheating.

          In short, as I’ve always said, RCTs are a useful kind of experiment, they offer a particular type of data and are appropriate in certain contexts. Some of those contexts involve political issues well outside science, other times the issue is that not much science has been done yet, other times the issue is that a lot of science has been done and we’re developing a treatment and the best we can do is too close to the performance of other types of treatments to call without something that can ensure the noise is well accounted for.

          But there’s just LOTS of science that isn’t RCTs (of the type we’re familiar with) and most of the actual understanding in the world usually comes from those other kinds of experiments. Most of the time we need at least some kinds of controls. But it’s even possible to learn about the world without what we’d traditionally call controls (for example if you’re extrapolating a lab built model to a unique condition where no control is possible, say the rehabilitation of a particular unique environment invaded by invasive species).

        • Daniel –

          > Suppose you have a carrier lotion and 10k chemicals you think might be candidates for bug repellent. Are you going to do RCTs on each of the 100M pairings? Of course not. Because RCTs actually are a poor way to extract information. They’re a ROBUST way but poor and inefficient one.

          I doubt that there are many analogous scenarios where RCTs are on the table.

          Typically, they aren’t the first step of the process that just pops up out of nowhere, and instead there is a model in some sense that inspires the RCT. In order to get it funded you have to identify a gap in the existing literature and argue a plausible reason why the RCT mighti close that gap.

          Of course the process is sub-optimal – but I don’t think it helps to argue from such an unrealistic rhetorical framing.

          (not that Phil’s “you’re nuts” us any better.)

        • Joshua,

          I came up with a scenario where you might want to do some science, but RCTs wouldn’t be appropriate (a 10k compound search), you agree they wouldn’t be appropriate there and then chide me for coming up with a scenario that is so far outside where RCTs would be used. But that’s exactly my point, there are vast caverns of science outside where RCTs would be appropriate!

          I just spent a whole couple of posts describing under what conditions RCTs would be the appropriate type of experiment. It’s either where you have a small number of possible treatments and only need to know which of them is better in some specific circumstance without needing to understand anything beyond the outcome… this is pretty much “engineering” and isn’t science at all. The agricultural field trials that a lot of classical stats were designed for was kinda like that. You just need to know for this spot in north dakota how much fertilizer per acre should you apply this year given the current conditions to maximize yield? It won’t generalize to some spot in Iowa or Illinois, so you’re not doing science, just some engineering.

          Or

          You’ve used up your other forms of scientific “understanding” in producing some thing that does a job and by virtue of it having been optimized through a lot of previous modeling and experiment you get to the point where it’s not obvious whether A or B is better (they’re both “near” optimum compared to the measurement noise and uncontrolled effects), and you need to maximize the entropy of the measurement noise and uncontrolled effects so you can determine that the small remaining differences are due to your treatment rather than biases in uncontrolled effects.

          There’s a HUGE swath of science between those two bookends where RCTs are neither appropriate nor needed, it’s essentially ALL of science where the ultra formalized style of RCTs we think of in drug trials or policy interventions aren’t needed.

          Sometimes you might do some kind of highly informal RCT, ie. we’ve got 100 things to measure, and 4 instruments to do it on, let’s measure each thing on 2 randomly selected instruments to be sure we don’t have some specific instrument related issue biasing our data… You could call that an RCT, but I’m excluding that kind of highly informal use from what I mean. You could replace that RCT with a specific calibration scheme to ensure all instruments read the same on known samples, but it can be cheaper to randomize to the various instruments and include the calibration calcuations for the instruments in the modeling.

          I keep saying RCTs are one kind of experiment that are appropriate in a specific set of contexts, but those contexts hardly qualify as “gold standards for science”. It’s like saying “shoelaces are the gold standard for fastening things” ignoring glue, drywall hangers, lag bolts, velcro, tape, sewing, carriage bolts, hex head bolts, sheet metal screws, buttons, snaps, rivets, welding, brazing, epoxy…

          saying shoelaces are the gold standard for fastening things just means you can’t really conceive of much outside of footwear, it’s a sign of a narrowminded focus on getting your pharmaceuticals approved so you can get rich, or stuff like that.

          There’s NOTHING WRONG with RCTs they’re just not even a core component of science much less “the gold standard for science”. They are, ultimately, a specialized tool for use in certain kinds of conditions.

          In many ways the mantra “RCTs are the gold standard for science” is really a kind of political propaganda, which is why I’m pushing back so hard.

        • Daniel –

          As happens sometimes when these convos get stretched out (and I think maybe it happens a relatively high % of the time with Phil), this could largely be pivoting around a question of semantics.

          > In many ways the mantra “RCTs are the gold standard for science” is really a kind of political propaganda, which is why I’m pushing back so hard.

          I don’t know how you’re viewing it as a political agenda, exactly, but I’m familiar with a particular teaching context where RCTs might well be described as a “gold standard,” but as I’ve said a number of times it’s meant in the sense that there’s a hierarchy of study types basically ranging from case reports, expert opinions, observational, cohort studies (particularly cross sectional) at the one end and RCTs (preferably longitudinal) toward the other (with meta-analyses above them). I’ve never seen it taught to even remotely suggest that RCT interventional science would supercede or stand in for or obviate the need for theoretical modeling or bench work or anything like that. It looks to me like you’re arguing a false dichotomy due to semantics abd definitions that alter the direction of approach.

          I don’t think that typically the “science” part of “gold standard of science” – with that latter part affixed by the discussants here – is typically meant to reference all or even most parts of the empirical process.

          In the end, I don’t at all disagree that their utility of rcts is circumscribed within the entire scientific endeavor.

          Let’s go back to where this all started – when Anoneoid picked out the following excerpt, from an article I linked, to vent his spleen about how no one else besides him and a few others on his enlightened plane of excellence truly understands “science.”

          Take the case of artificial sweeteners. Randomized studies — in which people are randomly assigned to one treatment or another to ensure that no other factors interfere — are considered the gold standard

          That is a very particular context – ways of understanding how diet affects health outcomes (across a variety of outcomes). The reference is against observational studies where the analytical paradigm us to just observe correlations across non-random populations. Obviously, IN THAT CONTEXT there are clear advantages to an RCT. But even there, in the very next sentence the authors go on to elaborate upon the limitations of many RCTs. And I highly doubt the autbors could in any reasonable way be interpreted as arguing that RCTs are a “gold standard” as measured against basically any part of the scientific process, as it seems to me their statement is being interpreted. The article us basically a critique of the quality of most RCT-based nutritional science!

          Indeed. It seems that you and I might guess Anoneoid are arguing about RCTs within a particular political context but not all references to RCTs as a “gold standard” are delivered in that political context.
          The two contexts can exist in parallel. Affixing “of science” to “RCTs are a gold standard” and leaving out the [as compared to non-randomized cross-sectional, observational studies] is problematic and it looks to me functions as a way to politicize the frame. I get the valid reasons for engaging the associated politics but sometimes a cigar is just a cigar.

        • The article us basically a critique of the quality of most RCT-based nutritional science!

          Yep, I was agreeing with them on that aspect. But I do not agree with the suggestion to do a bunch of crappy observational studies without some kind of theoretical model guiding them. That is just yet another way to generate conflicting results.

        • > . But I do not agree with the suggestion to do a bunch of crappy observational studies without some kind of theoretical model guiding them.

          The article suggests a particular observational study that would improve upon previous observational studies where RCTs are inherently very difficult to carry out (mostly because of the control part of the RCT) and which could leverage the passage of time in a way that an RCT would takes decades to manifest, in a context where the capacity to model is inherently limited due to a lack of scientific knowledge of how/what to model.

          Perhaps just standing around and doing nothing would be better, but I tend to doubt it.

          If it’s a matter of limited resources and they get disproportionately directed towards RCTs and not towards exploring the associated empirical underpinnings, then that seems like a legit gripe but in itself says nothing about the value of the study they suggest. If there’s no clear modeling analysis that’s being starved of finding due to opportunity cost then I fail to see the problem. Life is sub-optimal and so is research.

        • Heres an example of one of these long-term observational nutrition studies:
          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7832857/

          They found high sodium diet is associated with *lower* blood pressure. Does this convince either of you to change your diet?

          It makes perfect sense to me. Since sodium has such a pronounced acute effect on BP, we should expect the opposite effect of chronic high sodium consumption. That is just how complex systems of feedbacks generally work.

          But is such a study capable of changing anyones mind who didn’t already think so, I doubt it.

          If we want better nutrition info, we need to make it easy to collect that info. Ie, finding out my serum concentration of whatever vitamin/mineral should be as cheap and simple as checking my blood sugar. This area has essentially been neglected for 70 years.

          We do not need more of these studies that yield conflicting results because we have no idea what factors need to be monitored/controlled.

        • Joshua,

          Brian Wansink ran a lot of RCTs. Bottomless soup bowls vs normal ones, stuff like that. Did we learn a lot from them?

          The political context is this… People can get funding for flashy “gold standard” experiments. Let’s check ivermectin vs standard of care, let’s check hydroxychloroquine vs standard of care, let’s check bottomless soup bowls vs normal, let’s check a different dose of X compared to the one that failed in the last trial… Etc.

          By claiming to be doing “gold standard” research it’s possible to divert a lot of resources that could be utilized to build an accumulating research program utilizing modeling, bench science, animal models, biochemical techniques etc.

          But those animal models and bench science and biochemistry and etc are “not the gold standard” and don’t produce flashy human applicable “translational” results in a few months to a couple years timeline and as a result struggle to get funding

        • I’m this podcast, there’s quite a bit of discussion of this very issue:

          https://hubermanlab.com/tim-ferriss-how-to-learn-better-and-create-your-best-future/

          They discuss the limitations of RCTs relative to “anecdata,” and advocate to some degree for the power of small-scale observational data, not the least for understanding nutrition (ala, “I’m going to talk to people and try stuff and see what happens for me.).

          I have mixed feelings in reaction. On the one hand I get the point how RCTs have an ocean liner-like effect and it takes a long time to turn protocols around. There are inherent weaknesses and it’s a cumbersome process. On a smaller scale you can be more nimble and more responsive. On the other hand, while I think these guys are on to some interesting findings that can push the envelope, I also think there’s a grifter component where sometimes they push some sketchy stuff based on pretty flimsy evidence that’s oversold (Huberman constantly hawks supplements).

          It’s all tricky and like for many things I think a middle path is warranted.

        • Daniel –

          > Brian Wansink ran a lot of RCTs. Bottomless soup bowls vs normal ones, stuff like that. Did we learn a lot from them?

          There’s a potential problem here with logical consistency. There are countless examples of fraud that result from pushing science that doesn’t rely on processes such as those comprising RCTs. Fraud is not inherently a function of RCTs and I don’t think that there’s some kind of differentially positive association. On the other hand, sure, RCTs are not some sort of guarantee that there won’t be fraud. One question is whether in the real world they serve a function of REDUCING the opportunities for fraud. I would guess yes, but with a low confidence. Clearly there are many, many examples of fraudulent science being pushed pretty much as the result of skirting “establishment” scientific processes such as RCTs. On the end, I do think that in many cases they enhance the robustness of many research projects. I’ve seen it happen.

          Again, I get that the use of RCTs overlaps with a “political” argument. But the use of RCTs is not inherently political. Effectively, you’re an advocate in the associated political struggle and I can respect that but I think you’re using a hammer and seeing nails, in a sense. You’re seeing RCTs and you’re seeing politics, but RCTs aren’t a fundamental issue here. For the most part (but not entirely) they’re a symptom of the problem.

          In the examples of IVM and HCQ, the use of RCTs was beneficial for providing some basis for disabusing people of potentially harmful disinformation. But they also became a tool of combatants on both sides. People became locked in bickering about, and not infrequently gaming, RCT methodology as a proxy for another kind of more straightforward partisan left/right ideological struggle – but that’s the fault of the combatants not RCTs. The parallel is blaming “climate science” for the politicization of climate change; again, imo, blaming the symptom for the underlying disease of cultural cognition, motivated reasoning, etc.

        • Joshua, we’ve converged a bit I think.

          Yes, the main thing that RCTs are important for is not science per-se but rather their robustness to bad actors. It’s the “gold standard” in finding out if someone has been scamming you. That’s pretty much definitely true. The formalization of it all makes it harder to cook the books.

          But we still don’t necessarily fully understand each other. I’m not advocating more purely observational science. I’m advocating more bench science, more model building, more variety in animal models, more small scale non-randomized human trials testing actual outcomes against model predicted outcomes, more looking into large-scale observational data and running analysis of that. All of it put together. The US has crap medical records, but places like sweden and UK and maybe Canada, and the VA in the US have a lot better records.

          Do long lasting injectable anti-psychotic drugs overall produce better outcomes than pills in people diagnosed with schizophrenia? Do they reduce overall societal burden of disease? We could RCT this with 10000 patients for $50-100M spread across 50 VA sites, or we could sit down and interview the people who treat these people, find out what the common causes for people to have psychotic breaks are, build an agent based model that models the lives of these people in terms of frequency with which various events happen, fit that model to 100 patient histories, and then investigate the plausible bounds for changes in frequency caused by using the injectables vs pills. Let’s include in the costs the cost of criminal behavior that occurs in the very severe cases (my sister treats these people so I know what kinds of situations typically happen). I think you’d find that the model suggests strongly that injectables are a huge win for everyone. Since the model would have specific predictions, we could treat a small group, say 100, with injectables, and observe the rate of events in the 5 years prior to the 2 years post… Cost to do this study would be dramatically less than $50-100M. After doing this study we may have good reason to broaden the usage of these drugs, and continue to improve the predictive model, eventually leading to a situation where it’s convincing to shift almost everyone in some category (or not, who knows).

          Under what conditions do blood pressure reduction treatments produce benefit? harm? What is the frequency of incidents like my dentist had where interactions between multiple heart medications led to her being hospitalized and unable to work for 4 months? How do those particular medications interact biochemically? What commonality is there in biochemical action of multiple medications among people with similar emergencies? Can we identify biochemical causal pathways that lead to drug interactions among these drugs? Can we come up with decision-rules that can prevent these interactions which could be incorporated into medical records systems?

          Why do people have chronic sinusitis? Turns out we actually know why, some bench scientists figured it out a decade ago. It is almost unknown among clinicians and if you look on UpToDate they still first line recommend antibiotics that actually lead to the loss of microbial diversity that causes the problem. https://www.ucsf.edu/news/2012/09/98691/sinusitis-linked-microbial-diversity that insight from bench science changed my life. It wasn’t an RCT it was building a proper model of the causal mechanism.

          In each of these kinds of cases there’s an integrated holistic approach that I’m advocating… Look at observational data, come up with hypotheses about what might be causing them, try to create a bench model that is relevant to the problem, do biochemistry to figure out what drives the bench model, make predictions about what might make the situation better, or worse, look to see if those predictions are born out in observational data, then do small scale trials, contact physicians and have them run their BP patients through your interaction detector and re-adjust their medication mixes, or have your sinusitis patients do microbial transplantation from a healthy person in their family… etc etc.

          When we understand what’s going on, often the magnitude of the effect we can get is so large than no RCT is needed. I could convince you in a week that transplantation of sinus effluent after nasal rinsing from a healthy patient into a chronic sinusitis sufferer solves chronic sinusitis in the majority of cases. We could do it with 10 patients with a history of 6 months or more of chronic sinusitis. It wouldn’t be “just a coincidence” that they were suddenly better and stayed completely better within 3 or 4 days of once-a-day transplant. The effect size would be huge. but it’s not a “RCT” so it’s apparently not “the gold standard”.

          People like to say things like “biology is just so complicated we can’t possibly rely on this mechanistic sort of stuff and need to do real world trials” then we do decades of real world trials trying different antibiotics and surgery while tens of millions of people suffer, only to have some bench microbiologist sit down and figure out that sure enough **it was the frikin antibiotic treatments causing microbial diversity loss** and you know what? it wasn’t that hard to figure out if you put your mind to it.

  9. For Covid-19 related research, we can start with something less grandiose than publishing every correlation. We’d have made a lot of progress if all researchers publish the details of their models, i.e. model equation, coefficients, standard errors, goodness of fit, the usual stuff. It is beyond frustrating when most papers describe verbally what the model is (“Cox regression model adjusted for X1, X2, and X3” is typical), then present estimated Y values and nothing else. I doubt that any peer reviewers would be able to properly judge any of these models, and all they could do is to rubberstamp them based on whether the estimated Y values “make sense”.

  10. I suspect Phil has never actually worked in a lab and done experiments.

    Eg, test the memory of some rats by seeing how long it takes for them to find a submerged platform (water maze). Then you see that some *like* swimming, others hate it. Some swim by the platform and check it is there with their tail without climbing on (they know you will pull them out after a minute anyway). Some are very anxious and freaking out, possibly because the cagemate bullied them and ate all the food. There is a memory component, but is it 50%, 30%, who knows? And everything is like this. Further, humans are more complicated creatures than rats.

    When you see a difference between two groups, it tells you very little about what is going on. The very first thing to do is model the phenomenon you are measuring, then once you have some understanding of it you can compare groups.

    For mosquito repellent, we should first collect some baseline data. When someone goes camping, golfing, or sitting around a fire in the backyard how many bites per hour do they experience? We want to know the distribution under various conditions over time, then ask why some get bit much more often than others. I’d also like to know the “mosquito density”, which we could estimate by setting up some traps. I’m sure wind, temperature, humidity, time of day, season also matter.

    Perhaps repellent A is better than repellent B under some conditions but not others. Without knowing this, we will generate an endless series of conflicting results. This is exactly what we see in the medical literature.

    Then there are the “side effects”, ie all the other stuff putting a substance on your skin may do.

    • I’m also reminded of mosquito netting. These nets are now being used worldwide for fishing (since starvation is a higher priority than mosquitos/malaria):

      However, despite this, concerns have been raised regarding unforeseen impacts of the distribution of billions of insecticide treated MNs. MN ‘misuse’ has been of growing concern from an operational viewpoint for the health community, for example with nets used as crop coverings or protection for granaries [4]. One such concern for both natural resource management and health is the use of MNs within artisanal fisheries [5,6]. With at least 154 million MNs estimated to have been distributed in 2015, and similar numbers in previous years [3], it can be surmised that the incidence of MN fishing is potentially very high, and unlikely to decrease without intervention.

      Fine mesh sizes (usually ≤3mm) are critical for exclusion of mosquitos, but render MNs used in fisheries almost entirely unselective in terms of small fish. Reportedly high juvenile fish capture rates [6] are coupled with reports of MN use in mangroves and seagrass beds—important nursery grounds for fish [7]. Additionally, the broad availability and low cost of the nets may be leading to increased fishing pressure from additional fishers entering the fishery [8].

      https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0191519

    • Anoneuoid, I gotta stand up for Phil here. I definitely think he knows how to do experiments, though you’re right he probably hasn’t done any on rats.

      He can give you more details, but from what little I know, Phil worked for many years for a DOE lab in Berkeley doing building energy efficiency, electrical usage, thermal modeling, air quality, etc etc. I’m guessing he worked on experiments like where people painted roofs different colors and observed energy usage afterwards, or installed different HVAC equipment, or different kinds of thermostats, and I know he worked on Radon pollution, maybe comparing different kinds of mitigation measures, but also predicting the risk of Radon pollution without doing monitoring, etc.

      I’m pretty dang sure he knows how to do experiments within that kind of context, and he also has already said that he agrees that there’s lots of science outside of where RCTs are needed.

      The only thing he and I really seem to be disagreeing on is something about precisely how important RCTs are.

      I say, in essence, that RCTs are great for narrow usages. Phil seems to say that the narrow usage is nevertheless really really important and RCTs do a much better job in that narrow usage, and so it’s appropriate to call them “the gold standard” for that stuff and that stuff is important enough that to say “the gold standard of science” is not too far off the truth for him.

      He and I both seem to agree on the following statement “RCTs are the best way to determine the distribution of outcomes from a small number of treatments applied to a specific group in real world conditions when the outcomes are close enough to similar that it’s not obvious “by eye” which is better”.

      Where we disagree seems to be whether that’s sufficient to call them “the gold standard for science” or whether that’s really just “a small part of science, almost insignificant amongst the vast sea of things science involves”

      Phil seems to agree that there’s a vast sea of things, but apparently thinks the RCT stuff is really really important anyway, whereas I think that’s a politically motivated position designed to cement certain economic actors methods of rent extraction (ie. pharma, econ / policy pundits, etc).

      I think it’s very telling that you’ll virtually never see an RCT outside of a context where the question is hugely political/economic. Want to know whether Rats spread disease faster than mice? You’ll see a ton of different ecological research methodologies not involving RCTs and then some papers summarizing it. Want to know whether people should be allowed to sell a drug that supposedly increases your lifespan post cancer diagnosis by some amount but if you gave it to sick people they’d hardly notice how much it changes their outcomes? RCTs are your thing!

      • Daniel –

        > He and I both seem to agree on the following statement “RCTs are the best way to determine the distribution of outcomes from a small number of treatments applied to a specific group in real world conditions when the outcomes are close enough to similar that it’s not obvious “by eye” which is better”.

        Again, this looks lijecsenabtucs abd it looks deliberately framed to diminish the value of RCTs. The specific group might be a (theoretically) representative sample of a large group. And the impact might be obvious by eye but the exact causal relationship isn’t. RCTs aren’t only used to test and intervention but are sometimes meant to experimentally explore mediators or moderators or interaction effects as a step towards follow on interventions studies. Again, this is where there’s much crossover and interaction between theoretical modeling and RCTs that as I read it instead seem to be mutually exclusive in how you’re approaching the question. In many cases RCTs are a part of modeling and modeling is a part of RCTs

        And with that I’ll bud out.

        • Joshua, sure, RCTs are a kind of experiment, and experiments are what make science work. If someone said “experiments are the gold standard of science” I would rubber stamp that statement. An RCT to me means something more than “we used some randomization somewhere during the experiment and we had some experimental units that were used differently from other experimental units” to me. You could define RCTs so broadly that it just means “experiments” perhaps, but I don’t think that’s what most people talk about when they say RCTs. What I’m talking about is the kind of experiment run by Pfizer on the COVID vaccine, with a lot of formalized prespecified decision making and zero deviation from protocol allowed, etc imagine running a pharma style RCT every time you need to determine the cured strength of concrete or which epoxy additive does what to machinability of epoxy joints or how quickly mice recover from surgery with a given suture or whatever.

        • I always liked this after doing similar experiments:

          For example, there have been many experiments running rats through all kinds of mazes, and so on—with little clear result. But in 1937 a man named Young did a very interesting one. He had a long corridor with doors all along one side where the rats came in, and doors along the other side where the food was. He wanted to see if he could train the rats to go in at the third door down from wherever he started them off. No. The rats went immediately to the door where the food had been the time before.

          The question was, how did the rats know, because the corridor was so beautifully built and so uniform, that this was the same door as before? Obviously there was something about the door that was different from the other doors. So he painted the doors very carefully, arranging the textures on the faces of the doors exactly the same. Still the rats could tell. Then he thought maybe the rats were smelling the food, so he used chemicals to change the smell after each run. Still the rats could tell. Then he realized the rats might be able to tell by seeing the lights and the arrangement in the laboratory like any commonsense person. So he covered the corridor, and, still the rats could tell.

          He finally found that they could tell by the way the floor sounded when they ran over it. And he could only fix that by putting his corridor in sand. So he covered one after another of all possible clues and finally was able to fool the rats so that they had to learn to go in the third door. If he relaxed any of his conditions, the rats could tell.

          Now, from a scientific standpoint, that is an A‑Number‑l experiment. That is the experiment that makes rat‑running experiments sensible, because it uncovers the clues that the rat is really using—not what you think it’s using. And that is the experiment that tells exactly what conditions you have to use in order to be careful and control everything in an experiment with rat‑running.

          I looked into the subsequent history of this research. The subsequent experiment, and the one after that, never referred to Mr. Young. They never used any of his criteria of putting the corridor on sand, or being very careful. They just went right on running rats in the same old way, and paid no attention to the great discoveries of Mr. Young, and his papers are not referred to, because he didn’t discover anything about the rats. In fact, he discovered all the things you have to do to discover something about rats. But not paying attention to experiments like that is a characteristic of Cargo Cult Science.

          https://calteches.library.caltech.edu/51/2/CargoCult.htm

          Unfortunately no one has ever really established who Feynman was referring to. Possibly this never happened and he made it up, or maybe combined multiple anecdotes he’d heard into a narrative that fit his beliefs.

Leave a Reply

Your email address will not be published. Required fields are marked *