John Williams writes:
Here’s a new zombie idea.
Claims that the Wuhan lab is the likely source of SARS CoV2, and that Anthony Fauci squelched inquiries into this to please the Chinese, have reared up lately in places not normally given to right wing hysteria, for example a recent New Yorker article by Daniel Immerwahr (Doctor’s Orders). The clear consensus of qualified scientists, expressed most recently in a new World Health Organization report, is that the available evidence points to a zoonotic spillover at the Wuhan wet market as the most likely source, although the lab leak hypothesis can’t be ruled out. I don’t expect this report to change any minds, however.
This is a great topic to bring up on the blog, given that a few months ago we discussed an two articles, one by physicist Michael Weissman and one by economist Andrew Levin, both arguing that there are strong odds in favor of the lab leak hypothesis.
In contrast, the World Health Organization report concludes:
The work to understand the origins of SARS-CoV-2 remains unfinished. . . . while a zoonotic origin with spillover from animals to humans is currently considered the best supported hypothesis by the available scientific data, until requests for further information are met or more scientific data becomes available, the origins of SARS-CoV-2 and how it entered the human population will remain inconclusive.
So there’s real scientific disagreement here! Two groups studying the same data, one group saying the evidence isn’t clear (the best they can say is that one hypothesis is “currently considered the best supported by the available data”) and the other saying the evidence in favor of the other hypothesis is overwhelming.
What’s going on?
When this came up in February, I wrote that I didn’t know what to think:
It’s hard to compute Bayesian probabilities for this problem, not so much because of the priors but because of the likelihood, which is the probability of the data given the model. One problem is the selection of what is considered to be data, the other problem is that the model (“lab leak” or “wet-market leak” or whatever) is not clearly specified–in statistics jargon, these are “composite hypotheses.” This is not a criticism of this particular papers per se; it’s just a general difficulty with this sort of analysis. It’s not clear to me that Bayesian inference is the right way to attack this sort of problem. But I’ve been intimidated by the technical biological details in all these analyses so I haven’t looked at them personally.
For example, there’s this from Weissman’s article:
A natural outbreak could come from any of a diverse collection of pathogens, but this outbreak matched the specific subcategory of virus being studied in the Wuhan labs. Another update will come from a special genetic sequence that codes for the furin cleavage site (FCS) where the UNC-WIV-EHA DEFUSE proposal suggested adding a tiny piece of protein sequence to a natural coronavirus sequence. The tiny extra part of SC2’s spike protein, the FCS that is absent in its wild relatives, has nucleotide coding that is rare for related natural viruses but seems less peculiar for the most relevant known designed sequences, the mRNA vaccines.
I don’t know viruses, and this sort of thing is just impossible for me to judge.
Similarly, there’s this from the WHO report:
One of the viruses currently known to be closest to SARS-CoV-2 (RaTG13) was described along with an original characterization of SARS-CoV-2 by Zhou et al. (Zhou et al., 2020b) An addendum to this paper mentions that the virus material had been available since 2012/2013 at the Wuhan Institute of Virology in the form of an uncultured original animal stored sample (ID4991) with published partial sequence characterization in 2016, amended to near full sequence characterization in 2018, renamed RaTG13 . . . After the first identification and sequencing of SARS-CoV-2, comparison of the partial sequence led researchers to complete sequencing of the full genome of RaTG13. However, the RaTG13 genome has a low similarity with SARS-CoV-2 in the critical RBD of the spike protein (only 11 of 17 receptor contact residues are conserved).
The similarities and differences between these viruses must be very relevant to the probability of a lab virus becoming COVID, but what do I know about these viruses? Absolutely nothing! So right there I run into a wall. Which is kinda scary because I think I’m pretty good at assessing evidence, and if I’m so stuck here, what would that say about the ability of other non-experts to figure this out?
Usually the way to go is to rely on the experts, which in this case is the WHO report, so I guess my inclination is to go with that for now. But then I have to be clear that this is why I’m relying on it, because I find it difficult to otherwise judge. And, indeed, the WHO report itself cites other authorities who are equivocal:
On 2 April 2025, the French National Academy of Medicine published its report entitled ‘From the Origin of SARS-CoV-2 to the Risks of Zoonoses and Dangerous Virus Handling’, which analysed two hypotheses: 1) SARS-CoV-2 is a zoonosis and 2) the origin of SARS-CoV-2 is linked to a laboratory accident. The Academy did not reach any firm conclusions on the subject, which it has taken as a starting point for reflection on possible recommendations and proposals for action to better anticipate and react to emerging diseases.
So, uncertainty!
In theory I guess that some qualified epidemiologist could reconcile Weissman’s report and the WHO report. Such reconciliations are possible, but it can take some work and it might not attract much interest; for example, our paper, Reconciling evaluations of the millennium villages project, has been cited only 2 times since its publication in 2022. Fair enough–there’s not so much interest any more in that sort of international development project–; still, a bit frustrating. Given that Weissman’s paper itself remains unpublished, it’s not clear that someone with the qualifications to evaluate all this evidence would want to put in the time and effort to perform the reconciliation.
The job could be done, I’m sure. Nick Brown and I did something of the sort in our recently published assessment of certain published claims of mind-body healing, and that indeed took a lot of work on our part. You’d just need someone with a similar temperament and reputation to Nick and me, along with the necessary biology expertise.
P.S. I looked up the other John Williams, and . . . he composed lots of soundtracks before he did Jaws. Lots and lots! His first soundtrack was in 1954, he did a couple of the Gidget movies, a bunch of TV movies, the score for Fiddler on the Roof, the Poseidon Adventure, all sorts of things. I had no idea. Quite the career.
It was a remarkable coincidence that COVID started in the same city
as one of the world’s leading coronavirus labs, but coincidences do
happen. But in 2021 (from a leak) it was revealed that in 2018
a grant application had been made (DEFUSE), jointly with the Wuhan
lab and US researchers, to construct a virus very very strongly
similar to COVID-19 {\em in Wuhan}.
There is basically no chance at all that this is a coincidence.
I am a friend of Michael Weissman and have discussed his analysis,
but in my view the answer here is so obvious that a formal analysis
is not really necessary.
Unfortunately the issue is so politically loaded that organizations
such as the WHO cannot be relied on for an unbiased analysis. If they
can possibly avoid pointing the finger of blame they will. And passions
do run high. I had a potential collaborator pull out of an entirely unrelated
project, because he decided that I must be a right wing bigot since I
thought that COVID almost for sure originated in a Wuhan lab.
+1
>>[B}ut in my view the answer here is so obvious that a formal analysis
>>is not really necessary.
How do you know this isn’t a type 1 error? You don’t. And you can’t without evidence of what caused Covid. So, the conclusion that one should reach is to assign equal probabilities to all hypothesis and allow the evidence to decide which is more likely.
Of course it might be a type I error but in 2018 the Wuhan lab wanted
to build a virus extremely like COVID including inserting a “Furin cleavage site” exactly where it exists in COVID-19. What are the chances this is
unconnected with the pandemic outbreak? Weissman attempted to
quantify this but it is clear that the probability must be vanishingly small.
Really? Can you point to the sentences in the DEFUSE application that state that? From my reading the very limited and preliminary description of sequence engineering doesn’t support your assertion. Maybe you can point to some additional descriptions figures etc. but a search for “furin” in the DEFUSE application finds (apart from the word “furin” in a reference):
p. 13 “We will analyze all SARS-CoV S gene sequences for appropriately conserved proteolytic cleavage sites in S2 and for the presence of potential furin cleavage sites”. And then developing the point they say: “SARS- CoV with mismatches in proteolytic cleavage sits can be activated by exogenous trypsin or cathepsin L. Where clear mismatches occur, we wil introduce appropriate human specific cleavage sites and evaluate growth potential in Vero cel and HAE cultures.”
So there’s nothing “exact” about the description of the furin site in the application. The application states that they will introduce “appropriate human specific cleavage sites” but the Cov-19 Spike protein furin site isn’t one of the human specific cleavage sites known at the time of the appearance of Covid-19. Of course, if a furin cleavage site appears in a Spike protein there is only one place it could possibly be (on an accessible mobile loop on the external faces of the spike trimer where cleavage separates the S1 and S2 subunits). The fact that a furin cleavage site is found in the only place a furin site could be is completely unremarkable.
Drilling down, there are lots of things that make me skeptical of the lab origin conspiracy theory that argues on the basis of vague statements about the DEFUSE application. For example, there is an important 675-QTQTN-679 sequence just N-terminal to the furin sequence motif (681-PRRAR/SV-687) where the numbers are the amino acid sequence position and the “/” is the site of cleavage). Removing this greatly attenuates infectivity as does mutating the T’s to remove possibility for glycosylation does the same. Are you suggesting that the Wuhan lab introduced this site too? Where is this in the DEFUSE application? As far as I can see the importance of this site was only known since well after the start of the pandemic (published in 2023). The DEFUSE application does talk about introducing glycosylation sites but in the application these are N-linked sites. The important glycosylation site near the furin site is an O-linked (T = Thr) site….and so on… Are you suggesting that where the application refers to N-glycosylation they actually meant to say O-glycosylation??
Seems to me that as with all conspiracy theories the theory’s success feeds off uncertainty and requires absence/hiding of information to support the chosen certainty (the conspiracy theory).
btw I have no preference for any of the possible origins as described, for example, in the recent WHO report. I’m pretty comfortable with not knowing until we have appropriate evidence. In the meantime it would be derelict to stop investigating zoonotic origins.
Dear Nick, why do you think virologists before 2020 were so interested in furin cleavage sites?
Could it possibly be because they can be determinants of host-specificity and spillover, and therefore would be expected in a pandemic virus?
The scientists literally wrote a grant proposal (citing over a decade of literature that virologists worldwide are familiar with) that happened to note that this is a feature that would be a risk factor in a natural virus. And then a viral epidemic breaks out with a virus that has one such feature. It’s exactly what had been predicted, so actually it is not coincidental at all: it is exactly as expected.
And since then, scientists have indeed determined the furin cleavage site is important for transmission in humans. Therefore, it’s almost as if a pandemic SARS2-related coronavirus almost certainly would have to have one. So this isn’t a coincidence at all: it’s just selection, and it was all predicted by virologists around the world before the pandemic.
By the way, some more facts on the DEFUSE proposal:
1. The grant proposed modifying cleavage sites by mismatches, not inserting them. This is consistent with all past experiments on modifying coronavirus cleavage sites in the literature (none of which have been performed by Dr. Shi’s lab or in Wuhan, by the way).
2. That section of the grant was written by Ralph Baric, to be performed in his lab in UNC, in the US, not in Wuhan. This is his area of expertise, not Dr. Shi’s. So the “Wuhan lab” did not “want to build a virus”. The UNC lab did.
3. The rest of the grant states that they will be inserting spike genes into known existing viral backbones in Shi’s lab – exactly as her lab has done before, and exactly as predicted would be a feature of an engineered virus in 2020 by Andersen et al. 2020 – and exactly what we do not observe in SARS2.
4. The draft of the grant states that they would use a canonical furin cleavage site, not like the non-canonical site observed in SARS2, exactly as predicted by Holmes et al. 2021.
I don’t know why the Chinese built a coronavirus lab in Wuhan. But might it not be because it’s in an area where coronaviruses are endemic? It would not be a strange coincidence for a new species of frog to be discovered in a tropical wilderness preserve; such preserves are established precisely where biodiversity is known or expected to be great. If the lab were sited for its proximity to the viruses to be studied, then there’s nothing very surprising or evidentially salient for a virus to be found in its vicinity.G
No, it wasn’t. The Wuhan Institute of Virology used theWuhan population as a negative control for seropositivity to these types of viruses. There’s a consensus on all sides that the related viruses come from southern Yunnan, Vietnam, and Laos.
But this is true, just on a different scale. SARS also emerged in a large city in China (almost as far from southern Yunnan). And SARS was also detected in 2003-2004 on farms right outside of Wuhan! So it’s absolutely not a coincidence that most labs studying bat coronaviruses are in large cities in China, because SARS was an enormous national issue. For example there are plenty of labs studying novel bat coronaviruses in Beijing, Shanghai, and Hong Kong, and if SARS-CoV-2 had begun at any of those cities, that would look much more like a lab event than Wuhan, because these cities are even further from the reservoir. So within China Wuhan was not an especially likely city.
And any large city in China is even more likely than other cities worldwide. For example, if SARS2 had begun in Raleigh, NC, or Sacramento, CA, that would surely be suspicious, given that there are labs near those cities studying SARS-related bat coronaviruses but certainly no animal reservoir!
So in the grand scheme of things, this poster is right, Wuhan was not an especially unlikely place for a zoonotic outbreak of a bat coronavirus, and the reason why many large cities in China have labs studying coronaviruses is not a geographic coincidence, it’s because the threat is more significant within China.
The scientists used the Wuhan population as negative controls because novel coronaviruses aren’t spreading in large modern cities all of the time. For example, in 2002, you could use the population of Guangzhou or Foshan as negative controls for SARS as well. You could even do so after the SARS epidemic. But SARS still began there. So clearly your point doesn’t really hold any water here.
Wuhan is a big city. A key question for the lab-leak proponents is why the initial cases seem to cluster around the market, and there are no initial cases right around the lab itself. The market is across a river and a significant distance away from the lab, so these are distinct locations. That is, they aren’t right next to each other in terms of people transmitting a new disease. This is a major weakness in that theory.
The DEFUSE stuff takes some work to rebut. But, just on what you say, it’s potentially cherry-picking, in that over the whole set of research ever done at Wuhan, one could like find something somewhere which is vaguely like Covid, and then proclaim – look, look, a connection.
“Bigot” is a harsh word. How can one politely describe a view that you have been misled by right-wing liars who make up xenophobic conspiracy theories?
The Worobey analysis of the early case locations is all messed up.
You can read the J. Royal Stat Soc A papers by Stoyan and Chiu (https://doi.org/10.1093/jrsssa/qnad139) and by me (https://doi.org/10.1093/jrsssa/qnae021)
outside firewall: https://arxiv.org/abs/2401.08680
on how badly it’s messed up.
As Andrew mentioned, Levin did a more conventional analysis and concluded that there was a source south of the Yangtze.(https://www.nber.org/papers/w33428)
And the Stoyan and Chiu paper made clear, factual data errors that they haven’t corrected. Why do you think they haven’t corrected their paper?
A full response and re-analysis to that paper, which notes their errors, is described here:
https://arxiv.org/pdf/2403.05859
It’s pretty clear that Stoyan and Chiu’s main point is “Well, it could have been another site right next to the market, like a plaza or something”. Their other point is “Well, the market could be at the center of the cases but that could just be a coincidence”. But this is a ridiculous argument when stated plainly, and the WIV is on the exact opposite site of the city.
The question of what other work was being done in Wuhan does not enter into any of the likelihood ratios. The relevant ratios for this particular observation are of
P(Wuhan, sarbecovirus|lab leak}/P(Wuhan, sarbecovirus|zoonosis}.
For general zoonosis that factor is roughly 100. For the particular wildlife market version that became a favorite among Western virologists the ratio is more like 1000 because Wuhan has a tiny fraction of the wildlife trade, which is mostly concentrated in Guangdong and Guangxi.
This “it couldn’t be a coincidence” thinking is an unfalsifiable premise that I have seen many people rely on to dismiss all the weaknesses of theories behind a lab-mediated spillover.
There is no real evidence of an outbreak centered at the lab. There’s no actual evidence of a viable precursor virus being worked on at the lab (and very qualified experts have presented a lot of evidence as to why the viruses identified for previous research would have been very unlikely candidates for CoV-2).
These are huge problems for any cetainty about a lab-mediated spillover. But they can just be hand-waved away with a flourish of “it can’t be a coincidence.” If you want to make a scientific argument, do so. The “it cant be a coincidence” think is just facile epistemics. It’s not an actual argument. You could just as easily use the exact same heuristic for a market-mediated spillover.
More generally, who is at the root of the belief lab leaks are rare, which is contrary to all the evidence?
No one holding this belief is ever willing/able to share who told them it.
In reality, it is like a casual Friday event. Every month you can expect there were a few leaks of something from BSL 3/4 labs. This is ongoing right now.
People are collecting pathogens from remore locations around the world, concentrating them into buildings near/in major population centers, then leaking them into the population every other week or so.
This seems to be of little concern, its nuts.
The arguments around covid origins that you describe seem largely to fall into an inference trap that is common to many fields: as Andrew has termed it, “premature collapse of the wavefunction”. In other words, rather than maintaining multiple hypotheses with different levels of credibility, there seems to be a desire to commit to a single conclusion without sufficient evidence to warrant doing so. The WHO report, as well as the French Academy study it cites, is admirable for not falling into this trap.
It seems to me that the main purpose in identifying the origins of covid is to establish policies that would make it less likely to happen again. Given that different origin stories would suggest different policies, knowing which is more plausible could help us decide how to allocate resources to different policies. In a situation with uncertainty, we would do best to adopt policies that would mitigate against *any* plausible cause of a viral outbreak rather than putting all our eggs into one basket. But even if we knew with absolute certainty how covid arose, it’s not clear to me how much that policy recommendation would change. After all, covid is just one virus and viruses can spread via a lot of mechanisms. So I guess I don’t really understand the motivation to prematurely collapse the wavefunction in this case, since (a) there isn’t enough evidence; and (b) it wouldn’t really change policy recommendations.
P.S.: John “composer” Williams has had a very prolific career! He was a session pianist for a while under the name “Johnny Williams” (the same name he used when scoring the Lost in Space series). He doesn’t have a huge amount of concert music on his CV, but I would recommend his bassoon concerto “Five Sacred Trees”—I suspect it’ll be a surprise to those who are only familiar with his “golden age of Hollywood” scores for which he is more well-known. Also, I got to meet him once and found him to be a really kind and supportive person!
Excellent reply!
In my first post on this topic (https://www.dailykos.com/stories/2021/6/6/2033930/-Zoonosis-or-Lab-Leak), I took just that view. Uncertainty seemed like it was not only called for but also useful since precautions against both types of origin should be taken.
After a while it just seemed like too much effort to maintain artificial agnosticism against the evidence. That sense was reinforced as I realized how seriously erroneous the key papers supporting zoonosis were. Premature certainty, especially motivated certainty, is bad. So is motivated agnosticism if it requires gross scientific errors. Unfortunately for much of the scientific community the obvious motivations are strong.
I had the good fortune to attend the World Premiere of John Williams’ Concerto for Piano and Orchestra last Saturday at Tanglewood. It was very interesting (to me) in that it combined the sort of modern atonal spiky piano part (ably performed by Emmanuel Ax) that audiences are said to run from with a very lyrical orchestral accompaniment. Williams is now wheelchair-bound, but came to Massachusetts to acknowledge applause at the end of the piece.
(That said, the World Premiere was overshadowed, IMO, by a brilliant post-intermission Mahler Symphony #1.)
I have a very incomplete understanding of statistics. I understand (sort of) how to use probability prospectively. Filling an inside flush is unlikely, be careful about betting on this. Retrospectively has me confused. The Washington Nationals are unlikely to win a World Series; therefore 2019 did not happen? Is that what people are saying?
Bayesian probability is about quantifying a “state of information”. Retrospectively, that is, not what did happen, but rather how much you know about what did happen.
Obviously whatever happened, happened. If we could ask Omniscient Jones or God or whatever, they’d tell us precisely… But we dont have access to that information so from what we know, we need to decide which of the possible explanations has the most weight associated with it.
As Andrew points out, this virus stuff is complicated. My wife is a retired virologist, so I know this better than most. I also know which other virologists she takes seriously and which she thinks have done good work on the origin of other viral diseases, and I’ve learned to trust her judgement about this sort of thing, so it weighs heavily on my thinking that they are in the “probably a zoonotic origin camp.”
That said, if people want to take the trouble to read the WHO report, they can see how it deals with the furin cleavage and other points people have mentioned above. I’ve spent the day taking my computer to the repair shop (100 minutes from the remote place I live), so I can’t look up page numbers to guide them.
Dear Andrew,
The COVID origins debate goes much, much deeper than I think you realize. I am just going to include a few links that should be “entry level reading”.
The $100K debate – proposed by someone who leans lab leak, won by a zoonosis proponent:
https://medium.com/microbial-instincts/my-friend-won-the-us-100-000-debate-on-the-origin-of-covid-19-8a9d3f719ce9
The judges ended up at ~300:1 zoonosis. It was reviewed by Scott Alexander on his blog, who also leans zoonosis, 96-4:
https://www.astralcodexten.com/p/practically-a-book-review-rootclaim
Virologists and epidemiologists were anonymously polled and 80% leaned zoonosis and the median probability was 90% zoonosis:
https://gcri.org/publications/research/covid-origin/
The scientific literature is more lopsided and is almost entirely in favor of zoonosis.
Here are some recent statements in journals from 200+ scientists that support zoonosis:
The harms of promoting the lab leak hypothesis for SARS-CoV-2 origins without evidence: https://journals.asm.org/doi/full/10.1128/jvi.01240-24
Virology under the Microscope—a Call for Rational Discourse: https://journals.asm.org/doi/10.1128/jvi.00089-23
Statement in Support of: “Virology under the Microscope—a Call for Rational Discourse: https://journals.asm.org/doi/10.1128/mbio.00815-23
Virology—the path forward: https://journals.asm.org/doi/full/10.1128/jvi.01791-23
A Critical Analysis of the Evidence for the SARS-CoV-2 Origin Hypotheses: https://journals.asm.org/doi/full/10.1128/jvi.00365-23
And here are primary literature and reviews of the evidence, e.g. analyses from top scientists that support a market origin:
https://www.cell.com/cell/fulltext/S0092-8674(21)00991-0
https://www.cell.com/cell/fulltext/S0092-8674(24)00901-2
https://www.science.org/doi/10.1126/science.abm4454
https://www.science.org/doi/10.1126/science.abp8337
https://www.science.org/doi/10.1126/science.abp8715
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011934#sec001
https://www.annualreviews.org/content/journals/10.1146/annurev-virology-093022-013037
And more recent works, like:
https://www.biorxiv.org/content/10.1101/2025.04.05.647275v1
https://academic.oup.com/mbe/article/42/6/msaf109/8158640
You should probably also read the ODNI’s declassified summaries of the topic. These are all really “entry-level” readings on COVID-19 origins – the topic goes much, much deeper (most “aficionados” have spent 100+ hours on the topic, I shudder to think of what my number is).
“Lab leak” proponents will have responses and “gotchas” for each of them, and then there are responses in turn to them. It is like any other classic internet debate: creationism vs evolution of the 2000s, climate change of the 2010s, in this regard.
Some scientists (20% of experts) lean towards a lab origin. It’s a number higher than the percentage of experts who lean towards creationism or climate denial, so therefore we can certainly agree the truth is less certain than on those two issues. However, after reviewing the evidence I lean towards >99.9% zoonosis.
Finally, lab origin hypotheses are not intrinsically conspiracy theories – that’s why scientists discuss them in peer-reviewed journals. But many of the “practicing lab origin proponents”, e.g. Michael Weissman, make epistemic errors characteristic of conspiracy theorists. The difference between those two is important and distinct but often people interpret “conspiracy theory” as simply an insult, when instead it describes a pattern and category of epistemic errors.
These are arguments by authority, by consequences, and by adjectives. I won’t bother to cite counter-authorities and other consequences but will get right to the statistical methods.
As for “epistemic errors”, Scott Alexander and the others who drew Zoo-leaning conclusions from the famous debate made a serious error in applying Bayesian stats. They drew an enormous Bayes factor from a specific unrealistic model of notoriously unreliable data. Basic hierarchical Bayes says that the underlying uncertainties limit how big a Bayes factor such an approach can give. Without that unrealistic factor, even though they omitted other major factors favoring LL, their analyses would tilt toward LL. I discuss this particular issue in a fairly short blog.
https://michaelweissman.substack.com/p/open-letter-to-scott-alexander
To be clear, I was not arguing by authority, I was simply noting the existence of various opinions on the topic. It is important to be more accurate in your critiques. If I had said “You should believe in a zoonotic origin for COVID-19 because most experts do” that would be an argument from authority. But I did not say that. Rather, I encouraged Andrew (and others) to read the existing scientific literature and discussions on the topic by linking to it. That’s the data. I also think it is important to be clear about where most experts lie, but at the end of the day, the question is of course all about the data.
For example, in the debate that I cited, neither case was delivered by experts, it was simply an evaluation of the data.
I must admit your statement “They drew an enormous Bayes factor from a specific unrealistic model of notoriously unreliable data” is a tad ironic, given the number of examples of you doing so on your own blog post: https://michaelweissman.substack.com/p/an-inconvenient-probability-v57
Just as one example (my sense is this has been explained to you several times, but I will try again): you say “By far the largest cluster of early reports in this early data set is close to the WIV on the south side of the Yangtze”.
But:
1. That “cluster” is only made of 9 individuals by January 18, 2020. This is too few to be a meaningful sample of COVID-19 patients in Wuhan. There were already 4,126 reported cases in the city by this time, and at least 5 times as many unreported cases. So this ‘cluster” is just a few people who are self-reporting that they got sick by a time when the virus was everywhere in the city.
2. The noisiness of the “cluster” is clear in the figure itself. On the next day (Jan 19+20, as opposed to prior to Jan 18) the clustering changes completely. If this clustering actually represented a statistically significant signal of viral spread, we would instead have observed the spatial pattern diffusing slowly, as we do with the actual dataset for the Huanan market (see Holmes et al. 2021, Figure 1).
3. The Weibo COVID-19 website was only setup by February 4 2020; these earliest data points are a few individuals remembering or guessing when they had gotten sick previously, and are therefore prone to mistakes, especially when it’s just 9 individuals.
4. Another academic work also scraped the Weibo dataset for analysis, and ended up with a different dataset, which unlike the dataset in Peng et al., is fully available. When re-analyzed, this version of the Weibo dataset is similar to Peng et al. overall, but differs in some of the earliest cases, and shows no such “WIV” clustering. Unlike Peng et al., the dataset is actually fully available. https://github.com/YuhanJiang415/COVID19_Self-reported_Data_Weibo/blob/master/Self-reported%20Data.csv
This is just one example. My personal sense is that folks have tried to engage with you on all of these points many times and not gotten anywhere, so I’ll probably avoid getting into a further back and forth on it – you are welcome to the last word.
That’s a really peculiar example to give of me drawing large Bayes factors from shaky data. My conclusion of that section is (italicized in the original) “Such maps cannot reliably point to the spillover site.” The Bayes factor I derive from it is 1.0.
I gave that discussion only to point out that the map Worobey featured from the same paper was a cherry-picked later map, not the most relevant early map.
Citing scholarly literature is not “appealing to authority.” This is the same trap that people who “do their own research” and think that they know as much as people who spend decades being trained and doing primary research. it’s like people who shout “correlation is not causation” who seem to believe that the fact that correlation exists means that there is not a causal relationship. Correlation is not causation and correlation is not not causation.
I don’t have a dog in this fight, though I really think the main implication is that whatever happened, lab security needs to be better. Even if the probability it was an accidental leak is miniscule, we can see that the consequences are too high to risk.
p.s. You include Pekar 2022 as part of the primary evidence. Citation of that deeply, provably invalid paper serves as a kind of marker for who is not trying to get this analysis right.
See Angus McCowan’s deep discussion of the logic errors (https://arxiv.org/abs/2502.20076)
and my more readable account of his work.
(https://michaelweissman.substack.com/p/explanation-of-and-comments-on-mccowans)
Very aware of that preprint, which I personally find entirely unconvincing. The early phylogeny of SARS-CoV-2 still is very unlikely to have been observed given one introduction, and McCowan doesn’t even dispute that! He just introduces his own new model of two introductions under which the data is also unlikely. But clearly that begs the question of what model the observed data actually is likely under. The Pekar paper hints that this was either (a) two non-independent introductions or (b) multiple index cases of each lineage.
Personally, I believe multiple index cases of each lineage is the best model to explain the observed data. That’s really the only way you get two evenly sized polytomies, I think.
Some more responses to that preprint are found on pubpeer:
https://pubpeer.com/publications/3FB983CC74C0A93394568A373167CE#19
https://pubpeer.com/publications/2C73F441513A42FE6D566147E5E76C
For readers who want to know what this dispute is about, here’s the relevant passage from my explanation of McCowan’s arXiv.
“It is obviously essential that the same observations be used for each hypothesis. For example, if one were to update the odds for deciding which of two suspects committed a burglary, it would not be correct to use the ratio P(drives car|suspect 2)/P(drives blue Toyota|suspect 1). The P2022 paper makes an analogous mistake. For the I1 hypothesis the observation used is that there are two polytomies with sequences differing by at least 2 nucleotides with the smaller polytomy having at least 30% of the detected cases. (Lineage A had 35%). For the I2 hypothesis the requirement was only that there be two polytomies with no restriction on the sequence difference or the relative size. (5) This is a fundamental error in logic.”
“The early phylogeny of SARS-CoV-2 still is very unlikely to have been observed given one introduction, and McCowan doesn’t even dispute that!” One of the senior authors (Michael Worobey) takes this argument even further and presents the likelihood of a single introduction failing to reproduce observations as the probability that there were two introductions.
“He just introduces his own new model of two introductions under which the data is also unlikely.” I use Pekar’s model, merely adding a tuned Poisson to generate upstream diversity. I also present an alternative with upstream diversity written in by hand to conform to Pekar’s implicit assumptions.
“But clearly that begs the question of what model the observed data actually is likely under. The Pekar paper hints that this was either (a) two non-independent introductions or (b) multiple index cases of each lineage.” Pekar does not hint at non-independent introductions, and explicitly estimates the number of introductions as that maximising the likelihood of two succeeding. You are proposing a new model with different parameters.
Thanks for the interesting post. I’ve tried a very simple — and therefore not strictly correct — model, but one that is based on publicly available data and, I believe, realistic assumptions.
If the outbreak started in Wuhan’s food service sector (about 400,000 workers, based on Wuhan population figures and employment statistics), and we assume 100 initial infections, using an early estimate of the reproduction number from Lombardy (Rₜ ≈ 2.96) over six transmission cycles (~1 month), we would expect about 67,000 infections.
If it began at the Wuhan Institute of Virology (~500 staff, according to institutional records) with 10 initial infections, the same assumptions give about 6,700 cases.
Serological surveys (e.g., early 2020 Wuhan studies) suggest roughly 484,000 infections after the first month. Using a log-normal model to allow for a fourfold uncertainty in early counts, the data are about 10 times more likely under the market-origin scenario than the lab-origin scenario (Bayes Factor ≈ 10).
Tommaso Costa wrote:
“…based on publicly available data and, I believe, realistic assumptions.”
Now run the numbers if patient zero was at the lab and patient 1 was at the market. And be sure to account for the fact that person-to-person transmission inside virology labs is highly attenuated by personal protective equipment.
The general point is that we have no idea who patient zero was or where that person contracted the virus. Everyone goes to food markets or has someone in their household who goes to food markets, so all this downstream analysis (attempting to locate patient zero in physical space using the numbers from epidemiology) is never going to give a conclusive answer.
Thanks, Matt — I agree that we don’t know who “patient zero” was or the exact chain of transmission. My calculation was never meant to identify that; it was a deliberately simple likelihood comparison: given some realistic-seeming starting assumptions and early case-count estimates, how well does each scenario generate the observed infections after the first month?
You’re right that within-lab transmission would be attenuated by PPE. If we included that effect, the lab-origin scenario would produce even fewer cases in the early cycles than my simple model assumes — which would actually increase the Bayes factor in favour of the market scenario.
And you’re also right that everyone in Wuhan had some link to markets. That’s why I framed it as “given the observed data, which starting point is more likely to have produced it?” rather than “where was patient zero?”.
These uncertainties are not unique to my toy model — they are just as present, if not more so, in more elaborate frameworks like Levin’s, which combine multiple evidence streams and conditional probabilities. At that point, Occam’s razor comes in: if two models face the same basic unknowns, I’ll prefer the one whose assumptions are explicit, few, and easy to check.
Tommaso wrote:
If we included [the effect of PPE], the lab-origin scenario would produce even fewer cases in the early cycles”
You are still assuming that a viral infection that occurred in the lab would be associated with spread of that virus in the lab. I don’t think that is a reasonable assumption at all. The cohort of people least likely to show up for work with symptoms of a viral infection would be those who work in a virology lab. But those same people still need food.
Nor does Occam’s razor help you here. It would dominate the discussion if we knew of an animal at the market that met all the criteria for having been the intermediate host, but that did not happen. As things now stand, the market jump theory requires more significant assumptions than the lab leak theory by any reckoning.
These blog discussions have an element of pointless about them but it’s worth pointing out two things:
1. The virus wouldn’t spread in the lab amongst people with adequate PPE, but once a lab individual is infected (am assuming you think that must have happened) s/he is most likely to infect other lab workers at least as quickly as infecting outsiders. Once your morning or afternoon’s work in the lab is done you leave the lab (taking care to dispose of disposable PPE and undertaking whatever decontam protocols are required) and will probably meet with other lab members/senior staff to discuss progress, maybe meet for lunch or a get together at the end of the day and so on.
2. As for lab workers not showing up to work if they have symptoms of viral infection, that may well be the case. However for Covid-19 some of the highest levels of infectivity are/were in the first couple of days before symptoms appear. So an individual picking up a covid infection in the lab would be highly likely to spread it amongst co-workers for both reasons 1. and 2.
There’s some substantial evidence on this isn’t there? So increased Sars-Cov-2 positivity was found from swabs from within and around one of the Huanan market wildlife stalls, and these Sars-Cov-2 positive samples also contained wildlife DNA including from civets, bamboo rats, and raccoon dogs, that are known to be possible intermediate hosts between bats and humans (i.e. they’re known to harbour related virus’s). Other animal virus’s from wildlife animals were also found in swabs in and around stalls indicating that animals were shedding virus’s at the time of sampling.
Crits-Christophe et al. (2924) Genetic tracing of market wildlife and viruses at the epicenter of the COVID-19 pandemic, Cell vol 187
Chris wrote:
“There’s some substantial evidence on this isn’t there?”
No.
Once again, you are making claims here on the blog that are not supported by the paper you linked.
There is nothing in that paper – not even a hint of a claim – that would refute the scenario I gave where a sick lab worker stayed home from work but went to the market for food and infected a market worker. The fact that this scenario is itself unlikely is also irrelevant because there are lots of other plausible scenarios that fit the evidence with patient zero at the lab.
Here we are five years out and people still want to talk about civets and raccoon dogs, even after years of DNA testing that has failed to show that any of these animals were the intermediate hosts for the Covid pandemic in humans.
I’m not trying to refute your scenario Matt. In my opinion Tommaso’s argument is interesting and the objections you raise are not very robust for the reasons I suggested in my first post above.
But I’m not saying at all that the Crits-Christoph paper should provide evidence that would refute your notion of a lone sick worker; as you say, an unlikely story. By describing some of what’s in the paper, I’m not making a claim about anything other than that it provides evidence consistent with a wild animal Huanan market origin.
I would take from the paper that if there was a transmission of a bat-derived coronavirus strain through a wild animal host at the Huanan market, the observations of co-localised Covid-19 and DNA from wild animal hosts known to harbour related virus’s in swabs from the market in Jan 2020 are what one might expect to find. But that doesn’t prove the market origin scenario anymore that it disproves the lab-release scenario, although in my opinion it is consistent with a market wild animal origin.
IIRC racoon dogs can be infected by Covid-19, it doesn’t make them sick, but infected animals shed virus and can infect other animals. They’re known to harbour related virus’s acquired from bats. All those criteria seem to me to be appropriate for a plausible intermediate host that carried the infection to Huanan. It’s not proof though, and as Crits-Christoph point out there were other plausible animal carriers at the market. Whatever one thinks about covid-19 origins it’s good that people continue to investigate likely carriers, transmission scenarios and so on, in advance of future potential epidemics…
You’re right that within-lab transmission would be attenuated by PPE
Sort of. Any lab worker would not likely be using PPE when taking the bus to and from the lab, or while sitting around the dinner table with family and friends, or even at the hospital where they would have gone with flu-like symptoms before the pandemic was identified. I one has pointed to a signal of any of that. All we have seen is a signal associated very directly with the market. A lab-mediated spillover is definitely a possibility – but there’s. I actual evidence of one. It’s mom d if remakable how wedded some people are to that theory of the origin despite a lack of direct evidence. It certainly reminds me of the epistemics of many other very popular conspiracy theories.
Thanks, Matt — I see your point about not assuming sustained spread inside the lab. My simple model doesn’t require that, though: as Chris noted, early SARS-CoV-2 transmission can occur before symptoms, and in practice lab colleagues interact outside strict PPE contexts (meetings, meals, commuting). That means a lab-origin scenario wouldn’t be immune from early secondary cases among staff.
On Occam’s razor: I don’t think it requires the “perfect” intermediate host to be identified before it applies. The question is whether the simpler model, given the data we do have, explains the observations without layering on extra assumptions. Evidence from environmental swabs at the Huanan market — including SARS-CoV-2 RNA alongside DNA from raccoon dogs, civets, and other plausible hosts — is one such data point.
My point remains that both complex and simple models face the same basic unknowns; in that case, I’ll start with the one whose assumptions are few, explicit, and testable.
I hadn’t seen the Millenium Village paper before, but it’s very neat! It won’t be a third citation, but I am thinking of assigning it. I spent a lot of time last year looking for a good reading on using country level data for them, and this might be it.