“Estimating Covid-19 prevalence using Singapore and Taiwan”

Jacob Steinhardt writes:

I wanted to share some applied statistical modeling that you and your readers might enjoy. I took a break from machine learning research for the past week to do some applied statistical modeling, in particular trying to correct for underreporting due to insufficient testing in some countries. My overall conclusion is that in most European countries, backing out the number of cases from the mortality data is reasonably reliable, but there’s other countries where it’s less reliable and the reported deaths may substantially underestimate the actual deaths.

Of course, my analysis also relies on assumptions, many of which are obviously incorrect. But it’s a different set of incorrect assumptions than taking the reported deaths as given, so together these can help start to paint a clearer picture. And hopefully more analyses and more data later will continue to improve our understanding.

The full blog post is here, and you can also find the underlying data here or even rawer data on github.

I haven’t read this in detail, but I’m forwarding in case it interests some of you. My only quick comment on the analysis is I think you should just about never use the Poisson model. Always use overdispersed Poisson. Also I recommend you fit any model in Stan, as it’s flexible so you can expand it in various ways, include new data, etc., the usual story.

1. Gowanus says:

” Here I present a method to try to estimate the infection prevalence in different countries. The basic idea is to make use of the data from Singapore and Taiwan, which both perform thorough testing…”

_

…so a classic, non-representative “Convenience Sample” purporting to make somewhat valid conclusions on a huge world population.

(Look at India for a radically different result)

2. EHG says:

The author writes:

> This is clear just from looking at some of the numbers; the estimated infection prevalence of 0.05% in Italy is clearly too low, while the Egypt prevalence is too high — a prevalence of 1.94% would be substantially higher than the point at which Italy’s medical system was overrun, and we haven’t seen that in Egypt. […] Italy’s low apparent prevalence is likely due to travel restrictions suppressing the number of cases observed in other countries.

But hospitals are not overrun across Italy (and let us hope it stays that way) – the outbreak is really only very bad in Lombardy (plus to a lesser extent Emilia-Romagna and Veneto). See here https://www.statista.com/statistics/1099375/coronavirus-cases-by-region-in-italy/ . So a country could have a low infection prevalence nationally and still exceed hospital capacity in a specific region if most cases occurred there. That could be the case in Italy; I don’t know how the situation looks in Egypt.

That’s a quibble, it is a good read! I was wondering why the UK had relatively few cases compared to the countries of the Continent – I suppose it’d make sense that underreporting is a factor.

3. zbicyclist says:

I fully agree with “you should just about never use the Poisson model. Always use overdispersed Poisson”.

4. Calum says:

Andrew, have you seen this paper out of Oxford University? It suggests that the conspiracy theory doing the rounds that Covid-19 was circulating much earlier than we realise might have some truth to it.

https://www.dropbox.com/s/oxmu2rwsnhi9j9c/Draft-COVID-19-Model%20%2813%29.pdf?dl=0

I feel very uncomfortable relying on the tail of a model in this way to support such a conclusion, but I’ve no expertise in fitting MCMC models in general, never mind SIR models.

• Anoneuoid says:

the conspiracy theory doing the rounds that Covid-19 was circulating much earlier than we realise

Why would this require a “conspiracy”?

• All those graphs talk about the spread starting late Jan early Feb. None of that is “conspiracy theory” type. “conspiracy” type theories would be that it’s been circulating since Nov or something. Mid Jan to Early Feb is absolutely when we expect the first cases to have hit Italy and the US. First known community spread in WA is somewhere in mid Jan, and confirmed around Jan 30: https://www.scientificamerican.com/article/cdc-confirms-first-known-person-to-person-spread-of-new-coronavirus-in-u-s/

Since confirmed cases are always lagging behind reality, we basically know that it was spreading in the US starting some time in early Jan.

• if I assume the first community spread was Jan 15, then as of Mar 24 using 2^(n/k) with n= number of days and k = 3,4,5 we arrive at total caseload today of either 8.4M, 156k, or 14k so with about 50,000 confirmed cases, we can rule out spread at longer interval than doubling every 4 days… since most data is doubling ~ 3 days. I’m going to say we have around 1-10M infections today.

Or, we could push the spread back to Jan1, in which case we know we don’t have spread every 3 days because 213M people would have it already, but every 4 days it’d be 1.8M and 99k for every 5 days…

With that range of wiggle room, we can say it’s between 1M and 10M cases today in the US, most likely.

• Anoneuoid says:

Right, the conspiracy is that Fort Detrick was shut down last summer for a safety violation after which there was a strange summer virus that shut down a few nearby nursing homes. Then there was the “vaping illness” with very similar symptoms, but primarily in the young. Next we got civid-19 seeming to come out of wuhan.