“Estimating Covid-19 prevalence using Singapore and Taiwan”

Posted on March 21, 2020 10:49 PM by Andrew

Jacob Steinhardt writes:

I wanted to share some applied statistical modeling that you and your readers might enjoy. I took a break from machine learning research for the past week to do some applied statistical modeling, in particular trying to correct for underreporting due to insufficient testing in some countries. My overall conclusion is that in most European countries, backing out the number of cases from the mortality data is reasonably reliable, but there’s other countries where it’s less reliable and the reported deaths may substantially underestimate the actual deaths.

Of course, my analysis also relies on assumptions, many of which are obviously incorrect. But it’s a different set of incorrect assumptions than taking the reported deaths as given, so together these can help start to paint a clearer picture. And hopefully more analyses and more data later will continue to improve our understanding.

The full blog post is here, and you can also find the underlying data here or even rawer data on github.

I haven’t read this in detail, but I’m forwarding in case it interests some of you. My only quick comment on the analysis is I think you should just about never use the Poisson model. Always use overdispersed Poisson. Also I recommend you fit any model in Stan, as it’s flexible so you can expand it in various ways, include new data, etc., the usual story.

11 thoughts on ““Estimating Covid-19 prevalence using Singapore and Taiwan””

Gowanus on March 22, 2020 8:27 AM at 8:27 am said:

” Here I present a method to try to estimate the infection prevalence in different countries. The basic idea is to make use of the data from Singapore and Taiwan, which both perform thorough testing…”

_

…so a classic, non-representative “Convenience Sample” purporting to make somewhat valid conclusions on a huge world population.

(Look at India for a radically different result)

Reply ↓
EHG on March 22, 2020 8:28 AM at 8:28 am said:

The author writes:

> This is clear just from looking at some of the numbers; the estimated infection prevalence of 0.05% in Italy is clearly too low, while the Egypt prevalence is too high — a prevalence of 1.94% would be substantially higher than the point at which Italy’s medical system was overrun, and we haven’t seen that in Egypt. […] Italy’s low apparent prevalence is likely due to travel restrictions suppressing the number of cases observed in other countries.

But hospitals are not overrun across Italy (and let us hope it stays that way) – the outbreak is really only very bad in Lombardy (plus to a lesser extent Emilia-Romagna and Veneto). See here https://www.statista.com/statistics/1099375/coronavirus-cases-by-region-in-italy/ . So a country could have a low infection prevalence nationally and still exceed hospital capacity in a specific region if most cases occurred there. That could be the case in Italy; I don’t know how the situation looks in Egypt.

That’s a quibble, it is a good read! I was wondering why the UK had relatively few cases compared to the countries of the Continent – I suppose it’d make sense that underreporting is a factor.

Reply ↓
- Anoneuoid on March 22, 2020 9:34 AM at 9:34 am said:
  
  And in Italy it sounds like the hospitals that are overrun are overrun because they sent a lot of the staff home for testing positive and spreading it amongst the old people.
  
  <
  
  The problem is that so many of our staff are at home, as they are testing positive for COVID-19. So that leaves a handful of us to run everything
  
  https://www.democracynow.org/2020/3/20/headlines/worldwide_covid_19_death_toll_tops_10_000_as_italian_nurses_stop_counting_the_dead
  
  Reply ↓
  - Anoneuoid on March 22, 2020 2:15 PM at 2:15 pm said:
    
    Get ready for the paint shortages. If you want to repaint your house while you’re stuck there get it now.
    
    Reply ↓
zbicyclist on March 22, 2020 4:27 PM at 4:27 pm said:

I fully agree with “you should just about never use the Poisson model. Always use overdispersed Poisson”.

Reply ↓
Calum on March 24, 2020 3:22 PM at 3:22 pm said:

Andrew, have you seen this paper out of Oxford University? It suggests that the conspiracy theory doing the rounds that Covid-19 was circulating much earlier than we realise might have some truth to it.

https://www.dropbox.com/s/oxmu2rwsnhi9j9c/Draft-COVID-19-Model%20%2813%29.pdf?dl=0

I feel very uncomfortable relying on the tail of a model in this way to support such a conclusion, but I’ve no expertise in fitting MCMC models in general, never mind SIR models.

Reply ↓
- Anoneuoid on March 24, 2020 3:40 PM at 3:40 pm said:
  
  the conspiracy theory doing the rounds that Covid-19 was circulating much earlier than we realise
  
  Why would this require a “conspiracy”?
  
  Reply ↓
- Daniel Lakeland on March 24, 2020 3:54 PM at 3:54 pm said:
  
  All those graphs talk about the spread starting late Jan early Feb. None of that is “conspiracy theory” type. “conspiracy” type theories would be that it’s been circulating since Nov or something. Mid Jan to Early Feb is absolutely when we expect the first cases to have hit Italy and the US. First known community spread in WA is somewhere in mid Jan, and confirmed around Jan 30: https://www.scientificamerican.com/article/cdc-confirms-first-known-person-to-person-spread-of-new-coronavirus-in-u-s/
  
  Since confirmed cases are always lagging behind reality, we basically know that it was spreading in the US starting some time in early Jan.
  
  Reply ↓
  - Daniel Lakeland on March 24, 2020 3:55 PM at 3:55 pm said:
    
    washington state relevant link:
    
    https://www.washingtonpost.com/health/coronavirus-may-have-spread-undetected-for-weeks-in-washington-state/2020/03/01/0f292336-5bcc-11ea-9055-5fa12981bbbf_story.html
    
    Reply ↓
  - Daniel Lakeland on March 24, 2020 4:03 PM at 4:03 pm said:
    
    if I assume the first community spread was Jan 15, then as of Mar 24 using 2^(n/k) with n= number of days and k = 3,4,5 we arrive at total caseload today of either 8.4M, 156k, or 14k so with about 50,000 confirmed cases, we can rule out spread at longer interval than doubling every 4 days… since most data is doubling ~ 3 days. I’m going to say we have around 1-10M infections today.
    
    Or, we could push the spread back to Jan1, in which case we know we don’t have spread every 3 days because 213M people would have it already, but every 4 days it’d be 1.8M and 99k for every 5 days…
    
    With that range of wiggle room, we can say it’s between 1M and 10M cases today in the US, most likely.
    
    Reply ↓
  - Anoneuoid on March 24, 2020 4:50 PM at 4:50 pm said:
    
    Right, the conspiracy is that Fort Detrick was shut down last summer for a safety violation after which there was a strange summer virus that shut down a few nearby nursing homes. Then there was the “vaping illness” with very similar symptoms, but primarily in the young. Next we got civid-19 seeming to come out of wuhan.
    
    https://old.reddit.com/r/conspiracy/comments/fnyr9o/the_cdc_shut_down_the_army_infectious_diseases/
    
    I don’t buy it but I guess it is possible the virus was around that early. If we assume every country only detects it a few months after it starts circulating like in that model it could work, but that doesn’t mean it is correct.
    
    Reply ↓

Statistical Modeling, Causal Inference, and Social Science

“Estimating Covid-19 prevalence using Singapore and Taiwan”

11 thoughts on ““Estimating Covid-19 prevalence using Singapore and Taiwan””

Leave a Reply to Daniel Lakeland Cancel reply