Skip to content

“Estimating Covid-19 prevalence using Singapore and Taiwan”

Jacob Steinhardt writes:

I wanted to share some applied statistical modeling that you and your readers might enjoy. I took a break from machine learning research for the past week to do some applied statistical modeling, in particular trying to correct for underreporting due to insufficient testing in some countries. My overall conclusion is that in most European countries, backing out the number of cases from the mortality data is reasonably reliable, but there’s other countries where it’s less reliable and the reported deaths may substantially underestimate the actual deaths.

Of course, my analysis also relies on assumptions, many of which are obviously incorrect. But it’s a different set of incorrect assumptions than taking the reported deaths as given, so together these can help start to paint a clearer picture. And hopefully more analyses and more data later will continue to improve our understanding.

The full blog post is here, and you can also find the underlying data here or even rawer data on github.

I haven’t read this in detail, but I’m forwarding in case it interests some of you. My only quick comment on the analysis is I think you should just about never use the Poisson model. Always use overdispersed Poisson. Also I recommend you fit any model in Stan, as it’s flexible so you can expand it in various ways, include new data, etc., the usual story.


  1. Gowanus says:

    ” Here I present a method to try to estimate the infection prevalence in different countries. The basic idea is to make use of the data from Singapore and Taiwan, which both perform thorough testing…”


    …so a classic, non-representative “Convenience Sample” purporting to make somewhat valid conclusions on a huge world population.

    (Look at India for a radically different result)

  2. EHG says:

    The author writes:

    > This is clear just from looking at some of the numbers; the estimated infection prevalence of 0.05% in Italy is clearly too low, while the Egypt prevalence is too high — a prevalence of 1.94% would be substantially higher than the point at which Italy’s medical system was overrun, and we haven’t seen that in Egypt. […] Italy’s low apparent prevalence is likely due to travel restrictions suppressing the number of cases observed in other countries.

    But hospitals are not overrun across Italy (and let us hope it stays that way) – the outbreak is really only very bad in Lombardy (plus to a lesser extent Emilia-Romagna and Veneto). See here . So a country could have a low infection prevalence nationally and still exceed hospital capacity in a specific region if most cases occurred there. That could be the case in Italy; I don’t know how the situation looks in Egypt.

    That’s a quibble, it is a good read! I was wondering why the UK had relatively few cases compared to the countries of the Continent – I suppose it’d make sense that underreporting is a factor.

  3. zbicyclist says:

    I fully agree with “you should just about never use the Poisson model. Always use overdispersed Poisson”.

  4. Calum says:

    Andrew, have you seen this paper out of Oxford University? It suggests that the conspiracy theory doing the rounds that Covid-19 was circulating much earlier than we realise might have some truth to it.

    I feel very uncomfortable relying on the tail of a model in this way to support such a conclusion, but I’ve no expertise in fitting MCMC models in general, never mind SIR models.

Leave a Reply to Daniel Lakeland