Concerns with that Stanford study of coronavirus prevalence

Josh Rushton writes:

I’ve been following your blog for a while and checked in today to see if there was a thread on last week’s big-splash Stanford antibody study (the one with the shocking headline that they got 50 positive results in a “random” sample of 3330 antibody tests, suggesting that nearly 2% of the population has been infected “under the radar”). I didn’t see anything, so I thought I’d ask if you’d consider opening a discussion.

This paper is certainly relevant to the MrP thread on politicization of the covid response, in that the paper risks injecting misinformation into an already-broken policy discussion. But I think it would be better to use it as a case study on poor statistics and questionable study design. I don’t mean to sound harsh, but if scientists are afraid to “police” ourselves, I don’t know how we can ask the public to trust us.

Simply put, I see two potentially fatal flaws with the study (full disclosure: I [Rushton] haven’t read the entire paper — a thousand apologies if I’m jumping the gun — but it’s hard to imagine these getting explained away in the fine print):

  • The authors’ confidence intervals cannot possibly be accounting for false positives correctly (I think they use the term “specificity” to mean “low rate of false-positives). I say this because the test validation included a total of 30+371 pre-covid blood tests, and only 399 of them came back negative. I know that low-incidence binomial CIs can be tricky, and I don’t know the standard practice these days, but the exact binomial 95% CI for the false-positive rate is (0.0006, 0.0179); this is pretty consistent to the authors’ specificity CI (98.3%, 99.9%). For rates near the high end of this CI, you’d get 50 or more false positives in 3330 tests with about 90% probability. Hard to sort through this with strict frequentist logic (obviously a Bayesian could make short work of it), but the common-sense take-away is clear: It’s perfectly plausible (in the 95% CI sense) that the shocking prevalence rates published in the study are mostly, or even entirely, due to false positives. So the fact that their prevalence CIs don’t go anywhere near zero simply can’t be right.
  • Recruitment was done via facebook ads with basic demographic targeting. Since we’re looking for a feature that affects something like 2% of the population (or much, much less), we really have to worry about self selection. They may have discussed this in the portions of the paper I didn’t read, but I can’t imagine how researchers would defeat the desire to get a test if you had reason to believe that you, or someone near you, had the virus (and wouldn’t some people hide those reasons to avoid being disqualified from getting the test?)…

Pretty harsh words—but this is just some guy sending me an email. I’ll have to read the paper and judge for myself, which I did with an open mind. (Let me assure you that I did not title this post until after writing most of it.)

It’s been a busy month for Stanford on the blog. First there were these pre-debunked forecasts we heard from a couple of assholes from the Hoover Institution, then some grad students set us this pretty sane literature review, and now this!

Reading through the preprint

Anyway, after receiving the above email, I clicked though and read the preprint, “COVID-19 Antibody Seroprevalence in Santa Clara County, California,” by Eran Bendavid et al., which reports:

On 4/3-4/4, 2020, we tested county residents for antibodies to SARS-CoV-2 using a lateral flow immunoassay. Participants were recruited using Facebook ads targeting a representative sample of the county by demographic and geographic characteristics. We report the prevalence of antibodies to SARS- CoV-2 in a sample of 3,330 people, adjusting for zip code, sex, and race/ethnicity. . . . The unadjusted prevalence of antibodies to SARS-CoV-2 in Santa Clara County was 1.5% . . . and the population-weighted prevalence was 2.8%.

That’s positive test results. Then you have to adjust for testing errors:

Under the three scenarios for test performance characteristics, the population prevalence of COVID-19 in Santa Clara ranged from 2.5% to 4.2%. [I’ve rounded all numbers to a single decimal place for my own sanity. — AG]

To discuss this paper, I’ll work backward, starting from the conclusion and going through the methods and assumptions.

Let’s take their final estimate, 2.5% to 4.2%, and call it 3%. Is a 3% rate of coronavirus antibodies in Santa Clara county a high or a low number? And does this represent good news or bad news?

First off, 3% does not sound implausible. If they said 30%, I’d be skeptical, given how everyone’s been hiding out for awhile, but 3%, sure, maybe so. Bendavid et al. argue that if the number is 3%, that’s good news, because Santa Clara county has 2 million people and only an estimated 100 deaths . . . 0.03*(2 million)/100 = 600, so that implies that 1/600 of exposed people there died. So that’s good news, relatively speaking: we’d still like to avoid 300 million Americans getting the virus and 500,000 dying, but that’s still better than the doomsday scenario.

It’s hard to wrap my head around these numbers because, on one hand, a 1/600 death rate sounds pretty low; on the other, 500,000 deaths is a lot. I guess 500,000 is too high because nobody’s saying that everyone will get exposed.

The study was reported in the news as that the county “Santa Clara county has had 50 to 85 times more cases than we knew about, Stanford estimates.” It does seem plausible that lots more people have been exposed than have been tested for the disease, as so few tests are being done.

At the time of this writing, NYC has about 9000 recorded coronavirus deaths. Multiply by 600 and you get 5.4 million. OK, I don’t think 5.4 million New Yorkers have been exposed to coronavirus. New York only has 8.4 million people total! I don’t think I know anyone who’s had coronavirus. Sure, you can have it and not have any symptoms—but if it’s as contagious as all that, then if I had it, I guess all my family would get it too, and then I’d guess that somebody would show some symptoms.

That’s fine—for reasons we’ve been discussing for awhile—actually, it was just a month and a half ago—it doesn’t make sense to talk about a single “case fatality rate,” as it depends on age and all sorts of other things. The point is that there’ve gotta be lots of coronavirus cases that have not been recorded, given that we have nothing close to universal or random-sample testing. But the 1/600 number doesn’t seem quite right either.

Figuring out where the estimates came from

OK, now let’s see where the Stanford estimate came from. They did a survey and found 1.5% positive tests (that’s 50 out of 3330 in the sample). Then they did three statistical adjustments:

1. They poststratified on zip code, sex, and ethnicity to get an estimate of 2.8%. Poststratification is a standard statistical technique, but some important practical issues arise regarding what to adjust for.

2. They adjusted for test inaccuracy. This is a well-known probability problem—with a rare disease and an imperfect test, you can easily end up with most of your positive test results being false positives. The error rates of the test is the key input to this calculation.

3. They got uncertainty intervals based on the sampling in the data. That’s the simplest part of the analysis, and I won’t talk much about it here. It does come up, though, in the implicit decision of the paper to focus on point estimates rather than uncertainty ranges. To the extent that the point estimates are implausible (e.g., my doubts about the 1/600 ratio above), that could point toward a Bayesian analysis that would account for inferential uncertainty. But I’m guessing that the uncertainty due to sampling variation is minor compared to uncertainty arising from the error rate of the test.

I’ll discuss each of these steps in turn, but I also want to mention three other issues:

4. Selection bias. As Rushton wrote, it could be that people who’d had coronavirus symptoms were more likely to avail themselves of a free test.

5. Auxiliary information. In any such study, you’d want to record respondents’ ages and symptoms. And, indeed, these were asked about in the survey. However, these were not used in the analysis and played no role in the conclusion. In particular, one might want to use responses about symptoms to assess possible selection bias.

6. Data availability. The data for this study do not seem to be available. That’s too bad. I can’t see that there’d be a confidentiality issue: just knowing someone’s age, sex, ethnicity, and coronavirus symptoms should not be enough to allow someone to be identified, right? I guess that including zip code could be enough for some categories, maybe? But if that were the only issue, they could just pool some of the less populated zip codes. I’m guessing that the reason they didn’t release the data is simple bureaucracy: it’s easier to get a study approved if you promise you won’t release the data than if you say you will. Backasswards, that is, but that’s the world that academic researchers have to deal with, and my guess is that the turf-protectors in the IRB industry aren’t gonna letting go of this one without a fight. Too bad, though: without the data and the code, we just have to guess at what was done. And we can’t do any of the natural alternative analyses.

Assessing the statistical analysis

Now let’s go through each step.

1. Poststratification.

There are 2 sexes and it seems that the researchers used 4 ethnicity categories. I’m not sure how they adjusted for zip code. From their map, it seems that there are about 60 zip codes in the county, so there’s no way they simply poststratified on all of them. They say, “we re-weighted our sample by zip code, sex, and race/ethnicity,” but “re-weighed . . . by zip code” doesn’t really say exactly what they did. Just to be clear here, I’m not suggesting malfeasance here; it’s just the usual story that it can be hard for people to describe their calculations in words. Even formulas are not so helpful because they can lack key details.

I’m concerned about the poststratification for three reasons. First, they didn’t poststratify on age, and the age distribution is way off! Only 5% of their sample is 65 and over, as compared to 13% of the population of Santa Clara county. Second, I don’t know what to think about the zip code adjustment, since I don’t know what was actually done there. This is probably not the biggest deal, but given that they bothered to adjust at all, I’m concerned. Third, I really don’t know what they did, because they say they weighted to adjust for zip code, sex, and ethnicity in the general population—-but in Table 1 they give their adjusted proportions for sex and ethnicity and they don’t match the general population! They’re close, but not exact. Again, I’d say this is no big deal, but I hate not knowing what was actually done.

And why did they not adjust for age? They write, “We chose these three adjustors because they contributed to the largest imbalance in our sample, and because including additional adjustors would result in small-N bins.” They should’ve called up a survey statistician and asked for help on this one: it’s standard problem. You can do MRP—that’s what I’d do!—but even some simple raking would be fine here, I think.

There aren’t a lot of survey statisticians out there, but there are some. They could’ve called me up and asked for advice, or they could’ve stayed on campus and asked Doug Rivers or Jon Krosnick—they’re both experts on sampling and survey adjustments. I guess it’s hard to find experts on short notice. Doug and Jon don’t have M.D.’s and they’re not economists or law professors, so I guess they don’t count as experts by the usual measures.

2. Test inaccuracy.

This is the big one. If X% of the population have the antibodies and the test has an error rate that’s not a lot lower than X%, you’re in big trouble. This doesn’t mean you shouldn’t do testing, but it does mean you need to interpret the results carefully. Bendavid et al. estimate that the sensitivity of the test is somewhere between 84% and 97% and that the specificity is somewhere between 90% and 100%. I can never remember which is sensitivity and which is specificity, so I looked it up on wikipedia: “Sensitivity . . . measures the proportion of actual positives that are correctly identified as such . . . Specificity . . . measures the proportion of actual negatives that are correctly identified as such.” OK, here are concern is actual negatives who are misclassified, so what’s relevant is the specificity. That’s the number between 90% and 100%.

If the specificity is 90%, we’re sunk. With a 90% specificity, you’d expect to see 333 positive tests out of 3330, even if nobody had the antibodies at all. Indeed, they only saw 50 positives, that is, 1.5%, so we can be pretty sure that the specificity is at least 98.5%. If the specificity were 98.5%, the observed data would be consistent with zero, which is one of Rushton’s points above. On the other hand, if the specificity were 100%, then we could take the result at face value.

So how do they get their estimates? Again, the key number here is the specificity. Here’s exactly what they say regarding specificity:

A sample of 30 pre-COVID samples from hip surgery patients were also tested, and all 30 were negative. . . . The manufacturer’s test characteristics relied on . . . pre-COVID sera for negative gold standard . . . Among 371 pre-COVID samples, 369 were negative.

This gives two estimates of specificity: 30/30 = 100% and 369/371 = 99.46%. Or you can combine them together to get 399/401 = 99.50%. If you really trust these numbers, you’re cool: with y=399 and n=401, we can do the standard Agresti-Coull 95% interval based on y+2 and n+4, which comes to [98.0%, 100%]. If you go to the lower bound of that interval, you start to get in trouble: remember that if the specificity is less than 98.5%, you’ll expect to see more than 1.5% positive tests in the data no matter what!

3. Uncertainty intervals. So what’s going on here? If the specificity data in the paper are consistent with all the tests being false positives—not that we believe all the tests are false positives, but this suggests we can’t then estimate the true positive rate with any precision—then how do they get a confidence nonzero estimate of the true positive rate in the population?

It seems that two things are going on. First, they’re focusing on the point estimates of specificity. Their headline is the range from 2.5% to 4.2%, which come from their point estimates of specificity of 100% (from their 30/30 data) and 99.5% (from the manufacturer’s 369/371). So the range they give is not a confidence interval; it’s two point estimates from different subsets of their testing data. Second, I think they’re doing something wrong, or more than one thing wrong, with their uncertainty estimates, which are “2.5% (95CI 1.8-3.2%)” and “4.2% (2.6-5.7%)” (again, I’ve rounded to one decimal place for clarity). The problem is that we’ve already seen that a 95% interval for the specificity will go below 98.5%, which implies that the 95% interval for the true positive rate should include zero.

Why does their interval not include zero, then? I can’t be sure, but one possibility is that they did the sensitivity-specificity corrections on the poststratified estimate. But, if so, I don’t think that’s right. 50 positive tests is 50 positive tests, and if the specificity is really 98.5%, you could get that with no true cases. Also, I’m baffled because I think the 2.5% is coming from that 30/30=100% specificity estimate, but in that case you’d need a really wide confidence interval, which would again go way below 98.5% so that the confidence interval for the true positive rate would include zero.

Again, the real point here is not whether zero is or “should be” in the 95% interval, but rather that, once the specificity can get in the neighborhood of 98.5% or lower, you can’t use this crude approach to estimate the prevalence; all you can do is bound it from above, which completely destroys the “50-85-fold more than the number of confirmed cases” claim.

They do talk about this a bit: “if new estimates indicate test specificity to be less than 97.9%, our SARS-CoV-2 prevalence estimate would change from 2.8% to less than 1%, and the lower uncertainty bound of our estimate would include zero. On the other hand, lower sensitivity, which has been raised as a concern with point-of-care test kits, would imply that the population prevalence would be even higher.” But I think this misses the point. First, if the specificity were less than 97.9%, you’d expect more than 70 positive cases out of 3330 tests. But they only saw 50 positives, so I don’t think that 1% rate makes sense. Second, the bit about the sensitivity is a red herring here. The uncertainty here is pretty much entirely driven by the uncertainty in the specificity.

This is all pretty much what Rushton said in one paragraph of his email. I just did what was, in retrospect, overkill here because I wanted to understand what the authors were doing.

4. Selection bias. In their article, Bendavid et al. address the possibility: “Other biases, such as bias favoring individuals in good health capable of attending our testing sites, or bias favoring those with prior COVID-like illnesses seeking antibody confirmation are also possible.” That makes sense. Bias could go in either direction. I don’t have a good sense of this, and I think it’s fine to report the results of a self-selected population, as long as (a) you make clear the sampling procedure, and (b) you do your best to adjust.

Regarding (b), I wonder if they could’ve done more. In addition to my concerns expressed above regarding insufficient poststratification (in turn driven by their apparent lack of consultation with a statistics expert), I also wonder if they could’ve done something with the data they collected on “underlying co-morbidities, and prior clinical symptoms.” I don’t see these data anywhere in the report, which is too bad. They could’ve said what percentage of the people in their survey reported any coronavirus-related symptoms.

5. Auxiliary information. and 6. Data availability. As noted above, it seems that the researchers collected some information that could have helped us understand their results, but these data are unavailable to us.

Jeez—I just spent 3 hours writing this post. I don’t think it wasn’t worth the time. I could’ve just shared Rushton’s email with all of you—that would’ve just taken 5 minutes!

Summary

I think the authors of the above-linked paper owe us all an apology. We wasted time and effort discussing this paper whose main selling point was some numbers that were essentially the product of a statistical error.

I’m serious about the apology. Everyone makes mistakes. I don’t think they authors need to apologize just because they screwed up. I think they need to apologize because these were avoidable screw-ups. They’re the kind of screw-ups that happen if you want to leap out with an exciting finding and you don’t look too carefully at what you might have done wrong.

Look. A couple weeks ago I was involved in a survey regarding coronavirus symptoms and some other things. We took the data and ran some regressions and got some cool results. We were excited. That’s fine. But we didn’t then write up a damn preprint and set the publicity machine into action. We noticed a bunch of weird things with our data, lots of cases were excluded for one reason or another, then we realized there were some issues of imbalance so we couldn’t really trust the regression as is, at the very least we’d want to do some matching first . . . I don’t actually know what’s happening with that project right now. Fine. We better clean up the data if we want to say anything useful. Or we could release the raw data, whatever. The point is, if you’re gonna go to all this trouble collecting your data, be a bit more careful in the analysis! Careful not just in the details but in the process: get some outsiders involved who can have a fresh perspective and aren’t invested in the success of your project.

Also, remember that reputational inference goes both ways. The authors of this article put in a lot of work because they are concerned about public health and want to contribute to useful decision making. The study got attention and credibility in part because of the reputation of Stanford. Fair enough: Stanford’s a great institution. Amazing things are done at Stanford. But Stanford has also paid a small price for publicizing this work, because people will remember that “the Stanford study” was hyped but it had issues. So there is a cost here. The next study out of Stanford will have a little less of that credibility bank to borrow from. If I were a Stanford professor, I’d be kind of annoyed. So I think the authors of the study owe an apology not just to us, but to Stanford. Not to single out Stanford, though. There’s also Cornell, which is known as that place with the ESP professor and that goofy soup-bowl guy who faked his data. And I teach at Columbia; our most famous professor is . . . Dr. Oz.

It’s all about the blood

I’m not saying that the claims in the above-linked paper are wrong. Maybe the test they are using really does have a 100% specificity rate and maybe the prevalence in Santa Clara county really was 4.2%. It’s possible. The problem with the paper is that (a) it doesn’t make this reasoning clear, and (b) their uncertainty statements are not consistent with the information they themselves present.

Let me put it another way. The fact that the authors keep saying that “50-85-fold” thing suggest to me that they sincerely believe that the specificity of their test is between 99.5% and 100%. They’re clinicians and medical testing experts; I’m not. Fine. But then they should make that assumption crystal clear. In the abstract of their paper. Something like this:

We believe that the specificity of the test used in this study is between 99.5% and 100%. Under this assumption, we conclude that the population prevalence in Santa Clara county was between 1.8% and 5.7% . . .

This specificity thing is your key assumption, so place it front and center. Own your modeling decisions.

P.S. Again, I know nothing about blood testing. Perhaps we could convene an expert panel including George Schultz, Henry Kissinger, and David Boies to adjudicate the evidence on this one?

P.P.S. The authors provide some details on their methods here. Here’s what’s up:

– For the poststratification, it turns out they do adjust for every zip code. I’m surprised, as I’d think that could give them some noisy weights, but, given our other concerns with this study, I guess noisy weights are the least of our worries. Also, they don’t quite weight by sex x ethnicity x zip; they actually weight by the two-way margins, sex x zip and ethnicity x zip. Again, not the world’s biggest deal. They should’ve adjusted for age, too, though, as that’s a freebie.

– They have a formula to account for uncertainty in the estimated specificity. But something seems to have gone wrong, as discussed in the above post. It’s hard to know exactly what went wrong since we don’t have the data and code. For example, I don’t know what they are using for var(q).

P.P.P.S. Let me again emphasize that “not statistically significant” is not the same thing as “no effect.” What I’m saying in the above post is that the information in the above-linked article does not provide strong evidence that the rate of people in Santa Clara county exposed by that date was as high as claimed. Indeed, the data as reported are consistent with the null hypothesis of no exposure, and also with alternative hypotheses such as exposure rates of 0.1% or 0.5% or whatever. But we know the null hypothesis isn’t true—people in that county have been infected! The data as reported are also consistent with infection rates of 2% or 4%. Indeed, as I wrote above, 3% seems like a plausible number. As I wrote above, “I’m not saying that the claims in the above-linked paper are wrong,” and I’m certainly not saying we should take our skepticism in their specific claims and use that as evidence in favor of a null hypothesis. I think we just need to accept some uncertainty here. The Bendavid et al. study is problematic if it is taken as strong evidence for those particular estimates, but it’s valuable if it’s considered as one piece of information that’s part of a big picture that remains uncertain. When I wrote that the authors of the article owe us all an apology, I didn’t mean they owed us an apology for doing the study, I meant they owed us an apology for avoidable errors in the statistical analysis that led to overconfident claims. But, again, let’s not make the opposite mistake of using uncertainty as a way to affirm a null hypothesis.

P.P.P.P.S. I’m still concerned about the zip code weighting. Their formula has N^S_zsr in the denominator: that’s the number of people in the sample in each category of zip code x sex x race. But there are enough zip codes in the county that I’m concerned that weighting in this way will be very noisy. This is a particular concern here because even the unweighted estimate of 1.5% is so noisy that, given the data available, it could be explained simply by false positives. Again, this does not make the substantive claims in the paper false (or true), it’s just one more reason these estimates are too noisy to do more than give us an upper bound on the infection rate, unless you want to make additional assumptions. You could say that the analysis as performed in the paper does make additional assumptions, it just does so implicitly via forking paths.

P.P.P.P.P.S. A new version of the article has been released; see discussion here and here.

P.P.P.P.P.P.S. See here for our analysis of the data published in the revised report. Our conclusion:

For now, we do not think the data support the claim that the number of infections in Santa Clara County was between 50 and 85 times the count of cases reported at the time, or the implied interval for the IFR of 0.12–0.2%. These numbers are consistent with the data, but the data are also consistent with a near-zero infection rate in the county. The data of Bendavid et al. (2020a,b) do not provide strong evidence about the number of people infected or the infection fatality ratio; the number of positive tests in the data is just too small, given uncertainty in the specificity of the test.

432 thoughts on “Concerns with that Stanford study of coronavirus prevalence

  1. I believe self selection is a huge problem. People who suspect they may have had the virus will want to know and will volunteer for these studies at many times the rate of the general population. There is just no way to adjust for this because there is no representative sample enabling you to estimate the self-selection tendency. So even if they do have 3% of their sample, it could just mean that 3% of people who suspect they have the virus and want to know really do. Doesn’t say anything about the general population.

    • > no way to adjust for this because there is no representative sample enabling you to estimate the self-selection tendency

      Just make the ad say it’s an antibody study NOT testing for covid19 antibodies. Then in person tell them it’s for covid and tell them the results.

        • I bet if they repeat the study to an isolated place like Comoros they will get the same results. :)

      • Self selection works the other way as well and may not be an indicator of disease presence but of anxiety of having the disease. As a clinician the number of people wanting testing for all manner of symptoms might astound you. All so far proved negative.

        • I don’t see that sort of self selection as being very likely to produce infection levels lower than random selection. Anxiety and common symptoms might be a poor indicator of a rare illness, but it’s not like they are anti-correlated with it either.

        • The correlation may exist if people who is more concerned about the possibility of an infection is at the same time a) more likely to get tested because they are more motivated and b) less likely to get infected because they are more careful.

        • I suppose that is possible, but doesn’t the rarity here make that a small factor? You would need a great number of hypochondriacs skilled at disease avoidance to make a difference here, especially if we’ve already adjusted away demographic differences. (Like I can see income as causing such a correlation.)

        • I agree many years of clinical practice has taught me it biases neither way and that’s really the point I’m trying to make. Statistics can only take you so far particularly when there are so many variables to account for and not enough data and plausibility or, dare I say enlightened instinct can guide your interpretation of what’s actually happening. I’m a big admirer of John Ioannidis and he has taken a brave stand against the self interested (at least initially) hysteria the media have generated and Im sure he started out with the premise the quoted figures seemed a little ridiculous to be able to generalize. I think attempts at statistical rigour creates as many problems as it solves sometimes and in fact this is where John got himself into a pickle when he claimed anti depressants were no better than placebo ignoring obvious clinical usefulness. At least he wasn’t so rigid and partly retracted his stance. So I don’t believe he is statistically motivated but is using the study to bolster what he believes is likely.

          The other point I want to make is that John is anything but an ‘asshole’ and although I do admire Andrew’s work I wonder whether there is a touch or more of envy in the uncompromising intellectual stand John has taken and the attention it has received and I believe deserves for not being swept up in the panic. It may be years before we get a true picture of what this pandemic really means for our understanding of infectious diseases but I wouldn’t be surprised if the mortality figures are not much higher than influenza or at least no where near the fear invoked levels created by other so called experts having taken the stage like Imperial college.

          Finally The discussion is and should also be focussed on the collateral damage total lockdown can have for economics and the social fabric of society each of which will lead to its own epidemics of morbidity for example, many hospitals remain empty preparing for this yet to materialize influx and its totally distracting from other pressing concerns for humanity that will be far more critical to our well being in the long run. There are many more issues than just statistics to guide one’s impression of this so called pandemic.

        • Conversely I suspect many of the posts are driven by the pervasive sentiment of an obvious increase in death rates with the current pandemic than any statistical imperative

        • Hi Costa,
          Kudos to you for standing up to Dr. Ioannidis in this blog, a brave man indeed, who is using science and facts to try and lead public policy towards the right (hopefully) direction. My husband is a Columbia University Graduate and we admire the scrutiny Dr. Ioannidis was willing to withstand, to honor his commitment on unbiased studies and the service of science to mankind.

          Be well and thank you again!

          Anna

        • Costa:

          1. I never said John Ioannidis was an asshole! In the above post I referred to “a couple of assholes from the Hoover Institution.” So I don’t know where that is coming from. Ioannidis is not to my knowledge associated with the Hoover Institution. I was referring to Richard Epstein and Victor Hanson, as can be seen from the links.

          2. I don’t know why you think I’m envious of Ioannidis. I really can’t think of anything I’ve said that would make someone think that. Actually, in the whole post I never mentioned Ioannidis once.

          3. I don’t think the Imperial College researchers are perfect, but they are experts, no? What is gained by calling them “so-called experts.”

          4. I agree with you that statistics only tells part of the story. The statistics of the Santa Clara study are ambiguous; hence we need to rely on other information to make decisions.

        • Well Imperial College’s modelling has finally been released. Whats your assessment on their expertise now “asshole”

        • Lemi:

          I haven’t looked at the raw code from the Imperial College researchers, but I’ve been talking with them a lot about their statistical model. There’s some blog discussion here.

        • Self-selection is a serious problem when there are access to care issues and sub-populations prone to avoid interaction with agencies, or quasi-agencies of the government. I’m surprised at so many people appearing to be unaware of this problem.

      • >Just make the ad say it’s an antibody study NOT testing for covid19 antibodies.

        I doubt that will work. If I see an online ad for an antibody test, with no mention of the disease tested for, at the time of a well-publicized outbreak, I do believe I’ll make a good guess at what disease is being tracked.

    • Hello I am not a statistician but correct me if I’m wrong. Keep in mind a recent Dutch study came up with a similar number of 3%. Anyway…

      Now, From what I read in the thread, the problem is with the +/- of the test. The poster of this thread is very casual about what specificity vs sensitivity is. Specificities tend to underestimate how many have this thing. The researchers said they tested the “tests” and it correctly identified 100% of the negatives and 68% of the positives. Presumably with a PCR test. A follow up test identified 100% of negatives and 93% of positives. In both cases it shows that people who don’t have it are correctly identified. You are negative you don’t have it . However there are some positive cases that are being missed. This tells me that the number infected will most likely be greater than 3% which means we are massively over inflating the death rates.

      • No test gives a positive always and only when the sample is positive. Even just if there’s contamination you can get a positive for a person who doesn’t have the disease. There are other reasons too.

        The point of this article is that if the percentage of positives that are true positive is 98.5% or less… then the data is consistent with there not being any true positives in the sample.

        Worse than that, there are a bunch of extra uncertainties that make it such that even if the percentage of positives that are true positive is higher, maybe 99% or something, then it’s still within the realm of possibility that all of these are false positive. Even if it isn’t “all” of them, the bulk of the positives might well be false.

        Finally, with the recruitment via FB etc, it seems clear that this group was very likely to be enriched for people who thought they had the disease, so even if it’s a few percent, it’s a few percent of people who think they have the disease.

        Basically, we learn from this data ONLY that there probably aren’t 5% or more of the population who have had the disease in Santa Clara county. That’s all we learn. It’s a useful piece of information, but it’s not much. My prior was that it’s at most a few percent even before this study, so there’s nothing surprising, or particularly informative.

        • The more relevant piece of data for these smaller studies is not the false positive, but false negatives. Of course false plosives are important. We can’t just ignore that. Stanford would choose a high specificity test. The nature of these tests are generally(not always) that if it is geared toward specificity, the sensitivity is lacking. The controversy is over the factory stats of the antibody test. . If Stanford ran tests twice and both times got 100% accuracy on negatives, confirmed with PCR, then I have no reason to doubt them. And both times they ran the test , positives were missed up to 30%. Which again, can only mean the actual positives are underestimated. For smaller studies, that’s what you want. I don’t see why people are so dismissive of this, it’s like they want Stanford to be wrong. (No, I don’t have any affiliation).
          The OP is not the first to make this critique, I saw it in a nature article and elsewhere.Just people repeating the same thing. They are also repeating the same criticisms of case selection, valid points, but little value. If the selection was biased toward people who may have been exposed due to whatever, that would be massively overshadowed by the number of people induced by media panic. It’s inadvertently more beneficial.
          At this point, I feel it’s a mute issue as USC research (April 20] independently puts the number at 4% as is expected. Dutch study at 3%. I think at this point, we should gear policy making with the 1/1000 CFR as a maximum in mind. Still a large number of deaths. If we debate these numbers any further, we will have bigger problems than the virus.

        • Question:

          It’s a math thing. Actually, it’s a famous probability example. If the underlying prevalence rate is 50%, then, yes, both sorts of testing errors are important. But if the underlying prevalence rate is 1% or 2%, then the specificity is what’s important. It’s counterintuitive, but if you work out the probabilities, you’ll see the issue. Or you can look at the numbers in the above post: if the test has a 98.5% specificity, then you’d expect a rate of 1.5% of positive tests, even in the absence of true positives. The study reports specificity estimates of 99.5% and 100%, but a specificity of 98.5% or lower is also consistent with their data.

          Also, the USC study is not independent. It seems to be done by the same research group. The data are new, but they could be using the same flawed statistical methods.

          As I wrote in my above post, the Stanford/USC team could be correct in their substantive claims. What I and other critics are saying is that there’s a lot of uncertainty: their data are consistent with their claims, but their data are also consistent with much lower prevalence levels.

          The researchers should release their raw data so that more people can take a look. The problem is too important to do anything otherwise.

        • Andrew: I’ve posted on the other article but now that I think about it, Question actually brings up a good point.

          If the test here is *specifically selected* from a choice of several by Stanford on the basis on having (apparently) good specificity, i.e. this 2/371 result, then everyone’s calculations would actually be excessively generous, no? I’m not sure how to do the maths here, but the interval for false positive rate if we require merely that the best of say, 9 or so competing identically behaving tests – that would be significantly wider right?

        • I am aware that specificity is more important in low prevalence.

          However, If we really want to assess whether re-infection is possible, or whether there is herd immunity to this thing, we’d ask for more sensitivity in tests, rather than specificity.

          I believe there were some early reports of positive cases that turned negative (second test) and then again turned positive (third test). I just remember the gist of it being uncertainty whether it was a high false negative rate or re-infection.

          Not sure if that was for the antibody test though, but this post reminded me of it.

          Thanks

        • If you want to investigate the potential for reinfection, you should be looking for “neutralizing antibodies”. Those are antibodies with are sufficient to trap and immobilize the virions. The methods require a different sort of substrate than just a viral protein to bind to.

        • I wrote a graphical tool to help visualize the non-intuitive relationships betweeen predictive values (what we care about when interpreting tests) and the underlying sensitivity and specificity. In particular how “pretty accurate” tests yield very misleading results when applied to low prevalence populations.

          I think it’s clear that the results of the study WAY overestimate true infection rate. When the specificity of a test is similar (or lower than) the prevalence of the disease, then most of the Positive tests are False.

          https://sites.google.com/view/tgmteststat/home

        • Buddy.

          Firstly they didn’t get both times 100% accuracy.

          Generate some binomial random variables with size 401 and probability 0.015 and see how often you get 2 or fewer false positives. Getting two errors with 401 samples is not evidence the error probability is less than 0.005.

          The false negatives don’t matter because a correct estimate of the prevalence here would be something like (between 0 and 1) x 1.5% x (between 1 and 1.4). The first factor, corresponding to false positives, is *far more important*.

          > If the selection was biased toward people who may have been exposed due to whatever, that would be massively overshadowed by the number of people induced by media panic. It’s inadvertently more beneficial.

          No? Even if you think people induced by media panic are somehow less likely than a randomly selected person to have the virus, you require a small number of people from an enriched population of people with known contacts or symptoms to create 50 excess positives.

          The USC research is not independent. Same author. Same test. Probably same method.

          > I don’t see why people are so dismissive of this, it’s like they want Stanford to be wrong. (No, I don’t have any affiliation).

          It’s because these people who have an affiliation actually studied statistics.

        • The sensitivity would be important if the data showed a much higher % of tests positive. In this study it’s the specificity that throws the study completely off.

          > USC research (April 20] independently puts the number at 4% as is expected. Dutch study at 3%.

          That’s irrelevant, the % infected is obviously going to be different in different parts of the world.

          > I think at this point, we should gear policy making with the 1/1000 CFR as a maximum in mind. Still
          > a large number of deaths. If we debate these numbers any further, we will have bigger problems than
          > the virus.

          This sounds like a political agenda rather than an interest in discussing the validity of this study.
          Everyone is entitled to their own opinion, but everybody is not entitled to their own facts.

        • With reference to the Dutch study also estimating a prevalence of 3%, the Covid-death rate in the Netherlands is about 230 deaths per million total population, whereas it is about 40 deaths per million in California. Applying a 1/1000 CFR to the Dutch deaths would give a prevalence of 23%, not 3%. The Dutch study would suggest a CFR closer to 1/100 than 1/1000. Based on the Dutch results, the Santa Clar prevalence would be about 0.4% rather than 4%. Therefore, I think it would be a major mistake to ‘gear policy making with the 1/1000 CFR as a maximum in mind’ if it could be out by a factor of 10.

      • Also, as to the death rates. It comes down to the CFR estimate is an estimate of the death rate for people who get the symptomatic form of the disease. We have evidence that the asymptomatic form could be ~ 50% or so, but none of that matters a lot. The symptomatic cases doubled every ~ 3 days. and the CFR was ~ 5% among that group. So if you want to know how many people will die and you can estimate the growth of the symptomatic fraction… you can do ok by taking the symptomatic group and multiplying by 0.05… It’s easy to make that number be in the millions for the US.

        If ~ everyone got the disease, to keep the deaths below 1M you’d have to have about 93% asymptomatic. 330M*(1-.93)*.05 = 1.2M

        I think 93% asymptomatic is just pure wishful thinking, and that 1.2M deaths in the US is HUGE.

        so, while we really need prevalence estimates, they won’t really change the decision making.

        Inevitably, there’s no low-work way out of this. We have to do the South Korea thing, or something more or less like it.

      • Question:

        You write, “A follow up test identified 100% of negatives . . . it shows that people who don’t have it are correctly identified. You are negative you don’t have it .”

        That’s the assumption of 100% specificity that’s discussed in the above post. As I wrote, I think the paper would’ve been a zillion percent better if they’d prefaced their claims with, “We believe that the specificity of the test used in this study is between 99.5% and 100%. Under this assumption, we conclude . . .”

      • 100% specificity and 93% sensitivity is an index of false negatives and false positives respectively . So if your test is negative, you do not have the disease. But if your test is positive, there is a 7% chance it’s a false positive. Of course the true probabilities change based on prevalence. If none from a population of 100 have disease, 7 tests could still wrongly be positive (false positive). If all 100 had disease, with a 100% specificity, no case would falsely test negative.

        • An assumption of 100% specificity seems unrealistic. I don’t understand how someone could rationally make this assumption (except perhaps for didactic purposes).

        • You’re first confusing specificity with sensitivity, and then both of them with positive/negative predictive value. 100% specificity gives you full confidence in a positive result, not in negative. But even with a negative result, your post-test probability of not having the disease is not 7% – it depends on the disease prevalence.

    • My first thought about this was that people that are more concerned about the epidemic would be more interest in participate in this study. And these people are the ones that take more care about their behavior to not get infected. Probably, the people that are getting most infected are the ones that don’t care about this subject.

    • As a study participant, I concur. I was highly motivated to participate as I had Covid-type symptoms in Feb. I know of other participants who also did everything they could to get for the same reason.

      The way study access was gated via questionnaire probably also led to incorrect demographic information being provided. I suspect people answered falsely in order to get into the test.

      • Worthwhile information. So many things study designers don’t foresee. We need a compendium of such things. And study designs need to be read and critiqued by a variety of people. Takes a village to plan a good study.

  2. People who suspect they may have had the virus will want to know and will volunteer for these studies at many times the rate of the general population.

    I know a lot of people who think they had it, seems like that could actually be a normal cross section of the population. Also, these results don’t exist in a vacuum:

    https://www.boston25news.com/news/cdc-reviewing-stunning-universal-testing-results-boston-homeless-shelter/Z253TFBO6RG4HCUAARBO4YWO64/

    https://www.bloomberg.com/news/articles/2020-04-11/false-negative-coronavirus-test-results-raise-doctors-doubts

    Of course, maybe all these tests just suck. I’d like to know the results in only the nondyspnic people showing up with psO2 under 90 rather than people with flu like symptoms.

    • My guess would be that the people most likely to participate in this drive through test are people who are pretty cautious about infections but who are also out and about.

      In contrast, people who are adamantly Holed Up for the Duration would be reluctant to participate in the in-person part of the study because of fear of infection. At the other extreme, the What-Me-Worry-Just-the-Flu-Bro types probably wouldn’t bother with waiting in line to be tested.

      So I wouldn’t be surprised if the survey lucked into a pretty representative sample.

      But I also wouldn’t be surprised if the sample were highly skewed.

  3. I understand concerns about the data. But what about the code? Seeing that would clear up a lot of issue that you (rightly!) raise. I will ask them for the code and report back.

    • I think the code to me is less exciting than e.g. how exactly they advertised and recruited for this study, whether the advertisement gave an impression of allowing worried people to get tested. Also stats on ad impressions. If everyone who saw the ad went and got tested then that would go some way to assuaging my doubts on the self selection problem.

    • For those who haven’t had to go through an IRB to get human research approved, being able to release data is a problem. When reviewing articles for publication, I almost always ask why the data has not been provided or released and the answer is almost always their IRB didn’t approve it. Here is why…

      When human subjects are involved in a medical trial, there are 18 “identifying” fields that if removed will provide a safe harbor for the researchers. Those 18 are: names, geographic divisions smaller than state (with some rules about ZIP codes), all dates (other than year) or all ages over 89, telephone numbers, vehicle IDs, fax numbers, device IDs, email addresses, URLs, SSN, IP addresses, medical record numbers, biometric identifiers, health plan beneficiary numbers, full-face photographs, account numbers, any other identifying characteristic, or license numbers. So using the whole 5-digit ZIP code is *not* allowed (even if those ZIP codes have lots of people). At a minimum, only the first-3 digits can be used and even then only if that area has at least 20,000 people. Additionally, reporting ages over 89 years old isn’t allowed. Everyone 90 and above needs to be grouped together. For a disease that disproportionately is more severe for the elderly this would be a large source of bias.

      https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html

      My suggestion is to give these researchers a break for not providing their data (like the comments are doing) — getting data released in a form that would allow replication of any useful MRP analysis to be performed would likely involve negotiation with their IRB, probably a change to the consent, and an “expert determination” that releasing some of these 18 types of identifiers isn’t risky (which would produce liability for that expert should the data become identified.) Its unfortunate but the way things are for now.

  4. Got a form e-mail response from : “Thanks for your message. I am unavailable for inquiries for another 1-2 weeks. I will try to return your message at that time, or try me again at that time.” Anyone know one of the other authors? Who do we think actually wrote the code?

  5. A couple days ago, tweeted some concerns re: the confidence intervals that mostly just restate the technical issues Prof. Gelman posted above, but if people are interested, I also posted some code to bootstrap the study prevalence and get more accurate confidence intervals. Note that I can’t include any of the reweighting since the study authors didn’t publish that data. Primary takeaway is that it’s difficult to rule out the possibility that nearly all of the positives in the study are false positives given the specificity data they rely upon.

    https://github.com/jjcherian/medrxiv_experiment

    There’s also more code to run a parametric bootstrap from another person who’s given this analysis a shot here: https://gist.github.com/syadlowsky/a23a2f67358ef64a7d80d8c5cc0ad307

    Hope this is helpful!

    • John:

      Thanks. Just to clarify, I think the best way to attack this problem is through a Bayesian analysis. Classical confidence intervals are fine, but they kinda fall apart when you’re trying to combine multiple sources of uncertainty, as indeed can be seen in the Bendavid et al. article. I just did the classical intervals because that’s what they did, and it was the easiest way to see what went wrong with their analysis.

      • Thanks for clarifying! I think that makes sense for the non-parametric bootstrap (which has to treat these sources of uncertainty as essentially independent and seems like it underestimates the final uncertainty as a result), but I don’t think I understand why the parametric bootstrap fails here? It seems very similar conceptually to the Bayesian setup.

        • I don’t think Andrew was saying your stuff fails here. He’s just saying that in his article he stuck to classical intervals because they became comparable to the ones in the original article.

          Your stuff expands on that to handle more kinds of uncertainty. I think the Bayesian method would give the fullest picture, but I don’t know how different it would be from the bootstrap picture. The bootstrap can sometimes be seen as an approximation to the full Bayesian version.

        • Gotcha! Sorry I think “fails” maybe carries an implication that I didn’t intend. I’m just curious about how to think about differences between the parametric bootstrap and the Bayesian modeling approach. Found this B. Efron paper that specifically addresses this subject (https://arxiv.org/pdf/1301.2936.pdf), so I’m hopeful that reading this will resolve my questions. Thanks!

        • > The bootstrap can sometimes be seen as an approximation to the full Bayesian version.
          Well, Brad Efron wrote a paper or two on that 5 or more years ago, arguing that the bootstrap automatically creates its own non-informative prior automatically.

          The right bootstrap that is – which to me would be a lot more tricky to sort out than just defining an appropriate prior…

  6. I think this study is actually a great usecase for Stan to really show how the uncertainty propagates from their tests to their final estimate. In particular, it’s important to note how there is a highly non-linear and non-symmetric effect in that a slightly lower specificity really tanks the estimated prevalence. https://colab.research.google.com/drive/110EIVw8dZ7XHpVK8pcvLDHg0CN7yrS_t shows my attempt at modeling this.

    Note that there is a high density in the posterior near 0% for prevalence.

      • Another thing that’s potentially worth thinking about is there might be a selection bias in the survey respondents not only for increased COVID-19 antibodies, but also similar flu (or other respiratory infection) antibody. I’m sure of the exact dynamics of the antibody test, but that would imply that the sample might be enriched for false positives as well if the test has a higher error rate on people who had other viral infections.

        • Lots of false negatives in these tests, even in severe cases:

          Since rRT-PCR tests serve as the gold standard method to confirm the infection of SARS-CoV-2, false-negative results could hinder the prevention and control of the epidemic, particularly when this test plays a key reference role in deciding the necessity for continued isolated medical observation or discharge. Regarding the underlying reasons for false-negative rRT-PCR results, a previous published study suggested that insufficient viral specimens and laboratory error might be responsible (3). We speculated from these two cases that infection routes, disease progression status (specimen collection timing and methods), and coinfection with other viruses might influence the rRT-PCR test accuracy, which should be further studied with more cases.

          False-negative rRT-PCR results were seen in many hospitals. By monitoring data collected at our hospital from January 21 to 31, 2020, two out of ten negative cases shown by the rRT-PCR test were finally confirmed to be positive for COVID-19, yielding an approximately 20% false-negative rate of rRT-PCR. Although the false-negative estimate would not be accurate until we expand the observational time span and number of monitored cases, the drawback of rRT-PCR was revealed.

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7082661/

          But also they are using the rt-pcr results as the gold standard despite these issues? This doesn’t make much sense.

        • They are specifically using a serology test for antibodies. Some complexity is the test can look for two types and they didn’t really talk about the two types or break down the test. However the RT-PCR tests are different- these test for presence of genetic material of the virus, rather than prior exposure to the virus.

        • True, sorry for not being clear. I don’t trust *any* of these tests. I think if you look with equal effort at pretty much any of the studies of the general prevalence of either antibodies or viral RNA it will be problematic. Eg, here is that Iceland study (from the supplement):

          Specificity of the WHO recommended assays
          were assessed against a number of known viruses, including alphacoronaviruses, non-asian
          strains of betacoronaviruses, influenza and MERS. No cross-reactivity was observed.
          […]
          Validation of the RNA extraction and the
          qRT-PCR method(s) at deCODE was performed using 124 samples that had previously tested
          positive (n=104) or negative (n=20) with the qRT-PCR assay at LUH. All of the negative samples
          tested negative at deCODE and 102 of the 104 positive tested at LUH were also positive at
          deCODE. Two samples that tested positive at LUH were negative at deCODE. Upon subsequent
          sequencing (see below) viral genome could not be detected in these two samples, probably
          because very few viral particles were present. Samples from 643 individuals that tested
          positive using either the deCODE or the LUH qPCR assays were also submitted for viral genome
          sequencing (see below). Viral RNA (cDNA) from six samples (0.9 %) yielded no sequence data
          mapping to the viral reference genome. The success of generating sequencing libraries with
          good coverage is highly dependent on the amount of viral RNA in the samples as assessed by
          the C t values from the qRT-PCR assays. Figure S2 shows the relationship between measured
          C t values and the consensus coverage of the sequenced samples. These data show that the
          qRT-PCR assay is more sensitive in detecting viral RNA than the amplicon sequencing method.

          https://www.ncbi.nlm.nih.gov/pubmed/32289214

          We see 20/20 (100%) negative and 102/104 (98%) positive qPCR tests replicated in different labs. Also, 637/643 (99%) positive tests could be confirmed against the gold standard (genome sequencing).

          So, ok, sensitivity is observed positive over over all true positives. True positives, we don’t know since they only sequenced 643 samples with with positive qPCR results but none of the negative ones. Specificity is is observed negatives over all true negatives. They didn’t assess true negatives at all and just assume the WHO’s data from cell culture corresponds to their samples from humans.

          What they needed to do was sequence a bunch of samples to determine which were true positives/negatives, then tell us the proportion of each that tested positive/negative on PCR.

        • I don’t think you can sequence a raw sample. you need to do the PCR to amplify before you can sequence. The “gold standard” part is that you sequence the PCR amplicand and determine that it was in fact the viral one and not some other fragment that happened to be similar enough to amplify.

          There’s too little RNA to take just a swab and sequence it. I think.

        • I think it’s fairly likely that real world false negatives with the PCR test would be dominated not by some methodological issue but viral RNA just flat out not being present in the sample. It only takes a few hundred to a few thousand virons to infect an individual, and maybe none of them happened to be up the victim’s nose.

        • Yes, the PCR assay is extremely sensitive. The major issues are methodological in sample gathering. For virus that infects lower respiratory tract the gold standard is an expectorated sample. These are much more dangerous to obtain than a nasal swab. False negatives in PCR, which most versions of test are sensitive down to a few molecules are at the sample stage. For the antibody tests, it is discussed that false positives could be due to detection of other antibodies, for example to other coronaviruses. In a population that has seen widespread exposure, it is likely the antibody tests will be decent. I’ve seen a claim in a forwarded email suggesting these authors could have additional control tests that might improve the estimation of false positive rate. However, the case that these authors would like to make that we can potentially expect many fewer deaths has sort of sailed. There will be many deaths and they will be above some threshold that moots their earlier base speculation.

        • The major issues are methodological in sample gathering.

          Yes, I’d be more worried about stuff like this:
          https://www.seattletimes.com/seattle-news/health/uw-medicine-halts-use-of-coronavirus-testing-kits-airlifted-from-china-after-some-had-contamination/

          Or that the tests aren’t actually very predictive for the thing killing people because there is a coinfection, etc.

          That is why they need to test their actual methodology against a gold standard.

      • How would you do that? What am I missing here? What aspects of the data would relate to the latent selection bias variable so as to help identify it and its effects on serology? The selection here is clearly far from completely random, nor does it seem to be random conditional on observed variables. In fact, other than age, I don’t see anything in the data that even looks slightly informative about this. Other than a strong prior, what would identify the selection latent variable and its effects?

        • Clyde:

          Yeah, when I say “include a latent variable,” I mean that without additional data you’d need to make assumptions about that latent variable. The idea is that you’d specify some plausible range of selection bias, and this would have the effect of adding uncertainty to your estimate.

        • I don’t think it’s possible to usefully specify a “plausible” range of selection bias without data. A plausible range without data would include the possibility that seropositive individuals were 10-100-fold more likely to reply to the ad. That would reduce the prevalence estimate to being essentially meaningless on the lower end (at 100x, it would include the # of confirmed cases of COVID-19 based on swab tests for RNA), just like the uncertainty in the specificity. They have the fraction of participants who had COVID-19 symptoms. If they (or someone else) had data on what fraction of a random sample of the population had COVID-19 symptoms they could properly adjust for bias in recruitment. Without it, I don’t see how it’s possible to do something meaningful with a latent variable.

        • Marm:

          It’s hard for me to believe it would be factor of 100! But if 100 is possible, then it’s possible, and indeed you’ll get a really wide interval—that’s the way it goes. As you say, they do have some information on symptoms, and that could inform their estimates. Otherwise you just have to make some hypotheses and go with that.

    • This is an excellent teaching example for Stan. I’ve never used it before and it’s really easy to read the code and see how it’s modeling this problem. I have one really stupid question. What does the “b” mean in “prevalence ~ b”? Is that a shorthand for the uniform distribution?

      • Ah sorry, not sure how the code got messed up. Fixed now.

        (I think I accidentally tried opening this on a phone yesterday and accidentally edited the text.)

        • Ah gotcha. I was thinking about this a little more and I think the model is not exactly right. Shouldn’t the final number of tested positives should be a sum of the true positives and false positives, both of which are binomials? In other words, right now the code has:

          num_community_positive ~ binomial(num_community, (fp_rate*(1 – prevalence) + (tp_rate*prevalence)));

          but I think the correct computation (this isn’t valid Stan code) would be:

          num_community_positive ~ binomial(num_community*prevalence, tp_rate) + binomial(num_community*(1-prevalence), fp_rate);

          And I don’t *think* those are equivalent? And if they’re not I’m not sure this is possible with Stan. There’s the issue with num_community*prevalence not being an integer, but more importantly it kind of seems like there isn’t a way to do a sum of binomials: https://discourse.mc-stan.org/t/sum-of-binomials-when-only-the-sum-is-observed/8846

        • I think that section is correct as a mixture of binomials is equivalent to a single binomial. The easiest way to show that is to show that a mixture of bernoulli distributions is equivalent to a single bernoulli where the bernoulli parameter p = first_bernoulli_fraction * first_bernoulli_p + (1 – first_bernoulli_fraction) * second_bernoulli_p. The proof of that equivalence is that P(y = a | combined bernoulli) = P(y = a | mixture) for all a in {0, 1} (the domain of a bernoulli distribution.

          Once you have convinced yourself that a mixture of bernoullis is equivalent to a single bernoulli, you can then extrapolate to the binomial case by considering each independent draw separately. I might have missed something here (and if you have a nice counter-example, please do let me know!)

          It’s not possible to model this in Stan (due to a lack of support for integer parameters), but I think the way you could think about modeling this is the following:

          num_community_antibodies ~ binomial(num_community, prevalence)
          num_community_no_antibodies = num_community – num_community_antibodies;
          num_community_tp ~ binomial(num_community_antibodies, tp_rate);
          num_community_fp = num_community_positive – num_community_tp;
          num_community_fp ~ binomial(num_community_no_antibodies, fp_rate);

          This would require two more parameters, num_community_antibodies and num_community_tp.

        • Seems like we may have hit the response depth limit. In response to your comment below… doh! I actually thought about that and tried it, but made a dumb coding error. It also makes more sense as a model – the prevalence we care about is not the true # of infected within the study’s sample, but the prevalence of the population from which the sample was taken. Of course this still doesn’t take into account sampling biases.

          Thank you so much!

      • I think that’s just an error, as the line needs a trailing semicolon to be syntactically correct in Stan. I assume the statement was something like “prevalence ~ beta(1,1);” and then the author accidentally deleted a portion of it.

  7. At the time of this writing, NYC has about 9000 recorded coronavirus deaths. Multiply by 600 and you get 5.4 million. OK, I don’t think 5.4 million New Yorkers have been exposed to coronavirus. New York only has 8.4 million people total! I don’t think I know anyone who’s had coronavirus. Sure, you can have it and not have any symptoms—but if it’s as contagious as all that, then if I had it, I guess all my family would get it too, and then I’d guess that somebody would show some symptoms.

    I very much agree with Prof. Gelman’s takeaways in this post, and appreciate the dive in here into this very high profile and seemingly misunderstood piece. That said – I’m not sure the above follows. The premise of the 50-85x undercounting conclusion the authors draw is that the vast majority of cases are asymptomatic (or sufficiently mildly symptomatic that people don’t identify the symptoms until after they’ve had a positive test result). So I conceptually conceive that 65% of New Yorkers have had this in this scenario if >90% cases are functionally asymptomatic.

    Regardless, New York City has identified an additional 3,700 likely COVID deaths (https://www.nytimes.com/2020/04/14/nyregion/new-york-coronavirus-deaths.html), so they’re well above 10K now. The implied IFR in this study is about 0.1%, which as applied to the NYC population would suggest roughly an 100% attack rate in NYC. While there is no single IFR, it’s tough to imagine more than 50% of New Yorkers have been infected, given PCR testing (which was limited to people presenting with symptoms) was only showing about 50% positive results in New York at any given time. (And yes, it’s possible for 80% of people to have had COVID with no more than 50% as presenting at any given time, but we’re getting into very unlikely territory).

    • The obsession with the false positive rates seem to never be addressed when considering the tests used to diagnose case fatalities in cities and states. The comment above cites that additional 3,700 cases in NYT.. here is the headline:
      The city has added more than 3,700 additional people who were presumed to have died of the coronavirus but had never tested positive.

      So no concern about whether the patients actually had covid19? Kind of important when using studies like the one discussed here to project state-wide and nation-wide mortality rates based on the numbers provided by the states. At least the authors conducted their own testing and have control of the data and the manufacturers error rates

      • Have you tried looking up the company that was cited. I went to the CDC website and could not find anything there about even testing done under the emergency regulations for Premier Biotech, Minneapolis, MN, much less a full filing.

      • >So no concern about whether the patients actually had covid19?

        This is not true, though the misconception is common. In one of the CDC examples of how to code death certificates, we have an example of an 86-year old female, non-ambulatory for 3 years after a stroke. She is exposed to a family member (a nephew, for the sake of argument) who has covid-19 symptoms and subsequently tests positive for covid-19 infection. She develops covid-19 symptoms herself, but refuses to go to the hospital and is not tested. She dies of “acute respiratory illness” after five days. But given her exposure, the “acute respiratory illness” is listed as being caused by PROBABLE covid-19 infection.

        IMAO, this is quite reasonable. She has no history of respiratory problems in the years since her stroke, and she has no known risk factors for respiratory problems except the nephew whose infection has been confirmed. Could conceivably be something else, and testing would have been desirable, but the idea she just happened by coincidence to develop covid-like symptoms after being exposed to a carrier is hard to credit.

        https://www.cdc.gov/nchs/data/nvss/vsrg/vsrg03-508.pdf, p. 6.

        • +1

          The assumption “deaths that are attributed to Covid-19 without laboratory confirmation are not actually caused by Covid-19” regularly implies that medical examiners have less common sense, medical experience and knowledge about the patient than the person proposing this assumption, or they’re straight-up conspiracy theories.

  8. The supplemental materials have some detail on the re-weighting and the variance estimation:
    https://www.medrxiv.org/content/medrxiv/suppl/2020/04/17/2020.04.14.20062463.DC1/2020.04.14.20062463-1.pdf

    Maybe it’s just missing from the formula, but their description under “Delta method approach to variance measurement” looks like it just doesn’t include any means for incorporating uncertainty in the sensitivity/specificity estimates in to the estimate of the standard error. The standard error would be exactly the same regardless of whether they used 10 tests or 1000 for validation. Is that normal?

    • You are correct, they do not take into account any uncertainty on the specificity or sensitivity in their analysis. They only take into account the sample variance. This is one of the unfortunate flaws with their confidence interval estimates.

      Another is that the variance is actually calculated incorrectly, as when they reweight their q values (fraction of positive test results) by a scale factor they do not account for the fact that Var(a*x) = a^2 Var(x), and report it as a*Var(x). Not explicitly, but if you cross check their numbers you’ll find this to be the case.

      As John Rushton correctly pointed out, the reweighting of the q values directly (rather than the prevalence, which they call pi) is also highly problematic. They only apply the effect of false positives and negatives *after* doing the demographic reweighting. Which I don’t think is sensible in this case. It suggests that the point estimate of the specificity and the sensitivity of the test are only appropriate for the demographic distribution of Santa Clara county.

      There are other issues with the analysis, but I think the picture is clear. It would have been really nice to see more careful work done on this kind of test, which could be very informative. I applaud the authors for putting in the effort to start these much need serological studies, hopefully continued efforts will provide more reliable results as we move forward.

  9. For this to be credible, there needs to be a behavioral selection model accounting for the utility of the free test offered to people (as you noted). Such a selection model could then account for the value of the test to different populations and backwards adjust the estimates. Simple MRP won’t work here as it will only adjust the selected estimates by demographic categories, i.e., we know the estimate conditional on utility obtained, but not estimate marginal of utility obtained.

  10. Their prevalence CIs do not include zero because of their sampling model. (Some have hypothesized that the delta method is to blame for the seemingly-wrong CIs, but I did some numerical checks, and I don’t think it’s as big of an issue as the following). They reweight their biased dataset to match the demographics of Santa Clara, thereby boosting the number of positives. There are two ways to assess FPR/FNR on the reweighted data:

    1. Increase the FPR/FNR by the maximum coefficient of reweighting. (So, if we upweighted a positive datum by 2, then we would have to multiply FPR by 2.) This is rigorous, but makes inferences difficult. (I agree with this approach.)

    1. Just use the original FPR/FNR on the reweighted data, imagining it was a fresh random sample. It seems this is what they did. There may be an argument for this approach, if the antibody test performance has the appropriate relationship with their reweighting (i.e. if they are not upweighting data prone to FPs.) However, I don’t see that argument in the current version of their paper.

    Another separate issue is: their expressions for variance of the estimates of sensitivity/specificity seemingly do not depend on the sample sizes (of the manufacturer and local validation sets.) So, if they had 100,000 validation samples rather than 400, their confidence intervals seemingly wouldn’t change. Missing factor of n somewhere?

  11. I would like to offer a technical conjecture about what went wrong in the statistical analysis. (I already communicated this concern to the authors.) As already pointed out on this page, the test’s false positive rate is estimated at 2/301=.5%, but with a 95% confidence interval extending upward to 1.9%. By the authors’ own admission on age 2 of their statistical appendix, this means we cannot reject the hypothesis of zero prevalence. Note (this will become important): Because the binomial distribution is not well approximated by a normal here, the CI must be constructed as exact binomial, not by normal approximation. The authors do this and correctly report 1.9%. If they had mistakenly used a normal approximation combined with the sample variance, their CI would have extended only to 1.2% and zero prevalence would have been spuriously rejected.

    So here is my concern: The authors subsequently use the delta-method to analyze error propagation. This implicitly applies normal approximations. Indeed, the analysis culminates in providing standard errors, and these are only interpretable in the context of normal approximation. I therefore worry that the spurious rejection alluded to above affects the later part of the analysis. The earlier conclusion that 0 cannot be rejected seems to me to be appropriate. (Of course, should the propagation analysis have happend post reweighting, then all bets are off anyway.)

    At the very least, there is a striking dischord between the headline results, including a CI for specificity given in the paper that includes one minus their unadjusted empirical positive rate, and the following passage from the statistical appendix: “There is one important caveat to this formula: it only holds as long as (one minus) the specificity of the test is higher than the sample prevalence. If it is lower, all the observed positives in the sample could be due to false-positive test results, and we cannot exclude zero prevalence as a possibility.”

    Disclaimer: I emphasize that I’d be happy to stand corrected, I appreciate the data collection work, and I also appreciate that the paper was put together under insane pressure.

      • Update: I may stand corrected in that I may have been too charitable. A comment further down argues that, in the error propagation calculations, they uncritically took the 2/401 false positives from the validation data as ground truth. I don’t have time to check, but I have a sinking feeling it might be so. Just to clarify things for non-professionals who try to make sense of this thread: That analysis contradicts mine in detail -though not in qualitative conclusion- and would imply that I’ve charitably misread the paper. (Which is possible because I did kind of stop digging after seeing the switch to Delta method.)

  12. As someone who has tried to do survey sampling through Facebook ads before, there’s another – more pernicious – element to the selection bias they faced.

    Depending on how the ads are targeted and with what goal (eg minimize cost per impressions, click, conversion) the algorithm will be constantly adjusting who is served the ad based on who has responded. This can be demographic driven, but also based on their “interest” profiles as well.

    This can substantially amplify the impact of self selection or introduce other additional selection biases.

    • Can confirm. Facebook uses machine learning in their ads to maximize responses. And they use thousands of features in the ML. From likes of certain posts, to visiting third-party sites with facebook cookies installed.

      This is great for marketers trying to maximize sales, not so great for getting a randomized sample across a certain demographic for a scientific study.

      Even if the researchers were smart enough to not use interest-based targeting in their ads, the ML algorithm would soon learn that people with an interest in covid-19 testing were more likely to click on the ads, and would start distributing the ads only to similar people in order to maximize clicks.

      The use of FB ads also explains the under-sampling of older people, as fewer people who are 65+ are even on the platform or login regularly.

  13. This is an excellent post! I wish more things out there were this thoughtful.

    On: “But Stanford has also paid a small price for publicizing this work, because people will remember that ‘the Stanford study’ was hyped but it had issues. So there is a cost here. The next study out of Stanford will have a little less of that credibility bank to borrow from.”

    This sounds optimistic! (Compared to the default position of most: if its’s Stanford / Harvard it must be right…)

    • I think my more pessimistic extrapolation might be that the reputational damage might be done more to the field than the institution. “Oh this study says this, that study says that, clearly no one knows anything so I’ll go with whatever my preferred answer is.”

      • Zhou:

        Maybe you’re right. The extreme case is the post by economist Tyler Cowen who was trashing the entire field of epidemiology, in part based on a paper that was written by . . . an economist! See here for the details.

      • I have been talking to some of my friends in medial/bio-related fields. While they all agree that this paper is terrible, they seemed generally less concerned with not controlling for selection bias, which in my opinion is more than sufficient to sink the paper by itself. I’m pretty sure that this is far from the only medical study that doesn’t adequately adjust for the selection bias (probably because it is not part of their typical curriculum).

  14. Seems as good time as any to remind people of this excellent paper: https://projecteuclid.org/euclid.ss/1009213286
    I seem to recall that test inversion hast best frequentist properties for coverage in one-sided cases which is what we care about here. I think the 95% lower limit would be a little higher than Agresti or what is reported in the paper and 90% lower limit would be around 99%.

  15. I think a possible way to get a CI that excludes zero is if the positive cases are clustered.

    Imagine the extreme case where all positive tests came from a single zip area with few participants.

    This should also shift the estimate of your specificity.

    • If I understand this correctly, it is wrong.

      Suppose all the people who don’t have the disease are in zip codes 95000-95098, while all the positives are in zip code 95099. By hypothesis, most of the people answering are in areas 00-98. Unless specificity closely approaches 100%, always, you will get false positives in those areas. You’d need a specificity in the range of 99.97% to avoid getting false positives in the ‘no infections here’ area.

      But with such high specificity, distribution no longer matters. You’d be extremely unluckly to get as many as five false positives in 3300 tests. So the fifty positive results would correspond to 45-49 real infections out of the 3300 tested. With that accuracy, you could depend on the results no matter how the positives were distributed.

      • I believe you’ve grasped the idea, but are applying it the wrong way around.
        With a low chance for a low specificity and a low chance for a random sampling to be clustered like that, the chance of both occurring together drops, eliminating that combination from the confidence interval.

        A simpler example would be a D6 dice throw. Throw the dice once, and the low end (1) is well within 95% CI; throw the dice twice, and the low end (1+1=2) has p<0.05.

        Having a low specificity and a cluster of positives is just too unlikely to occur together by random chance.

  16. Surely a more informative analysis would be, given the data collected and the test performance, what is the posterior probability that at least 10% or even 30% of the population have been infected? Those are numbers on which to base policy. Even for a relatively poor test, that number will be low. This study is really an investigation into likely order of magnitude for prevalence – 1, 3, 10, 30%, Exact estimation is the wrong endpoint for the analysis.

    • Djaustin:

      One could estimate this with a Bayesian analysis. What you’ll find is a lot of uncertainty: the data are consistent with just about any prevalence between 0 and 5%. The problem is that the study as defined is just not very informative, unless you want to make the strong assumption that that the specificity of the test is between 99.5% and 100%.

      • I don’t think you necessarily have to make that assumption on the specificity, see my comment above.

        Then again, we do not know what the raw data look like…

        • Did they? In the results section they write:

          “Notably, the uncertainty bounds around each of these population prevalence estimates propagates the uncertainty in each of the three component parameters: sample prevalence, test sensitivity, and test specificity.”

          But maybe they actually used the point estimates, I don’t know whether we can tell.

        • But doesn’t this only mean that they used all the point estimates for the prevalence point estimate, but then they do use the uncertainty in the parameters to estimate the standard error?

        • No, they used the point estimates to estimate the standard error. For example, they calculated a variance for false positives *in their study* based on s_hat(1-s_hat), based on binomial outcomes, plugging in 0.995. This is implicitly assuming that they *know* the true false positive rate was 0.995, and their uncertainty estimate is then based on the sampling variability of that in a sample of size 3000+. They don’t consider the source of the original estimate itself has uncertainty.

        • Their variance estimate is *loosened* by their omission of dependence on the sample size. (Unless I misinterpreted their notation?) In any case, their estimates of specificity and sensitivity are too high going into this variance estimate, because they don’t account for the initial reweighting.

        • No, they don’t actually have a variance estimate for parameter uncertainty.

          Essentially what they did this:

          They noted that their prevalence is Pi, a function of weighted prevalence, sensitivity and specificity s
          They accepted that sensitivity and specificity can vary in their sample of 3300 due to sampling variability. In *their sample*, not the original ~400 sample
          They plugged in the point estimate of 0.995 or whatever into the variance of a binomial random variable to calculate the variance of specificity in their sample
          Then they used the delta method (based on derivatives of the Pi function) to compute how this affects their prevalence estimate

          In other words they assumed the uncertainty in the estimate resulting from the 400 or so original study was zero.

        • That’s correct, as far as their (mis)application of the delta method is concerned. (I am kind of hesitant to pore through their expressions until they fix the apparent issues.)

          But, concretely, consider the numerical value they derive for, e.g. var(\hat{r}}. They are using .67(1-.67) = .22, which is enormous for a Bernoulli parameter.

        • Shiva Kaul:

          That aspect confused me as well, until I realised it doesn’t square with their SE and so confidence intervals. What they are presenting as “Var” is not the variance, but the variance prior to an adjustment for the 3000+ sample size. They are actually claiming a variance of 0.00007.

        • I see. It’s laudable that they included a statistical appendix, but it really needs further clarification.

          As for fixing the error: they can use the delta method to estimate an analytic (1-eps) confidence interval in terms of eps and the unknown parameters q,r, and s. They trivially have concentration of \hat{q}, \hat{r}, \hat{s} around their means, so all unknown terms can be bounded, with probability eps, in terms the estimates, yielding a 1 – 2*eps interval in terms of eps and n. Because of the exponential concentration, I don’t think this will make a massive quantitative difference, unlike the issues earlier in the paper.

      • Yes, I’ve done this now. There is no support for a 10% prevalence on the raw data and almost none for 3%. That in itself is informative. I modelled sensitivity and specificity, and neither are much to shout about! But a poor test can give useable information, just not precision.

  17. Having recently listened to the Bad Blood audiobook, I sincerely appreciate your dig at Theranos (and people who believe they can be domain-independent “experts”, on any and every domain).

  18. We have a facebook group of 2000 physicians and epidemiologists where we review and discuss this type of paper for the frontline docs that are in the group (probably about 700). If you are a physician, nurse, epidemiologist or bench bio-scientist join us!
    Clinical Epidemiology Discussion Group
    https/facebook.com/groups/covidnerds
    Thank you for your analysis of this paper. We will post it to our group

  19. I came across this critique of the same paper yesterday: https://medium.com/@balajis/peer-review-of-covid-19-antibody-seroprevalence-in-santa-clara-county-california-1f6382258c25 … it mentions the same points: the false-positive rate and the participant selection issues. It also mentions an additional point: that it “would imply faster spread than past pandemics”:

    > In order to generate these thousands of excess deaths [in Lombardy, compared to base rate] in just a few weeks with the very low infection fatality rate of 0.12–0.2% claimed in the paper, the virus would have to be wildly contagious. It would mean all the deaths are coming in the last few weeks as the virus goes vertical, churns out millions of infections per week to get thousands of deaths, and then suddenly disappears as it runs out of bodies.

    • In order to generate these thousands of excess deaths [in Lombardy, compared to base rate] in just a few weeks with the very low infection fatality rate of 0.12–0.2% claimed in the paper, the virus would have to be wildly contagious.

      Or just around for much longer, and don’t forget that other countries started testing passengers disembarking from Italy and for awhile it was like every single flight had a few positives when there were supposedly only a couple ten thousand cases.

      Also, that high fatality rate is dependent on initially treating this according to the standard ARDS protocol which was apparently a mistake according to the critical care doctors: https://emcrit.org/emcrit/covid-respiratory-management/

    • Good writeup. There’s a disturbing tendency in the comments of that to go “well, there’s false negatives as well!” and imagine the two issues cancel out. I suppose that’s a manifestation of the “truth is in the middle” type thinking.

      • Zhou:

        Yeah, that’s where math is useful! One thing that’s frustrating here is that the false-positive problem is a well known issue. Indeed it’s a standard example—perhaps the standard example—that we used when teaching conditional probability. Everybody who’s ever taught probability knows that if you have a rare event, then even moderate levels of false-positive rates destroy you.

        The false positives and false negatives in this example don’t cancel each other out, and one thing that annoyed me in the above-linked paper is when they wrote, “On the other hand, lower sensitivity, which has been raised as a concern with point-of-care test kits, would imply that the population prevalence would be even higher.” This kinda sounds reasonable but it’s not the right way to put it here given the numbers involved. It’s qualitative talk, not quantitative talk.

        • +1

          The best lesson in my grad intro stats class was a lab where we calculated the probability that a real-life case of an Olympic runner being accused of doping was a false positive. A case of my understanding statistics resulting in a change in how I view the world. Then again, it made me give credence to Lance Armstrong far longer than seems wise in retrospect…

    • The use of this approach to try and get a sense of covid-19 prevalence is fine. Both studies give information about the range of possible prevalence of infection. It is the extension of these numbers, without properly including uncertainty, to estimate the infection fatality rate that is concerning. I don’t see that issue in this preprint.

  20. There is another error in this paper that you have not touched on. They authors count the total number of positive cases by either IgG or IgM. Therefore the sensitivity of their testing criteria is the maximum of the IgG and IgM sensitivities, which is 100%. The specificity is more complicated. From the manufacturer’s data, there were 3/371 false positives for IgM, and 2/371 for IgG. There is no information on whether these false positives overlap. For the author’s testing criteria, the specificity is at most 99.2%. Therefore, the point estimate for the unweighted prevalence is at most 0.7%, and could be as low as 0.15%.

  21. My impression as a clinician is that no point of care test I’ve used is 100% specific, and 99% would be very high.

    I was told on Twitter by someone whose bio seemed appropriate that validating an antibody test precisely enough to do a serosurvey requires using pre-COVID serum samples from patients that had a comparable incidence of other viral URIs in the preceding period, including the non-novel coronaviruses, because those antibodies can be cross-reactive. So serum from last April, at the end of cold-flu seasons. Using inappropriately bland serum would exaggerate specificity.

    The decision to double check the test characteristics with some of their samples presumably has clinical more than statistical grounds. But I find it hard to explain running a whole arm of the analysis with the point estimate of 100% specificity based on a sample of 30.

    Another thing that boosted the prevalence estate: They chose a low estimate of sensitivity in the arm of the analysis that limited itself strictly to the manufacturer’s characterization of the test. The manufacturer found 75/75 positive IgG antibodies but only 78/85 IgM. The study uses just the IgM sensitivity. But we would expect almost all recovered patients to have both types of antibodies by the time they’re recovered. If anything, IgG more lastingly than IgM.

    • Yeah, publishing a result (the high end, 4.2% prevalence) that relies _exclusively_ on a specificity of 100% based on a mere 30-samople test seems a bit nuts.

      Definitely feels like a bit of results-shopping to include that at all.

      Not a researcher, so no idea what the answer to this is: Do the the authors have any reason to believe their specificity study is more reliable than the manufacturer’s? Is there some reason why the manufacturer’s results are perhaps but applicable, say due to some sort of local population characteristics? Because I just can’t see any reason to publish a result (even as the endpoint of a “range”) that involves ignoring certain data.

  22. Of course this discusses a pre-print. This is not the only criticism I’ve seen.
    Yet journals are rushing reports to publication. And a great deal of low quality research is getting published, much of it in high impact journals. To some degree this represents SOP compressed in time; yet I can’t help but suspect a degree of academic and publication ‘profiteering’. All in all, especially on the treatment side, huge opportunities are being missed. Now more than ever is a great time to reform for more reliable research findings.

    • Charles:

      I don’t mind bypassing formal peer review, given that peer review is often just a way for people to give a seal of approval on their friends’ work. But I do wish the authors of this paper had run their ideas past an outsider with some statistical understanding before blasting their conclusions out to the world.

      • ” I do wish the authors of this paper had run their ideas past an outsider with some statistical understanding before blasting their conclusions out to the world.”

        +1

      • No argument whatsoever.
        Indeed a panoply of factors for years now led to Ioannidis’ estimate of the rate of ‘wrong’ results being published. In this time when actionable research is direly needed, nothing appears to have changed. Not merely a shame, but dangerous.

  23. I wonder what these false positives actually are. It seems likely (but I have no expertise here) that they are antibodies raised against other viruses that bind well enough to the virus that causes COVID-19 to show as a positive on the test. Do these antibodies work to prevent COVID-19, or reduce its severity? It is possible that a followup study of these false positives might lead to important insights. If so, it would be a mistake to think of false positives as a lab mistake.

    • To the best of my knowledge, you are correct that the cross-reactivity problem is real and these “false positives” can be real in the sense of the immune system providing resistance. As we are talking about 10 billion different antibodies are possible, most humans don’t have identical antibodies to the same antigen (virus) and it would be possible for a fraction of those with antibodies to the common cold (4 different coronaviruses + others) to cross-react with this test kit giving “real — it is their — false positives”.

      The only problem is what conclusions you can draw from the data. If most of it is really positive immune systems from another source, it tells you nothing about COVID-19 future paths.

      An example of cross-reactive is TB antibodies (positive TB test) from exposure to marine mycobacteria by fish handlers never have had TB. More dramatically being immune to smallpox after having cowpox, which was the start of the elimination of smallpox from the world.

      • Interesting.
        (Incidentally, my father never bothered to have a smallpox vaccination, since he had had cowpox as a boy — I believe contracted one of the summers that he spent working on his uncle’s farm. However, in later life, he was unable to donate blood because he had not been vaccinated for smallpox.)

    • Funko:

      I thought about this when writing the above post. You can come up with examples where the positive results are concentrated and can’t be explained by testing errors, for example if 20 of the 50 positive tests were happening from people in one small location. But I can’t see how this could possibly be what’s happening here, because in the article in question they’re just taking averages and poststratifying. It seems pretty clear that what’s going on is they’re just assuming the specificity is 99.5% or 100% and going from there.

      • But doesn’t the jump from 1.5% crude prevalence to 2.8% for the reweighed prevalence imply that there is some strong heterogeneity in the prevalences and that some higher prevalence cluster are underrepresented in the sample?

        • Sampling rate by zip code is tightly coupled to how close the Area was to one of the three testing sites (and in particular a large portion of the study came from the Stanford zip code). It seems to me the likely explanation is that the further you were from a test site the more motivated you would’ve had to be to participate, and so there would be even more bias toward symptomatic volunteers.

      • Hm, did my reply get eaten? Here it is again:

        But doesn’t the jump from 1.5% crude prevalence to 2.8% for the reweighed prevalence imply that there is some strong heterogeneity in the prevalences and that some higher prevalence cluster are underrepresented in the sample?

  24. I don’t understand why cross reactivity is ignired in the paper.

    Premier has a package insert for this test. It reports cross-reactivity, in particular for the common cold (CoV229e and CoVOC43).

    What is the cross-reactivity? In other words, how often does it say you have SARS-CoV-2 Abs when you actually only have 1 or more other CoVs?

    Is it always true positive with one but false positive with 2? False positive 10% or 1% of the time? These are knowable answers.

    They should re-run all the sample blood with PCR for common coronaviruses, immediately.

  25. There’s another factor that I haven’t seen anyone mention. If you assume that people are more likely to participate in the study if they think they might have the virus, then post-stratification will make the effect from self-selection considerably worse.

    To make this concrete, consider a zipcode that participates a lot and a zipcode that participates half as much (e.g. because it’s less convenient for them). The people from the second zipcode will be the ones who really, really want to participate since the less-enthusiastic ones are more likely to drop out. Thus, the detected infection rate from the second zipcode will be higher than the first, even if the underlying infection rates are actually the same. If you then post-stratify and increase the weights on the second zipcode because there weren’t as many participants, the overall infection rate will be doubly-raised: first because of the increased self-selection in the second zipcode and then because of the post-stratification.

    The same analysis applies to the race/ethnic post-stratification. Note that they report unadjusted infection rate of 1.5% and population-weighted rate of 2.8%. So the population-weighting resulted in a large increase. My hypothesis explains much of that increase.

    • Yes, this has been my concern from the getgo.

      They discuss the outcomes briefly in their write-up — they tried to get a representative sample, but their volunteer group ended up heavily oversampling white women and heavily undersampling Hispanics and Asians. It seems quite likely they the oversampled population was closer to an unbiased group — people just interested in participating or seeing their result, whereas the undersampled populations were most affected by self-selection by people who think they have current or prior active infections.

      So it seems perverse to take a raw 1.5% prevalence (which likely already includes some level of self-selection bias) and then adjust it _upward_ because of underrepresentation of results by what is likely the most biased populations.

  26. “Willingness to participate” problems are why the only valid analysis so far is of populations that have been fully sampled (note: this necessarily includes the proviso that the administered test actually be good). So far, this has been limited to a few ships (the Princess and the French carrier; I don’t know if the entire crew of the Roosevelt has been tested).

    So far, this bug is behaving like an ordinary highly-contagious cold virus: it’s infecting 25% to 50% of the population, mostly symptomatically, with mortality figures not out of line with typical flu seasons.

    • What do you mean by “mortality figures not out of line with typical flu seasons”? Annual deaths for flu and influenza are typically around 25 per 100’000 in New York City. Lab-confirmed COVID-10 deaths are already over 100 per 100’000.

      What do you mean by “it’s infecting 25% to 50% of the population”? What we observe is hard to reconcile with 25% to 50% of the population being already infected in general (but it could be in some places like NYC).

      If you mean that it’s (on track to) infecting 25% to 50% of the population and global mortality figures are not (yet) out of line with typical flu seasons (but will get out of line as we progress towards that 25%-50% infection rate) you could be right.

    • The Diamond Princess has seen 12 deaths (and counting) among 700ish infections. Are you saying that a nearly 2% IFR is typical for the flu?

  27. The preprint (https://www.medrxiv.org/content/10.1101/2020.04.14.20062463v1.full.pdf is what I have) actually gives an estimate of specificity (specificity of 99.5% (95 CI 98.3-99.9%)) which makes it all worse. I have a far inferior discussion of this on my blog (https://observationalepidemiology.blogspot.com/2020/04/some-covid-19-study-thoughts.html) where I used it to point out 50 could be all false positives (at least right now). One point seven percent of 3330 is 56, which is > 50.

  28. I can’t get my head around their strategy for re-weighting and adjusting for sensitivity/specificity. Surely you should adjust for sensitivity/specificity first, then re-weight. But that would require a sample prevalence that is greater than the false negative rate for all sub-groups. This is impossible, given the number of sub-groups that they use.

    • Nicholas:

      I think the right way to do it is using a hierarchical model. Even beyond any other difficulties with the survey adjustment, the probability estimates start to fall apart as you get near the boundary.

    • Nicholas:

      I think this sentence is only helpful if you actually knew the false negative rate:

      “But that would require a sample prevalence that is greater than the false negative rate for all sub-groups”

      However, as you do not know the actual specificity, significant between group-prevalence differences may provide (weak) evidence for a higher specificity than implied by the validation data alone.

  29. At some point its going to become obvious that the severity of disease is directly related to viral load.
    Which explains why you’re seeing anomalously high death rates in certain densely populated locales.
    And also explains why healthcare workers are disproportionately affected.
    And why outside of these places, hardly anybody knows anybody who has actually has symptomatic COVID.
    Further supporting universal sensible precautions as the only pragmatic and viable path forward for public policy.

    • Yes Chris,

      Yes, my hunch has been that the severity of COVID19-related symptoms is directly related to exposure to the extent of viral load, which can also explain why subsets remain asymptomatic through the entire course of the infection.

    • > And why outside of these places, hardly anybody knows anybody who has actually has symptomatic COVID.

      Or asymptomatic, for that matter.

      But I fully agree that healthcare workers are more exposed and it’s not just a binary thing. We saw that in Wuhan as well and they were better equipped. There are 31000 reported infected in Spain, mostly in primary care if I remember correctly. The CFR is 0.1%, by the way, but it’s difficult to generalize (they may be exposed to higher viral loads that could make it more severe but they are also younger and more healthy than the broad population).

    • Chris’s conjectures sound highly plausible — I hope researchers are investigating them. In the meantime, they seem reasonable enough to take into account in formulating guidelines for behavior that is likely to reduce the risk of infection.

  30. Beyond being a non-random and perhaps non-representative sampling for Santa Clara, Santa Clara is no way a nationally representative sampling.

    Yet these authors did not get in front of their analysis being used in the popular media for extrapolating national infection rates.

    And Ioannidis even participated in similar discussions extrapolating from a sample that is about as non-random and non-representative nationally as a sample can get – passengers on a cruise ship.

    These guys are professional epidemiologists. I’m an Internet schlub. I have to assume they have a good explanation. But from this schlub’s perspective, what they’ve done seems supremely irresponsible

      • I was thinking the same thing. I have had mad respect for Ioannidis.

        But this paper and his article about the Diamond Princess cruise ship seem to me to violate a first principle of good social science: don’t try to extrapolate from non-random and non-representative sampling.

        Ioannidis must have an explanation, right? I’d sure love to hear what it is.

        • Joshua:

          I think it’s ok to extrapolate from non-random and non-representative samples. If that’s the data you have, you go with it and then you work to get more data. It’s important when doing these extrapolations to state your assumptions clearly (and it’s not enough to just throw in a whole bunch of possible concerns in a grab-bag section near the end of your paper).

        • Andrew –

          The median income in Santa Clara is @ 4x the national medium income. Race/ethnicity demographics are not like the national demographics (i.e. Like 3%? African American). SES and race/ethnicity variables have a huge influence on health outcomes.

          How can you extrapolate nationally from samples that amount to outliers in key variables (at least if you haven’t weighted/adjusted your sample accordingly?)

          Now they didn’t exactly do that with the Santa Clara study, but they put out the preprint and must have had knowledge that in the current context people would take their preprint and run with it to leverage arguments related to infection AND mortality rates and social distancing policies all across the country. And they didn’t get in front of that process to make it clear that their work could not support arguments about social distancing policies and mortality rates in completely different communities.

          The problem is even worse with the Diamond Princess cruise data since that sample is an obvious SES outlier, with the treatment conditions being about as non-representative as they could get. In that case, Ioannidis argued directly from the sample to extrapolate broadly about mortality rates.

          I see no problem with studying a non-random and non-representative sample. But I fail to see how it could be justifiable to broadly extrapolate from such a sample without adjusting it to make it representative, and to clearly explain why the non-random recruiting method would potentially skewvrrsukts. This seems to me like a basic epidemiological principle.

        • I agree with you, but it’s pretty much the case that I’m your choir and you’re preaching.

          Playing Devil’s Advocate, Bendavid can argue that he published nothing on national mortality, purely on Santa Clara county mortality. And that anyone who wants to can adjust based on population characteristics to extrapolate national mortality.

          My non-expert estimate, based on the fact that SC County has about 3/4 the prevalence of people over 65 as the national average and 1/2 the prevalence of obesity as the national average, is that actual US mortality will be somewhere between 1.33x and 2x Santa Clara

        • I plan to reread the study. I doubt that such a claim would have been made.

          I think that redoing study to the strictures so elaborated here would be a good exercise. And it would be suspect if you then flinched from such an effort. As you know very few articles can be said to be of such high caliber. This is a theme on this blog and nearly every expose about the expert world. Appeals to authority are the status quo.

        • Joseph –

          I work with some people who do epidemiological, health outcomes research with large databases.

          They are very careful to work with nationally representative data, or weight/adjust their data w/r/t representativeness and/or clearly caveat the limitations of their work when they don’t.

          Maybe that’s not typical, and surely not everyone does so properly. But people as renown as Iaonidis should follow such practices when gaining wide public exposure discussing such hugely impactful research.

        • Joshua,
          My point is that the study explicitly _did not_ make any claims to estimating the national fatality rate from this study. So you can’t fault the study itself for making excessive claims.

          The authors? Sure. You can call them out. They’ve already given national fatality rate estimates based on much thinner data, so they should do us the courtesy of updating those estimates based on this additional info. Ioannidis in particular seems to want to use this as justification for a flu-like mortality rate, without even putting a number to it. That’s gross. I’d like to see Bhattacharya and Bendavid acknowledge that the 0.01% they threw out in the WSJ is obviously absurd, but so far all I’ve seen is Bhattacharya saying that this study was probably biased toward _under_counting cases since it didn’t include homeless or nursing homes. (I think it’s much more likely biased toward overcounting due to self selection.)

        • > My point is that the study explicitly _did not_ make any claims to estimating the national fatality rate from this study. So you can’t fault the study itself for making excessive claims.

          Fair enough.

          I stated elsewhere, but I acknowledge I’ve been somewhat inconsistent, that my amateur criticism of the study itself as opposed to their public dissemination campaign is with respect to their extrapolating from their (non-randomized, convenience sampling) recruitment methodology.

          Also, in my brief perusal of their limitations section I didn’t see mention w/r/t extrapolating without m adjusting for the SES profile of their participants. But those are criticisms w/r/t extrapolating to the Santa Clara community, not more broadly, and admittedly the SES issue would apply more to mortality rate than to infection rate.

    • Joshua:

      1. You write, “These guys are professional epidemiologists.” are you sure? The first author of the paper is a professor of medicine—I think that means he’s a doctor, not an epidemiologist. His graduate degrees are an MD and a masters in health services. The second author is a medical student with a masters in economics and a masters in public health. The third author has a PhD in policy analysis. The fourth author is a medical student with a masters in health policy. The fifth author is a medical student with a masters in epidemiology. The sixth author is a medical student. The seventh author is a medical student. The eighth author is a medical student. The ninth author . . . I don’t see his training on the web. He works for a nonprofit called Health Education is Power, Inc. I doubt he’s an epidemiologist but I can’t be sure. The tenth author runs a company that does lab tests. The eleventh author is a psychiatrist. The twelfth author is a biologist. The thirteenth author has a PhD in pharmaceutical sciences. The fourteenth author has a PhD in medical science and is an expert on blood doping. The fifteenth author is a masters student in epidemiology and clinical research. The sixteenth author has an MD but he’s been a professor of epidemiology. The seventeenth author has an MD and a PhD in economics.

      So, yes, there’s some epidemiology in that list. But I’d mostly call them a bunch of doctors and med students.

      I wouldn’t call what they did “supremely irresponsible.” Statistics is hard! They were just a little bit sloppy and made the mistake of not running their analysis by any statistical experts or any skeptics more generally.

      2. Regarding Ioannidis: I don’t know how much credit or blame you want to assign to author #16 of a 17-author paper.

      • I stand corrected w/r/t them being professional epidemiologists.

        I get that there is a convention of putting names on papers without accepting responsibility for their quality. I get the practical reasons for doing do. I still think it is supremely irresponsible.

        • And to add – I’m not saying they were irresponsible for making statistical or methodological errors. I am certainly not qualified to make a judgement in that area.

          My beef is with something more basic: generalizing from non-random sampling without making adjustments or clearly outlining the limitations of their sampling methods, and not getting in front of a situation where it was obvious that their work would be used to extrapolate nationally from a non-nationally representative sampling to justify arguments about life and death policies in the midst pandemic.

      • So I think that the 16th author of a 17 author paper should read the paper and, when there are startling conclusions, ask hard questions of the first and senior authors.

        Similarly, the media statements by the senior author lack nuance (as does the conclusion) and that’s really the big problem. I have a sensitivity test that I like to use where it is bad if the inference changes based on one misclassified observation. So their 30/30 (100% specificity) is 97% if we misclassify one participant. The 399/401 is now 98.25% and shrinks the estimate importantly. It is fine to say that the data support higher rates of infection than previously thought. But the actual comparison to the flu is problematic (also consider that some deaths may be misclassified).

        The sense of certainty moves it from “interesting data point” to “not very good”. It’d also be nice if they had a public data set available — they spoke to the media about a preprint so they are clearly not worried about early inference. I think that with some categorization of age it would be impossible to identify people and other things I am worried about (the weighting methods) could be explored in detail.

        • Joseph:

          I agree that all authors should take responsibility for a paper. But the errors in the paper are subtle. The main statistical issues are clear: you need to account for differences in sample and population, and you need to account for errors in the test. But the solutions are a bit technical. None of the authors of the paper is a statistician, but of the authors of the paper has an MA in epidemiology; maybe that person is the one who did the analysis, and I guess the other authors just said to themselves, “That’s technical stuff,” and didn’t think about it further. In that sense there’s a problem with delegation of authority. But some version of this delegation can’t be avoided. I’ve been a coauthor of lots of applied projects—including some high-profile projects—in which I’ve never touched the data at all. Ultimately it comes down to trust. All this is another reason why open data and code are a good idea—but in this case the authors seem to have provided us with enough information to know not to trust the conclusions. Other examples are more difficult. Remember that influential econ paper with the Excel error?

        • I think mistakes do inevitably get made, the media outreach on this was overenthusiastic, but at this point the authors *absolutely need to make a public retraction*, because these findings have *significant and dangerous policy implications*.

        • Andrew, so I guess you are not counting JPA Ioannidis as a statistician for this purpose? Is he the one with an MA in epidemiology you’re referring to?

        • His undergraduate degree is in mathematics. In any case, one can argue that not every statistician or epidemiologist has an MD. And certainly, John has been bred by a very able coterie of evidence-based expertise. It is worthy of admiration.

          Besides those who have criticized him have also make mistakes. We all make mistakes. I guess i think that cliques can be blind to their own mistakes.

        • Thanks Andrew. I think I’m pushing back against notion that these credentials matter much for explaining what went wrong here. I’ve seen plenty of silly analyses performed or endorsed by card-carrying statisticians and conversely much excellent work by people lacking the formal credential, but who clearly have put in the time. Maybe Ioannidis work is overrated but I think most people would call him a biostatistician for this purpose – for instance, Stanford has him as courtesy faculty in statistics and medical data science.

        • Doesn’t it feel like you’re being a bit overly charitable here?

          Ioannidis criticized prior estimates as being based on too-thin data, and then publicly gave his own back-of-the-envelope calc based on even thinner date (one cruise ship).

          Bendavid and Bhattacharya criticized policy decisions being made on too-thin data, and put our the obviously counterfactual conjecture that case fatality rate could be 0.01%.

          They’ve already laid the groundwork for how their published numbers will be used — they need to be held to a very high standard with regard to getting it right.

          Taking a 1.5% raw infection rate and adjusting that upward to 2.8% seems perverse when you _know_ your have an unaccounted for self-selection problem. I argue that their post-stratification is heavily exacerbating their self-selection problem. All of those authors, and no one says “c’mon guys, we’re taking 1.5 to 2.8? That doesn’t pass the laugh test”.

          And I still have no idea why they published a set of results based purely on their own specificity and sensitivity test results. Why would ignoring all that manufacturer data have any value? Is there like a question of batch consistency among the test, or precision of reproducing testing protocols, where it’s better to test against the way you did it vs someone else’s slightly different method? I know nothing about serology, but that seems bizarre.

        • I think you have put your finger right on the problem.
          ‘None of the authors is a statistician’
          With the richness or error sources available at this stage of a pandemic the subtleties are crucial. The risk is that misunderstood results are applied in decision support.

      • Andrew said,
        “Regarding Ioannidis: I don’t know how much credit or blame you want to assign to author #16 of a 17-author paper.”

        Conventions on order of co-author listing vary from field to field. For example, in (at least some areas of) biology, the advisor of a Ph.D. student traditionally is listed as last author in publications resulting from the student’s dissertation. In math, the convention is alphabetical order by last name, and the advisor is typically not listed in papers based on a Ph.D. student’s dissertation.

  31. I’d really like for them to go test all passengers who were in the diamond princess cruise ship. 6 people on that ship died, so if their proposed mortality rate is correct, at least 3000 of theovet 6000 passengers should test positive. This wouldn’t prove their mortality rates to be correct, since of course people could have gotten infected after getting off the ship, but if they get a number under 3000 we would at least have clear contradictory evidence.

    • Sorry for the wrong nesting of this comment below…

      The number of passengers who died is now up to 13.

      Go to “Worldometers” and search for “Diamond.”)

    • Grace,
      There’s a very heavy age bias to the death rate. Diamond Princess passengers were much older, than Santa Clara County residents, thus SC County should have a much lower mortality rate.

      I believe they did test everyone on Diamond Princess, and it’s been 13 deaths out if 700 cases. But that much higher death rate than what this study estimates could easily be due to age if population.

      • Asymptomatic at the time of testing, but many would develop symptoms later. The fact that testing proceeded from the older to the younger and took weeks could be enough to explain that apparently surprising fact.

    • I was just poking around news sources I don’t usually visit to see where the study was getting traction, and I saw him giving an interview here: https://www.foxnews.com/media/david-katz-coronavirus-vaccine-herd-immunity

      I think it will be difficult for him to own the mistakes in this paper at this point. Also, as long as I’m on the topic of, “is this being used as we’d fear”, the 2nd most-read article on the Wall Street Journal (according to their home page) was this opinion piece: https://www.wsj.com/articles/new-data-suggest-the-coronavirus-isnt-as-deadly-as-we-thought-11587155298?mod=trending_now_pos2

      Too bad… :(

      • Josh:

        I followed the link and it had this:

        Katz told Levin that if he were immune to coronavirus, he would be able to visit his elderly mother without worrying about contracting COVID-19.

        “My mother doesn’t want to get coronavirus and die [but] she also doesn’t want to die of something else before ever again being able to hug her grandchildren because she’s still waiting for a vaccine,” Katz said.

        I guess it would be ok if Katz visits his mother. I mean, he’d be taking a risk, but we take risks every day. The odds are that he’s either unexposed or already has the antibodies, right? It would seem unlikely that he happens to be contagious just on the day that he visits her, right?

        Another option is that Katz could work at a clinic with coronavirus patients—or maybe he’s already doing that?—in which case that should increase his risk of exposure. He could just do intake and not wear a mask. Then, assuming the virus doesn’t hurt him, he can go visit his mother after he recovers with some confidence.

        • That’s a great suggestion. I’m sure the good doctor will be in an NYC ER ward any moment now.

          I should have mentioned that Dr. Ioannidis is on the video clip beginning around the two minute mark. He spells out how the study’s implications support Katz’ proposal that we should start focusing on getting to herd immunity (presumably by getting back to work): “The best data we have now suggests … the number of people that will die is probably in the range of 1:1000”. That, plus the WSJ article confirm our fears that this is feeding into a broken discussion in a way that could have real-world consequences.

          Also, FWIW, I’ve been out of the academic loop for years now and had never heard of any of the researchers before Saturday. No grudges or anything. I saw a headline alluding to the paper and clicked through because of how tantalizing the conclusion was (and I really wanted it to be true).

          Getting back to a point Andrew made at the beginning, this story has apparently gotten some traction. Because of this, it will be personally difficult for the researchers to issue a retraction and an apology. It also means that I, for one, would applaud those actions as especially brave.

      • I’m not going to watch a Fox segment, but here is Ioannidis responding directly:
        https://youtu.be/jGUgrEfSgaU

        He stands behind the infection rate rage of estimates, and goes on to posit that Covid19 has an infection fatality rate “in the same ballpark as seasonal flu”, which he seems to think makes it not to big a deal.

        A lot of things to take issue with there, but I’ll take issue with the most obvious: a novel virus like this work such a high transmission rate is going to take upwards of 80% of the population getting it to achieve herd immunity.

      • Says “Our data suggests that” the infection mortality rate is the same as the seasonal flu.

        He says that based on sampling from Santa Clara.

        Can someone please explain to me how data from Santa Clara can be used to infer mortality rates in communities across the country – given that on average the country comprises communities very much unlike Santa Clara in a number of absolutely key variables?

        Please!

        (BTW – He argues that the reason for the higher number of death compared to the flue are because of hospital acquired (nosocomial) infections.

        As for his arguments about the impact of mandated social distancing orders – he treats the various outcome pathways as being a binary forking path. In other words, he completely ignores that many of the economic and psychological harms taking place under “lockdowns” will simply not disappear if the economy is opened contemporaneous with high rates of infection. The harmful outcomes are not only a function of the social distancing orders, but from the concerns about being infected irrespective of whether such orders are in place.

        • I’ll also note that he talks about the uncertainty resulting from the “dying with” vs. “dying from” Covid-19 issue.

          Ok, a valid issue.

          However, he fails to discuss the uncertainty about number of deaths given that people are dying without being tested (e.g., dying at home).

          Why does he only talk about the uncertainty on the one side of the issue?

          Seems pretty weird to me.

        • Ok – one last comment then I’ll stop cluttering up the thread:

          He also only talks about deaths as an outcome. He doesn’t discuss much the impact of serious illness, ICU admissions, hospitalizations, etc.

          Seems to me like he’s discussing the data in a manner so as to defend a position rather than address all the uncertainties in all directions.

        • >Seems to me like he’s discussing the data in a manner so as to defend a position rather than address all the uncertainties in all directions.

          exactly. He’s lost 100% of his credibility with me.

        • Ioannidis seems to be getting this crisis mostly wrong both empirically and from risk management perspective, which is quite sad. I don’t want to get into too much academic trash talk, but I have often wondered why his contribution ‘Why most research findings are false’ was such a revelation. When I first read it years ago, it seemed like some very straightforward deductions from the whole type-I/type-II paradigm. Of course, later still, I realized that the NHST paradigm itself is an unfortunate straight-jacket! So I was disappointed more recently in him leading the counter-charge to move past it when that ‘Abandon Significance’ paper came out…

        • So like Sweden, John Ioannidis leans to a particular point of view about social distancing, etc. How do you know that you are right Chris? Our health care system is less advantaged to respond efficaciously to such a public health crisis. It creates its own sets of risks as well regardless that have preceded this specific one.

        • Sameera, I am not certain. No one is. Uncertainty actually means from risk point of view— in view of pandemics being multiplicative, very fat tailed etc- we should over rather than under react. Empirically, I take Andrews critique here pretty seriously.

        • Unlike Ioannidis, Chris appears to know the extent of what he doesn’t know. Ioannidis is out there loudly proclaiming his mistaken confidence. This is malpractice.

        • The news report is from 2013 but they simply repeated the 2002 figures. They did not conduct independent research.

          The figures come from here:

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1820440/

          “In 2002, the estimated number of HAIs in U.S. hospitals, adjusted to include federal facilities, was approximately 1.7 million: 33,269 HAIs among newborns in high-risk nurseries, 19,059 among newborns in well-baby nurseries, 417,946 among adults and children in ICUs, and 1,266,851 among adults and children outside of ICUs. The estimated deaths associated with HAIs in U.S. hospitals were 98,987: of these, 35,967 were for pneumonia, 30,665 for bloodstream infections, 13,088 for urinary tract infections, 8,205 for surgical site infections, and 11,062 for infections of other sites.”

        • Josh,

          I doubt that he is arguing that Santa Clara is like other communities. He says specifically ‘Our data suggests that the infection mortality rate is the same as the seasonal flu. Not that the infection mortality rate in Santa Clara is like the mortality rate in another community or other communities.

          In so far as the higher number of deaths now maybe also a function of acquiring [nosociomial] infections. This subject has been covered by a host of very knowledgable experts, including Shannon Brownlee, Elizabeth Rosenthal, Mike Magee, and by a rather large number of nurses. You don’t think that is a factor too?

        • Sorry – I hit the submit too soo.

          > Not that the infection mortality rate in Santa Clara is like the mortality rate in another community or other communities.

          He is flat out saying that the Santa Clara data “suggest” that the mortality rate is the same as the flue.

          In other words, he continuously extrapolates from the Santa Clara data to justify a mortality rate of national or even global dimensions.

          Do you not think that doing so is problematic?

        • Let me read through the study again. I don’t think John Ioannidis relied simply on that study alone. I think it is a cumulative hypothesis that also comes from his Berlin Metric program. He had been working with the Italian health services also. He pointed to the large population of elderly that contracted COVID19, thus implying that the fatality rate is higher among the elderly and immune-compromised. He continued to suggest that far more testing is needed.

          If I had been in John’s position, I would have worded several of his public articles differently b/c they are subject to misreadings. But misreading and misinterpretations are par for the course, from what I observe. I force myself to read every article 3 times since they are so technical and dense. Even then I miss nuances.

          Whether John is justifying a national and global mortality rate is a question. I highly doubt that is his intention.

          Lastly, his very article that is being blitzed was really about the risks & consequences of a much longer term lockdown.

        • Sameera –

          Please watch the video I linked above.

          I’m not only talking here about the study, where the extrapolation is more caveated. My issue with the study is different – there I take issue with the non-random recruitment).

          In the video he says that the Santa Clara data suggest a mortality rate the same as the seasonal flu. He doesn’t caveat that mortality rate with respect to the lack of representativeness. He doesn’t talk a out Santa Clara being non-representative w/r/t key variables such as SES and race/ethnicity. He then goes in to talk about the societal implications of a mortality rate which was calculated from a non-representative sampling.

          I don’t think that is proper science.

          He did the same with the cruise passenger data. AFAIC, it would be hard to generate a less random and less representative dataset than collecting cruise passengers as a sample.

          Please get back to me after you watch the video. I’d be curious to hear your response.

          Maybe there is a good explanation. I’d like to find out if my non-expert understanding is wrong. But from where I sit, what he’s doing is irresponsible, with very important implications to a vital public health issue.

        • My understanding that there are about 1.7 million acquires healthcare-related infections of which 99,000 die annually. I don’t think that rate has changed appreciably since 2013.

          Recall that Dr. Fauci estimated that 100,000 to 240,000 were estimated by the end of April. So we should also be mindful of claims made from January on. So the precautionary approach was being debated. My point being that this is a fast-evolving numbers situation.

        • The 99k numbers are 2002 figures. I have not seen a more recent set of figures for healthcare infection deaths.

        • Sameera –

          I wasn’t intending to diminish the importance of nosocomial infections. I’m no expert, but it seems to me that logically that would be an important consideration.

          My point is that it is important to interrogate the uncertainty from both sides.

          My antennae go up when I see highly intelligent and knowledgeable people effectively ignore uncertainty on one side of an issue in favor of talking about uncertainty on the other side.

          My antennae also go up when I see people extrapolating broadly from unrepresentative data.

        • Joshua,
          You are right to call out the SC County research. I take up items that I have tried to complement others in this long thread, I am down below in April 22 about 5PM.

          One can only infer the US wide mortality from the SC County sample collection if many, many variables are controlled or carefully calibrated. Now that it has been reported just today that the first COV-2 death was a Santa Clara resident, not as previously thought from Washington state. That single fact makes the Stanford survey even more dubious, because SC County may have had a far earlier outbreak by 1-2 months aleast than the rest of the nation. Therefore, what ever actual prevalence exists in SC County, it greatly exceeds the rest of the US if only due to that fact.

          If Stanford wanted to prove its claims released Friday – that the infection mortality rate is on the order of the seasonal flu — there are many things it could have done to validate its own data. One is to have repeated the same test 2-3 weeks following the early April. Even with distancing, the rate of antibody positives should have doubled or more from its claimed ~4% infected rate in the later sample survey. I believe they fear the results will shoot down their initial April 3-4 sample and findings. They were motivated to conclude on basis of a single, flawed test.

  32. Hi,

    Just a thougth, confess, did not read the whole thread. But I suppose one good way of doing this kind of test would be to shake-up a thousand or more pre-corona era blood/plasma samples from freezers of bloodbanks, hospitals, forensic labs, universities… where-ever-needed, of the area to be studied. Then run the test on that area, freezing the new samples and mimicking other sampling conditions to best effort as well. And then compare/calculate the INCREASE of percentage of positives.

    Waltteri Hosia, PhD protein chemistry.

  33. I was thinking about this last night – I’m not convinced that age should be something worth segmenting on. We’re talking possession of antibodies here as a proxy of number of people infected, not how it affects people, and assuming that everyone under 65 who has a similar likelihood of having a day job would have similar infection rates seems like a reasonable enough assumption. One worth testing, but also one that seems fairly safe as well.

    Over 65 you might get a couple strata as people are more likely to be in a group home at some point, and of course some percentage of people have died or are sick and therefore would not be available for a blood test.

    Just a thought.

    • If we pool all the lateral flow immunoassays (same method as Santa Clara) we get 14 false positives amongst a sample of 653, so Agresti-Coull gives a 95% CI for false positive rate of 1.23%-3.64%. Given their small sample sizes per individual manufacturer there’s no clear evidence any particular model performs differently.

      • In that study, they operated the lateral flow assays according to manufacturer instructions. In the Santa Clara Study, they either did not, or their maths is even worse than we thought.
        The instructions for the premier biotech device say the test is positive if just one of the IgG and IgM test strips is positive. This means we must consider the false positive error rates for both IgG and IgM, the manufacturer has it at 3/371 for IgM, so we have a false error rate of 3/371 to 5/371 depending on the overlap. The IgM specificity is not documented in the paper at all!
        This leads me to conclude that when they wrote, “the total number of positive cases by either IgG or IgM” they meant something like “positive either way” in the sense of both; this also justifies using the lower value for sensitivity.
        But then, if this assay had been included in that study, it would have been operated differently, and so the result there can’t apply here.

  34. I am sincerely disappointed in Dr. Gelman, whose work I have eagerly read for years. He laments about a “couple of assholes” regarding politicization of the virus. Rather than name calling, Dr. Gelman ought to deal with substance and substance ONLY. If you don’t like the way things are currently debated, then rise above it and hold the standard high, where it belongs. Be a leader, and others will follow. Be a name caller and you just further degrade science and scientists, whom the public trusts less and less each day.

    It’s time for the community to start policing itself. Name calling, politicized twitter commentary, faculty lounge discussions revealing biased political leanings tied to research areas – these should all be things of the past. They are beneath the noble calling of science. Furthermore, as an aside, it’s time for scientists to stop all affiliations and associating with political figures. No more pictures of Scientist X with Politician Y, no more arm-in-arm pictures at fundraisers or social gatherings. Perception matters, and right now the public perceives that most if not all scientists are incapable of objective work and objective reporting of results. I cannot fathom how awful our society will become if the general public ever feels as jaded about its scientists as it does about its politicians.

    Science is a quest for knowledge. Politics is a quest for power. The latter corrupts the former. Please, let’s protect science from the influence of politics, and purge political language from scientific discussions.

    • Todd:

      Sure, I agree, “assholes” is subjective. I thought that was clear! I don’t think that calling those people “assholes” had anything to do with politics—I think they were just being assholes! But if it bothers you, I understand, and please ignore those words and focus on the rest of the post, which is all about statistics and social science.

      Regarding the rest of your comment: I don’t think it’s possible for scientists to avoid all contact with politicians. Flip it around: this would imply that politicians have no contact with scientists! I’m not a schmoozer myself, but I appreciate that some scientists do schmooze. If no scientists ever schmoozed, I have a feeling that science would have even less influence on policy! Like it or not, we live in a connected world. Doing science in a monastery isn’t really an option. And the monks had to keep their superiors happy too, right? It’s turtles all the way down.

      I don’t really care if a scientist has a political belief that find objectionable. I care much more about science than about scientists. Scientists are a vehicle for producing science. The vehicle is imperfect but the output can be beautiful and useful. The output can also be flawed, as discussed in the above post, and then we should try to figure out what went wrong.

      • I don’t really care if a scientist has a political belief that find objectionable.

        With all due respect, Prof. Gelman, I draw your attention to our recent discussion of a blog post where you, in the course of complaining about the politicization of the coronavirus discussion, implied that we should not take the advice of a particular individual, due to a political belief you perceived that individual to hold that you found objectionable.

        I don’t mean to say this individual’s advice was good, or that they were entirely above any kind of criticism… I mean to say that you are perhaps not living up to your own stated standards and you are perhaps not always fighting fair. Statistics > cancel culture.

        You may find it interesting to contrast Bennett’s remark with this (rather tasteless, in my opinion) joke by Daniel Tosh, a liberal comedian, from a recent standup… is it possible that Tosh is given a pass on making this joke because his liberal credentials were firm in the eyes of the audience? Could it be that our notions of what is and is not acceptable to say have more to do with our need to defend our tribe and smash the enemy tribe than anything else?

        • Fan:

          Where did I imply that we should not take the advice of a particular individual, due to a political belief that I found objectionable?

          If you’re talking about William Bennett, I’m implying we shouldn’t take advice on decision making under uncertainty from someone who says he broke even after playing millions of dollars on slot machines, as this implies to me that he’s either delusional or a liar, or both. Also he said something really stupid about a ridiculous counterfactual. If someone makes a habit of lying or saying stupid things on topics involving causal inference and decision making under uncertainty, then, yeah, I don’t recommend we take his advice. This has nothing to do with whether I agree with his political beliefs.

          As to Daniel Tosh . . . I’ve never heard of Daniel Tosh. Maybe I wouldn’t want people taking his advice either, I don’t know. There’s lots of people who I don’t think should be giving out policy advice. It’s a free country; Bennett and Tosh can give out advice as much as they want, but I’m also free to point out when these advice-givers say things that are stupid, trollish, and flat-out lies.

        • Andrew called someone an asshole, that may be mean, but isn’t politicisation. Why are you politicising this issue by pointing to Tosh as a liberal? Tosh is a comedian, and I don’t think this blog is in the business of reviewing comedians, liberal or conservative. Is Bennett a comedian?

          This performative outrage seems in very bad faith. If anything given the issues in some of this, I think Prof Gelman is if anything excessively civil.

        • I’m not outraged, just maybe a little disappointed.

          Andrew called someone an asshole, that may be mean, but isn’t politicisation.

          It can be. It makes things personal. If you call someone an asshole because they’re willing to discuss counterfactuals your political tribe considers taboo, esp. if those counterfactuals aren’t relevant to the subject matter at hand, that seems like a relatively clear-cut case of politicization.

          In any case, seems like I’ve made my point now so I’ll get back to lurking. All the best to you, Zhou.

        • Fan:

          Bennett said something that was both stupid and offensive. The fact that the statement was offensive does not get it off the hook for being stupid. Saying what he was saying may make him politically incorrect, it may make him bold, but it doesn’t make him a bold truth-teller. This is a fallacy that we see on both the left and the right: the idea that a ridiculous claim is validated in some way, just because it’s risky or taboo or offends somebody. I’m not buying it.

      • Hi Andrew,

        > Sure, I agree, “assholes” is subjective. I thought that was clear! I don’t think that calling those people “assholes” had anything to do with politics—I think they were just being assholes! But if it bothers you, I understand, and please ignore those words and focus on the rest of the post, which is all about statistics and social science.

        Why don’t you make the *entire* post about statistics and social science by removing the inflammatory non-contribution at the beginning?

        • Robert:

          I dunno, I guess I want to make things entertaining. This is a blog, not a textbook! And it’s all free, so you can skip the parts that don’t interest you.

          Really, though, I think you should be asking why those people had to be such assholes in the first place. If they hadn’t been assholes, I wouldn’t have had to write that.

    • Re: It’s time for the community to start policing itself. Name-calling, politicized twitter commentary, faculty lounge discussions revealing biased political leanings tied to research areas – these should all be things of the past.

      —————
      One of my favorite topics and one that a former President of Yale, Kingman Brewster, commented on when we were attending a symposium at Yale Divinity. I expressed my own frustration with departmental politics and prestige mongering when Brewster visited Boston. The sociology of expertise is a neglected dimension in assessing the quality of knowledge.

    • Todd, Thank you for your comment. I was about to construct my own comment pointing out the same thing but you captured it perfectly.

      The most disgusting thing I’ve witnessed from self-professed scientists in the past few decades is an abandonment of their objective dispositions which are a prerequisite for scientific inquiry. Despite what post-modernists claim, true knowledge exists and is not the result of hype and bullying and politics. When self-called scientists abandon the objective disposition then they can no longer credibly claim their work is science.

      One of the biggest problems with academia I have learned is that most science ph ds do not actually know the math and formulas they use as well as they should. This blog post’s opening paragraph should have been a clear citation and re-calculation of the parts of the stanford paper that the author objects to.

      I think the author resorts to name-calling because he wants to speak on the subject, wants to make a point, but is not willing to put the effort into a careful construction of his criticism so he resorts to bullying and shock.

      • Bobby:

        I see no hype or bullying in the above post. If people release a public document and share it with the press, it’s completely acceptable scientific behavior to question its assumptions and to point out flaws in the work. It’s also ok to express opinions that are clearly stated as such. Similarly, if people write public op-eds, it’s not “hype” or “bullying” to express disagreement with them, or even to call them assholes. Name-calling might be rude or counterproductive, or it might be entertaining and true, or it could be all of these things!, but it’s not hype and it’s not bullying. These people are in the arena.

  35. “Jeez—I just spent 3 hours writing this post. I don’t think it wasn’t worth the time. I could’ve just shared Rushton’s email with all of you—that would’ve just taken 5 minutes!”

    I, for one, greatly appreciate the work you put into walking us through this step by step. I think there’s a lot of value added there.

      • I don’t think the data from the study is sufficient to overwhelm the prior, which for me, would be to take the death toll and divide by the usual death rate estimate (somewhere between 0.2% and 2% I think), to get a prevalence in Santa Clara of between 2.5% and 0.25%.

        • Thanks for the response, Zhou. I was wondering whether it would be worth redoing the study in Santa Clara with the proposals that are identified here on the blog. Otherwise, the prevalence you calculated would not afford much utility.

        • What would be valuable is redoing the specificity tests on the antibody testing methodology. Until you have a methodology with a false positive rate you are certain is much lower than your prevalence, trying to estimate prevalence with that method is a waste of everyone’s time and resources.

        • What’s the source of that particular “usual death rate estimate”? Are you using an estimate of the Case Fatality Ratio?

        • It’s not a very rigorous methodology, I’m just trying anecdotally to to recall case fatality ratios across a range of countries and the various estimates people have come up with. The width of the interval hopefully communicates that.

      • The point is that given the uncertainty in the false positives, and the uncertainty in the representativeness of the recruited population, the study was not informative except that if your prior was that tens of percent or more could have had it, it squashed that down…

        so we’re left with basically “any amount less than 5%” or something like that. Since my prior was basically that already, this study didn’t inform me at all.

  36. Andrew –

    Hoping you’ll read this comment.

    There’s something rather specific I’m hoping you’ll address. I know that you indicated above that you feel differently than I do about extrapolating from non-random or non-representative sampling…but I have a more specific question here.

    Some of these authors are now out in public supporting extrapolations from the Santa Clara data to a NATIONAL mortality rate.

    Given that Santa Clara is clearly questionable as being nationally representative (race/ethnicity makeup, income levels, population density, etc.) do you think it isn’t clearly irresponsible for them to be doing so – in particular about a topic directly relevant to the outcomes of a pandemic disease? I’ll remove the perhaps hyperbolic “supremely” qualifier.

    Again, I’m not talking here about subtle statistical issues – I’m asking about basic principles of epidemiology (and actually, scientific method).

    I’ve been wanting to ask Ioannidis this question since I first saw him extrapolating from the cruise passenger data (which seems to me is even more of a non-representative sampling)… but obviously I have no way of getting this question to him.

  37. Overkill is right. What are your solutions? How oils you get an accurate prevalence estimate? You just wasted my time with a 1st year grad student journal club hit job and no better approach besides “avoid selection bias” or “get a more sensitive assay”. Use your next 300 hours running a study that does better.

    • Anon:

      As discussed in my above post, my solutions for this particular dataset would be:

      1. Use multilevel regression and poststratification to generalize from sample to population. As discussed in the above post, the simple poststratification used in the article has serious limitations.

      2. Use Bayesian analysis to combine the uncertainty in the sample and variation in the assays that were used to assess error rates. As discussed in the above post, the methods used in the article to summarize uncertainty have serious limitations.

      3. Perform an analysis to assess sensitivity of conclusions to assumptions about selection bias.

      4. Final conclusions should reflect uncertainties. Don’t act as if you’re sure, or nearly sure, if the data are also consistent with alternative results.

      5. Release data and code.

      Finally, if you write a paper that’s so bad that it can be shot down by “a 1st year grad student journal club,” maybe you should try to do better work.

  38. I’ve written a mini-critique from a layman’s point of view as part of my roundup of the day’s COVID-19 news:

    https://medium.com/before-the-apocalypse/the-day-in-good-news-at-least-by-covid-19-era-standards-april-20-47caabf8b392?sk=4d79684b14909a58f1dd1664d992e0d1

    I’d also respectfully suggest that if you’re in New York and don’t know anyone with the virus, it’s worth acknowledging that you’re writing from a place of privilege. I wouldn’t recommend doing this at the moment, but if you were able to walk a few blocks outside Columbia’s campus, I’m sure you’d find all too many people who’ve been afflicted.

    (The rest of my post is considerably funnier.)

  39. Beau Dure, I see your Medium article contains this sentence:
    “Between 50% and 80% of people may have already had the virus pass through their systems”.

    That’s simply wrong.

    Now I suppose if you’re talking about NYC specifically and extrapolating from the Bendavid study’s SC County death rates you could’ve gotten that NYC is as much as 150% infected, but obviously that would be wrong. Your 50%-80% is wrong too; where did you get if from?

    • Preceding that sentence: “One possible ramification of this study: …”

      The rest of the post takes into account the many caveats to that conclusion.

        • I have no idea what you’re referring to, unless you’ve somehow translated “Santa Clara county has had 50 to 85 times more cases than we knew about” to “Between 50% and 80% of people may have already had the virus pass through their systems”. Those two statements are in no way equivalent.

          If the former is what prompted you to write the latter, then go and take a look at the % of Santa Clara County that is infected that we know about, and then get back to us.

        • Respectfully, you have to read the rest of the post. I write thousands of words a week, and I’m quite happy to go back and clarify if something is truly unclear. This is not one of those times.

          From the post:

          “You may have seen the news that a Stanford study that tested volunteers for the presence of antibodies found many people had antibodies. One possible ramification of this study: Between 50% and 80% of people may have already had the virus pass through their systems, the vast majority without serious impact.

          “Science sites passed along this information with all due caveats — small sample size, potentially unreliable tests, sample that wasn’t really random, etc. Those caveats weren’t enough for people like Andrew Gelman at Columbia, who gave a breakdown of statistical issues with the study.”

          So what I’ve done here is to mention the existence of that study. It’s impossible to write about that study without mentioning that study. Dr. Gelman had to mention it in order to refute it here, hence my comment to Zhou Fang.

          At no point did I say the analysis here was wrong. I did say Dr. Gelman undermined his own credibility with one unnecessarily insult (I’ve noticed that this blog as a whole has the snarky tone of an early-2000s snarky entertainment site) and a puzzling anecdote (“I don’t think I know anyone who’s had coronavirus”) that seems like a strange statement for a statistician and frankly comes from a place of privilege — you have to be in a social bubble to be in New York and not know anyone who’s had.

          Instead, my conclusion is that we don’t have a conclusion. We’re living with uncertainty right now. I don’t think that’s controversial.

          The rest of the post goes on to talk about uncertainty. I don’t think the existence of uncertainty is controversial.

        • My beef is that you wrote up this entire thing as well, in your words, a “nerd fight”.

          Some people say X, some people say Y, I guess we don’t know. In the end you presented these “ramifications” and the authors presents, but Gelman’s critique is summarised merely as “statistical issues”, with no information on whether these issues are minor or fatal (as they are in this case). Then thirty percent of your write-up is nitpicking Gelman’s tone and one random sentence Gelman wrote.

          One thing is not uncertain. Stanford did their analysis *wrong*. Perhaps their conclusion is correct, but that would be *by accident*.

          And no, 50-80% of people already having the virus is flat out wrong even if you take Stanford as correct. That conclusion is baseless – it’s not even what Stanford claimed.

        • Beau:

          I agree that the existence of uncertainty should be uncontroversial. Indeed, my big problem with that Santa Clara study was that the researchers understated uncertainty by giving confidence intervals that were too narrow based on their data. This is indeed kind of a nerdy point, but they do pay me to be a statistician, so that’s what I do.

        • Beau:

          Also, what’s this bit about “Those caveats weren’t enough for people like Andrew Gelman at Columbia . . .”—as if I’m somehow demanding some huge thing.

          Of course a bunch of “caveats” are not enough for me, or for any statistician! This is not a legal brief; it’s a quantitative study. In a quantitative study, you should make your data and assumptions clear, and you should clearly lay out the steps going from data and assumptions to conclusions. Those steps had errors. In statistics, you’re not to supposed to do things wrong and then issue “caveats”; you’re supposed to do it right the first time—or, if someone points out mistakes, you’re supposed to correct them.

          This is so annoying. These are math questions. If you want to ignore the math because you’re not happy I shared an anecdote, that’s your damn problem. The coronavirus doesn’t care.

  40. Is it a violation of research ethics to write opinion pieces promoting your study without any acknowledgments or disclaimers that you were one of the authors?

    https://www.wsj.com/articles/new-data-suggest-the-coronavirus-isnt-as-deadly-as-we-thought-11587155298 is a piece in the Wall Street Journal that’s picking up a lot of interest (it’s currently the #1 most popular story the website). It uncritically explains and extrapolates from the study without any acknowledgment or disclaimer that the author was also one of the authors of the study.

    I think this has a high risk of making the results of the study sound more certain and trusted than they actually are.

    • Anon:

      As a person who’s written a few op-eds and newspaper articles, I’ll just say that they have space constraints and short deadlines. Usually when people write for the newspaper they’re happy to claim credit for their own work! So I’m guessing this was an oversight.

  41. Technical note to your correspondent (Rushton) — please try to avoid writing “*the* exact binomial 95% CI,” as there is no single confidence interval for this situation (even with the requirement of being “exact”).

    E.g. the Clopper-Pearson 95% CI for 2/401 goes from 0.00060458 to 0.01789969 (this is probably the one you meant). The Blyth-Still-Casella 95% CI goes from 0.00088690 to 0.01737243 (strictly narrower). And beyond those two, there are plenty of other valid, exact, two-sided, 95% CI procedures for the binomial proportion with 401 draws.

    None of them seem like quite the right mathematical tool for this situation (and I don’t think you have to depart from “strict frequentist logic” to say that) so I’ll stop here — there are many more important comments above.

  42. Has anyone looked at the Dutch serology study of nationwide blood donors? Little to no discussion in english press. A brief reuters wire. It was noted in the run up, but curiously no analysis or examination once completed. Would love to have someone look and report back in english…

  43. Both RNA and antibody testing (various manufacturers/labs/samples/times) point to 1.5-3.5% infected rates, and about 3% mortality in the US, which is to me amazingly well correlated, matches the model without much adjustment. That’s as objectively accurate as we can be at this instant. No apologies. We have to keep the infection rate low so critically ill patients can get top notch care, not just “the standard protocol”.

  44. To me, the only deal-breaker among the six criticisms is the false positive/specificity, but it looms extremely large. Could it be dealt with simply by doubling the number of blood samples from each participant, or by cutting each sample in half and testing both halves? Even with single-test specificity of only 95% (false positive rate of 5%), wouldn’t a repeated test generate a false positive rate of about only 0.25% (5% x 5%)?

    In turn, wouldn’t a false positive rate of 0.25% be sufficiently low to allow much better inferences to be drawn from future serology studies?

    • False positives can result from cross-reactivity with different antibodies than those you are testing for: for instance antibodies to a different coronavirus that has been prevalent in the target population. If that’s the case, a patient who falsely tests positive once will falsely test positive every time.

    • Depends on the reason for the false positive.

      I’m no epidemiologist, but my understanding is that one reason for false positives is the test picking up a response to related diseases. Presumably every test you propose above will come up (falsely) positive in that case.

      • Ok, thanks for the clarification, both Joseph & Pierre-Normand.

        However, If they conducted second studies, the false positive rate should be about the same in both, and any increase in the positive rate could be taken as a bare minimum of the true population infection rate, i.e. even under the most pessimistic assumption that all the positive results from the first serology tests were false positives.

        If the serology positive rate increased at the same rate as the county positive counts (from health labs), then that would be further evidence both that false positives are vanishingly small, and thus that the serology portions are a good approximation of the population-wide prevalence.

        For example, if the raw positive rate from serology rose from 1.5% to 2.5%, then even under the most pessimistic assumption that the first 1.5% were all false positives, at least 1% of the county’s population (the change from the first to the second test) had been infected by the time of the second study. If the county counts also rose during that period from 1000 to 1667 or so, then there’s some evidence that even the 1.5% from the first serology test were not merely false positives, because the change in the serology positive rate closely followed the change in the health-lab-identified cases.

        This is very crude, of course, but it seems to me that there’s SOME useful info in the studies, even if false positives void their initial conclusions, and even if further studies are required to jointly “tease out” the useful info from the first studies.

        • The best test out there is undoubtedly RNA-PCR. Korean data is 2% infection rate over 600k samples, and 2% mortality rate of positives. On the face of it, already multiple times worse than Influenza. It is widely believed that most deemed-Influenza deaths are actually Pneumonia, but the exact figures are highly uncertain. I have not heard of widespread cases of proven Pneumonia with Covid-19, rather that mortality was caused directly. Safe to say, Covid-19 is far more deadly than Influenza. The professor owes the public a retraction and an explanation. My reading of his personality / ego suggests he will not do so.

    • Curious:

      I took a look at thread. Most if was people shouting at each other or arguing about the scientific method, but I did notice this interesting comment:

      I think the criticism in this thread and elsewhere is a bit too harsh. It’s by no means a perfect study, nor the last word, but hopefully will motivate further studies.

      I volunteered on this study and talked with hundreds of the participants, at least 200 and possibly as many as 400. Two reported previous COVID symptoms, unprompted.

      The bigger problem was socioeconomic bias. Judging from number of Tesla’s, Audi’s, and Lamborghis, we also skewed affluent. Against the study instructions, several participants (driving the nicest cars I might add) registered both adults and tested two children. In general, these zip codes had a lower rate of infection. It’s very hard to understand which way this study is biased, and a recruiting strategy based on grocery stores might be more effective, but difficult to get zip code balance

      There has been additional validation since this preprint was posted and now there’s 118 known-negative samples that have been tested. Specificity remains at 100% for these samples. An updated version will be up soon on medrxiv.

      The remark about the previous symptoms reminds me that it said they asked people about symptoms but these were not reported in their article. It’s good news that they will be releasing an updated report. I hope they include their data and code with this report, as that could allay some of the concerns that have gone around.

      • I hope we will know also what the source of the known negative samples were. If they’re not from the same population, and same time frame, they could be of limited utility to infer false positive rates in the target population. (If they’re just more “pre-COVID-19” samples, those samples may also be pre-WHATEVER-IT-IS-THE-CURRENT-TESTS-CROSS-REACT-WITH samples.)

      • As long as we’re making a wish-list of what we hope they’ll publish, I’m not sure what I’d need to be reassured they had defeated major effects of self-selection but I know I’d need a lot. I guess I’d need assurance that their questionnaires had really probed respondents about (a) prior/current symptoms and (b) potential exposure risks (social contacts, occupation, etc.) *and* it would need to be very clear that, at the time they answered these questions, respondents understood that their answers would not disqualify them from getting the test (perhaps they did all this but felt that this kind of “research hygiene” was so obvious it didn’t need to be mentioned in the report?). Then, of course, I’d need to know how they coded these answers into risk factors (which I suppose falls under the heading of “include data and code”.

        Perhaps I worry about optics too much, but I’m bringing this up now because I don’t want to get in a situation where it looks like they’re complying with all our internet-nerd demands and we’re still not satisfied. Commenters have noted throughout this thread, and I noted at the outset, that specificity and self selection risk are *each* potentially fatal. The false-positives issue is a bit easier to talk about empirically (at least in principle), but self-selection matters is just as important.

  45. Andrew
    I just finished reading through this very apposite thread.
    Given that the errors and weaknesses exposed are really rather well known amongst professional statisticians and given that there are going to be a plethora of sero prevalence studies in the near future, would it not be useful to provide a canonical framework for medical researchers?

    A collaborative effort by statisticians might yield better decisions.

  46. I came looking for this because I have the exact same concern. I think this number may well be a Bayesian artifact which makes us think that we are on the way to herd immunity and also that, if so many people have the antibodies with no symptoms, maybe we should just let nature get on with it.

  47. Hi Andrew,

    You have an error in your post about errors: “I’m serious about the apology. Everyone makes mistakes. I don’t think they authors need to apologize just because they screwed up.”

    No need to apologize, though, for wasting my time in pointing this out to you.

    Be well.
    – Brad

  48. Well, sports fans, you mostly have read the study, pondered the data, and are well-informed on lots of other corona data. Based on all you know, what do you reckon is the already-infected rate in Santa Clara County?

    And Dr. Gelman, please do not resent too much your struggles and time wasted on this study. Your discussion is very illuminating, on so many levels, you taught us a lot, and I think you muchly.

    I’m an economist, so sort of an interloper here, but respectful and curious.

  49. One thing to be mindful of is the exposure-to-mortality delay used in estimates. Stanford used three weeks, which is quite long, as old people die much faster, which could lead to a much lower average. Areas with slower growth curves (like the bay area) are less sensitive to changes to this delay estimate than areas with large growth (like new york).

    —–

    On a different topic….

    I think we can learn something by using elderly care homes as a sort of statistical “canary in the coal mine”.

    April 17th, 70 / 618 (594 of them in NYC) = 11.3% of NY state nursing homes had experienced at least one death.
    April 17th, 228 / 1224 = 21% of CA state nursing homes had experienced at least one death.

    Which means we have not yet seen a majority of care homes, even in the most heavily hit areas, experience COVID infection.

    At what prevalence (or active carrier count) do we think nearly every elderly care home in a region will experience COVID infection (evidenced by at least one death)?

    There is no way to know the precise numbers, because it would be related to number of staff members with spreading contact, prevention measures, etc. However, I do think there is some prevalence number beyond which none of this matters, and we’d expect them to be infected anyway. For example, certainly if 50% of the population is actively shedding it, we’re not keeping it out of most elderly care homes. How about 30%? 20%?

    At 30% of the population exposed on a 4/7-day doubling rate, and carriers being infectious for 7-14 days, we might expect 18-25% of them to still be actively shedding it asymptomatically. In which case we’d really see most elderly care homes falling ill… Which suggests that because are not seeing most elderly care homes experience at least one death, that we are nto yet at 30% prevalence.

  50. As a Stanford student, I volunteered at the Seroprevalence study.

    I did not help “recruit” participants, but I directed participant intake, and essentially had power over which of the registered individuals were allowed to get tested. (many, many, many more than 3000 people received participant ID’s, and we had to turn away many that didn’t sign up “fast enough”).

    From my experience, the individuals being tested were incredibly motivated to receive the test, as only those that signed up almost instantly were able to get a test. Many “potential participants” expressed anger when they were denied a test.

    It was very clear to me that self-selection had occurred, and that those rushing to our tests were in fact those that felt very strongly that they may have already had the disease.

    Lastly, the sample was not representative of the Santa Clara County. People that have time to sit on Facebook are not a representative sample of SCC; we mostly tested wealthy white women.

    • I suspected this was the case. Very very poor sample design. You would have done much better driving into neighborhoods and knocking on random doors, people are home these days. My didn’t someone suggest a meta analysis of existing data? That’s been his bag for years. What’s wrong with the Korean data? Very little as far as I can tell. 600k samples with RNA-PCR, 2% infection rate, and 2% of infected suffer mortality. Most people with Flu actually die from Pneumonia. So far worse than Flu.

      • “Most people with Flu actually die from Pneumonia.”

        This is the “die with” and “die of” thing. It does not account for the question of how many people “with flu” who die “of pneumonia” would have died if they did not catch the flu.

        Especially for old people, dying “of pneumonia” is the most common way for the flu to lead to death.

        • It’s worse than that. Most people who die of the flue die of a bacterial superinfection at a time when the flu virus is no longer present in their system. The “died with”-critics would not count these deaths as influenza deaths, and then they would have difficulty explaining the excess mortality during flu outbreaks.

          Physicians who file death certificates know what they are doing. When a patient died “with” Covid-19, the virus was contributing to the death. The excess mortality from Covid-19 is now clearly apparent on national levels in the mortality surveillance from the CDC and EuroMOMO, so can we please lay this “they would have died anyway” argument to rest that was really no longer viable after Wuhan and Lombardy?

          We all die anyway. We just hope it’s later rather than sooner. Creating a society that doesn’t care if it wipes out its 60+ population really does make you look forward to retirement, doesn’t it?

        • The excess mortality from Covid-19 is now clearly apparent on national levels in the mortality surveillance from the CDC and EuroMOMO, so can we please lay this “they would have died anyway” argument to rest that was really no longer viable after Wuhan and Lombardy?

          The cdc data is still showing a drop in all cause mortality last I checked. Where did you see this?

          Also, the “they would have died anyway” argument can’t be put to rest until we see if there is a drop in all cause mortality in the coming months for the at risk groups. But obviously these lockdowns are going to cause their own issues that will affect that too.

        • Anoneuoid:

          Concurrent ‘all cause mortality’ will inevitably drop due to lower rates of accidents caused by the ‘stay at home’ orders and business closings. It is the wrong denominator. The average of the previous 5 years is a much better approach.

        • Anoneuoid:

          Concurrent ‘all cause mortality’ will inevitably drop due to lower rates of accidents caused by the ‘stay at home’ orders and business closings. It is the wrong denominator. The average of the previous 5 years is a much better approach.

          That is what it looks like is going on in the US, not in Europe though: https://www.euromomo.eu/

          But it seems like a lot of people are spending their money/time on booze, pot, and gambling: https://www.msn.com/en-us/news/us/its-like-new-years-every-day-as-lockdowns-drive-increase-in-booze-and-pot-sales/ar-BB12SEQr

          It will be really interesting to see everything broken down by cause of death at the end of this.

        • 1. I don’t see how you interpret that data as declining overall mortality for the U.S.. The lowest points at best show equivalence and the peaks clearly exceed the epidemic threshold: https://www.cdc.gov/coronavirus/2019-ncov/covid-data/covidview/04172020/nchs-mortality-report.html

          2. I don’t have any idea what point you are trying to make about alcohol, pot, and gambling.

          3. Death counts are very likely under-counted and almost never over-counted. If someone dies due to contracting covid-19 and is not tested, the cause of death will almost inevitably be whatever the comorbid diagnosis was.

        • 1. I don’t see how you interpret that data as declining overall mortality for the U.S.. The lowest points at best show equivalence and the peaks clearly exceed the epidemic threshold: https://www.cdc.gov/coronavirus/2019-ncov/covid-data/covidview/04172020/nchs-mortality-report.html

          Week 14 2020 was 49,770 deaths. That is exceptionally low. Here is similar data that goes back to 2013 (you have to hit download data): https://gis.cdc.gov/grasp/fluview/mortality.html

          You can see it plotted on page 2 here: https://www.docdroid.net/38mNTTt/covidstates-pdf

          2. I don’t have any idea what point you are trying to make about alcohol, pot, and gambling.

          People are not doing the most healthy activities during lockdown.

          3. Death counts are very likely under-counted and almost never over-counted. If someone dies due to contracting covid-19 and is not tested, the cause of death will almost inevitably be whatever the comorbid diagnosis was.

          On the other hand Dr. Birx said they wanted all deaths in a person who tested positive to be attributed to the virus. So there are going to be lots of false positives and negatives in that data.

        • Anoneuoid:

          Yes, I see what you are saying, that number is low. It is in the 5th percentile for weeks between 2015 and 2020. What would you attribute it to, if not reduced accidents?

          name:: TOTAL.DEATHS
          n :: 236
          avg :: 53694.57
          sd :: 3708.93

          min :: 27688
          p01 :: 49316.5
          p05 :: 49707.75
          p10 :: 50107
          p25 :: 51101.75
          p50 :: 53138.5
          p75 :: 55969.5
          p90 :: 58090.5
          p95 :: 59186.5
          p99 :: 63993.55
          max :: 67495

        • > If someone dies due to contracting covid-19 and is not tested, the cause of death will almost inevitably be whatever the comorbid diagnosis was.

          It ain’t necessarily so. In NYC, in the last 6 weeks 25’000 deaths have been reported. 10’000 confirmed (lab result), 5’000 probable (listed in the death certificate as a cause of death) and 10’000 not known to be either confired or probable. They may still be missing some, because we would expect to get less than 6’500 deaths over the period, but at least most of the excess deaths have been classifed as probable COVID-19. If there is the will, a better esimation than lab-confirmed deaths is possible. But usually it requires public outrage because it’s obvious that thousands of deaths are being ignored.

        • Carlos:

          I agree. Especially in a country the size of the U.S., regional and local numbers are going to tell the real story at any given point in time. What is unique about NY is the time and the speed of the spread. This may be creating a false sense of security in other areas where it simply hasn’t spread as much, *yet*. I suspect we are about to see that change in the near future.

        • I agree that using national level figures all cause mortality has not increase – and I think Mendel is wrong in that we shouldn’t really expect it to, given most of the US is still fairly unaffected by the virus. But NY does show a rise in all cause mortality – ~3200 in week 14 2020 vs ~2000 in week 14 2019. I think it’s this contrasting view of declining all cause mortality in the rest of the country and increasing mortality in heavily hit areas that gives us a fairly good idea that (a) no, the victims would not have died anyway, and (b) absent other factors, lockdowns reduce mortality.

          It might be possible to fit a model by taking the difference between unaffected states and affected states to estimate the impact of the virus separate to lockdown effects.

    • Thanks for confirming what is pretty obvious from the test setup. Given the rationing of tests, there is an effective “payment” to participants in this study. While ostensibly “free”, participants perceive a high value from it. So if one is concerned about paying people to do surveys (which is almost always the #1 thing brought up by students when asked to comment on survey bias), then one should be equally concerned about this aspect of selection bias.
      The other case where they sourced names from a market research firm is not much better. Not knowing how that firm compiled their list is not a reason for assuming no selection bias. There is also non-response bias. The fact that they had to do demographic adjustments is evidence that the sampling does not reflect what was expected.

  51. Hi all. Apologies if this has already been said above. Will leave the Stanford study to those who understand it better, but actual number of deaths in NYC (where I live and have remained throughout the pandemic) is closer to 20,000 including deaths at home and unexplained mortality above baseline. Given that the City’s ~1/2 million richest people all seem to be in the Hamptons, the Berkshires, etc. right now, I think it’s fair to round down and say the mortality figure is out of a population of ~8 million. That’s a crude fatality rate of 2.5 per 1,000 or 0.25%. Best estimates I’ve seen (based on percentage of pregnant women testing positive, among other things) put seroprevalence at ~20-30% across the five borroughs, though I think no one has any real idea at the moment and anything between 10 and 50% wouldn’t surprise me, though 10% prevalence would give a 2.5% IFR, which seems awfully high, while 50% prevalence would give 0.5% IFR and 25% prevalence would give a 1% IFR, both of which are in the ballpark of the most substantiated estimates I’ve seen so far.

  52. Since Cross-reactivity appears to be a “thing” in the context of LFIA’s, and the tests were primarily administered to the “I know i had it in January” set, couldn’t one presume that the there is a significantly higher likelihood of FP’s due to non-specific coronavirus C-R in this survey?

  53. Additional data:
    https://mms.mckesson.com/product/1163497/Premier-Biotech-RT-CV19-20 is the distributor site for the test kit, apparently.
    The package insert is under “more information”, and they also did a FDA filing where they tested the kit on 150 PCR-negative patients with symptoms or risk of exposure, these are exactly the ones we’d expect to self-select for this study.
    IgM was false positive 4 out of 150, IgG had 1 out of 150 and it overlapped with a IgM false positive.

    It comes down to whether they accepted “either” as positive (i.e. just one), as the study states, the test instructions state, and the LA County states, or whether they used “both” as criterium, whoch their maths and the omission of the IgM specificity from the paper suggest.

    A reddit user has shared an email advertising this study that was sent “to all of our children’s school lists”. Given the known role of children as transmitters in influenza outbreaks, this seems like another unfortunate dimension in which this study was not representative. The email read:

    “Free COVID-19 Testing — needed to calculate prevalence of disease in Santa Clara County

    What? Get tested. FDA approved antibody test for Coronavirus. To test the prevalence of disease in our county, we need 2500 residents. The antibody test differs from the swab test that measures infectious patients actively carrying the COVID-19 RNA in their nasal passages. The serum antibody test determines whether your immune system has fought off the virus and created antibodies to protect you from future exposure. Since the first cases appeared in November/December in Wuhan, China until the U.S. banned flights from China, an estimated 15,000 residents of Wuhan would have traveled to the U.S. so the disease likely entered the California in December (based on 2019 travel data).

    Who? HEALTHY volunteer who lives in Santa Clara County. ONE PERSON per household. 2500 people to be tested.

    When? APRIL 3rd and 4th. THIS WEEK. 7AM-5PM

    How? https://redcap.stanford.edu/surveys/index.php?s=FPM88N48HE&fbclid=IwAR1RDOJnykJneymdPWOEsCikbQ3nLuUd7Gpw5VZuRZwt

    Why get tested? (1) Knowledge – Peace of mind. You will know if you are immune. If you have antibodies against the virus, you are FREE from the danger of a) getting sick or b) spreading the virus. In China and U.K. they are asking for proof of immunity before returning to work. If you know any small business owners or employees that have been laid off, let them know — they no longer need to quarantine and can return to work without fear. If you don’t want to know the results, we don’t need to send you the results. (2) Research – Contribute to knowledge of the prevalence of virus spread in Santa Clara County. This allows researchers to plan hospital bed needs and forecasting where to allocate public health resources. This will help your neighbors and family members.

    More information here: https://www.wsj.com/articles/is-the-coronavirus-as-deadly-as-they-say-11585088464

    Source: https://old.reddit.com/r/CoronavirusUS/comments/ftxwl7/treatment_news_5_critically_ill_covid_patients_on/fma36su/

  54. Here’s a simple math: as of 04/16/2020: there were 657,720 cases in the US. Since we are at the peak of the outbreak, by the end of the outbreak we will have ca. 1,315,440 cases (not a bell-shaped curve, but for ease of calculation). Also let’s say only 20% of cases become symptomatic (which is a gross underestimation; according to the MMWR report 87% became symptomatic); then at most, we’ll have a total of 6,577,200 infected cases in the entire US by the end of the outbreak. With 2019 US population of 328.2 million, this means a sero-prevalence of 2% (roughly by mid-summer; very generously calculated). Now an ideal IgG kit (for sero-surveys) with a sensitivity of 100% and specificity of 95%, used in a context of a pre-test probability of 2%, would give us a positive predictive value (PPV) of 29% (best case scenario). Of course, this is for the whole country with NY State immensely skewing the calculations.
    Now the sero-prevalence for a state like Ohio by the end of the outbreak can be estimated as 0.7% (based on the published data on the state government data as of 04/16/2020). Therefore, the PPV will be 12%. Again, all of these calculated very generously.

    To mention a recent example from media: https://www.latimes.com/california/story/2020-04-17/coronavirus-antibodies-study-santa-clara-county, the study suggested a seroprevalence of 2.5-4.2% in Santa Clara county. According the above calculations and based on the premise for the test performance (sensitivity: 100%, specificity: 95%), PPV for their study would be 33% to 46%, which is translated to a large false alarm. Another way to look at this is comparing what they claim with pee-reviewed literature. According to this serosurvey, Stanford researchers estimated the actual numbers of cases be 50-85 times higher than what the county has announced. This roughly means only 1.17% to 2% of infected individuals become symptomatic.
    On a related note, according to a CDC study from California published in MMWR, 87% of infected individuals became symptomatic: https://www.cdc.gov/mmwr/volumes/69/wr/mm6913e1.htm . This finding (87%) is in sharp contrast with the results of the Stanford sero-survey estimates.
    It should also be mentioned that according to an earlier study, also from California, circa 5% of individuals with flu-like illness tested positive for COVID-19 by RNA testing: https://jamanetwork.com/journals/jama/fullarticle/2764137

  55. I write for several reasons, this blog and its participants appear to possess the statistical and medical chops to handle any outcome and any underlying findings. Here are my issues trying to be as least redundant with the posts above.

    1. Specificity: This really counts here. One comment at the beginning posts that it could be 98.5% but there is no offer of proof by the Stanford researchers that their test kits reach a specificity level capable of supporting their infection rate findings. The lowest wide scale commercial offers I can find claim specificities between 90% and 95% from a major distributor. I tend to believe until there is better evidence that the specificity cannot be assumed a level higher than 95% and this may be less for this novel virus in the presence of other antibodies in the same sample.

    2. Impact of low specificity on findings: Before accounting for all statistical effects, a 95% specificity permits a 5% false positive rate to a specified confidence interval. Trying to “see” below that rate in the antibody test observables is statistically unreliable and certainly unreliable to make a sweeping conclusion that this false rate reliably represents the level of infection and its fatality rate, which they claim to be on par with seasonal flu.

    3. Viable sample and productive antibody levels: If COV-2 follows other viruses, it minimally requires 2 weeks incubation prior to test. Its possible that specificity declines for samples drawn unable to mature for some time following those 2 retrospective weeks. The first possible appearance in Santa Clara county were likely near end of January. That means only 8-10 weeks of transmission were detected, possibly far less time. While theoretically possible to produce 80000 infections from just one initial carrier in that time period, the odds are unlikely. This raises a flag in addition to other factors and itself begs further scrutiny of the claims made by Stanford’s researchers.

    4. No markers coincident with the research outcome to validate findings, policy conclusions. A step back question – isn’t it remarkable that 80000 infections had no traceable impact, no doctor visits, no OTC medication, no symptoms “felt” by other family members or close partners, etc. Would this not raise flags that their collected sample figures extrapolated to a projected infected population of 80000 might be spurious? Would one not see if there are other community level markers to confirm or at least intelligently support such sweeping conclusions that such a large population even if mildly symptomatic were also in number so medically silent? Wouldn’t you further survey, call back the positive responses to see if they had acted in some way on their symptoms that could be checked, verified first?

    5. Self selection. As others have commented, this field survey is prone to promote recruitment of non-invited participants by word of mouth. so, someone who has a predisposition to test might tell family and friends to join in the finger prick, there seems to be no control on the entry of the participants. REcall there was little access to testing, and a large family might take advantage of such a free test. However, this would heavily enrich the positives that would not occur in a properly recruited survey sample.

    6. Other late breaking evidence appears to contaminate the Santa Clara survey sample: A report today confirms that there was a particularly early death confirmed by autopsy of COVID in Santa Clara from January. Santa Clara for reasons peculiar to its Chinese-American communities or holiday schedule, passengers just back from the Wuhan area in time for school (for instance) can make Santa Clara county an outlier with respect to extrapolating the Santa clara community “results” to the entire state or US.

    Finally, an ethical point. This pre-printed study released last Friday demonstrates that its conclusions should not have be circulated until the findings themselves were vetted through rigorous peer review. Instead within 48 hours, the study has reached national press attention.

    It would be fine to have said something like “we find the incidence of infection much greater than previously reported but seek insights first from the scientific community before drawing conclusions based on the results obtained here. Therefore we are circulating these findings through this pre-print to members of the medical and scientific community to assess its significance and accuracy in accordance with our institution’s policy”. That at least honors the process as one seeking intellectual honesty and accepts responsibility should its publication ultimately require retraction in some part.

    Instead, these researchers liberally asserted their conclusions without community consultation appropriate and did publicly push the pre-print as if their findings and conclusions were satisfactorily proven. Their positions have been previously reported as believing the virus was not more fatal than seasonal flu, but there was no disclosure statement in the pre-print that appears to at least qualify the positions taken by the researchers or showing their policy or political views do not inform the findings. There was no safeguard it appears to ensure veracity of findings through first conducting careful review of methods and practices. It is not clear the researchers had at least secured internal peer review through its Medical Center or Stanford U itself before putting it into the public without disclaiming it represents Stanford or its Med Center. Academic freedom is one thing, but employing the institutional name as inferring its authority without first showing it had institutionally reviewed the finding is not fair to those relying on Stanford’s top drawer reputation for such authority. Stanford hopefully intervenes.

    • Rich:

      Given that they only observed 1.5% positive tests in their sample, it seems unlikely that the test could have a specificity as low as 95%. Another challenge here is that specificity doesn’t have to be a constant.

  56. Andrew,
    I realize that they claimed a high specificity, 2 false positives in 370, and if I recall they separately claimed 100% through a test which is unclear. Basically, did they apply all the conditions in their lab that were present in the 3200 diverse field samples? Unless one calibrates their samples with all the factors from the large field sample I have to assume the true false positive rate leans toward a higher rate found for the commercial test kits. We need more data, calibration method details.

    I agree with you that the specificity might not be fixed nor a rigid all-condition constant value. It may depend on conditions that they could not or did not control. The mechanisms that create false readings seem to require deep involvement from the test labs who design the kits and have to prove their claims to an independent FDA or similar body. The commercial kits probably know that these tests cannot fulfill or reach more than 95% across a range of test or sample variation conditions.

    • I appreciate your analysis. You extrapolated the number of deaths in NYC by the 1/600 rate you derived from the Stanford study you came up with 5,400,000 infected in NYC. At this point I believe your statements are missing the point of the study. You state “OK, I don’t think 5.4 million New Yorkers have been exposed to coronavirus. New York only has 8.4 million people total! I don’t think I know anyone who’s had coronavirus. Sure, you can have it and not have any symptoms—but if it’s as contagious as all that, then if I had it, I guess all my family would get it too, and then I’d guess that somebody would show some symptoms.”

      I believe the point of the anti-body testing is to see how many have been exposed over time, in the past before we were even aware of this current infection. Therefore, your community would have been sick with/without symptoms and explained it away as a cold or not even noticed. So depending on the time scale it could very well be possible that 5.4 million NYC residents have been infected.

      • I am NOT saying that 5.4 million (or any other specific number) of people in NYC have been infected. There is still only “suggestive” data to make any sort of guess about that number.

        But I agree with Jeff that there’s a problem when so many people (including the rather stat-geeky types who frequent this blog) are quick to reject out of hand any possibility that half or more of the population of NYC have a current or past COVID-19 infection.

        Just in case we have forgotten…

        NOBODY KNOWS THE PROPORTION OF INFECTED PEOPLE WHO ARE ASYMPTOMITIC

        When we read a number like 20% or 50% or 80% or any other specific rate of asymptomatic infections, that is someone’s speculation. Not a fact. So we can’t use the unknown asymptomatic infection rate to prove or disprove some claim about population prevalence. We can’t disprove something using an unknown.

        • I think we know some things. We know that everyone was tested in an Italian town and among the positive results most of them had symptoms. Do you know anyting that suggests that 80%-90% of the infections are asymptomatic?

        • I don’t bother to keep track of them but there have been statements by all sorts of people giving rates all over the place. Some of them seem harder to believe than others. For my part, I have yet to see anything to convince me it’s 90% or 10% or some other extreme value. But who knows, maybe I’m wrong about that too.

          There’s just too many things about this whole coronavirus subject that seem perfectly reasonable, well supported and believable then turn out to be all or mostly bullshit upon further review or examination. So I don’t believe we know anything about asymptomatic proportions right now, beyond the obvious that the rate isn’t nearly zero or nearly 100%.

          We don’t even know if it’s constant across demographic groups, geographically, over time, whether it depends on the strain of the virus, whether it depends on viral load. We know virtually nothing that could be counted on as a criterion for confirming or disproving some other number by implication.

          Honestly, I’m just about to the point of saying wake me up in a couple years when there’s been time for actual science to happen.

        • Assume that infection confers immunity. Then a 50% or 80% infection rate slows the spread down considerably without any mitigation measures being applied. We should see that in the data, and if we could, we wouldn’t need serological assays. Different Countries plateau at different levels of per-capita rates of confirmed cases and deaths, which suggests that a) mitigation measures are effective, b) herd immunity is not yet in sight.

        • In addition, containment via contact tracing would not work if we had a large number of unrecognizable asymptomatic cases; but it demonstrably does, see the super spreader in South Korea.
          The first German cluster in Munich was contained by contact tracing, and there was a virus mutation in one of the patients that replicated to those infected, and that genetic line has not re-appeared outside of that cluster, says Prof. Drosten. Containment works, it can’t work if the proportion of asymptomatic infections is substantial.

        • As mentioned above – so we even know if immunity is in some way related to the level of exposure? Is it possibly that immunity is not black and white as reveled by the presence of antibodies?

          Assuming a perfect test, and assuming that immunity exists, and that it develops relatively quickly after infection, and that it lasts reasonably long, isn’t it possible that one person will have antibodies and be immune and another test positive and not have immunity?

          So then do we need tests thar reveal more details about rhe antibodies present?

          All these arguments about what will or won’t happen with “herd immunity” seem waaaaay premature to me.

        • Yes to all of that. Virologists do neutralization tests. They’re a step up from what they usually do.
          Level 0: swab a surface, sequence the RNA, say “there is virus here”.
          Level 1: take a stool sample, isolate the virus, try to infect lung tissue in the lab (level 4 biosecurity), find it’s never infectious.
          Level 2: take a virus sample, mix it with putative antibodies from A’s blood, then try to infect lung tissue. If that fails, A is immune.
          Level 3: Wait a while, and A could have no detectable antibodies any more, but still have an immune response that works. Here’s where you break out the monkeys and the longitudinal cohort studies.

          We might not get herd immunity, we may have transient immunity and if we fail to eradicate the virus in 2 years, it might be endemic, which is why we need a vaccine. The point is, we’re not anywhere near herd immunity because there is no place on Earth right now where the virus stopped spreading by itself (as far as we know).

        • Frankly I don’t understand why they didn’t do this antibody test study in New York. Higher numbers make specificity much less of an issue, lots of deaths reduce uncertainty about mortality rate estimates. And if we really knew upwards of 50% in NY are immune that would actually actionable.

        • I’m with you. But then again since early March I’ve said “Frankly I don’t understand why they didn’t….” so many times I’ve lost count.

        • It just occurred to me this is an “effect” kind of opposite to “file drawer effect”. Carefully doing real science takes months. The whole pandemic (outside of China at least) hasn’t been going on long enough to do even one careful study from data collection through peer review to publication.

        • “One serosurvey is already underway in six metropolitan areas, including New York City, the hardest hit city in the United States.” Science, April 7

          “Gov. Andrew Cuomo said Monday that state Department of Health officials plan to randomly select 3,000 people for tests that will look for indications that their bodies have fought off the virus, even if they were never tested or showed any symptoms.” npr, April 20

          It takes minimal research to discover this.

        • An estimate of 1’750’000 people infected in NYC and 15’000 deaths gives an IFR (with the usual caveats) of 0.85%

  57. An interesting article just out last night on the major commercial labs’ progress on COV-2 antibody test, improving and promising picture but still not ready for commercial release.

    Specifically cites a series of problems with antibody COV-2 test kits currently in the field, and cites the lack of specificity performance that is required.

    Link:
    https://www.reuters.com/article/us-roche-results/a-disaster-roche-ceos-verdict-on-some-covid-19-antibody-tests-idUSKCN2240JS

    An erroneous false-positive result could lead to the mistaken conclusion that someone has immunity. In developing its test, Schwan said, Roche scrutinised some products out today with rather questionable performance.

    Quote:
    “It’s a disaster. These tests are not worth anything, or have very little use,” Schwan told reporters on a conference call. “Some of these companies, I tell you, this is ethically very questionable to get out with this stuff.”

    Schwan said there were about 100 such tests on offer, including finger-prick assays that offer a quick result. The Basel-based company declined to specify which rival tests it had studied, but said it was not referring to tests from established testing companies.”

  58. Thanks for this analysis, with some of the claims being made based on that manuscript over the last few days, I thought people had gone crazy.

    From a meta-perspective, this is what a world without pre-publication peer review looks like. A few days of nonsense claims spread by people who don’t understand what they are reading, followed by robust post-publication peer review noting the errors.

  59. It appears to me that the Stanford study used the wrong specificity on the IgG part of the test. They state “Similarly,our estimates of specificity are 99.5%(95 CI 98.1-99.9%)and 100%(95 CI 90.5-100%).” I believe they mean that the specificity of the IgM test is 99.5% and the specificity of the IgG test is 100%. However, the manufacturer indicates that the IgG gave 3 false positives in 371 tests with IgG. http://en.biotests.com.cn/newsitem/278470281 So I believe their methodology should use either 99.2% for IgG or (75+368)/(75+371) = 99.3% for the IgG specificity.

    I’ll leave it to the statisticians to calculate the overall specificity and confidence intervals when either a positive IgG or IgM is counted as a positive. I’m confident it is below the 99.5% used in the report.

    • “Similarly” refers back to “based on the manufacturer’s and locally tested data, respectively”: 99.5% is manufacturer data, 100% is local data.

      I have been wondering if they got their local data in a double-blind fashion: with these “pregnancy test”-type lateral flow assays, it seems easy to dismiss a bar as too weakly colored if you know the sample is supposed to be negative.

      The mathematics in the Stanford study are appropriate if they required both IgG and IgM to test positive: then the combined specificity can’t be lower than the greater of both, but the sensitivity of the test is lowered. Unfortunately, requiring both is neither what the manufacturer instructions say, nor what I interpret the paper to say about their method.

      Btw, test information directly from the distributor is available at https://mms.mckesson.com/product/1163497/Premier-Biotech-RT-CV19-20 (the download links are at the bottom on the right).

      • Thanks for the link. Am I reading the instructions for the use of the test kits correctly:
        “5. Testing Method:
        For the confirmed COVID-19 infected patients or COVID-19 infection rule-out patients, evaluation on clinical
        sensitivity and specificity of in-vitro diagnostic kit should be carried out comparing to clinical diagnosis.”
        Does that mean that the tests were not blinded – that the test results were read knowing whether or not the subject was actually COVID-positive?

        • I have no idea how the Chinese did it, and don’t think I will find out.
          I hope that Stanford will tell us how they do it; alledgedly, they have tested over 100 samples now, perfect 100% specificity. It’s not an implausible result if you require both IgG and IgM to be positive.
          One kit that actually does have FDA emergency use authorization uses an electronic optical reader with clearly defined thresholds to determine the test result, that eliminates any bias.

  60. Jesus Christ, you guys are scientists?!

    Care to share any evidence that suggests this virus is new to humans?

    Maybe the high the study was accurate?!??!!?
    This virus is clearly not new to humans, simply newly identified, and likely spread via the World Olympic games that were staged in Wuhan as the “pandemic” began.
    Do they pay you to think or pay you to keep quiet?

    • Care to share any evidence that suggests this virus is new to humans?

      Doctors are saying this is a novel illness with bizarre characteristics that looks similar to high altitude sickness (“happy hypoxic” patients, etc):

      https://www.medscape.com/viewarticle/928156
      https://twitter.com/AnesDecon/status/1253086770356355072

      Then again it took quite awhile for this similarity to be noticed, according to Kyle-Sidell he only noticed it because he was in a unique position to see patients in all stages of the illness. So if there wasn’t a hysteria would people have just continued following the standard protocol? Maybe this isn’t the first time this has happened. It is possible.

      The tests for the virus/antibody are another matter. I think that has been sloppy this whole time and the way these tests appeared around the world without any public discussion of how well they worked makes me really suspect. And I keep hearing about doctors ignoring the tests because the results don’t match with the symptoms.

      • Have you looked for the “public discussion of how well they work”, or are you considering your personal lack of knowledge as proof of nonexistence?

        • Yes, I have been following this quite closely from the beginning. There were later a few papers on how unreliable the tests are, and anecdotes from doctors saying so too. I can find them if you want.

        • If you could find a paper, that’d be great, I failed to find one.
          Current WHO recommendation is to collect two different specimen for a PCR test.

        • Thank you!

          a) These tests have a chance of producing false negatives for various known reasons. This has been known and communicated for as long as I can remember. I don’t think the lack of discussion on this is surprising, because that fact is not in dispute.
          b) It’s impossible to have developed a better test that quickly.
          c) Having the test is better than not having it.
          d) The Korean paper suggests that a co-infection coud cause the test to fail; I surmise this could be because the foreign virus drowns out Sars-cov-2 in the replication stage? That is a point worth noting, but it’s an inherent problem of the PCR method, which means it might be hard to fix.
          e) clinical diagnostics, esp. lung CT, are good, but more expensive and time-consuming, and inconclusive for patients who have not yet developed pneumonia. With the PCR test, you can diagnose patients that do not yet suffer from pneumonia, and you don’t need to subject them to a CT.
          f) There are epidemiological consequences of a possible 20% false negative rate when it comes to isolating infected cases. General advice (CDC, RKI) is for people to behave as if they were positive even if they aren’t if they had a high-risk exposure and have symptoms.

          In the light of this, my question to you is: what good would a public discussion do? What positive outcome could it effect? What would change?

    • Joe:

      I don’t really know how the New York State study was done; see last night’s post. In any case, all studies are flawed in some way or another. The challenge is to learn what we can from them. The Santa Clara study seems to give evidence that the rate of infection in that county was less than 4%, so that’s something. More will be learned in future studies, and I hope that these and other researchers learn from the mistakes of the studies that have been done so far.

      • I think that if we can be mindful of the interests of various stakeholders, then we would be in a much better situation to address COVID19 response options. I speculate that some parts of the answers are in understanding the transmission course of those that have no or mild symptoms. I agree with Dr. Birx on this. She sprinkles some quite nuanced insights that are not picked up by most lay audiences to the White House briefing, judging from Twitter responses; Twitter being the most robust social media platform.

      • Why doesn’t Stanford just come out and say he is a wrong (or a crank)? At my institution we have several internal reviews before a paper can be made public in order to protect the reputation of the institution. Apparently, Stanford is backwards in this regard.

        • Not necessarily! It was a preprint. Many an institution has adopted a ‘cover our asses’ only to find that it too has been on the chopping block.

          You should thank John Ioannidis that he gives you an opportunity to nitpick the study? As some have been waiting for this seminal moment.

        • Again, our institution has internal reviews BEFORE it can be submitted anywhere. Preprints are AFTER submission. There is no covering asses concept. It is called the scientific method. While I am doing research I discuss it with coworkers not involved in the work. While I am writing the paper I discuss it with coworkers not involved in the work. It is due diligence, checking and doubling checking the work. Again, that is part of the scientific method. If you don’t do it, you are a fake scientist.

        • That’s your interpretation of the internal review process. Obviously you are not going to characterize the process as a shoddy or sub-optima. Nor do you know whether or not the Santa Clara authors also discussed it with others.

          The study does not appear to have been issued under the auspicies of Stanford. It doesn’t list who the individual donors are.

          My point really was that an internal review does not necessarily translate into the scientific method. So much of data is proprietary: not shared to begin with.

        • The preprint says the work was funded by the Laura and John Arnold Foundation. And I know for a fact that Stanford medical school is terribly upset about his methods.

        • To be fair, in my experience external “peer review” for publication often does absolutely nothing that I would characterize as “the scientific method”.

        • Rom, where do you see that in the pre-print?
          I’m still waiting for the corrected version that Bendavid said would be forthcoming.

        • Ron,

          To suggest that Stanford Medical School is unhappy with John Ioannidis’ methods seems to me like you are conveying the sentiment on Stanford Medical School’s behalf. Not likely. Did you speak to its leadership? A specific medical school academic? Did the school authorize you to convey it?

          Diversity of perspectives can be a strength or weakness of an institution. And heck the National Academy of Sciences had gone at it for years in debating what the Scientific Method was. Read Challenges by Serge Lang, which is relevant to this day as a critique of the statistics used in the social sciences, in particular political science. Lang led a vigorous campaign against Samuel Huntington; thus preventing Huntington’s membership to the academy. We are probably seeing a wave of the same level of skepticism of much research.

          I didn’t see an individual donor listed on the preprint that is under review here. But thanks for clarifying who did fund it. The foundation was perhaps instrumental in contributing to the establishment of the METRICSStanford Center, of which John Ioannidis and Steven Goodman are co-directors.

          In reading the articles generated by the statistics community, strong disagreements over research methods have constituted the mainstay. The question is why some get a pass and others don’t. I attribute this to this in part to publication biases which prevail. Neither the Imperial College nor the IMHE statistics got any near the vituperation that the Santa Clara Study got. That is not only my view but the view of some of my doze friends who are watching with bewilderment at the conflicting perspectives and the feuding among Twitter circles. We are not professional statisticians. We making efforts to understand them as consumers and patients.

          I have no qualms if you state your specific objection to the study. Many excellent points have been raised.

  61. I believe the antibody test from Premier Biotech was actually purchase from the Chinese company Hangzhou Biotest Biotech and was subsequently banned from export. It is not approved by Chinese nor US regulatory agencies.

    See: https://www.nbcnews.com/health/health-news/unapproved-chinese-coronavirus-antibody-tests-being-used-least-2-states-n1185131
    and https://premierbiotech.com/innovation/press/

    Now I’m not trying to call it into question since it is Chinese, but if this story is accurate and it was the test used in the Santa Clara study (or other studies), then maybe the antibody test should be checked carefully for reliability.

  62. As a layman, I think the authors are trying to find the base of the pyramid, like the one for Flu at https://www.cdc.gov/flu/about/burden/index.html.

    They themselves are doing more studies nationwide.

    Criticism should be directed to what the criticiser thinks the base to be, rather than just find holes in the process of the present study that the author themselves are admitting to be small.

    I believe there is credible hypothesis and theory there, unless we were to assume that a) there are no asymptomatic or un reported cases and b) Covid19 is not that infectious.

  63. Wouldn’t you validate the test anyway? Why trust the manufacturer?
    This is why validations should be done by a reputable outside lab. For any test, no matter where it originates.
    You might even need to do random sampling to assess production quality. The big assay kits that are run in machines 96 samples at a time come with controls that are run with every batch for quality control.

    Stanford is rumored to have validated the test on 88 more samples, total 118, with no false positives, pushing the specificity to 100% (95% CI 97.5%-100%).

    Apart from that, we have the package insert asserting 369/371 for IgG and 368/371 for IgM. The study isn’t clear if a test counts as poitive if EITHER IgG or IgM are positive, or if BOTH are required. If it’s BOTH, total specifity is >= 369/371 (95% CI 98.1%-99.9%).

    The company’s CDC filing reports the test being run on samples of 150 people with symptoms of a respiratory tract infection who tested negative in the PCR, with 146/150 for IgM and 149/150 for IgG, evaluated as EITHER to 146/150. This means specificity for BOTH is 149/150 (95% CI 96.3%-99.98%) under these circumstances.

    All data in a bucket is 636/639 = 99.5% (95% CI 98.6%-99.9%) specificity.

    • The appendix (https://www.medrxiv.org/content/medrxiv/suppl/2020/04/17/2020.04.14.20062463.DC1/2020.04.14.20062463-1.pdf) mentions they consider either IgM or IgG a positive test:

      > Note: we consider 𝑇𝐸𝑆𝑇 + as any band on the test kit indicating the presence of IgG or IgM antibodies or both.

      Additionally, the Covid-19 Testing Project (https://covidtestingproject.org) evaluated this test and found 3 false positives out of 108 samples.

      • Thank you, I overlooked that! And thanks for the link, that’s a great resource!

        From the Covidtestingproject manuscript: “Reader training is key to reliable LFA performance, and can be tailored for survey goals.” This ought to be documented by a good study, unless an electronic optical reader is used, as I’ve seen in one of the FDA-authorized assays. In this project, the two readers were unaware of the status of the sample, and that’s got to be the standard. (They actually accidentally rotated half of their sample plates 180 degrees and ran the tests in the opposite order; and were able to discover and correct the error afterwards!)

        The low end of the specificity range is 84.3%, with a false error rate of 15.7% we’re approaching NYC levels of uncertainty.

        105/108 = 97.2% (95% CI 92.1-99.4%)

        If we take this data, the 368/371 manufacturer IgM specificity, and the 118/118 Stanford specificity, we have 591/597 = 99.0% (95% CI 97.8-99.6%); at 99%, getting 3/108 false positives or more has p=0.095. The lowest specificity applied in the Santa Clara study, 99.5% specificity, gives p=0.017 for 3/108, which indicates this value is too high.

        But since we do not know whether the manufacturer’s IgG and IgM false positives overlap, the manufacturer specificity could be as low as 366/371=98.7%. In the Covidtestingproject data, the false positives did not overlap; in the CDC filing, they did. With that specificity, seeing no false positive in 118 samples still has p=0.2, and 105/108 has p=0.17.

        Of course, 366/371 specificity reduces the raw prevalence from 1.5% to 0.15%, which equates to only 5 true positive samples in 3330.
        It does look like there’s a fatal flaw after all.

        • I find it troubling the degree to which it appears Stanford/Santa Clara studies have thumbs on the scale. I think they will not be able to shake questions about the test

  64. The paper quotes the manufacturer as saying there were 369 negative results on IgG tests involving 371 known negative people. That is a false positive rate of 2/371 = 0.5%. Because there were only 2 false positives that FPR has a large error bound, but since there is not much data to go on let’s use that point estimate. Stanford found 70 negative results in 70 known negative patients. If the manufacturer is to be believed it’s no surprise because the manufacturer’s FPR estimate would say the expected number of false positives Stanford would have found in 70 patients is 0.5% x 70 = 0.35.

    At the time of the study there were about 1000 known infections and 50 deaths. If we estimate that only those who we’re sick enough to go to the hospital got tested and use the estimated hospitalization rate the would mean there were actually about 5000 infections in a county of 1.9 million people. Rounding that up a little to 2 million that means the estimate that was just made suggests a 5000/2000000 = 0.25% infection rate.

    A Bayesian analysis with a prior of 0.25% infection rate and a test having a 0.5% false positive rate suggests the probability of a person being infected given they test positive is only about 30%. This would mean 70% of Stanford’s positives were false. This is not surprising because if one assumes 0.25% = 1 in 400 people have the virus, but 0.5% = 2/400 people who do not have the virus would have a false positive then a measurement with this test would say 3/400 people were infected when in reality only 1/400 people were actually infected leading to a factor of 3 over estimates.

    Even if one accepts the weighting and possible selection bias adjustments from Stanford the factor of 3 error because of false positives says instead of a 2.5% to 4.2% infection rate (48,000 to 81,000 infections) when adjusted for false positives the true rate is more like 30% of that or 0.75% to 1.3% or 14,000 to 24,000 which is much more realistic considering there’d were only 50 deaths. Also please note that does not mean the death rate is just 50/14000 or 50/24000 because those 50 people who died caught the disease about 18 days earlier when the infection rate was much lower.

  65. I think there may be one more issue with the interpretation of this study, which admittedly is not due to the study itself but rather a consequence of how the study is being positioned. It involves the comparison to the widely acknowledged (but also formally estimated by the CDC) 0.1% fatality rate of the seasonal flu (the paper states that its estimates for prevalence imply a COVID-19 fatality rate between 0.12% and 0.2%).

    It looks like when the CDC estimates this rate the denominator is estimated *symptomatic* illnesses, which is greater than confirmed cases (they adjust that number up based on prior patterns of people coming into the doctor’s office), but importantly *less* than the total number of people who actually had the disease insofar as it doesn’t attempt to estimate asymptomatic cases. See one CDC write-up here: https://www.cdc.gov/flu/about/burden/2018-2019.html .

    I believe the COVID-19 antibody tests would capture both symptomatic and asymptomatic cases, no? Please correct me if anyone understands this differently but it seems like that would be one of the primary values of such a test.

    I think the best way to make that comparison more apples-to-apples is to add back in the asymptomatic cases of the seasonal flu. The Discussion section of this paper in The Lancet (https://www.thelancet.com/journals/lanres/article/PIIS2213-2600%2814%2970034-7/fulltext ) states that 3/4 of all normal flu cases are asymptomatic, i.e. the total # of seasonal flu cases is 4x the # of symptomatic cases, and the estimated fatality rate (0.025%) is one quarter of the fatality rate (0.1%) of symptomatic cases.

    Admittedly this is not the fault of the authors, but it does seem to be an important point insofar as the comparison to the seasonal flu is helpful in understanding the impact (or, alternative, being wielded as an instrument to deceive).

    • That’s an astute observation!

      And the Covid-19 deaths in Santa Clara were underestimated:
      “New data released to this news organization shows that deaths recorded by the Santa Clara County Medical Examiner-Coroner’s Office rose 20% last month, compared with March of 2019 — an increase that includes a 17% rise in the number of people who died at home. Overall, COVID-19 was listed as the cause of death or a significant condition for 32 people who died in the county in March — about half of the overall increase, though county officials acknowledge many more infections likely went undiagnosed.”
      “County Executive Jeff Smith said the uptick in deaths may be even higher — up to 25% compared to March 2019, with a rise in deaths at home between 20% and 21% — and include more COVID-19 deaths than previously known. He said COVID-19 fatalities appear to have accounted for about 41% of the increase in the total number of deaths.”
      https://www.mercurynews.com/2020/04/22/santa-clara-county-death-data-shows-20-increase-in-march-suggesting-more-coronavirus-victims-than-previously-known/

      So, another factor of 2 in the mix.

      “The data collected so far on how many people are infected and how the epidemic is evolving are utterly unreliable.” (John Ioannidis, March 2020)

      • I think Jihn meant to say:

        The data *other peoole* collected so far on how many people are infected and how the epidemic is evolving are utterly unreliable

  66. John K above raises an important point that has been missing and regards the statistical timeline and its huge impact on Infection Fatality Rate, IFR and similar statistics. Stanford and perhaps others are improperly compare the infection, case and death levels, not rates or causal factors that affect IFR. The level of reported death cases should be compared to infections at the time the disease triggered the case, than unfortunately results in death. Today’s increase in number of confirmed cases will take 20-30 treatment days to end in release or deaths that are associated with the incoming cases. A quick way to gauge this in comparison to seasonal flu, the rate of COVID deaths per week are nearing 10K/week, while SF is of the order of 1-2K/week (CDC reports 50K/year as a midpoint in 2017). Both started with infections and incubated for similar amounts of time however we know that SF statistics have had chance to reach equilibrium, we clearly have not reached a similar equilibrium in the COVID statistical case and death trajectories.

    It seems we can completely discount the Stanford paper and its analysis on a number of grounds. It is too bad that its institutional discipline doesnt require skeptical test of findings internally before any published release, preprint or otherwise. More authoritative work appears to preprint where there are multiple centers represented to avoid the capture effect of the lead investigators who in this case clearly had a bias to find outcomes suiting their non-scientific findings. For an example of good practice, I recommend recent UCSF papers which incorporate other research centers and private companies. While there is no perfect, it removes much opportunity for wanton bias as we see in the Santa Clara county rushed to press piece.

    • Hi Rich,

      Re: Stanford and perhaps others are improperly comparing the infection, case, and death levels, not rates or causal factors that affect IFR. The level of reported death cases should be compared to infections at the time the disease triggered the case then, unfortunately, results in death.

      —-

      I don’t see how you can discern the exact time the disease triggered the case. It seems to me that such evaluations would require some exceptional monitoring.

      Nor do I see how any study can be held to be without biases and interests. To characterize th Stanford paper as perhaps consisting of ‘wanton’ biases seems promiscuous. I mean really to imply that research centers’ collaborations are therefore justiable strikes me as overgeneralization and binary, despite that little pinky caveat you inserted, i.e While there is no perfect. I don’t expect you to derogate your past or current associations.

    • For the Premier Biotech test (Stanford) and the Bio Medomics test, that data in this database is coming from the covidtestingproject.org trials. It’s already pretty well documented on their site, but it’s nice to have a .csv . It’s not test data, but trial documentation, as I understand it?
      I’ve already used the 105/108 specificity above.

  67. Why did they use just one test? There are lots of tests, and another with established accuracy could have been used as a robustness check. Also, who checked the manufacturer’s self-serving estimate of over 98% accuracy for false positives?

    About sample bias – the authors use a common “trick” of mentioning a few possible biases, and suggesting that they might cancel each other out. It is a bad trick. It is easy to find other possible biases not mentioned. The sample is skewed towards people who are out-and-about (because they had to go to the testing site), and presumably out-and-about people are more likely to have caught the virus. Facebook users are possibly more social than others, again more likely to have contacts through which they can catch the virus.

  68. So I just took their raw data and did standard Bayesian for fraction positives, f1, in specificity test (2 of 401), and for fraction positives, f2 in SC county data (50 of 3380), (flat priors), to get posterior for joint, p(f2,f1|Data) = p(df,f1|Data), where df = f2-f1, difference in fraction positives, marginalized over f1 (i.e. averaged over the posterior specificity), to get posterior p(df|Data), and integrated to get the cdf. I did not try any stratification or re-weighting, too problematic.

    So we can ask:
    what is probability that df > 0: From their raw data this is 95%, which is reassuring, i.e. only 5% probability SC county sample is all false positives!, since we know there were true infections. Probability df > 1.5% is only about 4%, and the posterior of the author’s quoted 2.4-4.2% extremely unlikely imho.
    Median posterior value of df (true positives) is 0.8%, which corresponds to about 15,000 cases, still about 16-fold above confirmed cases, but nowhere near the 45-80-fold in the press-releases.

    • Mendel –

      Thanks for the link:

      In the limitations:

      > other areas are likely to have different seroprevalence estimates based on effective contact
      rates in the community,

      Why not enlarge the context from that part of the discussion with some evidence on per capita death representativeness of Santa Clara county? They discuss the issue of areas with higher fatality rates, with numbers, but don’t attempt to bound the range of uncertainty in that part of their discussion.

      Also, no mention in the limitations section of the lack of representativeness of Santa Clara county from a national perspective. WTF? They say that their estimates might be different than certain communities w/r/t fatality rates (e.g., nursing homes, homeless populations) but they don’t even mention Santa Clara’s SES or ethnic/race profile when discussing the limitations of their findings for a broader extrapolation? That’s really, really hard for me to understand.

      • They do mention the early cases that have bern discovered in retrospect, but not that the medical examiner says that the death count was likely underestimated. That was in the same press article, though.

    • page 19: 6 of the 13 specificity samples they have collected from various unnamed sources are at 99.2% or lower, which is the lower edge of their 95% CI for the 95.5% specificity they’re still going with. The 7 other specificity samples are all perfect 100%. How likely is that? Can we split the data in two along this line and do two alternate analyses, one with the lower average from the non-perfect samples and one with 100%?
      a) We know that the 105/108 sample from the UCSF was read by independent readers blind to the status of the sample, i.e. they read both positive and negative samples without knowing which was which. If the readers aren’t blind, they might ignore a weak red bar?
      b) maybe half were summer and half were winter samples?

      The analysis of self-selection bias does not discuss possible exposure as a motivator to participate.
      They’re glossing over the fact that a lot of recruitment was via private sharing of links.
      They have provided some information on the weights.
      2 of 167 old people (>167) tested positive, for a raw 1.2% prevalence with a huge error margin and unknown weights. Overall fatality rate is mainly determined by this age group.

      They did have 4 positives with symptoms in the past 2 weeks, and 14 positives with symptoms in the past 2 months. This would put their symptomatic:asymptomatic rate at approximately 1:2.5. If the false positives were mostly asymptomatic, this would be closer to 1:1.3. If positives self-selected based on exposure to a known case, and we’d expect some of these with symptoms to have already been tested, the asymptomatic should be higher than expected; since I’m expecting a rate of 3:2 based on Diamond Princess data, this fits.

      There’s one argument I do not understand, down on page 20: “for 0 true positives to be a possibility, one needs … 0 false negatives”. It’s immaterial, but it seems wrong to me.

      Oh well.

  69. Lame. Revenge of the nerd shit. You don’t need a test to determine whether you are sick.
    This is just Stanford babble to coner up the facts that the hospital is empty and the nurses are on vacation.

    • Anon:

      My understanding is that the purpose of the test is not so much to determine whether you are sick, but (a) if you are sick, to find out what sickness you have, so that doctors can treat you more effectively, and (b) if you are not sick, to find out if you have been infected, to better estimate the spread of the disease throughout the population.

  70. Go back to class. Stanford hospital is empty. Nurses on vacation. Did you do any research for this? Or just drink a warm coke and sharpen your pencil?

    • Curious:

      I don’t understand the focus on Ioannidis here, given that he’s the 16th out of 17 authors.

      In any case, yes, I’ve seen the new article and I discussed it here and here.

      If there’s something specific that you disagree with in these posts, please share in comments. What I’m writing may be “hyperbolic and nonconstructive,” but if you can’t tell me your reasons for saying that, I can’t do much with this.

  71. Greetings – I see I’m late to this discussion, but did want to raise one issue:

    The zip code adjustment could be masking important and causal factors. There is a huge variability in land use, demographics, and population density across the county.

    Density alone could drive transmission, and I don’t understand how a valid adjustment from a rural zip code and a dense urban zip code can be made.

    How can statistical adjustment or weighting tease all this out? (ps – obviously, I am not a statistician – this piece was sent to me by my son, who is)

    • Bill:

      I have no confidence at all in the statistical adjustment used in that Santa Clara paper. But, speaking more generally, you adjust for known differences between sample and population. Sex, age, ethnicity, and zip code are just part of the story, but adjusting for them should help—if you do it in a reasonable way.

  72. Disappointed in this article, while there are flaws in the study (there are flaws in all studies, we just decide our views based on bias, no matter how hard we try not too), and you claim to approach it scientifically and in that approach should be as unbiased as possible.

    But by your own admission you didn’t read the whole paper before writing this and allowed your own bias, to say “you have read enough its time to tell the world how this is pulled apart. Its shameful and flies in the face of the point you were trying to make.

    Put your approach under a microscope before you next critique an article. Fortunately i tend to look for peer review studies rather than how an individual feels about how the initial study was written, but this was sent to me by someone doing research who came across this like it was a genuine unbiased piece of work. Which it clearly is not.

    • Stephen:

      Ironically you seem to have not read my post carefully before commenting. The statement, “I haven’t read the entire paper” is not by me; it’s by my correspondent, which should be clear given that what I wrote above was, “I [Rushton] haven’t read the entire paper.” I [Gelman, the author of the post] did indeed read the entire paper.

      My goal in reading a paper is not to grade the authors on a curve. I know that, as you put it, “there are flaws in all studies.” There are flaws in all of my studies, that’s for sure! So what? The point of the above post is that the paper by the Stanford team understated the uncertainties in their claims, and then used these understated uncertainties to make inappropriately strong conclusions.

      If you think that it’s “shameful” of me to point this out, then you’re just shooting the messenger. I recommend you direct your angry feelings toward the people who released this study without checking their statistics, rather than toward the people who pointed out the evident errors in the paper.

    • Bob:

      Is there a particular thing that I wrote that you think is wrong?

      As I wrote in response to a similarly rude comment above, if there’s something specific that you disagree with in these posts, please share in comments. What I’m writing may be “drivel,” but if you can’t tell me your reasons for saying that, I can’t do much with this.

    • Bob

      “Study after study is showing asymptomatics as 70%-95% of the population, the latest being Penn State.”

      Oddly, the paper you cite doesn’t say that. In particular:

      ‘researchers estimated the detection rate of symptomatic COVID-19 cases using the Centers for Disease Control and Prevention’s influenza-like illnesses (ILI) surveillance data over a three week period in March 2020.

      “We analyzed each state’s ILI cases to estimate the number that could not be attributed to influenza and were in excess of seasonal baseline levels,”’

      And later:

      “Their estimates showed rates much higher than initially reported but closer to those found once states began completing antibody testing.

      In New York, for example, the researchers’ model suggested that at least 9% of the state’s entire population was infected by the end of March. After the state conducted antibody testing on 3,000 residents, they found a 13.9% infection rate, or 2.7 million New Yorkers.”

      Earlier start than thought, with highly underestimated number of cases that converges with the serology test data from NY.

      Nowhere do they suggest that 75% of the population (245 million people) have been infected and asymptomatic.

      As far as Ioannidis and the Stanford/Santa Clara study, they had to rework their paper in part due to the “drivel” published by our host, increasing their lower bound for IFR upwards by about 50%. And that only addressed part of the criticism …

      • That said, there is some circumstantial (for now) evidence that antibody tests may miss some cases, in particular those who exhibit IgA in mucosal tissue (so not in the bloodstream) and / or T-cell responses (which aren’t caught by these tests). There is however just a very small sample study around for now (8 individuals).

        Elsewhere, Bergamo reports a putative 70% infection rate, but the sample isn’t representative (people aren’t willing to participate in the serosurveys because they have to quarantine if positive).

    • Bob, why is your name a hyperlink to the study you posted? Seems a lot of work to go through to promote one news release that does not really support the conclusion one suspects you are gunning for…

    • Well, the flaws of extrapolating the Stanford study-claimed low rates to the whole nation are still severe flaws. More than 0.1%-0.2% of the entire population of New York City has died.

      I wouldn’t be surprised if asymptomatics are more common than we thought, however, for example if some people recover from COVID without producing antibodies of the type or at the level that are being tested for. (T-cell response, innate immune system, whatever…)

      IFR could be somewhat lower than we thought, but Lombardy and NYC set a lower bound on *how* low it could be, at least in dense urban areas with “first-world” demographics (relatively high median age).

    • Follow:

      I don’t know why people keep talking about Ioannidis. He’s the 16th author of this paper! He’s not even mentioned in the above post.

      Also, as I’ve written many many times, in criticizing that article I’m not saying that I think its substantive claims about prevalence rates, infection fatality rates, are wrong. I have no idea! The claims could be wrong, or they could be right. My point has always been with the statistical methods, which led the authors to an inappropriate claim of certainty based on the data presented in that study.

      Here’s what I wrote in the above post:

      I’m not saying that the claims in the above-linked paper are wrong. Maybe the test they are using really does have a 100% specificity rate and maybe the prevalence in Santa Clara county really was 4.2%. It’s possible. The problem with the paper is that (a) it doesn’t make this reasoning clear, and (b) their uncertainty statements are not consistent with the information they themselves present.

      To put it in general terms, the sentence “Data X imply the statement Y,” can be false, even if statement Y happens to be true, or might be true, or whatever, based on other evidence.

  73. In addition to modelling individual fatality rates, shouldn’t we be making policy decisions based on population death rates? In addition, shouldn’t the models of the latter account for a possible inverse relationship with the former? Ebola has a much higher IFR and yet only 11k deaths world wide. Does a higher fear of a disease for an individual lead to increased efforts at protection and thus a lower impact on global population?

Leave a Reply

Your email address will not be published. Required fields are marked *