Skip to content
 

Concerns with that Stanford study of coronavirus prevalence

Josh Rushton writes:

I’ve been following your blog for a while and checked in today to see if there was a thread on last week’s big-splash Stanford antibody study (the one with the shocking headline that they got 50 positive results in a “random” sample of 3330 antibody tests, suggesting that nearly 2% of the population has been infected “under the radar”). I didn’t see anything, so I thought I’d ask if you’d consider opening a discussion.

This paper is certainly relevant to the MrP thread on politicization of the covid response, in that the paper risks injecting misinformation into an already-broken policy discussion. But I think it would be better to use it as a case study on poor statistics and questionable study design. I don’t mean to sound harsh, but if scientists are afraid to “police” ourselves, I don’t know how we can ask the public to trust us.

Simply put, I see two potentially fatal flaws with the study (full disclosure: I [Rushton] haven’t read the entire paper — a thousand apologies if I’m jumping the gun — but it’s hard to imagine these getting explained away in the fine print):

  • The authors’ confidence intervals cannot possibly be accounting for false positives correctly (I think they use the term “specificity” to mean “low rate of false-positives). I say this because the test validation included a total of 30+371 pre-covid blood tests, and only 399 of them came back negative. I know that low-incidence binomial CIs can be tricky, and I don’t know the standard practice these days, but the exact binomial 95% CI for the false-positive rate is (0.0006, 0.0179); this is pretty consistent to the authors’ specificity CI (98.3%, 99.9%). For rates near the high end of this CI, you’d get 50 or more false positives in 3330 tests with about 90% probability. Hard to sort through this with strict frequentist logic (obviously a Bayesian could make short work of it), but the common-sense take-away is clear: It’s perfectly plausible (in the 95% CI sense) that the shocking prevalence rates published in the study are mostly, or even entirely, due to false positives. So the fact that their prevalence CIs don’t go anywhere near zero simply can’t be right.
  • Recruitment was done via facebook ads with basic demographic targeting. Since we’re looking for a feature that affects something like 2% of the population (or much, much less), we really have to worry about self selection. They may have discussed this in the portions of the paper I didn’t read, but I can’t imagine how researchers would defeat the desire to get a test if you had reason to believe that you, or someone near you, had the virus (and wouldn’t some people hide those reasons to avoid being disqualified from getting the test?)…

Pretty harsh words—but this is just some guy sending me an email. I’ll have to read the paper and judge for myself, which I did with an open mind. (Let me assure you that I did not title this post until after writing most of it.)

It’s been a busy month for Stanford on the blog. First there were these pre-debunked forecasts we heard from a couple of assholes from the Hoover Institution, then some grad students set us this pretty sane literature review, and now this!

Reading through the preprint

Anyway, after receiving the above email, I clicked though and read the preprint, “COVID-19 Antibody Seroprevalence in Santa Clara County, California,” by Eran Bendavid et al., which reports:

On 4/3-4/4, 2020, we tested county residents for antibodies to SARS-CoV-2 using a lateral flow immunoassay. Participants were recruited using Facebook ads targeting a representative sample of the county by demographic and geographic characteristics. We report the prevalence of antibodies to SARS- CoV-2 in a sample of 3,330 people, adjusting for zip code, sex, and race/ethnicity. . . . The unadjusted prevalence of antibodies to SARS-CoV-2 in Santa Clara County was 1.5% . . . and the population-weighted prevalence was 2.8%.

That’s positive test results. Then you have to adjust for testing errors:

Under the three scenarios for test performance characteristics, the population prevalence of COVID-19 in Santa Clara ranged from 2.5% to 4.2%. [I’ve rounded all numbers to a single decimal place for my own sanity. — AG]

To discuss this paper, I’ll work backward, starting from the conclusion and going through the methods and assumptions.

Let’s take their final estimate, 2.5% to 4.2%, and call it 3%. Is a 3% rate of coronavirus antibodies in Santa Clara county a high or a low number? And does this represent good news or bad news?

First off, 3% does not sound implausible. If they said 30%, I’d be skeptical, given how everyone’s been hiding out for awhile, but 3%, sure, maybe so. Bendavid et al. argue that if the number is 3%, that’s good news, because Santa Clara county has 2 million people and only an estimated 100 deaths . . . 0.03*(2 million)/100 = 600, so that implies that 1/600 of exposed people there died. So that’s good news, relatively speaking: we’d still like to avoid 300 million Americans getting the virus and 500,000 dying, but that’s still better than the doomsday scenario.

It’s hard to wrap my head around these numbers because, on one hand, a 1/600 death rate sounds pretty low; on the other, 500,000 deaths is a lot. I guess 500,000 is too high because nobody’s saying that everyone will get exposed.

The study was reported in the news as that the county “Santa Clara county has had 50 to 85 times more cases than we knew about, Stanford estimates.” It does seem plausible that lots more people have been exposed than have been tested for the disease, as so few tests are being done.

At the time of this writing, NYC has about 9000 recorded coronavirus deaths. Multiply by 600 and you get 5.4 million. OK, I don’t think 5.4 million New Yorkers have been exposed to coronavirus. New York only has 8.4 million people total! I don’t think I know anyone who’s had coronavirus. Sure, you can have it and not have any symptoms—but if it’s as contagious as all that, then if I had it, I guess all my family would get it too, and then I’d guess that somebody would show some symptoms.

That’s fine—for reasons we’ve been discussing for awhile—actually, it was just a month and a half ago—it doesn’t make sense to talk about a single “case fatality rate,” as it depends on age and all sorts of other things. The point is that there’ve gotta be lots of coronavirus cases that have not been recorded, given that we have nothing close to universal or random-sample testing. But the 1/600 number doesn’t seem quite right either.

Figuring out where the estimates came from

OK, now let’s see where the Stanford estimate came from. They did a survey and found 1.5% positive tests (that’s 50 out of 3330 in the sample). Then they did three statistical adjustments:

1. They poststratified on zip code, sex, and ethnicity to get an estimate of 2.8%. Poststratification is a standard statistical technique, but some important practical issues arise regarding what to adjust for.

2. They adjusted for test inaccuracy. This is a well-known probability problem—with a rare disease and an imperfect test, you can easily end up with most of your positive test results being false positives. The error rates of the test is the key input to this calculation.

3. They got uncertainty intervals based on the sampling in the data. That’s the simplest part of the analysis, and I won’t talk much about it here. It does come up, though, in the implicit decision of the paper to focus on point estimates rather than uncertainty ranges. To the extent that the point estimates are implausible (e.g., my doubts about the 1/600 ratio above), that could point toward a Bayesian analysis that would account for inferential uncertainty. But I’m guessing that the uncertainty due to sampling variation is minor compared to uncertainty arising from the error rate of the test.

I’ll discuss each of these steps in turn, but I also want to mention three other issues:

4. Selection bias. As Rushton wrote, it could be that people who’d had coronavirus symptoms were more likely to avail themselves of a free test.

5. Auxiliary information. In any such study, you’d want to record respondents’ ages and symptoms. And, indeed, these were asked about in the survey. However, these were not used in the analysis and played no role in the conclusion. In particular, one might want to use responses about symptoms to assess possible selection bias.

6. Data availability. The data for this study do not seem to be available. That’s too bad. I can’t see that there’d be a confidentiality issue: just knowing someone’s age, sex, ethnicity, and coronavirus symptoms should not be enough to allow someone to be identified, right? I guess that including zip code could be enough for some categories, maybe? But if that were the only issue, they could just pool some of the less populated zip codes. I’m guessing that the reason they didn’t release the data is simple bureaucracy: it’s easier to get a study approved if you promise you won’t release the data than if you say you will. Backasswards, that is, but that’s the world that academic researchers have to deal with, and my guess is that the turf-protectors in the IRB industry aren’t gonna letting go of this one without a fight. Too bad, though: without the data and the code, we just have to guess at what was done. And we can’t do any of the natural alternative analyses.

Assessing the statistical analysis

Now let’s go through each step.

1. Poststratification.

There are 2 sexes and it seems that the researchers used 4 ethnicity categories. I’m not sure how they adjusted for zip code. From their map, it seems that there are about 60 zip codes in the county, so there’s no way they simply poststratified on all of them. They say, “we re-weighted our sample by zip code, sex, and race/ethnicity,” but “re-weighed . . . by zip code” doesn’t really say exactly what they did. Just to be clear here, I’m not suggesting malfeasance here; it’s just the usual story that it can be hard for people to describe their calculations in words. Even formulas are not so helpful because they can lack key details.

I’m concerned about the poststratification for three reasons. First, they didn’t poststratify on age, and the age distribution is way off! Only 5% of their sample is 65 and over, as compared to 13% of the population of Santa Clara county. Second, I don’t know what to think about the zip code adjustment, since I don’t know what was actually done there. This is probably not the biggest deal, but given that they bothered to adjust at all, I’m concerned. Third, I really don’t know what they did, because they say they weighted to adjust for zip code, sex, and ethnicity in the general population—-but in Table 1 they give their adjusted proportions for sex and ethnicity and they don’t match the general population! They’re close, but not exact. Again, I’d say this is no big deal, but I hate not knowing what was actually done.

And why did they not adjust for age? They write, “We chose these three adjustors because they contributed to the largest imbalance in our sample, and because including additional adjustors would result in small-N bins.” They should’ve called up a survey statistician and asked for help on this one: it’s standard problem. You can do MRP—that’s what I’d do!—but even some simple raking would be fine here, I think.

There aren’t a lot of survey statisticians out there, but there are some. They could’ve called me up and asked for advice, or they could’ve stayed on campus and asked Doug Rivers or Jon Krosnick—they’re both experts on sampling and survey adjustments. I guess it’s hard to find experts on short notice. Doug and Jon don’t have M.D.’s and they’re not economists or law professors, so I guess they don’t count as experts by the usual measures.

2. Test inaccuracy.

This is the big one. If X% of the population have the antibodies and the test has an error rate that’s not a lot lower than X%, you’re in big trouble. This doesn’t mean you shouldn’t do testing, but it does mean you need to interpret the results carefully. Bendavid et al. estimate that the sensitivity of the test is somewhere between 84% and 97% and that the specificity is somewhere between 90% and 100%. I can never remember which is sensitivity and which is specificity, so I looked it up on wikipedia: “Sensitivity . . . measures the proportion of actual positives that are correctly identified as such . . . Specificity . . . measures the proportion of actual negatives that are correctly identified as such.” OK, here are concern is actual negatives who are misclassified, so what’s relevant is the specificity. That’s the number between 90% and 100%.

If the specificity is 90%, we’re sunk. With a 90% specificity, you’d expect to see 333 positive tests out of 3330, even if nobody had the antibodies at all. Indeed, they only saw 50 positives, that is, 1.5%, so we can be pretty sure that the specificity is at least 98.5%. If the specificity were 98.5%, the observed data would be consistent with zero, which is one of Rushton’s points above. On the other hand, if the specificity were 100%, then we could take the result at face value.

So how do they get their estimates? Again, the key number here is the specificity. Here’s exactly what they say regarding specificity:

A sample of 30 pre-COVID samples from hip surgery patients were also tested, and all 30 were negative. . . . The manufacturer’s test characteristics relied on . . . pre-COVID sera for negative gold standard . . . Among 371 pre-COVID samples, 369 were negative.

This gives two estimates of specificity: 30/30 = 100% and 369/371 = 99.46%. Or you can combine them together to get 399/401 = 99.50%. If you really trust these numbers, you’re cool: with y=399 and n=401, we can do the standard Agresti-Coull 95% interval based on y+2 and n+4, which comes to [98.0%, 100%]. If you go to the lower bound of that interval, you start to get in trouble: remember that if the specificity is less than 98.5%, you’ll expect to see more than 1.5% positive tests in the data no matter what!

3. Uncertainty intervals. So what’s going on here? If the specificity data in the paper are consistent with all the tests being false positives—not that we believe all the tests are false positives, but this suggests we can’t then estimate the true positive rate with any precision—then how do they get a confidence nonzero estimate of the true positive rate in the population?

It seems that two things are going on. First, they’re focusing on the point estimates of specificity. Their headline is the range from 2.5% to 4.2%, which come from their point estimates of specificity of 100% (from their 30/30 data) and 99.5% (from the manufacturer’s 369/371). So the range they give is not a confidence interval; it’s two point estimates from different subsets of their testing data. Second, I think they’re doing something wrong, or more than one thing wrong, with their uncertainty estimates, which are “2.5% (95CI 1.8-3.2%)” and “4.2% (2.6-5.7%)” (again, I’ve rounded to one decimal place for clarity). The problem is that we’ve already seen that a 95% interval for the specificity will go below 98.5%, which implies that the 95% interval for the true positive rate should include zero.

Why does their interval not include zero, then? I can’t be sure, but one possibility is that they did the sensitivity-specificity corrections on the poststratified estimate. But, if so, I don’t think that’s right. 50 positive tests is 50 positive tests, and if the specificity is really 98.5%, you could get that with no true cases. Also, I’m baffled because I think the 2.5% is coming from that 30/30=100% specificity estimate, but in that case you’d need a really wide confidence interval, which would again go way below 98.5% so that the confidence interval for the true positive rate would include zero.

Again, the real point here is not whether zero is or “should be” in the 95% interval, but rather that, once the specificity can get in the neighborhood of 98.5% or lower, you can’t use this crude approach to estimate the prevalence; all you can do is bound it from above, which completely destroys the “50-85-fold more than the number of confirmed cases” claim.

They do talk about this a bit: “if new estimates indicate test specificity to be less than 97.9%, our SARS-CoV-2 prevalence estimate would change from 2.8% to less than 1%, and the lower uncertainty bound of our estimate would include zero. On the other hand, lower sensitivity, which has been raised as a concern with point-of-care test kits, would imply that the population prevalence would be even higher.” But I think this misses the point. First, if the specificity were less than 97.9%, you’d expect more than 70 positive cases out of 3330 tests. But they only saw 50 positives, so I don’t think that 1% rate makes sense. Second, the bit about the sensitivity is a red herring here. The uncertainty here is pretty much entirely driven by the uncertainty in the specificity.

This is all pretty much what Rushton said in one paragraph of his email. I just did what was, in retrospect, overkill here because I wanted to understand what the authors were doing.

4. Selection bias. In their article, Bendavid et al. address the possibility: “Other biases, such as bias favoring individuals in good health capable of attending our testing sites, or bias favoring those with prior COVID-like illnesses seeking antibody confirmation are also possible.” That makes sense. Bias could go in either direction. I don’t have a good sense of this, and I think it’s fine to report the results of a self-selected population, as long as (a) you make clear the sampling procedure, and (b) you do your best to adjust.

Regarding (b), I wonder if they could’ve done more. In addition to my concerns expressed above regarding insufficient poststratification (in turn driven by their apparent lack of consultation with a statistics expert), I also wonder if they could’ve done something with the data they collected on “underlying co-morbidities, and prior clinical symptoms.” I don’t see these data anywhere in the report, which is too bad. They could’ve said what percentage of the people in their survey reported any coronavirus-related symptoms.

5. Auxiliary information. and 6. Data availability. As noted above, it seems that the researchers collected some information that could have helped us understand their results, but these data are unavailable to us.

Jeez—I just spent 3 hours writing this post. I don’t think it wasn’t worth the time. I could’ve just shared Rushton’s email with all of you—that would’ve just taken 5 minutes!

Summary

I think the authors of the above-linked paper owe us all an apology. We wasted time and effort discussing this paper whose main selling point was some numbers that were essentially the product of a statistical error.

I’m serious about the apology. Everyone makes mistakes. I don’t think they authors need to apologize just because they screwed up. I think they need to apologize because these were avoidable screw-ups. They’re the kind of screw-ups that happen if you want to leap out with an exciting finding and you don’t look too carefully at what you might have done wrong.

Look. A couple weeks ago I was involved in a survey regarding coronavirus symptoms and some other things. We took the data and ran some regressions and got some cool results. We were excited. That’s fine. But we didn’t then write up a damn preprint and set the publicity machine into action. We noticed a bunch of weird things with our data, lots of cases were excluded for one reason or another, then we realized there were some issues of imbalance so we couldn’t really trust the regression as is, at the very least we’d want to do some matching first . . . I don’t actually know what’s happening with that project right now. Fine. We better clean up the data if we want to say anything useful. Or we could release the raw data, whatever. The point is, if you’re gonna go to all this trouble collecting your data, be a bit more careful in the analysis! Careful not just in the details but in the process: get some outsiders involved who can have a fresh perspective and aren’t invested in the success of your project.

Also, remember that reputational inference goes both ways. The authors of this article put in a lot of work because they are concerned about public health and want to contribute to useful decision making. The study got attention and credibility in part because of the reputation of Stanford. Fair enough: Stanford’s a great institution. Amazing things are done at Stanford. But Stanford has also paid a small price for publicizing this work, because people will remember that “the Stanford study” was hyped but it had issues. So there is a cost here. The next study out of Stanford will have a little less of that credibility bank to borrow from. If I were a Stanford professor, I’d be kind of annoyed. So I think the authors of the study owe an apology not just to us, but to Stanford. Not to single out Stanford, though. There’s also Cornell, which is known as that place with the ESP professor and that goofy soup-bowl guy who faked his data. And I teach at Columbia; our most famous professor is . . . Dr. Oz.

It’s all about the blood

I’m not saying that the claims in the above-linked paper are wrong. Maybe the test they are using really does have a 100% specificity rate and maybe the prevalence in Santa Clara county really was 4.2%. It’s possible. The problem with the paper is that (a) it doesn’t make this reasoning clear, and (b) their uncertainty statements are not consistent with the information they themselves present.

Let me put it another way. The fact that the authors keep saying that “50-85-fold” thing suggest to me that they sincerely believe that the specificity of their test is between 99.5% and 100%. They’re clinicians and medical testing experts; I’m not. Fine. But then they should make that assumption crystal clear. In the abstract of their paper. Something like this:

We believe that the specificity of the test used in this study is between 99.5% and 100%. Under this assumption, we conclude that the population prevalence in Santa Clara county was between 1.8% and 5.7% . . .

This specificity thing is your key assumption, so place it front and center. Own your modeling decisions.

P.S. Again, I know nothing about blood testing. Perhaps we could convene an expert panel including George Schultz, Henry Kissinger, and David Boies to adjudicate the evidence on this one?

P.P.S. The authors provide some details on their methods here. Here’s what’s up:

– For the poststratification, it turns out they do adjust for every zip code. I’m surprised, as I’d think that could give them some noisy weights, but, given our other concerns with this study, I guess noisy weights are the least of our worries. Also, they don’t quite weight by sex x ethnicity x zip; they actually weight by the two-way margins, sex x zip and ethnicity x zip. Again, not the world’s biggest deal. They should’ve adjusted for age, too, though, as that’s a freebie.

– They have a formula to account for uncertainty in the estimated specificity. But something seems to have gone wrong, as discussed in the above post. It’s hard to know exactly what went wrong since we don’t have the data and code. For example, I don’t know what they are using for var(q).

P.P.P.S. Let me again emphasize that “not statistically significant” is not the same thing as “no effect.” What I’m saying in the above post is that the information in the above-linked article does not provide strong evidence that the rate of people in Santa Clara county exposed by that date was as high as claimed. Indeed, the data as reported are consistent with the null hypothesis of no exposure, and also with alternative hypotheses such as exposure rates of 0.1% or 0.5% or whatever. But we know the null hypothesis isn’t true—people in that county have been infected! The data as reported are also consistent with infection rates of 2% or 4%. Indeed, as I wrote above, 3% seems like a plausible number. As I wrote above, “I’m not saying that the claims in the above-linked paper are wrong,” and I’m certainly not saying we should take our skepticism in their specific claims and use that as evidence in favor of a null hypothesis. I think we just need to accept some uncertainty here. The Bendavid et al. study is problematic if it is taken as strong evidence for those particular estimates, but it’s valuable if it’s considered as one piece of information that’s part of a big picture that remains uncertain. When I wrote that the authors of the article owe us all an apology, I didn’t mean they owed us an apology for doing the study, I meant they owed us an apology for avoidable errors in the statistical analysis that led to overconfident claims. But, again, let’s not make the opposite mistake of using uncertainty as a way to affirm a null hypothesis.

P.P.P.P.S. I’m still concerned about the zip code weighting. Their formula has N^S_zsr in the denominator: that’s the number of people in the sample in each category of zip code x sex x race. But there are enough zip codes in the county that I’m concerned that weighting in this way will be very noisy. This is a particular concern here because even the unweighted estimate of 1.5% is so noisy that, given the data available, it could be explained simply by false positives. Again, this does not make the substantive claims in the paper false (or true), it’s just one more reason these estimates are too noisy to do more than give us an upper bound on the infection rate, unless you want to make additional assumptions. You could say that the analysis as performed in the paper does make additional assumptions, it just does so implicitly via forking paths.

P.P.P.P.P.S. A new version of the article has been released; see discussion here and here.

P.P.P.P.P.P.S. See here for our analysis of the data published in the revised report. Our conclusion:

For now, we do not think the data support the claim that the number of infections in Santa Clara County was between 50 and 85 times the count of cases reported at the time, or the implied interval for the IFR of 0.12–0.2%. These numbers are consistent with the data, but the data are also consistent with a near-zero infection rate in the county. The data of Bendavid et al. (2020a,b) do not provide strong evidence about the number of people infected or the infection fatality ratio; the number of positive tests in the data is just too small, given uncertainty in the specificity of the test.

432 Comments

  1. Asher says:

    I believe self selection is a huge problem. People who suspect they may have had the virus will want to know and will volunteer for these studies at many times the rate of the general population. There is just no way to adjust for this because there is no representative sample enabling you to estimate the self-selection tendency. So even if they do have 3% of their sample, it could just mean that 3% of people who suspect they have the virus and want to know really do. Doesn’t say anything about the general population.

    • Anonymous says:

      > no way to adjust for this because there is no representative sample enabling you to estimate the self-selection tendency

      Just make the ad say it’s an antibody study NOT testing for covid19 antibodies. Then in person tell them it’s for covid and tell them the results.

      • Costa Vakalopoulos says:

        Self selection works the other way as well and may not be an indicator of disease presence but of anxiety of having the disease. As a clinician the number of people wanting testing for all manner of symptoms might astound you. All so far proved negative.

        • Zhou Fang says:

          I don’t see that sort of self selection as being very likely to produce infection levels lower than random selection. Anxiety and common symptoms might be a poor indicator of a rare illness, but it’s not like they are anti-correlated with it either.

          • Carlos Ungil says:

            The correlation may exist if people who is more concerned about the possibility of an infection is at the same time a) more likely to get tested because they are more motivated and b) less likely to get infected because they are more careful.

            • Zhou Fang says:

              I suppose that is possible, but doesn’t the rarity here make that a small factor? You would need a great number of hypochondriacs skilled at disease avoidance to make a difference here, especially if we’ve already adjusted away demographic differences. (Like I can see income as causing such a correlation.)

              • Carlos Ungil says:

                I do agree with you, it’s not likely to have a large effect. But it’s not completly implausible.

          • Costa Vakalopoulos says:

            I agree many years of clinical practice has taught me it biases neither way and that’s really the point I’m trying to make. Statistics can only take you so far particularly when there are so many variables to account for and not enough data and plausibility or, dare I say enlightened instinct can guide your interpretation of what’s actually happening. I’m a big admirer of John Ioannidis and he has taken a brave stand against the self interested (at least initially) hysteria the media have generated and Im sure he started out with the premise the quoted figures seemed a little ridiculous to be able to generalize. I think attempts at statistical rigour creates as many problems as it solves sometimes and in fact this is where John got himself into a pickle when he claimed anti depressants were no better than placebo ignoring obvious clinical usefulness. At least he wasn’t so rigid and partly retracted his stance. So I don’t believe he is statistically motivated but is using the study to bolster what he believes is likely.

            The other point I want to make is that John is anything but an ‘asshole’ and although I do admire Andrew’s work I wonder whether there is a touch or more of envy in the uncompromising intellectual stand John has taken and the attention it has received and I believe deserves for not being swept up in the panic. It may be years before we get a true picture of what this pandemic really means for our understanding of infectious diseases but I wouldn’t be surprised if the mortality figures are not much higher than influenza or at least no where near the fear invoked levels created by other so called experts having taken the stage like Imperial college.

            Finally The discussion is and should also be focussed on the collateral damage total lockdown can have for economics and the social fabric of society each of which will lead to its own epidemics of morbidity for example, many hospitals remain empty preparing for this yet to materialize influx and its totally distracting from other pressing concerns for humanity that will be far more critical to our well being in the long run. There are many more issues than just statistics to guide one’s impression of this so called pandemic.

            • Costa Vakalopoulos says:

              Conversely I suspect many of the posts are driven by the pervasive sentiment of an obvious increase in death rates with the current pandemic than any statistical imperative

            • Anna Amiradaki says:

              Hi Costa,
              Kudos to you for standing up to Dr. Ioannidis in this blog, a brave man indeed, who is using science and facts to try and lead public policy towards the right (hopefully) direction. My husband is a Columbia University Graduate and we admire the scrutiny Dr. Ioannidis was willing to withstand, to honor his commitment on unbiased studies and the service of science to mankind.

              Be well and thank you again!

              Anna

            • Andrew says:

              Costa:

              1. I never said John Ioannidis was an asshole! In the above post I referred to “a couple of assholes from the Hoover Institution.” So I don’t know where that is coming from. Ioannidis is not to my knowledge associated with the Hoover Institution. I was referring to Richard Epstein and Victor Hanson, as can be seen from the links.

              2. I don’t know why you think I’m envious of Ioannidis. I really can’t think of anything I’ve said that would make someone think that. Actually, in the whole post I never mentioned Ioannidis once.

              3. I don’t think the Imperial College researchers are perfect, but they are experts, no? What is gained by calling them “so-called experts.”

              4. I agree with you that statistics only tells part of the story. The statistics of the Santa Clara study are ambiguous; hence we need to rely on other information to make decisions.

              • lemi says:

                Well Imperial College’s modelling has finally been released. Whats your assessment on their expertise now “asshole”

              • Andrew says:

                Lemi:

                I haven’t looked at the raw code from the Imperial College researchers, but I’ve been talking with them a lot about their statistical model. There’s some blog discussion here.

        • Gilligan says:

          Self-selection is a serious problem when there are access to care issues and sub-populations prone to avoid interaction with agencies, or quasi-agencies of the government. I’m surprised at so many people appearing to be unaware of this problem.

      • Stephen M St Onge says:

        >Just make the ad say it’s an antibody study NOT testing for covid19 antibodies.

        I doubt that will work. If I see an online ad for an antibody test, with no mention of the disease tested for, at the time of a well-publicized outbreak, I do believe I’ll make a good guess at what disease is being tracked.

    • Question says:

      Hello I am not a statistician but correct me if I’m wrong. Keep in mind a recent Dutch study came up with a similar number of 3%. Anyway…

      Now, From what I read in the thread, the problem is with the +/- of the test. The poster of this thread is very casual about what specificity vs sensitivity is. Specificities tend to underestimate how many have this thing. The researchers said they tested the “tests” and it correctly identified 100% of the negatives and 68% of the positives. Presumably with a PCR test. A follow up test identified 100% of negatives and 93% of positives. In both cases it shows that people who don’t have it are correctly identified. You are negative you don’t have it . However there are some positive cases that are being missed. This tells me that the number infected will most likely be greater than 3% which means we are massively over inflating the death rates.

      • No test gives a positive always and only when the sample is positive. Even just if there’s contamination you can get a positive for a person who doesn’t have the disease. There are other reasons too.

        The point of this article is that if the percentage of positives that are true positive is 98.5% or less… then the data is consistent with there not being any true positives in the sample.

        Worse than that, there are a bunch of extra uncertainties that make it such that even if the percentage of positives that are true positive is higher, maybe 99% or something, then it’s still within the realm of possibility that all of these are false positive. Even if it isn’t “all” of them, the bulk of the positives might well be false.

        Finally, with the recruitment via FB etc, it seems clear that this group was very likely to be enriched for people who thought they had the disease, so even if it’s a few percent, it’s a few percent of people who think they have the disease.

        Basically, we learn from this data ONLY that there probably aren’t 5% or more of the population who have had the disease in Santa Clara county. That’s all we learn. It’s a useful piece of information, but it’s not much. My prior was that it’s at most a few percent even before this study, so there’s nothing surprising, or particularly informative.

        • Question says:

          The more relevant piece of data for these smaller studies is not the false positive, but false negatives. Of course false plosives are important. We can’t just ignore that. Stanford would choose a high specificity test. The nature of these tests are generally(not always) that if it is geared toward specificity, the sensitivity is lacking. The controversy is over the factory stats of the antibody test. . If Stanford ran tests twice and both times got 100% accuracy on negatives, confirmed with PCR, then I have no reason to doubt them. And both times they ran the test , positives were missed up to 30%. Which again, can only mean the actual positives are underestimated. For smaller studies, that’s what you want. I don’t see why people are so dismissive of this, it’s like they want Stanford to be wrong. (No, I don’t have any affiliation).
          The OP is not the first to make this critique, I saw it in a nature article and elsewhere.Just people repeating the same thing. They are also repeating the same criticisms of case selection, valid points, but little value. If the selection was biased toward people who may have been exposed due to whatever, that would be massively overshadowed by the number of people induced by media panic. It’s inadvertently more beneficial.
          At this point, I feel it’s a mute issue as USC research (April 20] independently puts the number at 4% as is expected. Dutch study at 3%. I think at this point, we should gear policy making with the 1/1000 CFR as a maximum in mind. Still a large number of deaths. If we debate these numbers any further, we will have bigger problems than the virus.

          • Andrew says:

            Question:

            It’s a math thing. Actually, it’s a famous probability example. If the underlying prevalence rate is 50%, then, yes, both sorts of testing errors are important. But if the underlying prevalence rate is 1% or 2%, then the specificity is what’s important. It’s counterintuitive, but if you work out the probabilities, you’ll see the issue. Or you can look at the numbers in the above post: if the test has a 98.5% specificity, then you’d expect a rate of 1.5% of positive tests, even in the absence of true positives. The study reports specificity estimates of 99.5% and 100%, but a specificity of 98.5% or lower is also consistent with their data.

            Also, the USC study is not independent. It seems to be done by the same research group. The data are new, but they could be using the same flawed statistical methods.

            As I wrote in my above post, the Stanford/USC team could be correct in their substantive claims. What I and other critics are saying is that there’s a lot of uncertainty: their data are consistent with their claims, but their data are also consistent with much lower prevalence levels.

            The researchers should release their raw data so that more people can take a look. The problem is too important to do anything otherwise.

            • Zhou Fang says:

              Andrew: I’ve posted on the other article but now that I think about it, Question actually brings up a good point.

              If the test here is *specifically selected* from a choice of several by Stanford on the basis on having (apparently) good specificity, i.e. this 2/371 result, then everyone’s calculations would actually be excessively generous, no? I’m not sure how to do the maths here, but the interval for false positive rate if we require merely that the best of say, 9 or so competing identically behaving tests – that would be significantly wider right?

            • Navigator says:

              I am aware that specificity is more important in low prevalence.

              However, If we really want to assess whether re-infection is possible, or whether there is herd immunity to this thing, we’d ask for more sensitivity in tests, rather than specificity.

              I believe there were some early reports of positive cases that turned negative (second test) and then again turned positive (third test). I just remember the gist of it being uncertainty whether it was a high false negative rate or re-infection.

              Not sure if that was for the antibody test though, but this post reminded me of it.

              Thanks

              • David Winsemius says:

                If you want to investigate the potential for reinfection, you should be looking for “neutralizing antibodies”. Those are antibodies with are sufficient to trap and immobilize the virions. The methods require a different sort of substrate than just a viral protein to bind to.

            • tim says:

              I wrote a graphical tool to help visualize the non-intuitive relationships betweeen predictive values (what we care about when interpreting tests) and the underlying sensitivity and specificity. In particular how “pretty accurate” tests yield very misleading results when applied to low prevalence populations.

              I think it’s clear that the results of the study WAY overestimate true infection rate. When the specificity of a test is similar (or lower than) the prevalence of the disease, then most of the Positive tests are False.

              https://sites.google.com/view/tgmteststat/home

          • Zhou Fang says:

            Buddy.

            Firstly they didn’t get both times 100% accuracy.

            Generate some binomial random variables with size 401 and probability 0.015 and see how often you get 2 or fewer false positives. Getting two errors with 401 samples is not evidence the error probability is less than 0.005.

            The false negatives don’t matter because a correct estimate of the prevalence here would be something like (between 0 and 1) x 1.5% x (between 1 and 1.4). The first factor, corresponding to false positives, is *far more important*.

            > If the selection was biased toward people who may have been exposed due to whatever, that would be massively overshadowed by the number of people induced by media panic. It’s inadvertently more beneficial.

            No? Even if you think people induced by media panic are somehow less likely than a randomly selected person to have the virus, you require a small number of people from an enriched population of people with known contacts or symptoms to create 50 excess positives.

            The USC research is not independent. Same author. Same test. Probably same method.

            > I don’t see why people are so dismissive of this, it’s like they want Stanford to be wrong. (No, I don’t have any affiliation).

            It’s because these people who have an affiliation actually studied statistics.

          • VM says:

            The sensitivity would be important if the data showed a much higher % of tests positive. In this study it’s the specificity that throws the study completely off.

            > USC research (April 20] independently puts the number at 4% as is expected. Dutch study at 3%.

            That’s irrelevant, the % infected is obviously going to be different in different parts of the world.

            > I think at this point, we should gear policy making with the 1/1000 CFR as a maximum in mind. Still
            > a large number of deaths. If we debate these numbers any further, we will have bigger problems than
            > the virus.

            This sounds like a political agenda rather than an interest in discussing the validity of this study.
            Everyone is entitled to their own opinion, but everybody is not entitled to their own facts.

          • Rod Jackson says:

            With reference to the Dutch study also estimating a prevalence of 3%, the Covid-death rate in the Netherlands is about 230 deaths per million total population, whereas it is about 40 deaths per million in California. Applying a 1/1000 CFR to the Dutch deaths would give a prevalence of 23%, not 3%. The Dutch study would suggest a CFR closer to 1/100 than 1/1000. Based on the Dutch results, the Santa Clar prevalence would be about 0.4% rather than 4%. Therefore, I think it would be a major mistake to ‘gear policy making with the 1/1000 CFR as a maximum in mind’ if it could be out by a factor of 10.

      • Also, as to the death rates. It comes down to the CFR estimate is an estimate of the death rate for people who get the symptomatic form of the disease. We have evidence that the asymptomatic form could be ~ 50% or so, but none of that matters a lot. The symptomatic cases doubled every ~ 3 days. and the CFR was ~ 5% among that group. So if you want to know how many people will die and you can estimate the growth of the symptomatic fraction… you can do ok by taking the symptomatic group and multiplying by 0.05… It’s easy to make that number be in the millions for the US.

        If ~ everyone got the disease, to keep the deaths below 1M you’d have to have about 93% asymptomatic. 330M*(1-.93)*.05 = 1.2M

        I think 93% asymptomatic is just pure wishful thinking, and that 1.2M deaths in the US is HUGE.

        so, while we really need prevalence estimates, they won’t really change the decision making.

        Inevitably, there’s no low-work way out of this. We have to do the South Korea thing, or something more or less like it.

      • Andrew says:

        Question:

        You write, “A follow up test identified 100% of negatives . . . it shows that people who don’t have it are correctly identified. You are negative you don’t have it .”

        That’s the assumption of 100% specificity that’s discussed in the above post. As I wrote, I think the paper would’ve been a zillion percent better if they’d prefaced their claims with, “We believe that the specificity of the test used in this study is between 99.5% and 100%. Under this assumption, we conclude . . .”

      • Prakash Nayak says:

        100% specificity and 93% sensitivity is an index of false negatives and false positives respectively . So if your test is negative, you do not have the disease. But if your test is positive, there is a 7% chance it’s a false positive. Of course the true probabilities change based on prevalence. If none from a population of 100 have disease, 7 tests could still wrongly be positive (false positive). If all 100 had disease, with a 100% specificity, no case would falsely test negative.

        • Martha (Smith) says:

          An assumption of 100% specificity seems unrealistic. I don’t understand how someone could rationally make this assumption (except perhaps for didactic purposes).

        • Michal says:

          You’re first confusing specificity with sensitivity, and then both of them with positive/negative predictive value. 100% specificity gives you full confidence in a positive result, not in negative. But even with a negative result, your post-test probability of not having the disease is not 7% – it depends on the disease prevalence.

      • Mike Patterson says:

        For what it’s worth, that’s also how I’m reading this.

    • Caio Teles says:

      My first thought about this was that people that are more concerned about the epidemic would be more interest in participate in this study. And these people are the ones that take more care about their behavior to not get infected. Probably, the people that are getting most infected are the ones that don’t care about this subject.

    • AW says:

      As a study participant, I concur. I was highly motivated to participate as I had Covid-type symptoms in Feb. I know of other participants who also did everything they could to get for the same reason.

      The way study access was gated via questionnaire probably also led to incorrect demographic information being provided. I suspect people answered falsely in order to get into the test.

      • Martha (Smith) says:

        Worthwhile information. So many things study designers don’t foresee. We need a compendium of such things. And study designs need to be read and critiqued by a variety of people. Takes a village to plan a good study.

      • LemmusLemmus says:

        How would you change your information in order to get into the study? That would presuppose that you know which kinds of people they’re still looking for, right?

  2. Zhou Fang says:

    Is there a good resource to refer to on the Agresti-Coull interval, or should I just dive into the original paper?

  3. Anoneuoid says:

    People who suspect they may have had the virus will want to know and will volunteer for these studies at many times the rate of the general population.

    I know a lot of people who think they had it, seems like that could actually be a normal cross section of the population. Also, these results don’t exist in a vacuum:

    https://www.boston25news.com/news/cdc-reviewing-stunning-universal-testing-results-boston-homeless-shelter/Z253TFBO6RG4HCUAARBO4YWO64/

    https://www.bloomberg.com/news/articles/2020-04-11/false-negative-coronavirus-test-results-raise-doctors-doubts

    Of course, maybe all these tests just suck. I’d like to know the results in only the nondyspnic people showing up with psO2 under 90 rather than people with flu like symptoms.

    • Steve Sailer says:

      My guess would be that the people most likely to participate in this drive through test are people who are pretty cautious about infections but who are also out and about.

      In contrast, people who are adamantly Holed Up for the Duration would be reluctant to participate in the in-person part of the study because of fear of infection. At the other extreme, the What-Me-Worry-Just-the-Flu-Bro types probably wouldn’t bother with waiting in line to be tested.

      So I wouldn’t be surprised if the survey lucked into a pretty representative sample.

      But I also wouldn’t be surprised if the sample were highly skewed.

  4. D Kane says:

    I understand concerns about the data. But what about the code? Seeing that would clear up a lot of issue that you (rightly!) raise. I will ask them for the code and report back.

    • Zhou Fang says:

      I think the code to me is less exciting than e.g. how exactly they advertised and recruited for this study, whether the advertisement gave an impression of allowing worried people to get tested. Also stats on ad impressions. If everyone who saw the ad went and got tested then that would go some way to assuaging my doubts on the self selection problem.

    • Eric says:

      For those who haven’t had to go through an IRB to get human research approved, being able to release data is a problem. When reviewing articles for publication, I almost always ask why the data has not been provided or released and the answer is almost always their IRB didn’t approve it. Here is why…

      When human subjects are involved in a medical trial, there are 18 “identifying” fields that if removed will provide a safe harbor for the researchers. Those 18 are: names, geographic divisions smaller than state (with some rules about ZIP codes), all dates (other than year) or all ages over 89, telephone numbers, vehicle IDs, fax numbers, device IDs, email addresses, URLs, SSN, IP addresses, medical record numbers, biometric identifiers, health plan beneficiary numbers, full-face photographs, account numbers, any other identifying characteristic, or license numbers. So using the whole 5-digit ZIP code is *not* allowed (even if those ZIP codes have lots of people). At a minimum, only the first-3 digits can be used and even then only if that area has at least 20,000 people. Additionally, reporting ages over 89 years old isn’t allowed. Everyone 90 and above needs to be grouped together. For a disease that disproportionately is more severe for the elderly this would be a large source of bias.

      https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html

      My suggestion is to give these researchers a break for not providing their data (like the comments are doing) — getting data released in a form that would allow replication of any useful MRP analysis to be performed would likely involve negotiation with their IRB, probably a change to the consent, and an “expert determination” that releasing some of these 18 types of identifiers isn’t risky (which would produce liability for that expert should the data become identified.) Its unfortunate but the way things are for now.

  5. D Kane says:

    Got a form e-mail response from : “Thanks for your message. I am unavailable for inquiries for another 1-2 weeks. I will try to return your message at that time, or try me again at that time.” Anyone know one of the other authors? Who do we think actually wrote the code?

  6. John says:

    A couple days ago, tweeted some concerns re: the confidence intervals that mostly just restate the technical issues Prof. Gelman posted above, but if people are interested, I also posted some code to bootstrap the study prevalence and get more accurate confidence intervals. Note that I can’t include any of the reweighting since the study authors didn’t publish that data. Primary takeaway is that it’s difficult to rule out the possibility that nearly all of the positives in the study are false positives given the specificity data they rely upon.

    https://github.com/jjcherian/medrxiv_experiment

    There’s also more code to run a parametric bootstrap from another person who’s given this analysis a shot here: https://gist.github.com/syadlowsky/a23a2f67358ef64a7d80d8c5cc0ad307

    Hope this is helpful!

    • Andrew says:

      John:

      Thanks. Just to clarify, I think the best way to attack this problem is through a Bayesian analysis. Classical confidence intervals are fine, but they kinda fall apart when you’re trying to combine multiple sources of uncertainty, as indeed can be seen in the Bendavid et al. article. I just did the classical intervals because that’s what they did, and it was the easiest way to see what went wrong with their analysis.

      • John says:

        Thanks for clarifying! I think that makes sense for the non-parametric bootstrap (which has to treat these sources of uncertainty as essentially independent and seems like it underestimates the final uncertainty as a result), but I don’t think I understand why the parametric bootstrap fails here? It seems very similar conceptually to the Bayesian setup.

        • I don’t think Andrew was saying your stuff fails here. He’s just saying that in his article he stuck to classical intervals because they became comparable to the ones in the original article.

          Your stuff expands on that to handle more kinds of uncertainty. I think the Bayesian method would give the fullest picture, but I don’t know how different it would be from the bootstrap picture. The bootstrap can sometimes be seen as an approximation to the full Bayesian version.

          • John says:

            Gotcha! Sorry I think “fails” maybe carries an implication that I didn’t intend. I’m just curious about how to think about differences between the parametric bootstrap and the Bayesian modeling approach. Found this B. Efron paper that specifically addresses this subject (https://arxiv.org/pdf/1301.2936.pdf), so I’m hopeful that reading this will resolve my questions. Thanks!

          • > The bootstrap can sometimes be seen as an approximation to the full Bayesian version.
            Well, Brad Efron wrote a paper or two on that 5 or more years ago, arguing that the bootstrap automatically creates its own non-informative prior automatically.

            The right bootstrap that is – which to me would be a lot more tricky to sort out than just defining an appropriate prior…

  7. Justin Pickett says:

    Excellent write-up.

  8. Ethan Steinberg says:

    I think this study is actually a great usecase for Stan to really show how the uncertainty propagates from their tests to their final estimate. In particular, it’s important to note how there is a highly non-linear and non-symmetric effect in that a slightly lower specificity really tanks the estimated prevalence. https://colab.research.google.com/drive/110EIVw8dZ7XHpVK8pcvLDHg0CN7yrS_t shows my attempt at modeling this.

    Note that there is a high density in the posterior near 0% for prevalence.

    • Andrew says:

      Ethan:

      +1. Also you can include a latent variable to represent selection bias in the survey respondents.

      • Ethan Steinberg says:

        Another thing that’s potentially worth thinking about is there might be a selection bias in the survey respondents not only for increased COVID-19 antibodies, but also similar flu (or other respiratory infection) antibody. I’m sure of the exact dynamics of the antibody test, but that would imply that the sample might be enriched for false positives as well if the test has a higher error rate on people who had other viral infections.

        • Anoneuoid says:

          Lots of false negatives in these tests, even in severe cases:

          Since rRT-PCR tests serve as the gold standard method to confirm the infection of SARS-CoV-2, false-negative results could hinder the prevention and control of the epidemic, particularly when this test plays a key reference role in deciding the necessity for continued isolated medical observation or discharge. Regarding the underlying reasons for false-negative rRT-PCR results, a previous published study suggested that insufficient viral specimens and laboratory error might be responsible (3). We speculated from these two cases that infection routes, disease progression status (specimen collection timing and methods), and coinfection with other viruses might influence the rRT-PCR test accuracy, which should be further studied with more cases.

          False-negative rRT-PCR results were seen in many hospitals. By monitoring data collected at our hospital from January 21 to 31, 2020, two out of ten negative cases shown by the rRT-PCR test were finally confirmed to be positive for COVID-19, yielding an approximately 20% false-negative rate of rRT-PCR. Although the false-negative estimate would not be accurate until we expand the observational time span and number of monitored cases, the drawback of rRT-PCR was revealed.

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7082661/

          But also they are using the rt-pcr results as the gold standard despite these issues? This doesn’t make much sense.

          • Craig Kaplan says:

            They are specifically using a serology test for antibodies. Some complexity is the test can look for two types and they didn’t really talk about the two types or break down the test. However the RT-PCR tests are different- these test for presence of genetic material of the virus, rather than prior exposure to the virus.

            • Anoneuoid says:

              True, sorry for not being clear. I don’t trust *any* of these tests. I think if you look with equal effort at pretty much any of the studies of the general prevalence of either antibodies or viral RNA it will be problematic. Eg, here is that Iceland study (from the supplement):

              Specificity of the WHO recommended assays
              were assessed against a number of known viruses, including alphacoronaviruses, non-asian
              strains of betacoronaviruses, influenza and MERS. No cross-reactivity was observed.
              […]
              Validation of the RNA extraction and the
              qRT-PCR method(s) at deCODE was performed using 124 samples that had previously tested
              positive (n=104) or negative (n=20) with the qRT-PCR assay at LUH. All of the negative samples
              tested negative at deCODE and 102 of the 104 positive tested at LUH were also positive at
              deCODE. Two samples that tested positive at LUH were negative at deCODE. Upon subsequent
              sequencing (see below) viral genome could not be detected in these two samples, probably
              because very few viral particles were present. Samples from 643 individuals that tested
              positive using either the deCODE or the LUH qPCR assays were also submitted for viral genome
              sequencing (see below). Viral RNA (cDNA) from six samples (0.9 %) yielded no sequence data
              mapping to the viral reference genome. The success of generating sequencing libraries with
              good coverage is highly dependent on the amount of viral RNA in the samples as assessed by
              the C t values from the qRT-PCR assays. Figure S2 shows the relationship between measured
              C t values and the consensus coverage of the sequenced samples. These data show that the
              qRT-PCR assay is more sensitive in detecting viral RNA than the amplicon sequencing method.

              https://www.ncbi.nlm.nih.gov/pubmed/32289214

              We see 20/20 (100%) negative and 102/104 (98%) positive qPCR tests replicated in different labs. Also, 637/643 (99%) positive tests could be confirmed against the gold standard (genome sequencing).

              So, ok, sensitivity is observed positive over over all true positives. True positives, we don’t know since they only sequenced 643 samples with with positive qPCR results but none of the negative ones. Specificity is is observed negatives over all true negatives. They didn’t assess true negatives at all and just assume the WHO’s data from cell culture corresponds to their samples from humans.

              What they needed to do was sequence a bunch of samples to determine which were true positives/negatives, then tell us the proportion of each that tested positive/negative on PCR.

              • I don’t think you can sequence a raw sample. you need to do the PCR to amplify before you can sequence. The “gold standard” part is that you sequence the PCR amplicand and determine that it was in fact the viral one and not some other fragment that happened to be similar enough to amplify.

                There’s too little RNA to take just a swab and sequence it. I think.

              • Zhou Fang says:

                I think it’s fairly likely that real world false negatives with the PCR test would be dominated not by some methodological issue but viral RNA just flat out not being present in the sample. It only takes a few hundred to a few thousand virons to infect an individual, and maybe none of them happened to be up the victim’s nose.

              • Craig Kaplan says:

                Yes, the PCR assay is extremely sensitive. The major issues are methodological in sample gathering. For virus that infects lower respiratory tract the gold standard is an expectorated sample. These are much more dangerous to obtain than a nasal swab. False negatives in PCR, which most versions of test are sensitive down to a few molecules are at the sample stage. For the antibody tests, it is discussed that false positives could be due to detection of other antibodies, for example to other coronaviruses. In a population that has seen widespread exposure, it is likely the antibody tests will be decent. I’ve seen a claim in a forwarded email suggesting these authors could have additional control tests that might improve the estimation of false positive rate. However, the case that these authors would like to make that we can potentially expect many fewer deaths has sort of sailed. There will be many deaths and they will be above some threshold that moots their earlier base speculation.

              • Anoneuoid says:

                The major issues are methodological in sample gathering.

                Yes, I’d be more worried about stuff like this:
                https://www.seattletimes.com/seattle-news/health/uw-medicine-halts-use-of-coronavirus-testing-kits-airlifted-from-china-after-some-had-contamination/

                Or that the tests aren’t actually very predictive for the thing killing people because there is a coinfection, etc.

                That is why they need to test their actual methodology against a gold standard.

      • Clyde Schechter says:

        How would you do that? What am I missing here? What aspects of the data would relate to the latent selection bias variable so as to help identify it and its effects on serology? The selection here is clearly far from completely random, nor does it seem to be random conditional on observed variables. In fact, other than age, I don’t see anything in the data that even looks slightly informative about this. Other than a strong prior, what would identify the selection latent variable and its effects?

        • Andrew says:

          Clyde:

          Yeah, when I say “include a latent variable,” I mean that without additional data you’d need to make assumptions about that latent variable. The idea is that you’d specify some plausible range of selection bias, and this would have the effect of adding uncertainty to your estimate.

          • Marm Kilpatrick says:

            I don’t think it’s possible to usefully specify a “plausible” range of selection bias without data. A plausible range without data would include the possibility that seropositive individuals were 10-100-fold more likely to reply to the ad. That would reduce the prevalence estimate to being essentially meaningless on the lower end (at 100x, it would include the # of confirmed cases of COVID-19 based on swab tests for RNA), just like the uncertainty in the specificity. They have the fraction of participants who had COVID-19 symptoms. If they (or someone else) had data on what fraction of a random sample of the population had COVID-19 symptoms they could properly adjust for bias in recruitment. Without it, I don’t see how it’s possible to do something meaningful with a latent variable.

            • Andrew says:

              Marm:

              It’s hard for me to believe it would be factor of 100! But if 100 is possible, then it’s possible, and indeed you’ll get a really wide interval—that’s the way it goes. As you say, they do have some information on symptoms, and that could inform their estimates. Otherwise you just have to make some hypotheses and go with that.

    • Andrew H. says:

      This is an excellent teaching example for Stan. I’ve never used it before and it’s really easy to read the code and see how it’s modeling this problem. I have one really stupid question. What does the “b” mean in “prevalence ~ b”? Is that a shorthand for the uniform distribution?

      • Ethan Steinberg says:

        Ah sorry, not sure how the code got messed up. Fixed now.

        (I think I accidentally tried opening this on a phone yesterday and accidentally edited the text.)

        • Andrew H. says:

          Ah gotcha. I was thinking about this a little more and I think the model is not exactly right. Shouldn’t the final number of tested positives should be a sum of the true positives and false positives, both of which are binomials? In other words, right now the code has:

          num_community_positive ~ binomial(num_community, (fp_rate*(1 – prevalence) + (tp_rate*prevalence)));

          but I think the correct computation (this isn’t valid Stan code) would be:

          num_community_positive ~ binomial(num_community*prevalence, tp_rate) + binomial(num_community*(1-prevalence), fp_rate);

          And I don’t *think* those are equivalent? And if they’re not I’m not sure this is possible with Stan. There’s the issue with num_community*prevalence not being an integer, but more importantly it kind of seems like there isn’t a way to do a sum of binomials: https://discourse.mc-stan.org/t/sum-of-binomials-when-only-the-sum-is-observed/8846

          • Ethan Steinberg says:

            I think that section is correct as a mixture of binomials is equivalent to a single binomial. The easiest way to show that is to show that a mixture of bernoulli distributions is equivalent to a single bernoulli where the bernoulli parameter p = first_bernoulli_fraction * first_bernoulli_p + (1 – first_bernoulli_fraction) * second_bernoulli_p. The proof of that equivalence is that P(y = a | combined bernoulli) = P(y = a | mixture) for all a in {0, 1} (the domain of a bernoulli distribution.

            Once you have convinced yourself that a mixture of bernoullis is equivalent to a single bernoulli, you can then extrapolate to the binomial case by considering each independent draw separately. I might have missed something here (and if you have a nice counter-example, please do let me know!)

            It’s not possible to model this in Stan (due to a lack of support for integer parameters), but I think the way you could think about modeling this is the following:

            num_community_antibodies ~ binomial(num_community, prevalence)
            num_community_no_antibodies = num_community – num_community_antibodies;
            num_community_tp ~ binomial(num_community_antibodies, tp_rate);
            num_community_fp = num_community_positive – num_community_tp;
            num_community_fp ~ binomial(num_community_no_antibodies, fp_rate);

            This would require two more parameters, num_community_antibodies and num_community_tp.

            • Andrew H. says:

              I have to think about the theory more, but I did a simulation in Python, picking point values that were close to the ones in the paper, with 1 million trials, and the distributions weren’t identical. The sum of binomials had a tighter 95% CI than the single binomial.

              https://colab.research.google.com/drive/1gkoL-L7X3WD_YO0zPcfj2P-XmwHclwRZ

            • Andrew H. says:

              Seems like we may have hit the response depth limit. In response to your comment below… doh! I actually thought about that and tried it, but made a dumb coding error. It also makes more sense as a model – the prevalence we care about is not the true # of infected within the study’s sample, but the prevalence of the population from which the sample was taken. Of course this still doesn’t take into account sampling biases.

              Thank you so much!

      • Peter says:

        I think that’s just an error, as the line needs a trailing semicolon to be syntactically correct in Stan. I assume the statement was something like “prevalence ~ beta(1,1);” and then the author accidentally deleted a portion of it.

  9. Kostya says:

    At the time of this writing, NYC has about 9000 recorded coronavirus deaths. Multiply by 600 and you get 5.4 million. OK, I don’t think 5.4 million New Yorkers have been exposed to coronavirus. New York only has 8.4 million people total! I don’t think I know anyone who’s had coronavirus. Sure, you can have it and not have any symptoms—but if it’s as contagious as all that, then if I had it, I guess all my family would get it too, and then I’d guess that somebody would show some symptoms.

    I very much agree with Prof. Gelman’s takeaways in this post, and appreciate the dive in here into this very high profile and seemingly misunderstood piece. That said – I’m not sure the above follows. The premise of the 50-85x undercounting conclusion the authors draw is that the vast majority of cases are asymptomatic (or sufficiently mildly symptomatic that people don’t identify the symptoms until after they’ve had a positive test result). So I conceptually conceive that 65% of New Yorkers have had this in this scenario if >90% cases are functionally asymptomatic.

    Regardless, New York City has identified an additional 3,700 likely COVID deaths (https://www.nytimes.com/2020/04/14/nyregion/new-york-coronavirus-deaths.html), so they’re well above 10K now. The implied IFR in this study is about 0.1%, which as applied to the NYC population would suggest roughly an 100% attack rate in NYC. While there is no single IFR, it’s tough to imagine more than 50% of New Yorkers have been infected, given PCR testing (which was limited to people presenting with symptoms) was only showing about 50% positive results in New York at any given time. (And yes, it’s possible for 80% of people to have had COVID with no more than 50% as presenting at any given time, but we’re getting into very unlikely territory).

    • Tim says:

      The obsession with the false positive rates seem to never be addressed when considering the tests used to diagnose case fatalities in cities and states. The comment above cites that additional 3,700 cases in NYT.. here is the headline:
      The city has added more than 3,700 additional people who were presumed to have died of the coronavirus but had never tested positive.

      So no concern about whether the patients actually had covid19? Kind of important when using studies like the one discussed here to project state-wide and nation-wide mortality rates based on the numbers provided by the states. At least the authors conducted their own testing and have control of the data and the manufacturers error rates

      • David Winsemius says:

        Have you tried looking up the company that was cited. I went to the CDC website and could not find anything there about even testing done under the emergency regulations for Premier Biotech, Minneapolis, MN, much less a full filing.

      • Stephen M St Onge says:

        >So no concern about whether the patients actually had covid19?

        This is not true, though the misconception is common. In one of the CDC examples of how to code death certificates, we have an example of an 86-year old female, non-ambulatory for 3 years after a stroke. She is exposed to a family member (a nephew, for the sake of argument) who has covid-19 symptoms and subsequently tests positive for covid-19 infection. She develops covid-19 symptoms herself, but refuses to go to the hospital and is not tested. She dies of “acute respiratory illness” after five days. But given her exposure, the “acute respiratory illness” is listed as being caused by PROBABLE covid-19 infection.

        IMAO, this is quite reasonable. She has no history of respiratory problems in the years since her stroke, and she has no known risk factors for respiratory problems except the nephew whose infection has been confirmed. Could conceivably be something else, and testing would have been desirable, but the idea she just happened by coincidence to develop covid-like symptoms after being exposed to a carrier is hard to credit.

        https://www.cdc.gov/nchs/data/nvss/vsrg/vsrg03-508.pdf, p. 6.

        • Mendel says:

          +1

          The assumption “deaths that are attributed to Covid-19 without laboratory confirmation are not actually caused by Covid-19” regularly implies that medical examiners have less common sense, medical experience and knowledge about the patient than the person proposing this assumption, or they’re straight-up conspiracy theories.

  10. NL says:

    The supplemental materials have some detail on the re-weighting and the variance estimation:
    https://www.medrxiv.org/content/medrxiv/suppl/2020/04/17/2020.04.14.20062463.DC1/2020.04.14.20062463-1.pdf

    Maybe it’s just missing from the formula, but their description under “Delta method approach to variance measurement” looks like it just doesn’t include any means for incorporating uncertainty in the sensitivity/specificity estimates in to the estimate of the standard error. The standard error would be exactly the same regardless of whether they used 10 tests or 1000 for validation. Is that normal?

    • kjrc says:

      You are correct, they do not take into account any uncertainty on the specificity or sensitivity in their analysis. They only take into account the sample variance. This is one of the unfortunate flaws with their confidence interval estimates.

      Another is that the variance is actually calculated incorrectly, as when they reweight their q values (fraction of positive test results) by a scale factor they do not account for the fact that Var(a*x) = a^2 Var(x), and report it as a*Var(x). Not explicitly, but if you cross check their numbers you’ll find this to be the case.

      As John Rushton correctly pointed out, the reweighting of the q values directly (rather than the prevalence, which they call pi) is also highly problematic. They only apply the effect of false positives and negatives *after* doing the demographic reweighting. Which I don’t think is sensible in this case. It suggests that the point estimate of the specificity and the sensitivity of the test are only appropriate for the demographic distribution of Santa Clara county.

      There are other issues with the analysis, but I think the picture is clear. It would have been really nice to see more careful work done on this kind of test, which could be very informative. I applaud the authors for putting in the effort to start these much need serological studies, hopefully continued efforts will provide more reliable results as we move forward.

  11. Robert Kubinec says:

    For this to be credible, there needs to be a behavioral selection model accounting for the utility of the free test offered to people (as you noted). Such a selection model could then account for the value of the test to different populations and backwards adjust the estimates. Simple MRP won’t work here as it will only adjust the selected estimates by demographic categories, i.e., we know the estimate conditional on utility obtained, but not estimate marginal of utility obtained.

  12. Shiva Kaul says:

    Their prevalence CIs do not include zero because of their sampling model. (Some have hypothesized that the delta method is to blame for the seemingly-wrong CIs, but I did some numerical checks, and I don’t think it’s as big of an issue as the following). They reweight their biased dataset to match the demographics of Santa Clara, thereby boosting the number of positives. There are two ways to assess FPR/FNR on the reweighted data:

    1. Increase the FPR/FNR by the maximum coefficient of reweighting. (So, if we upweighted a positive datum by 2, then we would have to multiply FPR by 2.) This is rigorous, but makes inferences difficult. (I agree with this approach.)

    1. Just use the original FPR/FNR on the reweighted data, imagining it was a fresh random sample. It seems this is what they did. There may be an argument for this approach, if the antibody test performance has the appropriate relationship with their reweighting (i.e. if they are not upweighting data prone to FPs.) However, I don’t see that argument in the current version of their paper.

    Another separate issue is: their expressions for variance of the estimates of sensitivity/specificity seemingly do not depend on the sample sizes (of the manufacturer and local validation sets.) So, if they had 100,000 validation samples rather than 400, their confidence intervals seemingly wouldn’t change. Missing factor of n somewhere?

  13. Joerg Stoye says:

    I would like to offer a technical conjecture about what went wrong in the statistical analysis. (I already communicated this concern to the authors.) As already pointed out on this page, the test’s false positive rate is estimated at 2/301=.5%, but with a 95% confidence interval extending upward to 1.9%. By the authors’ own admission on age 2 of their statistical appendix, this means we cannot reject the hypothesis of zero prevalence. Note (this will become important): Because the binomial distribution is not well approximated by a normal here, the CI must be constructed as exact binomial, not by normal approximation. The authors do this and correctly report 1.9%. If they had mistakenly used a normal approximation combined with the sample variance, their CI would have extended only to 1.2% and zero prevalence would have been spuriously rejected.

    So here is my concern: The authors subsequently use the delta-method to analyze error propagation. This implicitly applies normal approximations. Indeed, the analysis culminates in providing standard errors, and these are only interpretable in the context of normal approximation. I therefore worry that the spurious rejection alluded to above affects the later part of the analysis. The earlier conclusion that 0 cannot be rejected seems to me to be appropriate. (Of course, should the propagation analysis have happend post reweighting, then all bets are off anyway.)

    At the very least, there is a striking dischord between the headline results, including a CI for specificity given in the paper that includes one minus their unadjusted empirical positive rate, and the following passage from the statistical appendix: “There is one important caveat to this formula: it only holds as long as (one minus) the specificity of the test is higher than the sample prevalence. If it is lower, all the observed positives in the sample could be due to false-positive test results, and we cannot exclude zero prevalence as a possibility.”

    Disclaimer: I emphasize that I’d be happy to stand corrected, I appreciate the data collection work, and I also appreciate that the paper was put together under insane pressure.

    • Ned Loomis says:

      Interesting. I have a feeling you will get an “explanation” rather than a mea culpa.

      • Joerg Stoye says:

        Update: I may stand corrected in that I may have been too charitable. A comment further down argues that, in the error propagation calculations, they uncritically took the 2/401 false positives from the validation data as ground truth. I don’t have time to check, but I have a sinking feeling it might be so. Just to clarify things for non-professionals who try to make sense of this thread: That analysis contradicts mine in detail -though not in qualitative conclusion- and would imply that I’ve charitably misread the paper. (Which is possible because I did kind of stop digging after seeing the switch to Delta method.)

  14. Colin says:

    As someone who has tried to do survey sampling through Facebook ads before, there’s another – more pernicious – element to the selection bias they faced.

    Depending on how the ads are targeted and with what goal (eg minimize cost per impressions, click, conversion) the algorithm will be constantly adjusting who is served the ad based on who has responded. This can be demographic driven, but also based on their “interest” profiles as well.

    This can substantially amplify the impact of self selection or introduce other additional selection biases.

    • Martha (Smith) says:

      Worthwhile information to pass on. Thanks.

    • BoseQC35 says:

      Fascinating. Important! Thanks for this info.

    • Ed L says:

      Can confirm. Facebook uses machine learning in their ads to maximize responses. And they use thousands of features in the ML. From likes of certain posts, to visiting third-party sites with facebook cookies installed.

      This is great for marketers trying to maximize sales, not so great for getting a randomized sample across a certain demographic for a scientific study.

      Even if the researchers were smart enough to not use interest-based targeting in their ads, the ML algorithm would soon learn that people with an interest in covid-19 testing were more likely to click on the ads, and would start distributing the ads only to similar people in order to maximize clicks.

      The use of FB ads also explains the under-sampling of older people, as fewer people who are 65+ are even on the platform or login regularly.

  15. This is an excellent post! I wish more things out there were this thoughtful.

    On: “But Stanford has also paid a small price for publicizing this work, because people will remember that ‘the Stanford study’ was hyped but it had issues. So there is a cost here. The next study out of Stanford will have a little less of that credibility bank to borrow from.”

    This sounds optimistic! (Compared to the default position of most: if its’s Stanford / Harvard it must be right…)

    • Zhou Fang says:

      I think my more pessimistic extrapolation might be that the reputational damage might be done more to the field than the institution. “Oh this study says this, that study says that, clearly no one knows anything so I’ll go with whatever my preferred answer is.”

      • Andrew says:

        Zhou:

        Maybe you’re right. The extreme case is the post by economist Tyler Cowen who was trashing the entire field of epidemiology, in part based on a paper that was written by . . . an economist! See here for the details.

      • Tom says:

        I have been talking to some of my friends in medial/bio-related fields. While they all agree that this paper is terrible, they seemed generally less concerned with not controlling for selection bias, which in my opinion is more than sufficient to sink the paper by itself. I’m pretty sure that this is far from the only medical study that doesn’t adequately adjust for the selection bias (probably because it is not part of their typical curriculum).

  16. Adi Wyner says:

    Seems as good time as any to remind people of this excellent paper: https://projecteuclid.org/euclid.ss/1009213286
    I seem to recall that test inversion hast best frequentist properties for coverage in one-sided cases which is what we care about here. I think the 95% lower limit would be a little higher than Agresti or what is reported in the paper and 90% lower limit would be around 99%.

  17. Funko says:

    I think a possible way to get a CI that excludes zero is if the positive cases are clustered.

    Imagine the extreme case where all positive tests came from a single zip area with few participants.

    This should also shift the estimate of your specificity.

    • Stephen M St Onge says:

      If I understand this correctly, it is wrong.

      Suppose all the people who don’t have the disease are in zip codes 95000-95098, while all the positives are in zip code 95099. By hypothesis, most of the people answering are in areas 00-98. Unless specificity closely approaches 100%, always, you will get false positives in those areas. You’d need a specificity in the range of 99.97% to avoid getting false positives in the ‘no infections here’ area.

      But with such high specificity, distribution no longer matters. You’d be extremely unluckly to get as many as five false positives in 3300 tests. So the fifty positive results would correspond to 45-49 real infections out of the 3300 tested. With that accuracy, you could depend on the results no matter how the positives were distributed.

      • Mendel says:

        I believe you’ve grasped the idea, but are applying it the wrong way around.
        With a low chance for a low specificity and a low chance for a random sampling to be clustered like that, the chance of both occurring together drops, eliminating that combination from the confidence interval.

        A simpler example would be a D6 dice throw. Throw the dice once, and the low end (1) is well within 95% CI; throw the dice twice, and the low end (1+1=2) has p<0.05.

        Having a low specificity and a cluster of positives is just too unlikely to occur together by random chance.

  18. djaustin says:

    Surely a more informative analysis would be, given the data collected and the test performance, what is the posterior probability that at least 10% or even 30% of the population have been infected? Those are numbers on which to base policy. Even for a relatively poor test, that number will be low. This study is really an investigation into likely order of magnitude for prevalence – 1, 3, 10, 30%, Exact estimation is the wrong endpoint for the analysis.

    • Andrew says:

      Djaustin:

      One could estimate this with a Bayesian analysis. What you’ll find is a lot of uncertainty: the data are consistent with just about any prevalence between 0 and 5%. The problem is that the study as defined is just not very informative, unless you want to make the strong assumption that that the specificity of the test is between 99.5% and 100%.

      • Funko says:

        I don’t think you necessarily have to make that assumption on the specificity, see my comment above.

        Then again, we do not know what the raw data look like…

        • Zhou Fang says:

          Making the assumption on the specificity is what they did, though, see the statistical appendix.

          • Funko says:

            Did they? In the results section they write:

            “Notably, the uncertainty bounds around each of these population prevalence estimates propagates the uncertainty in each of the three component parameters: sample prevalence, test sensitivity, and test specificity.”

            But maybe they actually used the point estimates, I don’t know whether we can tell.

            • Zhou Fang says:

              Yes, look at the final page of https://www.medrxiv.org/highwire/filestream/76776/field_highwire_adjunct_files/0/2020.04.14.20062463-1.pdf

              They use the study specificity as a plug in estimate.

              • Funko says:

                But doesn’t this only mean that they used all the point estimates for the prevalence point estimate, but then they do use the uncertainty in the parameters to estimate the standard error?

              • Zhou Fang says:

                No, they used the point estimates to estimate the standard error. For example, they calculated a variance for false positives *in their study* based on s_hat(1-s_hat), based on binomial outcomes, plugging in 0.995. This is implicitly assuming that they *know* the true false positive rate was 0.995, and their uncertainty estimate is then based on the sampling variability of that in a sample of size 3000+. They don’t consider the source of the original estimate itself has uncertainty.

              • Shiva Kaul says:

                Their variance estimate is *loosened* by their omission of dependence on the sample size. (Unless I misinterpreted their notation?) In any case, their estimates of specificity and sensitivity are too high going into this variance estimate, because they don’t account for the initial reweighting.

              • Zhou Fang says:

                No, they don’t actually have a variance estimate for parameter uncertainty.

                Essentially what they did this:

                They noted that their prevalence is Pi, a function of weighted prevalence, sensitivity and specificity s
                They accepted that sensitivity and specificity can vary in their sample of 3300 due to sampling variability. In *their sample*, not the original ~400 sample
                They plugged in the point estimate of 0.995 or whatever into the variance of a binomial random variable to calculate the variance of specificity in their sample
                Then they used the delta method (based on derivatives of the Pi function) to compute how this affects their prevalence estimate

                In other words they assumed the uncertainty in the estimate resulting from the 400 or so original study was zero.

              • Shiva Kaul says:

                That’s correct, as far as their (mis)application of the delta method is concerned. (I am kind of hesitant to pore through their expressions until they fix the apparent issues.)

                But, concretely, consider the numerical value they derive for, e.g. var(\hat{r}}. They are using .67(1-.67) = .22, which is enormous for a Bernoulli parameter.

              • Zhou Fang says:

                Shiva Kaul:

                That aspect confused me as well, until I realised it doesn’t square with their SE and so confidence intervals. What they are presenting as “Var” is not the variance, but the variance prior to an adjustment for the 3000+ sample size. They are actually claiming a variance of 0.00007.

              • Shiva Kaul says:

                I see. It’s laudable that they included a statistical appendix, but it really needs further clarification.

                As for fixing the error: they can use the delta method to estimate an analytic (1-eps) confidence interval in terms of eps and the unknown parameters q,r, and s. They trivially have concentration of \hat{q}, \hat{r}, \hat{s} around their means, so all unknown terms can be bounded, with probability eps, in terms the estimates, yielding a 1 – 2*eps interval in terms of eps and n. Because of the exponential concentration, I don’t think this will make a massive quantitative difference, unlike the issues earlier in the paper.

      • djaustin says:

        Yes, I’ve done this now. There is no support for a 10% prevalence on the raw data and almost none for 3%. That in itself is informative. I modelled sensitivity and specificity, and neither are much to shout about! But a poor test can give useable information, just not precision.

  19. anon says:

    Having recently listened to the Bad Blood audiobook, I sincerely appreciate your dig at Theranos (and people who believe they can be domain-independent “experts”, on any and every domain).

  20. Lawrence S. Mayer, MD, PhD says:

    We have a facebook group of 2000 physicians and epidemiologists where we review and discuss this type of paper for the frontline docs that are in the group (probably about 700). If you are a physician, nurse, epidemiologist or bench bio-scientist join us!
    Clinical Epidemiology Discussion Group
    https/facebook.com/groups/covidnerds
    Thank you for your analysis of this paper. We will post it to our group

  21. EHG says:

    I came across this critique of the same paper yesterday: https://medium.com/@balajis/peer-review-of-covid-19-antibody-seroprevalence-in-santa-clara-county-california-1f6382258c25 … it mentions the same points: the false-positive rate and the participant selection issues. It also mentions an additional point: that it “would imply faster spread than past pandemics”:

    > In order to generate these thousands of excess deaths [in Lombardy, compared to base rate] in just a few weeks with the very low infection fatality rate of 0.12–0.2% claimed in the paper, the virus would have to be wildly contagious. It would mean all the deaths are coming in the last few weeks as the virus goes vertical, churns out millions of infections per week to get thousands of deaths, and then suddenly disappears as it runs out of bodies.

    • Anoneuoid says:

      In order to generate these thousands of excess deaths [in Lombardy, compared to base rate] in just a few weeks with the very low infection fatality rate of 0.12–0.2% claimed in the paper, the virus would have to be wildly contagious.

      Or just around for much longer, and don’t forget that other countries started testing passengers disembarking from Italy and for awhile it was like every single flight had a few positives when there were supposedly only a couple ten thousand cases.

      Also, that high fatality rate is dependent on initially treating this according to the standard ARDS protocol which was apparently a mistake according to the critical care doctors: https://emcrit.org/emcrit/covid-respiratory-management/

    • Zhou Fang says:

      Good writeup. There’s a disturbing tendency in the comments of that to go “well, there’s false negatives as well!” and imagine the two issues cancel out. I suppose that’s a manifestation of the “truth is in the middle” type thinking.

      • Andrew says:

        Zhou:

        Yeah, that’s where math is useful! One thing that’s frustrating here is that the false-positive problem is a well known issue. Indeed it’s a standard example—perhaps the standard example—that we used when teaching conditional probability. Everybody who’s ever taught probability knows that if you have a rare event, then even moderate levels of false-positive rates destroy you.

        The false positives and false negatives in this example don’t cancel each other out, and one thing that annoyed me in the above-linked paper is when they wrote, “On the other hand, lower sensitivity, which has been raised as a concern with point-of-care test kits, would imply that the population prevalence would be even higher.” This kinda sounds reasonable but it’s not the right way to put it here given the numbers involved. It’s qualitative talk, not quantitative talk.

        • Michael Nelson says:

          +1

          The best lesson in my grad intro stats class was a lab where we calculated the probability that a real-life case of an Olympic runner being accused of doping was a false positive. A case of my understanding statistics resulting in a change in how I view the world. Then again, it made me give credence to Lance Armstrong far longer than seems wise in retrospect…

  22. Better stuff? https://www.medrxiv.org/content/10.1101/2020.04.17.20053157v1.full.pdf

    Crude rises to the top faster than cream, but we can hope for cream…

    • Joseph says:

      The use of this approach to try and get a sense of covid-19 prevalence is fine. Both studies give information about the range of possible prevalence of infection. It is the extension of these numbers, without properly including uncertainty, to estimate the infection fatality rate that is concerning. I don’t see that issue in this preprint.

  23. Nicholas says:

    There is another error in this paper that you have not touched on. They authors count the total number of positive cases by either IgG or IgM. Therefore the sensitivity of their testing criteria is the maximum of the IgG and IgM sensitivities, which is 100%. The specificity is more complicated. From the manufacturer’s data, there were 3/371 false positives for IgM, and 2/371 for IgG. There is no information on whether these false positives overlap. For the author’s testing criteria, the specificity is at most 99.2%. Therefore, the point estimate for the unweighted prevalence is at most 0.7%, and could be as low as 0.15%.

  24. Ivan says:

    My impression as a clinician is that no point of care test I’ve used is 100% specific, and 99% would be very high.

    I was told on Twitter by someone whose bio seemed appropriate that validating an antibody test precisely enough to do a serosurvey requires using pre-COVID serum samples from patients that had a comparable incidence of other viral URIs in the preceding period, including the non-novel coronaviruses, because those antibodies can be cross-reactive. So serum from last April, at the end of cold-flu seasons. Using inappropriately bland serum would exaggerate specificity.

    The decision to double check the test characteristics with some of their samples presumably has clinical more than statistical grounds. But I find it hard to explain running a whole arm of the analysis with the point estimate of 100% specificity based on a sample of 30.

    Another thing that boosted the prevalence estate: They chose a low estimate of sensitivity in the arm of the analysis that limited itself strictly to the manufacturer’s characterization of the test. The manufacturer found 75/75 positive IgG antibodies but only 78/85 IgM. The study uses just the IgM sensitivity. But we would expect almost all recovered patients to have both types of antibodies by the time they’re recovered. If anything, IgG more lastingly than IgM.

    • Joe Candelora says:

      Yeah, publishing a result (the high end, 4.2% prevalence) that relies _exclusively_ on a specificity of 100% based on a mere 30-samople test seems a bit nuts.

      Definitely feels like a bit of results-shopping to include that at all.

      Not a researcher, so no idea what the answer to this is: Do the the authors have any reason to believe their specificity study is more reliable than the manufacturer’s? Is there some reason why the manufacturer’s results are perhaps but applicable, say due to some sort of local population characteristics? Because I just can’t see any reason to publish a result (even as the endpoint of a “range”) that involves ignoring certain data.

  25. Charles Carter says:

    Of course this discusses a pre-print. This is not the only criticism I’ve seen.
    Yet journals are rushing reports to publication. And a great deal of low quality research is getting published, much of it in high impact journals. To some degree this represents SOP compressed in time; yet I can’t help but suspect a degree of academic and publication ‘profiteering’. All in all, especially on the treatment side, huge opportunities are being missed. Now more than ever is a great time to reform for more reliable research findings.

    • Andrew says:

      Charles:

      I don’t mind bypassing formal peer review, given that peer review is often just a way for people to give a seal of approval on their friends’ work. But I do wish the authors of this paper had run their ideas past an outsider with some statistical understanding before blasting their conclusions out to the world.

      • Martha (Smith) says:

        ” I do wish the authors of this paper had run their ideas past an outsider with some statistical understanding before blasting their conclusions out to the world.”

        +1

      • Charles Carter says:

        No argument whatsoever.
        Indeed a panoply of factors for years now led to Ioannidis’ estimate of the rate of ‘wrong’ results being published. In this time when actionable research is direly needed, nothing appears to have changed. Not merely a shame, but dangerous.

  26. I wonder what these false positives actually are. It seems likely (but I have no expertise here) that they are antibodies raised against other viruses that bind well enough to the virus that causes COVID-19 to show as a positive on the test. Do these antibodies work to prevent COVID-19, or reduce its severity? It is possible that a followup study of these false positives might lead to important insights. If so, it would be a mistake to think of false positives as a lab mistake.

    • Dallas Weaver Ph.D. says:

      To the best of my knowledge, you are correct that the cross-reactivity problem is real and these “false positives” can be real in the sense of the immune system providing resistance. As we are talking about 10 billion different antibodies are possible, most humans don’t have identical antibodies to the same antigen (virus) and it would be possible for a fraction of those with antibodies to the common cold (4 different coronaviruses + others) to cross-react with this test kit giving “real — it is their — false positives”.

      The only problem is what conclusions you can draw from the data. If most of it is really positive immune systems from another source, it tells you nothing about COVID-19 future paths.

      An example of cross-reactive is TB antibodies (positive TB test) from exposure to marine mycobacteria by fish handlers never have had TB. More dramatically being immune to smallpox after having cowpox, which was the start of the elimination of smallpox from the world.

      • Martha (Smith) says:

        Interesting.
        (Incidentally, my father never bothered to have a smallpox vaccination, since he had had cowpox as a boy — I believe contracted one of the summers that he spent working on his uncle’s farm. However, in later life, he was unable to donate blood because he had not been vaccinated for smallpox.)

  27. Funko says:

    I have prepared an example contradicting this senctence:

    “50 positive tests is 50 positive tests”

    See https://imgur.com/1ixY563

    • Andrew says:

      Funko:

      I thought about this when writing the above post. You can come up with examples where the positive results are concentrated and can’t be explained by testing errors, for example if 20 of the 50 positive tests were happening from people in one small location. But I can’t see how this could possibly be what’s happening here, because in the article in question they’re just taking averages and poststratifying. It seems pretty clear that what’s going on is they’re just assuming the specificity is 99.5% or 100% and going from there.

      • Funko says:

        But doesn’t the jump from 1.5% crude prevalence to 2.8% for the reweighed prevalence imply that there is some strong heterogeneity in the prevalences and that some higher prevalence cluster are underrepresented in the sample?

        • Peter says:

          Sampling rate by zip code is tightly coupled to how close the Area was to one of the three testing sites (and in particular a large portion of the study came from the Stanford zip code). It seems to me the likely explanation is that the further you were from a test site the more motivated you would’ve had to be to participate, and so there would be even more bias toward symptomatic volunteers.