Panos Toulis writes:

The debate on the Santa Clara study actually me to think about the problem from a finite sample inference perspective. In this case, we can fully write down the density

f(S | θ) in known analytic form, where S = (vector of) test positives, θ = parameters (i.e., sensitivity, specificity and prevalence).

Given observed values s_obs we can invert a test to obtain an exact confidence set for θ.I wrote down one such procedure and its theoretical properties (See Procedure 1.) I believe that finite sample validity is a benefit over asymptotic/approximate procedures such as bootstrap or Bayes, which may add robustness. I compare results in Section 4.3.

I recently noticed that in your paper with Bob, you discuss this possibility of test inversion in Section 6. What I propose is just one way to do this style of inference.

From my perspective, I don’t see the point of all these theorems: given that the goal is to generalize to the larger population, I think probability modeling is the best way to go. And I have problems with test inversion for general reasons; see here and, in particular, this comment here. Speaking generally, I am concerned that hypothesis-test inversions will not be robust to assumptions about model error.

But I recognize that other people find the classical hypothesis-testing perspective to be useful, so I’m sharing this paper.

Also relevant is this note by Will Fithian that also uses a hypothesis-testing framework.

You also have to take into account waning and t-cell mediated immunity, especially for all the mild/asymptomatic cases.

Article in press relevant to the Santa Clara study: https://www.clinicalkey.com.au/#!/content/journal/1-s2.0-S0140673620313040.

From the Lancet though!

Ratio of seroprevalence to confirmed clinical case = 12:1

Hi Andrew,

many thanks for initiating the discussion.

I read your general criticisms on the test inversion method in the links above.

Indeed, they illustrate nicely how the method can be misused, but I don’t think they are an indictment on the method itself.

Following your examples, consider two models, M1 and M2. Suppose that M1 fits the data much better than M2.

For some parameter of interest, it could happen that with test inversion in M1 we get [0, 4] as 95% CI, and with M2 we get [2.0, 2.1].

Your criticism, as far as I understand, is that M2 appears to give sharper inference than M1 but, as you correctly point out, this is due to M2 just being a bad fit to the data. So, test inversion appears to mix inference with model fit.

However, this argument presupposes that comparing CI length across different models is a valid method for model selection.

This is obviously wrong, but it is actually a universal problem — we can always create a model (Bayesian or not) that artificially narrows the confidence intervals.

In fact, from a purely frequentist perspective, in the above example we should just report

{ (M1, [0,4]), (M2, [2,2.1]) } as our 95% confidence set; that is, we just treat “model” as a parameter.

Test inversion works fine here as long as it is interpreted correctly in terms of frequentist coverage and not in terms of model selection. It is also important to note that in the Santa Clara study the model is mainly comprised of Binomial counts, which is uncontroversial — more or less all studies used the same core model to analyze the data.

> The trouble in your example can be seen by considering three analysts with the same model and slightly different datasets. Analyst 1 has the interval [3.5, 3.6] as in your example above. He proudly publishes his super-precise result, secure in the knowledge that he has a classically valid confidence interval. Analyst 2, with nearly the same data but a slightly better fit to the model, gets the interval [3.0, 4.1]. That’s ok but not so precise. Analyst 1 is getting better results because his model fits worse. Next there’s Analyst 3, whose model fits slightly worse than that of Analyst 2. His interval is empty. So, instead of being able to make a very strong claim, he can say nothing at all about the parameter.

I’ve been thinking a bit about the kind of situation where this may happen. I came up with the following example: a set of measurements with standard normal error and a test based on the minimum and maximum value, which rejects when the more extreme data point is too far from the null hypothesis value.

If the data (range) is distributed as expected, we get some interval inverting the test. If the data (range) happens to be more concentrated, the confidence interval produced by inverting the test will be wider. If the data (range) happens to be more spread out, the confidence interval produced by inverting the test will be narrower.

It would be misleading to say that the interval is narrower because the fit is bad. If the data is bad as in “too concentrated” instead of as in “too spread out” the interval will be wider. A “bad fit”, a data range larger or shorter than its expected size, doesn’t necessarily indicate a problem with the model. The model may still be right, for example if the distribution of measurment errors is well known.

The properties of an interval confidence created by inverting a test depend on the properties of the test. The test discussed before, based on the min/max, is not optimal. The most powerful test would be a likelihood ratio test, based on the mean. It would never reject the model, the intervals would never be empty, we would have reasonably-behaved likelihood inference. Ad-hoc tests may have the right frequentist properties but those intervals may be hard to interpret on their own.