Solution to that little problem to test your probability intuitions, and why I think it’s poorly stated

The other day I got this email from Ariel Rubinstein and Michele Piccione asking me to respond to this question which they sent to a bunch of survey respondents:

A very small proportion of the newborns in a certain country have a specific genetic trait.
Two screening tests, A and B, have been introduced for all newborns to identify this trait.
However, the tests are not precise.
A study has found that:
70% of the newborns who are found to be positive according to test A have the genetic trait (and conversely 30% do not).
20% of the newborns who are found to be positive according to test B have the genetic trait (and conversely 80% do not).
The study has also found that when a newborn has the genetic trait, a positive result in one test does not affect the likelihood of a positive result in the other.
Likewise, when a newborn does not have the genetic trait, a positive result in one test does not affect the likelihood of a positive result in the other.
Suppose that a newborn is found to be positive according to both tests.
What is your estimate of the likelihood (in %) that this newborn has the genetic trait?

Here was my response:

OK, let p = Pr(trait) in population, let a1 = Pr(positive test on A | trait), a2 = Pr(positive test on A | no trait), b1 = Pr(positive test on B | trait), b2 = Pr(positive test on B | no trait).
Your first statement is Pr(trait | positive on test A) = 0.7. That is, p*a1/(p*a1 + (1-p)*a2) = 0.7
Your second statement is Pr(trait | positive on test B) = 0.2. That is, p*b1/(p*b1 + (1-p)*b2) = 0.2

What you want is Pr(trait | positive on both tests) = p*a1*b1 / (p*a1*b1 + (1-p)*a2*b2)

It looks at first like there’s no unique solution to this one, as it’s a problem with 5 unknowns and just 2 data points!

But we can do that “likelihood ratio” trick . . .
Your first statement is equivalent to 1 / (1 + ((1-p)/p) * (a2/a1)) = 0.7; therefore (p/(1-p)) * (a1/a2) = 0.7 / 0.3
And your second statement is equivalent to (p/(1-p)) * (b1/b2) = 0.2 / 0.8
Finally, what you want is 1 / (1 + ((1-p)/p) * (a2/a1) * (b2/b1)). OK, this can be written as X / (1 + X), where X is (p/(1-p)) * (a1/a2) * (b1/b2).
Given the information above, X = (0.7 / 0.3) * (0.2 / 0.8) * (1-p)/p

Still not enough information, I think! We don’t know p.

OK, you give one more piece of information, that p is “very small.” I’ll suppose p = 0.001.

Then X = (0.7 / 0.3) * (0.2 / 0.8) * 999, which comes to 580, so the probability of having the trait given positive on both tests is 580 / 581 = 0.998.

OK, now let me check my math. According to the above calculations,
(1/999) * (a1/a2) = 0.7/0.3, thus a1/a2 = 2300, and
(1/999) * (b1/b2) = 0.2/0.8, thus b1/b2 = 250.
And then (p/(1-p))*(a1/a2)*(b1/b2) = (1/999)*2300*250 = 580.

So, yeah, I guess that checks out, unless I did something really stupid. The point is that if the trait is very rare, then the tests have to be very precise to give such good predictive power.

But . . . you also said “the tests are not precise.” This seems to contradict your earlier statement that only “a very small proportion” have the trait. So I feel like your puzzle has an embedded contradiction!

I’m just giving you my solution straight, no editing, so you can see how I thought it through.

Rubinstein and Piccione confirmed that my solution, that the probability is very close to 1, is correct, and they pointed me to this research article where they share the answers that were given to this question when they posed it to a bunch of survey respondents.

I found the Rubinstein and Piccione article a bit frustrating because . . . they never just give the damn responses! The paper is very much in the “economics” style rather than the “statistics” style in that they’re very focused on the theory, whereas statisticians would start with the data. I’m not saying the economics perspective is wrong here—the experiment was motivated by theory, so it makes sense to compare results to theoretical predictions—I just found it difficult to read because there was never a simple plot of all the data.

My problem with their problem

But my main beef with their example is that I think it’s a trick question. On one hand, it says only “very small proportion” in the population have the trait; indeed, I needed that information to solve the problem. On the other hand, it says “the tests are not precise”—but I don’t think that’s right, at least not in the usual way we think about the precision of a test. With this problem description, they’re kinda giving people an Escher box and then asking what side is up!

To put it another way, if you start with “a very small proportion,” and then you take one test and it gets your probability all the way up to 70%, then, yeah, that’s a precise test! It takes a precise test to give you that much information, to take you from 0.001 to 0.7.

So here’s how I think the problem is misleading: The test is described as “not precise,” and then you see the numbers 0.7 and 0.2, so it’s natural to think that these tests do not provide much information. Actually, though, if you accept the other part of the problem (that only “a very small proportion” have the trait), the tests provide a lot of information. It seems strange to me to call a test which offers a likelihood ratio of 2300 as being “not precise.”

To put it another way: I think of the precision of a test as a function of the test’s properties alone, not of the base rate. If you have a precise test and then apply it to a population with a very low base rate, you can end up with a posterior probability of close to 50/50. That posterior probability depends on the test’s precision and also on the base rate.

I guess they could try out this problem on a new set of respondents, where instead of describing the tests as “not precise,” they describe them as “very precise,” and see what happens.

One more thing

On page 11 of their article, Rubinstein and Piccione given an example where different referees have independent data in their private signals, when trying to determine if a defendant is guilty of a crime. This does not seem plausible in the context of deciding whether a defendant is guilty. I think it would make more sense to say that they have overlapping information. This does not change the math of the problem—you can think of their overlapping information, along with the base rate, as being a shared “prior” and the non-overlapping information corresponds to the two data points in your earlier formulation—but that would make it more realistic.

I understand that this model is just based on the literature. I just have political problems with oversimplified models of politics, juries, etc. I’d recommend that the authors either use a different “cover story” or else emphasize that this is just a mathematical story not applicable to real juries. In their paper, they talk about “the assumption that people are Bayesian,” but I’m bothered by the assumption that different referees have independent data in their private signals. That’s a really strong assumption! It’s funny which assumptions people will question and which assumptions they will just accept as representing neutral statements of a problem.

A connection to statistical inference and computing

This problem connects to some of our recent work on the computational challenges of combining posterior distributions. The quick idea is that if theta is your unknown parameter (in this case, the presence or absence of the trait) and you want to combine posteriors p_k(theta|y_k) from independent data sources y_k, k=1,…,K, then you can multiply these posteriors but then you need to divide by the factor p(theta)^(k-1). Dividing by the prior to a power in this way will in general induce computational instability. Here is a short paper on the problem and here is a long paper. We’re still working on this.

55 thoughts on “Solution to that little problem to test your probability intuitions, and why I think it’s poorly stated

  1. > On the other hand, it says “the tests are not precise”—but I don’t think that’s right, at least not in the usual way we think about the precision of a test.

    I don’t know who are « we » but that’s in fact a well-defined term in the field of classification: https://en.m.wikipedia.org/wiki/Precision_and_recall

    « Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances »

    • “The tests have low precision” would have been maybe-acceptable, though still arguably misleading, as “the tests” taken together have very high precision. “Each test individually has low precision” would have been acceptable. “The tests are not precise” is not acceptable.

      • «  “The tests are not precise” has low acceptability. » would have had high acceptability.

        « “The tests are not precise” is not acceptable. » has low acceptability.

    • The tests are imprecise — and their “precision” is what is specified by the problem; test A has a precision of 0.7 and test B has a precision of 0.2 — but that’s just telling you why tests aren’t described in terms of precision and recall. Tests are described in terms of sensitivity and specificity, because those properties are properties of the test independent of the base rate of the condition, whereas precision and recall cannot be determined without knowing the base rate of the condition.

      And while the sensitivity of these tests can be anything without affecting the answer to the problem, their specificity is very close to 1. (Under the assumption that the base rate of the condition is close to 0.) That is what makes them so informative; that’s what Andrew is picking up on.

  2. In case anyone is curious, my solution and discussion are here: https://eighteenthelephant.com/2022/08/02/three-views-of-a-statistics-puzzle/ . In addition to algebra, I tried simulating the scenario, which was trickier than I thought. It perhaps makes it clearer, though, that the false positive and true positive rates aren’t necessary for the problem.

    Both the algebra and the simulation reach the same numerical conclusion as Andrew, but I don’t share the confusion about the “not precise” phrasing — I didn’t pay attention to those words, and it’s not clear to me what “precise” means, or why we should be upset about it!

    • I was unable to solve the problem algebraically, but I was able to simulate it in a spreadsheet. I thought that was the easy approach. I also have the same numerical conclusion as Andrew.

      Note that while the true positive rate is not important to the problem, the false positive rate is very important! The reason these tests are informative is that they have very low false positive rates.

      (If the base rate of the condition is low, that tells us several things:

      1. Our test can only detect a small number of true positives [because only a small number of true positives exist].

      2. A large number of condition-negative people exist.

      3. But our test can only flag a minuscule proportion of those condition-negative people as false positives, because even for test B, the raw number of false positives can only be four times the number of true positives, and the number of true positives is known to be small.

      4. Therefore, the false positive rate is very, very, very low for both tests.)

      • Correction: where I say “[because only a small number of true positives exist]”, that should say “[because only a small number of condition-positive people exist]”.

  3. My concern with the problem is that it encourages modeling as a kind of “assumption laundering”. By giving a mixture of precise (0.7 and 0.2) information and vague information (“small proportion”, “tests are not precise”, “study has found that”), the example invites you to build a model to accommodate the precise parts and then fill in the blanks by making guesses about the vague parts. That’s what you and the people who “solved” the problem did in the discussion thread.

    The danger is forgetting about those assumptions. If you do, you feel like you’ve learned something about the world and can make a useful judgment. But really, you’ve learned about the logical consequences of your assumptions about the world.

    Learning the consequences of assumptions is valuable and necessary. I think that’s what models are for. And there’s nothing wrong with assumptions! We need them for everything. I’m not making the nihilistic claim that because we can’t know everything we know nothing.

    But in order for assumptions to lead to models that are informative/useful, the assumptions need to be justified by domain knowledge (e.g., causal theories, converging operations, prior results). In a toy problem, there ain’t no domain knowledge, just what we are told. So by implying that it is possible to “solve” a toy problem with assumptions, my concern is that these kind of examples fool people into thinking the same tricks will help in non-toy problems.

  4. Sure there can be “computational instability” but the straightforward answer to the question is that the final odds are the product of the tests’ odds divided by the initial odds to the (k-1), where here k=2. I’m not sure what all the ruckus is about, although it was interesting seeing the variety of intuitive responses.

    • Michael:

      I like the problem. I just think the statement, “However, the tests are not precise,” is a red herring. Remove that sentence and I think the problem is just fine.

      P.S. The computational instability I was referring to does not arise in this particular math problem. It arises in a related statistics problem of combining inferences from several posterior distributions, each of which is summarized with error. You can see the linked papers by Aki and me for further discussion of this issue.

  5. Andrew is not « making guesses about the vague parts » when he claims that the answer to the question can be written as X / (1 + X) and gives an expression for X that depends explicitly on the only relevant unknown variable p (the prevalence of the trait in the population).

  6. How about this kind of Signal Detection Theory based analysis which models the bias and signal-to-noise ratio of the tests. This would seemingly produce the reported FA and hit rates BUT also low probability for the “two-tests positive” person actually having the trait…

    But what I’m not happy about this is that I don’t think the numbers in the original post are the raw FA and Hit rates: they seem to be (maybe) conditioned on the the test being positive; kind of normalized, which seems odd to me. If that is what has been done, then all of what I’ve done here is rubbish (maybe it is regardless of that!) and bias or d prime values can not be identified. But just you know: I’ve taken them here at face value. Hadn’t written anything in R for a while so that was also nice; not sure how much I screwed up.

    (Incoming a rather long R code and I don’t know how to put preformatted text in this thing but maybe copy/paste will recover most of it)

    # R SCRIPT STARTS HERE
    #
    # There are two parts to this:
    #
    # First, for both tests, their biases (are they more biased towards
    # positive/negative results) and their d prime values (basically how
    # good the signal-to-noise ratio the test is) are calculated.
    #
    # Second, a population is simulated. Random proportion of this population
    # has the genetic trait of interest. Each person takes both of the tests.
    # The tests produce random amount of evidence (for the person having
    # the genetic trait) and if this evidence exceeds the previously
    # estimated bias of the test, they are classified as having the
    # genetic trait and otherwise not.

    # Population size and the probability of having the genetic trait:
    N_population = 50000
    prob_has_trait = 0.01

    # We prepare the data matrix:
    # 1st column: indicator for having the genetic trait (0, 1)
    # 2nd column: indicator for results of the first test (0, 1)
    # 3rd column: indicator for results of the second test (0, 1)
    datmat = matrix(NaN, ncol = 3, nrow = N_population)
    colnames(datmat) = c(“Has_trait”, “Test_1”, “Test_2”)

    # Sample indicators for having/not having the genetic trait:
    datmat[,1] = sample(x = c(0, 1),
    size = N_population,
    replace = TRUE,
    prob = c(1 – prob_has_trait, prob_has_trait))

    # Here, we calculate the bias of the tests as well as their
    # “d prime” value.
    # For test 1: P(FA) = 0.3 and P(Hit) = 0.7
    # For test 2: P(FA) = 0.8 and P(Hit) = 0.2

    bias_test_1 = -qnorm(0.3)
    bias_test_2 = -qnorm(0.8)
    dprime_test_1 = qnorm(0.7) – qnorm(0.3)
    dprime_test_2 = qnorm(0.2) – qnorm(0.8)

    # Now we run simulated tests for the simulated population.
    # People without the genetic trait produce evidence from a
    # zero-centered normal distribution whereas people with
    # the genetic trait produce evidence from normal distribution
    # centered on the d prime value.
    #
    # If evidence exceeds the bias of the test, the simulated person
    # is classified as having the genetic trait, otherwise not:

    evidence_test_1 = rnorm(N_population, dprime_test_1 * datmat[,1])
    zero_inds_test_1 = which(evidence_test_1 < bias_test_1)
    datmat[,2] = rep(1, N_population)
    datmat[zero_inds_test_1, 2] = 0

    # This is the exactly same thing as for the first test:
    evidence_test_2 = rnorm(N_population, dprime_test_2 * datmat[,1])
    zero_inds_test_2 = which(evidence_test_2 < bias_test_2)
    datmat[,3] = rep(1, N_population)
    datmat[zero_inds_test_2, 3] = 0

    # Lastly, we see who were classified as positive in both tests…
    both_positive = intersect(which(datmat[,2] == 1), which(datmat[,3] == 1))
    # …and see the ratio of those who actually do have the genetic trait:
    p_hit_both_pos = sum(datmat[both_positive, 1]) / length(both_positive)

    cat(paste("Probability that a person whose both results are positive\n",
    "actually has the the genetic trait is",
    p_hit_both_pos, "\n"))

    # TESTS TO SEE IF EVERYTHING WENT AS EXPECTED.
    # Last, we see if the false alarm and hit rates correspond
    # to the reported values. (Note: these can be rather noisy
    # if the proportion of population having the trait is low, but
    # they should converge to the correct values in any case).
    #
    # False alarm rate for test_1 (should be close to 0.3)
    length(intersect(which(datmat[,1] == 0), which(datmat[,2] == 1))) /
    length(which(datmat[,1] == 0))
    # Hit rate for test_1 (should be close to 0.7)
    length(intersect(which(datmat[,1] == 1), which(datmat[,2] == 1))) /
    length(which(datmat[,1] == 1))

    # False alarm rate for test_2 (should be close to 0.8)
    length(intersect(which(datmat[,1] == 0), which(datmat[,3] == 1))) /
    length(which(datmat[,1] == 0))
    # Hit rate for test_2 (should be close to 0.2)
    length(intersect(which(datmat[,1] == 1), which(datmat[,3] == 1))) /
    length(which(datmat[,1] == 1))

      • That was perhaps a silly way to express it… what I mean is that since FA rate is calculated by looking at positive responses to negative cases and Hit rate by looking at positive responses to positive cases, they need not be complements of each other — they can be, but that would seem more like a fluke than anything else.

        If you have R, here are a few Receiver Operating Curves for a few signal strengths. These display FA and hit rates as a function of false alarm rate (“bias”) for some fixed signal strength (“d prime”).

        ### SCRIPT STARTS
        drawROC = function(dprime){
        bias = seq(-5, 5, 0.1)
        pfa = pnorm(bias, lower.tail = F)
        phit = pnorm(bias, dprime, lower.tail = F)

        plot(pfa, phit, type = “l”,
        xlab = “P(FA)”, ylab = “P(Hit)”,
        main = paste(“d’ = “, dprime))
        }

        par(mfrow = c(2,2))
        drawROC(0)
        drawROC(1)
        drawROC(3)
        drawROC(-1)

        ### SCRIPT ENDS

        As you can see, P(Fa) and P(Hit) are not necessarily nor usually complements of each other, I believe there’s exactly one point on each ROC curve* in which they sum to one. This lead me to believe that — in the OP — maybe these “raw” rates have been normalized by making them sum to one. If this is what’s been done, then there’s (again: I believe) no way to recover the d prime and bias for those rates. Maybe one could say that for the first test d prime is positive and for the second negative, but I’m not completely sure about that either.

        *CD disk, ATM machine…

        • Ah, morning brain, it’s Receiver Operating Characteristic.

          But ALSO, looking at the comments, I just realized that I missed that the second test is only done to those classified as having the trait in the first! Urkh. When I read that the tests are independent my brain threw away the “screening” part that was said earlier.

        • Sorry for spamming, the screeningness bug is brute-force fixed by including this “filtering” step before the 2nd test:

          #SCRIPT STARTS
          # Include only positive results from the first test, by throwing away
          # negative results, very violent:
          positive_inds = which(datmat[,2] == 1)
          N_population = length(positive_inds)
          datmat = datmat[positive_inds,]
          #SCRIPT ENDS

          This doesn’t seem to affect the results that much, and — I think — shouldn’t break anything.

        • To David Marcus: now we are at the core of these kinds of cleverly construed problems: it seems that people have hard time even agreeing on _what’s_ being asked! Are those FA/hit rates, how”re the tests administered, what means that the tests aren’t precise, what it means that the trait is rare…

  7. Note, the tests are declared “not precise.” Precision is the share of positive results are true. While it is not clear what level of precision warrants the label, it is clearly context-dependent.

    For simplicity of illustration, suppose a test yields no false negatives. If we only tested positive people, the precision would be 100%. If we tested only negative people, the precision would be 0%. So the test precise sometimes, and very not precise others. Precision is related to, but not fundamental to, the test itself.

    The problem as given states exactly what the precision of the tests are *but lacks a clear context*. The given answer presumes that these are precisions *when testing the general population*. The precision for test A will be much higher if we screen the population by test B and vice versa— because we have screened out with the first test almost all the true negatives.

    And importantly, if we erroneously screen out positives with out first test, we must in near proportion be screening out negatives. If we miss 20% of true positives, then we are reducing by 20% the number of true positives in our screened sample. But the ratio of false positives to true positives is fixed *on the first test* by the given test precision, so the screened pool must have 20% fewer false positives. That is, the false negative rate simply determines the size of the screened pool— not the composition.

    In summary, although neither test is precise when applied to the general population (presumed), test B is very precise and vice versa. The “paradox” is that precision is not inherent to the test but depends on who gets tested.

    • David:

      Rather than trying to define what is a precise test, I think I’m on safer ground just saying that they could remove the sentence, “However, the tests are not precise,” from the problem, as I think that it just adds confusion.

        • Re: Other David here

          “However, the tests are not precise.
          A study has found that:
          70% of the newborns who are found to be positive according to test A have the genetic trait (and conversely 30% do not).
          20% of the newborns who are found to be positive according to test B have the genetic trait (and conversely 80% do not).”

          So it *looks* like the author meant “precise” in exactly the same sense of TP/(TP/FP).

          We can argue if it’s useful to call a *test* precise as opposed to a test/data pair. (I’d say no– that’s the point of my original comment.) So yes I agree with Andrew that the first sentence is misleading. I just don’t agree that Rubinstein and Piccione’s meaning of “precise” is ambiguous here.

  8. Andrew:
    I do not think you can combine the two PPVs as likelihood ratios. PPV varies with disease prevalence – the 70% and 20% figures are only valid when applied to the population of babies used in the studies. When combining the two tests, either test A first or test B first, the disease prevalence is altered (to 70% or 20%) and then the PPV of the test applied second is no longer valid.
    The prevalence-dependence thing is why NPV and PPV are seldom useful – sensitivity and specificity are preferred given they are mathematically impervious to prevalence (although IRL there is some effect).

    • I agree kind of. This seems to imply something hidden in all the derivations of the answer that have been posted here:

      The study has also found that when a newborn has the genetic trait, a positive result in one test does not affect the likelihood of a positive result in the other.

      Likewise, when a newborn does not have the genetic trait, a positive result in one test does not affect the likelihood of a positive result in the other.

      I suspect there is a simpler derivation. Doesn’t this mean that if test-A is used to filter the results, p(B|D) is unchanged? Where B is positive test-B and D is positive for disease/gene,

      If we are given p(D|B) = 0.2, p(!D|B) = 0.8, and p(D) is small (Eg, p(D) = 0.001). Then:

      p(B|D) = p(B)*p(D|B)/p(D) = p(B)*0.2/0.001 = 200*p(B)

      p(B|!D) = p(B)*p(!D|B)/p(!D) = p(B)*0.8/0.999 = 0.8008*p(B)

      So p(B|D)/p(B|!D) = 200/0.8008 = 249.75/1

      In terms of probabilities, out of the positive test-B’s then 200/(200+0.8008) = 99.6% will have the disease. This should be true whether or not the results were filtered by test-A.

      This qualitatively agrees with Andrew’s answer but is off by 0.2%.

      • > In terms of probabilities, out of the positive test-B’s then 200/(200+0.8008) = 99.6% will have the disease. This should be true whether or not the results were filtered by test-A.

        We’re told that the number is 20% in the latter case. [20% of the newborns who are found to be positive according to test B have the genetic trait]

        Your formula doesn’t include the frequency of the trait in the population and it’s valid only when you start with a 1:1 ratio. After test A the ratio is 7:3 which is close enough to 1:1 for you to get almost the right answer using a completely wrong derivation.

        • After test A the ratio is 7:3

          Sure, I do agree. But doesn’t this disagree with the assumption we are given:

          The study has also found that when a newborn has the genetic trait, a positive result in one test does not affect the likelihood of a positive result in the other.

          Aren’t you saying that a positive test-A does affect the likelihood of a positive result on test-B (via changing the prevalence)?

        • I’m saying that the formula you used p(D|B) = p(B|D)/(p(B|D)+p(B|!D)) is correct only when p(D)=p(!D). In general p(D)!=p(!D) and the missing terms in your formula don’t cancel out.

      • But I assumed p(D|B) = 0.2 and seem concluded p(D|B) = 0.996, so that seems wrong. Anyway, I think there is some simpler way to understand it based on the independence assumption.

  9. It seems to me that this can be analyzed easily by considering a simple numerical example. Unfortunately for me, my answer does not match Andrew’s and I cannot find my error.

    We are told that “a very small proportion” of newborns have the trait.

    Let’s assume that “very small” = 0.01. Further, let’s assume that the population of interest consists of 10,000 newborns—100 of whom have the trait and 9900 do not.

    We test them all. Assume that tests A and B always detect the trait but also generate false positives. The results for Test A would return 143 positive results—100 true positives and 43 false positives (100/143 ≈ 0.7). This number of positive results maximizes the false positive rate at 43/9900. Similarly, B would return 500 positive results with a false positive rate of 400/9900. We can regard the probability of false positives from test A as 43/9900 ≈ 0.004 and from B as 400/9900 ≈ 0.04. We are told that these probabilities are independent. So, the chances of a double false positive are minute ≈ 0.00016. The probability of true positive is the complement of this or ≈ 1.

    If I choose the rate to be 0.001, then of the population of 10,000, 10 have the trait and 9990 do not. So, Test A would find 14 newborns that tested positive, of which 10 were positive and 4 were false positives. Similarly, Test B would find 50 newborns that test positive of which 10 were positive and 40 were false positives. The probability of a double false positive would be (4/9990)*(40/9990) ≈ 10^-6. The probability of having the trait given both tests being positive is about 0.999999. Andrew gets 0.998 for this quantity. Hmm.

    The point at which I am unable to understand his exposition is the statement
    What you want is Pr(trait | positive on both tests) = p*a1*b1 / (p*a1*b1 + (1-p)*a2*b2)

    I calculate the quantity 1 – (1-p)a2*(1-p)b2 = 1- a2*b2*(1-p)^2 which is the complement of the double false positive quantity.

    It’s late and I need to run. I’ll think about this more later.

    Bob76

    • > The probability of a double false positive would be (4/9990)*(40/9990) ≈ 10^-6. The probability of having the trait given both tests being positive is about 0.999999.

      To calculate the probability of having the trait given that both tests are positive you have to consider P(A and B|notG)P(notG) = 0.00017% [≈ 10^-6] and P(A and B|G)P(G) = 0.1% [≈ 10^-3] for a total probability of a double positive of 0.10017% [≈ 10^-3]. The probability of having the trait conditional on that double positive is 0.1%/0.10017% = 99.83% [≈ 1 – 10^-3].

        • Above, I showed that running the tests on a 10,000 newborns with p = 0.01 would find 143 positive under A and 400 positive under B.

          So, conditional on the event that a newborn tests positive on A and B the chances that it is truly positive are 1-chance(both tests failed).

          The chance that the first test fails is (43/143)—pick at random one of the 43 false positives on test A in the 143 that tested positive under A.

          The chance that the second test fails is (400/9900)—pick at random one of the 400 false positives in the 9900 lacking the trait.
          1 – (43/143)*(400/9900) = 0.98785.

          If we change the numbers to reflect a p = 0.001, the numbers becomes
          1-(4/14)*(40/9990) = 0.9988.

          That’s reasonable agreement with Andrew’s 0.98 given that 10/14 = 0.714 rather than 0.7.

          So, I was right that a simple numerical calculation would give a number close to the correct number. I just did the wrong calculation.

          Bob76

        • I think it is interesting to see how easy it is to come up with a wrong explanation that gets numerically close to the right answer.

          Then think of all the times people just come up with a vague explanation for a positive/negative difference.

        • Bob76,

          while that’s still the “wrong” calculation at least it’s not wrong by orders of magnitude now – given the probabilities involved.

          You arrive at the solution 1-(4/14)*(40/9990) = 1- 0.00114 = 0.99886 but as you said there is an approximation in the 4/14 that should be 3/10. Your “exact” – but not correct! – solution is 1 – 0.00120 = 0.99880. Andrew’s correct solution is 0.99829 = 1 – 0.00171.

          Depending on how you chose to look at them those numbers may or may not be in reasonable agreement. Note that if you switched A and B in your calculation you would get 1 – 0.00034 = 0.99966 which is even farther away.

          It’s not clear to me how you arrrived at it, but your solution comes down to P(trait|A,B) = 1- P(notrait|A,B) = 1- P(notrait|A) P(B|notrait) which is not exactly right. The second term is missing a factor P(A)/P(AB). In you example, that’s approximatively 14.28/10.02 = 1.425 and 0.00120 becomes 0.00171 – in agreement with the correct soultion. Similarly, for the “switched” calculation the additional factor is approximatively 50/10.02 = 4.99 and 0.00034 becomes 0.00171.

          One we have the full expression we can see that P(A)/P(AB) = 1/P(B|A) and we can argue that this factor can be ignored because P(B|A) is close to 1. In your example it will be approximatively 0.7 (70% of the positives in the first test will also give positive in the second test with 100% probability, for the other 30% the probability of a second positive is very low) so the factor is not far from 1. But when the order is switched P(A|B) is approximatively 20% and the approximation is less defensible. Much better to include those approximate values for the missing factor 10/7 and 5 instead of 1 in the solution and get 0.99828 – which is really close to the correct solution 0.99829.

  10. I wrote a small calculator that reproduces the solution provided and displays all the relevant probabilities. (It was a nice opportunity to play with the brand new Shiny for Python framework that runs in the browser.)

    https://ungil.com/doublepositive/

    [There may be another message with an ugly url in the queue awaiting approval. Please ignore it.]

  11. I might be wrong but these p terms look inconsistent. Is this a typo or am I missing something (aboslutely possible!)? The first line line defines it as (p/1-p) and the next is (1-p)/p.

    “Finally, what you want is 1 / (1 + ((1-p)/p) * (a2/a1) * (b2/b1)). OK, this can be written as X / (1 + X), where X is (p/(1-p)) * (a1/a2) * (b1/b2).
    Given the information above, X = (0.7 / 0.3) * (0.2 / 0.8) * (1-p)/p”.

  12. > I guess they could try out this problem on a new set of respondents, where instead of describing the tests as “not precise,” they describe them as “very precise,” and see what happens.

    Just as the posterior odds of having a disease upon receiving a positive test result depend on the population in question, so too would the value of changing the language of question depend on the study population. The article writes:

    > The subjects were mostly current and past students in game theory courses. No monetary incentives were provided other than a few subjects being randomly chosen to receive $40 regardless of their answers.

    Would they be familiar with confusion matrix terminology, or would they think more in terms of e.g. the precision of an estimator?

    Bonus question: a similar intro probability question is asked, but it doesn’t provide information in natural language (e.g. 70% of X have Y given A) or mathematical notation (e.g. Pr(Y|A)), but rather provides all the necessary values exclusively in confusion matrix terminology, e.g. the markedness is this, the informedness is that, the balanced accuracy is something else, etc.). Who here would be able to solve the problem without having to consult to a glossary?

    (personally, I can never remember what any of the mean, and even have to look up more common vocab eg sensitivity & specificity half the time)

  13. My earlier answer to this puzzler was egregiously wrong, but let me try to construct an intuition for the correct one. By intuition I mean words, the sort of words I’d use in the classroom in front of students some of whom faint at the first sight of algebra. To be clear, invoking Bayes’ Theorem does *not* count as intuition, although of course we need to get to it sooner or later.

    Start with test A. It has a true positive rate of 70% for those who test positive, derived from its performance on the general population of newborns. If this were the only test, any baby testing positive would have a 70% chance of having the trait.

    But it’s *not* being applied to the general population but a different one, the population of newborns who test positive on test B. This will change A’s success rate.

    Here I would go off on an example. Suppose you have a test of fluency in French that consists of a few passages that have to be translated from French to English. 70% of those who pass the test in the US are found to actually be fluent when the test is administered randomly. But then take the test to a different population in France. There, even with random administration, nearly all the takers will pass and nearly all will be fluent, yielding a success rate close to 100%. The change in population makes a big difference.

    So what about the change from the general population to those who test positive on B? Well, we know that 20% of the B’s (assuming random administration) really have the trait. Now suppose just one in a thousand, .001, have it generally. That means the B-positives have 200 times the likelihood of having the trait than the overall population. This is like France! That means, in *this* restricted population, test A is going to have far fewer false positives, fairly close to zero.

    To get more precise, at this point you’d have to bring on the algebra, but the intuition would be in place. Essentially, Bayes’ Theorem is an abstract, precise version of the population story.

    When I attended a workshop in teaching reform calculus 30+ years ago, I learned that a fundamental problem in math education is that many students’ first response to a problem is to try to cram it into a known formula, with often bizarre results. It’s much better to work through a rough understanding of the problem and only then invoke the formula.

    As for me, my only excuse is that I saved the problem to think about during a dental procedure a couple of mornings ago. I was literally a numbskull (OK, a numbjaw, but how different is that?) and came up with a corresponding answer. Invoking the independent binary test rule is like the kind of student error I just described, since the rule applies only to different, independently distributed test criteria, as in the case that A and B test for different traits. Then the complication of different populations doesn’t apply.

  14. I like your description, but “it’s *not* being applied to the general population but a different one, the population of newborns who test positive on test B” should be replaced by something like “imagine that Test A is given only to newborns who test positive on Test B.” This makes it clearer that the properties of the tests given in the problem are for the general population.

    • I see your point. The B-positives are not a separate population in the sense that the French are compared to the Americans; they’re a subpopulation. But it’s effectively the same. My language story could be changed so that population #2 is French citizens currently residing in the US, and the parallel is pristine.

  15. To me it’s a bit odd that they’ve chosen 70% and 20%. I think it slightly obscures the key forces (maybe that is the point, or maybe there are more examples in the paper). Qualitatively, the same answer results if only 20% of people test positive on both tests, so long as 20% is much larger than the prior likelihood.

    But why stop there? If both tests are such that 1% of positives have the disease, and the population prevalence of the disease is 1/100,000, then I believe both tests coming back positive indicate over a 90% likelihood that the tester has it.

    In many ways the problem works by inverting the normal base-rate neglect setup: In those problems, tests are often describe as being “precise” in that Pr(positive | have it) and Pr(negative | don’t have it) are very high, but the base rate means even a positive test doesn’t indicate high likelihood of having it. Here, we get Pr(have it | positive), which seems only middling, but as you say, this implies the tests must be extremely precise in the typical sense!

    The description of test precision is inverted depending on the way the question author is trying to surprise us.

  16. Perhaps another “intuitive” solution:

    “prior” of the odds: p/1-p;
    “posterior” of odds given test A: 0.7/0.3, hence the “likelihood” of test A = (0.7/0.3) / (p/1-p);
    likewise, the “likelihood” of test B = (0.2/0.8) / (p/1-p).

    Now, we observe both A and B, so we add the log likelihood and log prior,

    p/1-p * (0.7/0.3) / (p/1-p) * (0.2/0.8) / (p/1-p) = (0.7/0.3) * (0.2/0.8) / (p/1-p).

    That is X/(1-X) = (0.7/0.3) * (0.2/0.8) / (p/1-p), which solves X.

    P.S.
    1. The term prior/likelihood here just means they are two objects with which you can play bayes rule.
    2. In linear regression and if we have independent design, we just add up the coefficient. Consider a different story: Assuming a linear model and the all features are independent, and the expected income of people who went to college is A, and of females is B, then what is the expected income of females who went to college? This solution will only manipulate A and B linearly.

    • Yuling:

      Yes, I agree the solution is very direct when you multiply likelihood ratios. To me, the likelihood ratio or “odds” formulation has never been intuitive, but I know that for many people (including many Bayesians) it’s very natural. I think lots of the confusion in this problem arises because the respondents are being given various posterior probabilities, and in the usual formulation of such problem, we’re given sensitivity and specificity and we need to derive the posterior probabilities. In that sense, it’s a trick question. Also the result is very sensitive to the assumption that the two tests provide independent information, which in reality will probably not be the case.

      • My understanding is that we can almost always solve these 2 by 2 table quizzes by a logistics regression, and we can further convert a logistics regression into a linear model by converting probabilities into log odds. The barrier is that log odds are not intuitive-intuitive, while the advantage is that we can work with more familiar generative models.

Leave a Reply

Your email address will not be published. Required fields are marked *