Skip to content

“Boosting intelligence analysts’ judgment accuracy: What works, what fails?”

Kevin Lewis points us to this research article by David Mandel, Christopher Karvetski, and Mandeep Dhami, which begins:

A routine part of intelligence analysis is judging the probability of alternative hypotheses given available evidence. Intelligence organizations advise analysts to use intelligence-tradecraft methods such as Analysis of Competing Hypotheses (ACH) to improve judgment, but such methods have not been rigorously tested. We compared the evidence evaluation and judgment accuracy of a group of intelligence analysts who were recently trained in ACH and then used it on a probability judgment task to another group of analysts from the same cohort that were neither trained in ACH nor asked to use any specific method. Although the ACH group assessed information usefulness better than the control group, the control group was a little more accurate (and coherent) than the ACH group. Both groups, however, exhibited suboptimal judgment and were susceptible to unpacking effects. Although ACH failed to improve accuracy, we found that recalibration and aggregation methods substantially improved accuracy. Specifically, mean absolute error (MAE) in analysts’ probability judgments decreased by 61% after first coherentizing their judgments (a process that ensures judgments respect the unitarity axiom) and then aggregating their judgments. The findings cast doubt on the efficacy of ACH, and show the promise of statistical methods for boosting judgment quality in intelligence and other organizations that routinely produce expert judgments.

Interesting topic, interesting abstract. I have not tried to assess their evidence. I’d like to see some scatterplots instead of just averages.


  1. Jon Baron says:

    Data are here if anyone wants to make scatterplots:
    (this journal requires publication of data)

    ACH training may be officially approved, but, IMHO, it is inferior to the training used in the Good Judgment Project.
    E.g. (same journal).

    But the studies are not easily compared. The latter used real people rather than members of the Intelligence Community.

    Of course, yes, proper aggregation helps a lot, too.

    • I am a huge fan of Philip Tetlock’s Expert Political Judgment Having been raised and engaged in the foreign policy academic community, since birth it seems, I welcomed its publication. I lobbied on the Hill when the opportunity arose to push for evidence-based policymaking. I argued that small sample opinions are not sufficient for large scale consequential decision-making. This is not some genius insight. I can find support in the work Of Irving Janis. Crucial Decisions and Groupthink I would guess laid the foundation for Expert Political Judgment. After all, Irving Janis was Philip Tetlock’s thesis advisor.
      I heard Irving Janis at Yale in the early ’70s. I may have been influenced by Janis myself.

      I think though that learning to think well is a lifelong project. And I am not sure there is any one right way to learn. As I speculate occasionally some have developed better-thinking tools through some luck, chance, and opportunity. Such people have a deeper curiosity about the world and how it works. They are hobbyists perhaps and have the luxury of not having to publish or perish.

  2. Anoneuoid says:

    Participants read about a fictitious case in which they were required to assess the tribe membership of a randomly selected person from a region called Zuma.1 They read that there were four tribes (A-D) that constituted 5%, 20%, 30%, and 45% of Zuma, respectively. Each tribe was then described in terms of 12 probabilistic cue attributes. For in- stance, for the Acanda tribe (i.e., Tribe A) the description read:

    Acanda: 10% of the tribe is under 40 years of age, 75% use social media, 50% speak Zebin (one of two languages spoken in Zuma), 25% are employed, 90% practice a religion, 25% come from a large family (i.e., more than 4 children), 50% have been educated up to the age of 16, 75% have a reasonably high socio-economic status relative to the general population, 75% speak Zimban (one of two languages spoken in Zuma), 75% have a political affiliation, 75% wear traditional clothing, and 25% have fair coloured skin.

    Next, the target’s cue attributes were described as follows:

    The target is under 40 years of age, uses social media, speaks Zebin, is employed, does not practice a religion, does not come from a large family, does not have education up to age 16, does not have a reasonably high socio-economic status, speaks Zimban, is not politically affiliated, wears traditional clothing, and does not have fair coloured skin.

    As described previously, more often than not, individuals produce probability estimates that are incoherent and violate probability axioms
    The primary measure of accuracy we use is mean absolute error (MAE), which in this research computes the mean ab- solute difference between a human-originated judgment (i.e., raw, transformed, or aggregated), y i , and the corresponding posterior probabilities derived from Bayes theorem assuming class conditional independence (i.e., a “naïve Bayes”model), x i .

    So basically they want the analyst to be a classifier and softmax whatever “pseudo-probabilities” (sometimes called “logits”) they come up with. Then they determine accuracy by comparing those results with whatever their bayesian model output from the same info using MAE (they probably really wanted cross entropy).

    I think the first mistake is using summary statistics for each tribe rather than the raw data those are based on, you lose all information about correlations between the different attributes with only the summary stats. The analyst should also, of course, get a description of how this data was arrived at.

    The second mistake is not measuring the right thing. They should get some data where each individual is a member of a known tribe (you can simulate but real data would be far better to capture realistic correlations between the different attributes) and some measure of the costs of guessing wrong. Ie, guessing tribe A when it was actually tribe B may be worse than tribe A when it was tribe C, etc.

    Then have the analyst report their probabilities for the test target(s) and use something like this “bilinear loss function” (cross entropy is a special case where all penalties are the same) to judge the performance when compared to the true membership for each member.

    They also later discover that aggregating the predictions of multiple analysts helps, which is them rediscovering ensemble averaging:

    …when aggregated, analysts’ judgments are substantially more accurate than aggregated random judgments. Second, it is evident from the left panel in Figure 1 that aggregation greatly improves accuracy in analysts’ judgments, but to a degree comparable to that observed in the randomly generated response data. This suggests that most of the error reduction observed is due to variance reduction from averaging

    I think some pretty basic machine learning background could be very useful for people studying this stuff.

Leave a Reply