Kevin Lewis points us to this research article by David Mandel, Christopher Karvetski, and Mandeep Dhami, which begins:
A routine part of intelligence analysis is judging the probability of alternative hypotheses given available evidence. Intelligence organizations advise analysts to use intelligence-tradecraft methods such as Analysis of Competing Hypotheses (ACH) to improve judgment, but such methods have not been rigorously tested. We compared the evidence evaluation and judgment accuracy of a group of intelligence analysts who were recently trained in ACH and then used it on a probability judgment task to another group of analysts from the same cohort that were neither trained in ACH nor asked to use any specific method. Although the ACH group assessed information usefulness better than the control group, the control group was a little more accurate (and coherent) than the ACH group. Both groups, however, exhibited suboptimal judgment and were susceptible to unpacking effects. Although ACH failed to improve accuracy, we found that recalibration and aggregation methods substantially improved accuracy. Specifically, mean absolute error (MAE) in analysts’ probability judgments decreased by 61% after first coherentizing their judgments (a process that ensures judgments respect the unitarity axiom) and then aggregating their judgments. The findings cast doubt on the efficacy of ACH, and show the promise of statistical methods for boosting judgment quality in intelligence and other organizations that routinely produce expert judgments.
Interesting topic, interesting abstract. I have not tried to assess their evidence. I’d like to see some scatterplots instead of just averages.
Data are here if anyone wants to make scatterplots:
http://journal.sjdm.org/18/18803/data.csv
(this journal requires publication of data)
ACH training may be officially approved, but, IMHO, it is inferior to the training used in the Good Judgment Project.
E.g. https://www.sas.upenn.edu/~baron/journal/16/16511/jdm16511.pdf (same journal).
But the studies are not easily compared. The latter used real people rather than members of the Intelligence Community.
Of course, yes, proper aggregation helps a lot, too.
I am a huge fan of Philip Tetlock’s Expert Political Judgment Having been raised and engaged in the foreign policy academic community, since birth it seems, I welcomed its publication. I lobbied on the Hill when the opportunity arose to push for evidence-based policymaking. I argued that small sample opinions are not sufficient for large scale consequential decision-making. This is not some genius insight. I can find support in the work Of Irving Janis. Crucial Decisions and Groupthink I would guess laid the foundation for Expert Political Judgment. After all, Irving Janis was Philip Tetlock’s thesis advisor.
I heard Irving Janis at Yale in the early ’70s. I may have been influenced by Janis myself.
I think though that learning to think well is a lifelong project. And I am not sure there is any one right way to learn. As I speculate occasionally some have developed better-thinking tools through some luck, chance, and opportunity. Such people have a deeper curiosity about the world and how it works. They are hobbyists perhaps and have the luxury of not having to publish or perish.
So basically they want the analyst to be a classifier and softmax whatever “pseudo-probabilities” (sometimes called “logits”) they come up with. Then they determine accuracy by comparing those results with whatever their bayesian model output from the same info using MAE (they probably really wanted cross entropy).
I think the first mistake is using summary statistics for each tribe rather than the raw data those are based on, you lose all information about correlations between the different attributes with only the summary stats. The analyst should also, of course, get a description of how this data was arrived at.
The second mistake is not measuring the right thing. They should get some data where each individual is a member of a known tribe (you can simulate but real data would be far better to capture realistic correlations between the different attributes) and some measure of the costs of guessing wrong. Ie, guessing tribe A when it was actually tribe B may be worse than tribe A when it was tribe C, etc.
Then have the analyst report their probabilities for the test target(s) and use something like this “bilinear loss function” (cross entropy is a special case where all penalties are the same) to judge the performance when compared to the true membership for each member.
They also later discover that aggregating the predictions of multiple analysts helps, which is them rediscovering ensemble averaging:
tl;dr
I think some pretty basic machine learning background could be very useful for people studying this stuff.