Probability paradox may be killing thousands

Brian Kinghorn points to this news article by Christian Grothoff and J. M. Porup, “The NSA’s SKYNET program may be killing thousands of innocent people; ‘Ridiculously optimistic’ machine learning algorithm is ‘completely bullshit,’ says expert.” The article begins:

In 2014, the former director of both the CIA and NSA proclaimed that “we kill people based on metadata.” Now, a new examination of previously published Snowden documents suggests that many of those people may have been innocent.

Last year, The Intercept published documents detailing the NSA’s SKYNET programme. According to the documents, SKYNET engages in mass surveillance of Pakistan’s mobile phone network, and then uses a machine learning algorithm on the cellular network metadata of 55 million people to try and rate each person’s likelihood of being a terrorist.

The news displays some leaked documents labeled Top Secret. I don’t know if it’s legal for me to copy them here, but one of them says, “0.18% False Alarm Rate at 50% Miss Rate.” Grothoff and Porup write:

A false positive rate of 0.18 percent across 55 million people would mean 99,000 innocents mislabelled as “terrorists” . . . The leaked NSA slide decks offer strong evidence that thousands of innocent people are being labelled as terrorists; what happens after that, we don’t know.

Kinghorn writes:

I find this quite disturbing. I’m betting a lot can be chalked up to the Base Rate Fallacy.
If Pr(being terrorist) < Pr(flagged by model) Then Pr(terrorist | flagged) < Pr(flagged | terrorist)

P.S. A commenter points to a news article by Martin Robbins that concludes, “Nobody is being killed because of a flaky algorithm.” It’s hard to know either way, given that so much of this is secret.

41 thoughts on “Probability paradox may be killing thousands

  1. Yes, the false positive rate is way too high relative to the base rate, but that’s only a problem if everyone flagged as positive is automatically targeted. My problem is with the “may be killing” part, that’s pretty unlikely. However evil you think the NSA is, I roll to disbelieve that they will kill 99,000 people to get a few hundred terrorists. There are a million people on the US Terror Watch List, they’re not all about to be assasinated.

    On the other hand, if the NSA knows when people switch phones and where people go, it’s not hard to believe that they can listen in on the actual phone calls. A couple hundred dedicated analysts could go through the suspicious phone calls of 99,000 people, but not 55 million. If SKYNET is used to generate a list of people to eavesdrop on, you can still think it’s pretty evil but it’s no longer monstrously villainous.

    Also, big ups to the article for introducing the word “terroristiness”.

    • “A couple hundred dedicated analysts could go through the suspicious phone calls of 99,000 people, but not 55 million.”

      Sure, but when I get a voicemail from Comcast it gets converted to text and emailed to me. So, it would seem possible to convert these conversations to text as well (no reason to assume that Urdu is any harder for this than English), and then run algorithms on the text looking for things that are highly suspicious, and then have the analysts go through the suspicious phone calls of maybe 9,000 people.

      • I think there is a reason to believe that Urdu is harder. Mainly, Google bought the company that became “Google Voice” specifically to eavesdrop on several hundred million people in the US and create a database of natural language recordings to train voice recognition systems with. It’s not so clear that a corpus of similar quality is available for Urdu. So, not in principle harder, but in practice possibly quite a bit harder. Possibly not though. I don’t know what is available to the NSA.

    • It’s still monstrously villainous, not to mention ineffective. You’ve simply added another layer of conditional probability.

      Still it would be fun in a terribly morbid kind of way to think about how we could approach estimating the degree of villainy. Maybe we can make a good “Stan to the rescue” thread out of this. Let’s start from the beginning. We have a population, from the article: 192 million people in Pakistan. Some small percentage of that population are terrorists. What’s a good prior here? How many terrorists are in Pakistan? Let’s pretend we’re past the terrible stage of definitions here and we have a good measurement of what exactly constitutes a “terrorist”. (Although, I’d argue this is first place to added some villainy score to the model: what if the definition of terrorist is wrong? You may have faith that the NSA uses some more robust definition than “brown-skinned male between the ages of 16 and 38”, but I’m not so sure I do.)

      For the sake of argument, let say there are 500,000 terrorists in Pakistan. That seems like a lot, that’s like a third of Philadelphia, but it avoids too many decimals in the Pr(Terrorist|Resident of Pakistan) which is something like 0.25%.

      Ok, next step. We have a sample of the population. 55 million people with cell phones. So now we have to think about Pr(cell phone|terrorist, Pakistani) and Pr(cell phone|not terrorist, Pakistani). Do all terrorists have cell phones? Or might they be less likely than the average Joe Pakistani to have a cell phone? Put a prior on it. Stan it up.

      I think we’re now at the SKYNET stage, right? What do we need? I think we need Pr(flagged by the model), Pr(terrorist| flagged) which from the article looks like it’s .5 (50% miss rate) and Pr(not terrorist |flagged) (Are we calling this 0.18%). We should be able to multiply all of these things out to this point to get our flagged population, which consists of some number of people that satisfy the condition of (not terrorist | flagged) and (terrorist |flagged). BUT we should also be able to get an estimate of the number of people that satisfy the condition (terrorist |not flagged) which is at least half of the terrorists in Pakistan, but probably more since it seems likely at least one terrorist doesn’t use a cell phone.

      But then you’ve pointed out that flagged doesn’t automatically equal a kill. Fine, so let’s add another filter. We need Pr(targeted|flagged terrorist) and Pr(target|flagged non-terrorist). Sure there a couple of hundred analysts and maybe some them are really good. But as Andrew has pointed out with the running retraction thread, even the best people make mistakes. It may be hard to convince a journal to retract an article, but it’s a fair-sight harder to convince God to retract a hellfire missile. So some innocent people are still being killed, and some flagged terrorists are also probably being let off the hook.

      At the end of the day we should be able to get a few numbers out of this model: the number of innocent people killed, the number of terrorist killed, and the number of terrorists unkilled. Those numbers are what we should base our villainy and effectiveness estimates on. Lots of subjectivity here, but my gut instinct is that this program probably kills a lot of innocent people without putting all that big a dent in the terrorist population.

  2. It’s not completely clear to me (and I’m assuming the NSA hasn’t clarified) whether “false alarm rate” means “false positive rate” in which case all the freaking out is justified, or if it means “false discovery rate”, which would make this much ado about nothing. An FDR is a posterior probability [meaning P(terrorist | flagged) ] and it is a common metric. People usually use metrics like this when base rates are so low, 99% predictive accuracy doesn’t mean much when 99.9999% of cases are false.

    My interpretation of “false alarm rate” would be that it’s an FDR. We could also reasonably assume that that the folks at the NSA are more knowledgeable than a machine learning novice, are aware of the extremely low base rate, and reported the most meaningful statistic. That would also explain why the miss rate (assuming that means false negative) is so high, they are probably (rightly) more concerned with false positives than false negatives given the extremely low base rate.

    • There’s an interesting quote from your article.

      “In the end though they were able to train a model with a false positive rate – the number of people wrongly classed as terrorists – of just 0.008%.”

      That is not even remotely believable. This is going to be a very squishy, fuzzy model and a false positive rate like that is amazingly good.

      Whatever they did to build the models, they are B.S.ing their final results.

      • Suppose all of the people are wrongly classified, but only 0.008% are classified, then you could report this statistic if you wanted to make yourself look good, knowing that it sounds like only 0.008% of the classified people are wrongly classified, whereas it’s only 0.008% of people are classified as terrorists, and all of them wrongly :-)

  3. Andrew, you’re falling for something you usually rail against here: journalists writing a deeply misleading headline/article that takes a preliminary scientific (in some sense at least) thing out of its context and sensationalizes it beyond recognition for clicks, and then you’re uncritically repeating it, making the problem worse.

    The guardian article posted by “skyjo” is an incredibly important rejoinder that makes this make a lot more sense and adds back the context. I urge you to read it and maybe add an update at the bottom of your post recognizing that its a lot more complicated than they used this algorithm to “kill people.” There’s no statement at all that this was any sort of target list, or any evidence that this was ever meant to be used in that way.

  4. As pointed out by a number of posters above, I agree it is more complicated than the Grotoff & Porup article suggests and one can reasonably assume the intelligence communities are not filled with idiots. However, what is not being discussed is the ‘3 hop rule (http://www.theguardian.com/world/interactive/2013/oct/28/nsa-files-decoded-hops)’ and what this means for the incorrectly labelled people as well as their friends and family, not to mention the friends and family of the latter.

  5. Leaving aside the morality of such a program for a second, am I correct in understanding that they are using a random forest with 80 features and 7 positive labels (known terrorists)? Only seven? How do you create a model with that?

    • That’s a real problem in their methodology. Moreover, these 7 are not going to be a random sample of terrorists but instead most likely a weird subgroup. Maybe really clumsy terrorists since they got themselves spotted.

      So all in all, this is a classic example of Bad Stats.

  6. Even if the program is 100% accurate and there are absolutely no innocent victims, (which we all know it is not true) this is absolutely wrong. And lets not forget there is a “Nobel Peace Prize” winner behind this programme.

  7. The proper analogy here is to screening for very rare diseases. The original article was completely irresponsible. Lots of organizations using statistical methods to prioritize the time of investigators (e.g., for fraud detection). As Martin Robbins ably points out, this is the case here as well, and it is pretty obvious if you look at the rest of the slide deck.

    It is also informative to consider the hype culture within the defense and intelligence communities. In my experience, everyone is claiming to have a new technology that will create a “Revolution in Military Affairs”. Snowden seems to have captured a lot of these hype-filled slide decks. Journalists need to be more skeptical!

  8. 1. Where’s Entosophy? I’d think he’d have something to say about this.

    2. For about 15 years I built measurement devices and developed associated signal processing algorithms for hazardous materials detection. Our primary customer was the DOD. They’d tolerate maybe a couple false positives per week. That gets me thinking about the false positive rate called out above. If you’ve got a false positive rate of 0.2% then how often do you get a false positive? The reason our false alarm rate had to be down in the 1-2 per week range is that if it was significantly higher than that then the users would just turn it off. True positives weren’t frequent enough to for them to tolerate many false positives. (The operational consequences of false positives were a pain in the ass for users.)

    3. Automatic Target Recognition (ATR) is nothing new. What is relatively new is their application to ‘fuzzy’ problems. It’s one thing to detect a tank in hide – shape, color, and other physical properties are well-defined. It’s another thing to apply ATR to detecting things which are more ambiguously defined.

    4. How’d they get the ground truth for false alarm rate and detection rate? Think about it… how did they know those people were terrorists? Did they collect on a bunch of people, follow them until they committed a terrorist act, and then develop their detection algorithm using the data they collected? That’s a charming thought. Did they know a priori who the terrorists were and just collected on them until they had enough data to develop their classifier? If they knew the people they were following were terrorists then why did they let them circulate freely with the non-terrorist population? And what was n? Did they follow half a dozen people around until they committed terrorist acts? Using a small sample would seem problematic. Did they collect on a thousand terrorists? How’d that work? How did they obtain their terrorist and non-terrorist samples? I would genuinely like to know that.

    • Isn’t the better analogy to think of this as marketing lead generation using data mining?

      No one’s going to kill a “terrorist” just because a machine learning algo. with 0.2% false alarm rate. But if you want to present (say) analysts or Law Enforcement a list of targets to screen deeper into then the 0.2% false alarm rate isn’t that alarming?

      • > But if you want to present (say) analysts or Law Enforcement a list of targets to screen deeper into then the 0.2% false alarm rate isn’t that alarming?

        Automatic Target Cueing (ATC) as opposed to Automatic Target Recognition (ATR). Rely on human-in-the-loop for confirmation. No, a 0.2% FAR isn’t unreasonable for an ATC algorithm. FAR at Pd=0.8 (rather than Pd=0.5) is a more typical figure in DOD circles though. Again, questions come to mind: Even for ATC would they really be happy with Pd=0.5 or did the FAR completely go to hell in a handbasket between 0.5 and 0.8? And I’m still wanting to know how they got their sample to establish Pd.

        PS I think that marketing lead generation is a good analogy.

        • I think the false alarm rate that is reasonable for any given application is heavily context dependent:

          0.2% is terrible if that’s going to automatically trigger a drone strike. OTOH, if all that’s going to initiate is a closer look at the movements & activities of a Pakistani citizen, 0.2% doesn’t sound too bad at all.

        • > I think the false alarm rate that is reasonable for any given application is heavily context dependent…

          Yes, the objective an ATC algorithm is to reduce the number of samples you have to inspect in detail. Reducing the number of detailed analyses you have to do by a factor of 500* is a good thing if you’re an analyst.

          * The factor of 500, 1/Pfa, is the ratio potential “targets” you’d have to screen if you didn’t run an ATC algorithm on your data to the number you need to screen when you do run ATC.

  9. What about these two questions before throwing around the numbers?
    How is it legitimate to kill a supposed terrorist?
    How many terrorists are created by the drone war?

    • > How is it legitimate to kill a supposed terrorist?

      Killing a citizen of another country in their home country, a country with which we are not at war? Seems pretty sketchy to me but it’s been all the rage for the past decade and a half so what do I know?

      > How many terrorists are created by the drone war?

      When I mentioned Entosophy above I was thinking of a post of his where he set up a rate equation (dterrorists/dtime = …) and demonstrated that there were very plausible scenarios where attempting to kill your way out of the problem would make it worse.

      • People have been killing supposed terrorists in other countries before drones and machine learning were invented. And innocents have been unfairly targeted as well…and maybe even be killed as well. The only difference is that before, the process of identifying people to kill was primarily human-based, and now we have a mechanical component as well to identify targets…and this mechanical component reduces the possibility for human error.

        Putting aside the ethics of whether the bombing campaigns are themselves legitimate, it seems that reducing civilian casualties from these bombings is essential. If the drone campaign *reduces* the probability of human error, then less civilians will die. This is a good thing. The algorithm will make mistakes, and civilians die…but humans also make mistakes and kill civilians too. If the machines make less mistakes than the humans (even if it is as small as 0.1% less mistakes), then I would rather trust the machines to fight our wars.

        • I believe that there’s a strong case that machine-aided targeting decisions are more accurate than human-only decisions. (NB: “machine-aided” not “machine-only”. Under no circumstances would I trust machines to fight wars.) The downside is that people with command authority will use improved detection statistics in Scenario A to expand the scope of their endeavor to include Scenario B. “Hey, I’ve got a better tool than I had before. What else can I do with it?” This may be an urban legend but once upon a time I recall hearing that air bags reduced p(death|velocity) but that people started driving faster because cars were safer so that the death rate didn’t drop. Whether or not that’s a true story, my concern is that better machine learning algorithm lead to that sort of decision making on the DOD side.

        • I fail to see the analogy. Your “citizen of another country” point seemed to rest on sovereignty issues. Which would be valid typically.

          But if the sovereign involved does not object the sovereignty / jurisdiction arguments becomes moot is all I’m saying.

          If your point instead was that the terrorists are being denied due process, I agree. That’s indeed a problem. But orthogonal to the “Killing a citizen of another country” issue.

        • I pretty much concur with your framing of the issues and it’s a worthwhile topic for discussion/debate but in the interest of not getting too far off topic from statistical modeling and causal inference, I’ll desist.

          I’ll toss out another potentially controversial statistical modeling topic: Economic forecasting. More specifically, the kerfuffle which has resulted from UMass economist Gerald Friedman’s forecasts derived from Bernie Sanders’ economic proposals. Many professional economists are claiming that Friedman’s predictions are absurd because his presumption are no good. I tend to agree but what info I can find suggests that the models in use for forecasting economic growth (and related macroeconomic things) aren’t quantitative anyway (the get the sign right, that’s about it) so why should I take the details of Friedman’s predictions or the detailed criticisms of his predictions too seriously? “Quantitative economic forecasting” seems an oxymoron. Thoughts?

        • Regardless of the country or time, there is only one reason for an “interventionist foreign policy”: money & power. All the rest is propaganda. Human rights? WMD’s ? … its like an insult to our intelligence sometimes.

        • The distinction between intention & effects is important.

          Would European Jews be happier had USA not intervened in WW2? Would Rwandans or Sudanese, on average, be happier had the UN not intervened?

        • I should clarify that by “interventionist” I mean a predisposition to intervene. I’m no more an advocate of isolationism than I am interventionism. I advocate restraint and de-escalation. Even in cases where there’s a compelling argument in favor of intervention we should take more seriously the possibility that intervention could do more harm than good.

Leave a Reply to Tom Dietterich Cancel reply

Your email address will not be published. Required fields are marked *