Wednesday, March 15, 2006

Data mining is not the way to tackle terrorism

Bruce Schneier again: Why Data Mining Won't Stop Terror

"The promise of data mining is compelling, and convinces many. But it's wrong. We're not going to find terrorist plots through systems like this, and we're going to waste valuable resources chasing down false alarms. To understand why, we have to look at the economics of the system.

Security is always a trade-off, and for a system to be worthwhile, the advantages have to be greater than the disadvantages. A national security data-mining program is going to find some percentage of real attacks and some percentage of false alarms. If the benefits of finding and stopping those attacks outweigh the cost -- in money, liberties, etc. -- then the system is a good one. If not, you'd be better off spending that capital elsewhere.

Data mining works best when you're searching for a well-defined profile, a reasonable number of attacks per year and a low cost of false alarms...

Terrorist plots are different. There is no well-defined profile and attacks are very rare. Taken together, these facts mean that data-mining systems won't uncover any terrorist plots until they are very accurate, and that even very accurate systems will be so flooded with false alarms that they will be useless.

All data-mining systems fail in two different ways: false positives and false negatives. A false positive is when the system identifies a terrorist plot that really isn't one. A false negative is when the system misses an actual terrorist plot. Depending on how you "tune" your detection algorithms, you can err on one side or the other: you can increase the number of false positives to ensure you are less likely to miss an actual terrorist plot, or you can reduce the number of false positives at the expense of missing terrorist plots...

When it comes to terrorism, however, trillions of connections exist between people and events -- things that the data-mining system will have to "look at" -- and very few plots. This rarity makes even accurate identification systems useless.

Let's look at some numbers. We'll be optimistic -- we'll assume the system has a one in 100 false-positive rate (99 percent accurate), and a one in 1,000 false-negative rate (99.9 percent accurate). Assume 1 trillion possible indicators to sift through: that's about 10 events -- e-mails, phone calls, purchases, web destinations, whatever -- per person in the United States per day. Also assume that 10 of them are actually terrorists plotting.

This unrealistically accurate system will generate 1 billion false alarms for every real terrorist plot it uncovers. Every day of every year, the police will have to investigate 27 million potential plots in order to find the one real terrorist plot per month. Raise that false-positive accuracy to an absurd 99.9999 percent and you're still chasing 2,750 false alarms per day -- but that will inevitably raise your false negatives, and you're going to miss some of those 10 real plots.

This isn't anything new. In statistics, it's called the "base rate fallacy," and it applies in other domains as well. For example, even highly accurate medical tests are useless as diagnostic tools if the incidence of the disease is rare in the general population. Terrorist attacks are also rare, any "test" is going to result in an endless stream of false alarms...

Finding terrorism plots is not a problem that lends itself to data mining. It's a needle-in-a-haystack problem, and throwing more hay on the pile doesn't make that problem any easier. We'd be far better off putting people in charge of investigating potential plots and letting them direct the computers, instead of putting the computers in charge and letting them decide who should be investigated."

Bruce puts out a version of this essay at least once a year. Long may he continue to do so because the people that need to take heed of it are just not doing so. Like another hobby horse of mine, the second law of thermodynamics, the base rate fallacy (and Bayes Theorem from which it derives) should be a compulsory part of the school curriculum, one that also includes critical thinking more widely. The idea that using irrelevant information to make a decision (or, more accurately, a probability judgement) is not very bright is not difficult to understand. So why is it so difficult to apply in the context of complex information systems supposedly deployed to tackle complex security problems?

No comments: