Daniel Lemire's blog

, 2 min read

Scam Spam, the death of email, and Machine Learning

Tim Bray has predicted the end of email as we know it:

I don’t know about you, but in recent weeks I’ve been hit with high volumes of spam promoting penny stocks. They are elaborately crafted and go through my spam defenses like a hot knife through butter. (…) This could be the straw that finally breaks the back of email as we know it, the kind that costs nothing to send and something to receive.

Yes, Tim, I’ve been bombed by spam mail too. To the point that the fraction of non-spam email has gone below 10% for the first time in years. Before you think I’m an extreme case, ask your local IT experts about the amount of spam they are receiving. Currently, no spam filter can cope with the amount of spam I’m receiving.

The only spam filter that does anything to help is Google Mail’s spam filter, but it still let more spam through than legit emails (if I exclude mailing lists).

What is really failing us here is not the Internet per se: it is rather trivial to think of a better way to design email protocols. What is failing us is the blunt application of Machine Learning to a real-world problem.

Many Machine Learning researchers would have you believe, mostly because they really believe it, that Bayes or Neural Networks (add your favorite algorithm here) are ideally suited to solve most classification problems. That they can be tweaked to a particular problem. That in some small way, we have strong AI at our door. But we don’t. The failure of spam filters is symbolic. There is really no free lunch as far as algorithms go.

This is not to say that Machine Learning does not work. Recommender systems like those based on collaborative filtering or PageRank work. But in the real world, the best they can do is assist us. And how fancy your algorithm is does not change the equation.

The lesson here is that until we have strong AI, and this could be a long way still, if ever, we should collectively work on finding algorithms that can assist us better instead of trying to replace us.

For example, spam filters should work with the user on defining what is spam. And I don’t mean having the user train the algorithm. I mean that the user should be allowed to change and add to the spam filter. Naturally, in practice, this is hard work, very hard work, and thus, it might be simpler and better to replace the email protocols.

We have to move away from black box algorithms and embrace the fact that we lack strong AI. The intelligence is in your users, not in your software.