Sisyphus
For those who do not allow their mails being read by a large cloud provider, fighting junk mail is an endless task. And, many people developed various -- sometimes really clever -- techniques to mitigate it. Though, current spam filters are typically time-consuming to configure, consume immense resources (or deliberately do not use all available information), and are complex in maintenance.
This text is about Sisyphus, a novel filter that automatically learns all (!) your mails' content to effectively fight off junk, is easy to deploy (one binary), and just needs one line of config, the location of your mail directory.
Content is King
All we really care about is our mails' content! In particular, we would like to separate mail we want to read from mail we do not want to read. This is a tricky problem because such mails may indeed come from the same sender. For example, my favorite news site sends me lots of mails via newsletter. Some of those newsletters (even if marketing "spam") are of interest to me, others just fill my mail box and remain unread forever.
When doing "manual" junk mail filtering by eye, I noticed that I was scanning for particular words in the subject and content of a mail. In other words, I was reading each word of a mail and giving it a certain probability of being junk or good. The overall impression then yields a quick decision for a basket.
In economics, we teach a nice Theorem that formalizes this approach, i.e. Bayes' Theorem:
In plain English, this theorem tells us that the probability that a mail is junk, given its content, is equal the probability that its content appears in the junk basket, weighted by the chance to fall into either basket "accidentally" (without any preconditions).
Learn every (!!!) mail you get
Bayes' Theorem has previously been used in junk mail filters, for example by the legendary POPFile and many others. However, all these previous junk filters learn only from their errors and never from correctly classified mails. This is a waste of information!
For example, if the filter classified a mail as good because 30% of the words had an extremely high probability of landing in the good basket, traditional junk filters would not take the opportunity to learn the remaining 70% of the words as being likely good as well.
With Sisyphus however, I developed a junk filter that is capable of really learning all the content of all mails, all the time.
How to store 50000 unique mails in 10MB?
The major challenge was storing the content of thousands of unique mails in an efficient way. Interestingly, Bayes' Theorem helps a lot here as all the necessary ingredients (probabilities) can be derived from a combination of counters on words.
Second, even though I wanted Sisyphus to be as good as possible, I was willing to give up some of the precision for speed and a very small footprint.
Another very nice idea came to help: the HyperLogLog Algorithm developed by Philippe Flajolet and many others. This algorithm essentially allows to count all words in all mails in a very efficient way, allowing Sisyphus to store roughly 100'000 word counters from 50'000 unique mails in 10MB.
Note that this algorithm sometimes produces slightly wrong counting numbers. I am willing to accept those errors, as they typically will not alter the probabilities of a mail classification.
Who classifies mails?
As tastes are individual, there is no way to circumvent the human owner of a mailbox. However, I wanted to make Sisyphus not only easy to setup but also invisible to the users.
This is achieved by running directly on the users' Maildir directories on the server. Sisyphus reads arriving mails in the "new" directory, then calculates the probabilities and physically moves the file to the junk directory, if classified as such. Otherwise, it does not touch any mails at all.
If a user moves a mail from good to junk or vice versa, Sisyphus will learn about that in the next learning cycle and become more and more accurate over time.
Known issues
It would not be just to only mention the positive aspects of Sisyphus without mentioning some known issues.
First of all, it is not so easy to compute nice statistics of its successes or failures. Sisyphus is totally agnostic about whether it has already learned a mail because the HyperLogLog algorithm takes care of duplicates. And we are not really keeping track of wrongly classified mails, just to produce some nice statistics.
Second, related to the first issue, if the user is not "correcting" errors fast enough (faster than the configured learning cycle), i.e. if the user is not moving wrongly classified mails into the other folder in time, Sisyphus is learning a mail eventually for the good and the junk basket, respectively. This does not have any adverse effect on the quality, but slows down speed of learning. Proper tuning of the configuration helps mitigate this issue effectively.
Where to get Sisyphus?
Download pre-compiled binaries or the source code from Github. Any comments, issues, or patches are very welcome.