While visiting relatives for Christmas, I heard a pretty damning account from one of my cousins (who works for a company that develops spam filtering software) about the uselessness of recent Ph.D.s in this area. If I understand the issue correctly, there is a pretty big mismatch between typical machine learning / information retrieval models of the spam filtering problem (a relatively static corpus of spam and ham messages, from which one must learn to filter the spam with the best possible combination of precision and recall) and the actual behavior of spammers (who are actively engaged in seeking out holes in spam filtering software, blasting as much spam as possible through any hole they find until it is patched or the system learns to filter it, and then moving on to the next hole).

In connection with this I found a paper by Tom Fawcett that made very similar points, nearly a decade ago. But it's easy to find recent and highly-cited works that don't take Fawcett's lessons to heart.



Same holds for Information Retrieval, in which most of the results in SIGIR and other such Information Retrieval conferences have nothing to do with what could ever possibly be implemented in a 4-billion-queries-a-day search engine such as Google.


There is a generic rule saying that 90% of everything is bad and useless. I do not think that SIGIR or any other conference is significantly better or worse than anything else.


There are two camps of people: one camp is saying that PhD in computer science is useless and the other is trying to prove that it is not. The answer, of course, is "it depends on so many things". In particular, on the breadth of knowledge and the quality of expertise of these former PhD students. On their leadership qualities, etc etc. Whether it was an applied PhD or some esoteric theory.

I do not believe that a good PhD cannot handle the dynamic vs static problem. In fact, there are megatons of papers on this topic so that PhDs have an edge in processing this material. Even better if a PhD has good software engineering skills. These kind of people are rather rare and earn well in the northern part of 100K.

It is also true that some companies benefit from PhDs more than others.


I do think adversarial data mining is a big thing that needs more research. Not just spam detection. This occurs anytime companies do data mining not fully in the interests of those who provide the data (i.e. practically every time). Spam detection is just the tip of the iceberg, and it is not surprising that a fresh PhD might not know the current tricks. But I bet they have the background and knowledge to develop the next set of methods.


There's an element here of companies whining that universities aren't doing their research for them at public expense. They do know that if they want more PhD theses that are commercially useful, they can sponsor graduate students, right?


Spam is a nasty business. Back when I was using my own filter, I actually stopped training it several years before I stopped using it. I discovered that my corpus of about 10k spam and 30k ham messages was sufficient. It was actually good enough that I re-used my pre-trained filter as part of some contract work (it was better better than Yahoo and Hotmail's standard spam filter at the time).

To the topic at hand; it's all about pattern recognition. It just so happens that the patterns that are being matched are pieces from headers + body + attachments, so are arbitrarily easy to tokenize. The question is: what tokens?

According to an interviewer and related question I got from a fellow who worked on Gmail's spam filter (which would be over 4 1/2 years ago now), filtering in the real world is all about coming up with a "signature" for an email so that an individual email can be analyzed with other emails. One simple method is:

  1. Determine the 64 most common words in each language
  2. Detect what language an email was written
  3. Construct a 64-bit number consisting of the bits set that correspond to the most common words being used in an email

You can even take the obvious next step and compute 'ham' signatures and 'spam' signatures for emails in each language (using pre-screened corpuses that are fine at 95+% correct). This can work very well individually, but I suspect that watching the eb and flow of signatures over time at the mail provider level is also another useful signal.

Thinking about this the first time after a half-dozen years, I've got about a dozen ideas for things that might work, but I hope that someone else has already tried them out.


In what way are the Ph.Ds useless ? are people not able to design good spam filters, or is it that published research on this is not useful ? It generally annoys me when companies complain about the "uselessness" of academic research but don't do much to facilitate better interactions between the two.