Spam's Plan For Defeating Bayesian Filtering The first thing that sticks out of my mind in Ed Felten's article is that he begins with saying that Bayesian filters are "trained by the bad guys". Which is true, but it is not the whole truth; Bayesian filters learn from both good (hammy) and spam emails. The key to Bayesian filtering's success is that everyone's e-mail is different. While tokens signifying spam don't vary much between users, those signifying useful e-mail do. For example, they may include the names of a user's friends and family members, or technical terms related to a particular profession. To get around a customized Bayesian filter, a spammer must customize a message for every user, and by definition, spam isn't customized.
[more] The idea of poisoning a Bayesian filter doesn't work simply because the filter will adjust the token ratings by decreasing the weight of tokens that can occur in both spam and ham while increasing the weight of the differing tokens. It will, in essence, make the filter "tighter" (a smaller set of differing tokens), but no less effective (as they will be weighted all the more heavily). If the spammers were really smart, they would monitor users' emails and build custom spams for each individual. Not an easy task, given that spams are sent out in the millions. It's made even more hazardous considering the wrath it would invoke among indignant users/companies/nations. Even if they could do it, spammers would still have to find a way to insert their HTML and/or URL in there without flagging the Bayesian filter. I guess I'm saying that while it IS possible to fool Bayesian filtering, you can't do so in a practical manner. Especially not in the manner that spammers are currently trying to do it. Spammers are our enemy, and our enemy will show us where we're weak. But they have as of yet to convincingly beat any Bayesian filter I've used (keeping in mind that I get several hundred spams a day).
[Comment on the above] |