Bayes spam filters (Linus Torvalds)

Index Home About Blog

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [OT] Confirmation Spam Blocking was: List 'linux-dvb' closed to
Original-Message-ID: <Pine.LNX.4.58.0401221441500.2998@home.osdl.org>
Date: Thu, 22 Jan 2004 23:00:26 GMT
Message-ID: <fa.j5ccs59.1e2om0n@ifi.uio.no>

On Thu, 22 Jan 2004, jw schultz wrote:
>
> Beyes is the wrong aproach for those random words from the
> dictionary blocks.

Bayes is not wrong per se, but doing bayes on pure word statistics is
wrong. It always was. People knew how it could be broken. The current rash
of spams is just the obvious way to do it.

> Those i've seen seem to be a long string of words all longer
> than 4 characters.  A rule that gave a score of based on the
> number of consecutive words longer than some number or
> characters would catch those fairly easily.  If i get
> annoyed enough i may figure out how to write such a rule.

Don't. That's easily broken too, as you realized yourself.

> What we need is a bounty on these scum.  $1000 fine per
> reported recipient with half going to the reporter would be
> nice.

What you should aim for, and which should be much harder to break, is to
realize that random words that make no sense give a really unlikely
score when you build up a markov chain of them.

So to avoid the random words problem, do Bayes on the _chain_ of words
instead.

Now, you can try to overcome this by spamming with something that makes
"sense" from the markov chain standpoint, but by then that spam is going
to be hilarious. Once I start getting spams that are generated by markov
generators and read like "real" email, I might stop filtering them, just
because they are bound to be a lot of fun to read.

Have you played with Markov chains? What happens is that you don't just
build up a list of words and their likelihood of being spam or ham, you
build up a list of word _combinations_ and the likelihood of one
particular word following another one.

That's how a lot of the "random phrase" generators on the web work.

They can be absolutely hilarious, exactly because the sentences they
generate actually _almost_ make sense. Sometimes you get an almost
readable story, but one that reads like somebody having a bad trip and his
reality just shifted 90 degrees. (Usually the best stories come if the
training material is coherent, which email sadly usually isn't).

Do a google search for "Mark V Shaney", and you should get some idea
about this.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [OT] Confirmation Spam Blocking was: List 'linux-dvb' closed to
Original-Message-ID: <Pine.LNX.4.58.0401221511210.2998@home.osdl.org>
Date: Thu, 22 Jan 2004 23:18:23 GMT
Message-ID: <fa.j3sgrd9.1dion8n@ifi.uio.no>

On Thu, 22 Jan 2004, Linus Torvalds wrote:
>
> Do a google search for "Mark V Shaney", and you should get some idea
> about this.

Oh, damn. I shouldn't have reminded myself. It's been so long since I did
this, that I had forgotten all about it, and I'm just happy that I'm
working from home, because if I'd been in an office, they'd have come to
take me away already, I was laughing so hysterically.

Some twisted soul put Mark V Shaney to work on a combination of Bible and
UNIX newsgroups. My favourite so far:

	..

	This is supported by Jesus's use of low cost eight bit micros and
	small amounts of RAM.  When you find salvation.

	..

	If God truly loves humankind then why does He create sinners?  If
	human is His creation, then who is the ultimate in all shells?

	I know at one point Jesus said "no one may come to grips with the
	cpio header blown away".

	It speaks of the original ftpd.

	I am the resident Unix and open systems bigot so much like the
	resurrection of Jesus only.

	..

	...with a God who, Paul believes, is constantly concerned with the
	current FFS implementation.

	Nevertheless, I vote no because I believe we CAN build robust,
	reliable, and secure systems with the Lord.

	Mark V. Shaney

Hey, spammers, please start sending me those emails.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [OT] Confirmation Spam Blocking was: List 'linux-dvb' closed to
Original-Message-ID: <Pine.LNX.4.58.0401241307490.10144@home.osdl.org>
Date: Sat, 24 Jan 2004 21:14:30 GMT
Message-ID: <fa.ieg5fjm.1j0c8ic@ifi.uio.no>

On Sat, 24 Jan 2004, Kevin O'Connor wrote:
>
> A good Bayesian spam filter isn't nearly as susceptible to random words as
> some people think.  Words that are likely to be spam (along with words that
> are frequently "ham") are given _exponentially_ more weight than other
> words.

Especially if the "random words" in the spam end up being weighted by real
frequency, you just _cannot_ use single-word bayes filters on it. Or if
you do, you'll eventually have those words either being neutral, or (worst
of all cases) you'll have real mail be marked as spam after having
aggressively trained the filter for the spams.

It might not be that big of a deal especially if you have a fairly narrow
scope of emails in your ham-list, but people who get mail from varied
sources _will_ get screwed by this, one way or the other.

Of course, the spam filters will catch on to other things. I find that the
DNS lookups take care of most of it, to the point where the other rules
don't even much matter.

		Linus

Index Home About Blog