At 6:01 PM +0000 2/16/04, Matt Schneider wrote:
At 12:28 PM 2/16/2004 -0500, you wrote:
so, i guess a sliding window would catch filter-busting headers and
trailers.
No, they add a bunch of garbage right in the body of the spams too, fake
HTML tags or text that's the same color as the background.
There's no real way to avoid this stuff.
Those are both really quite easy to catch, and can even be caught by
automatic learning filters. For example, the word 'oblivity' inside angle
brackets (i.e. a bogus HTML tag) occurs nowhere at all in any of my
legitimate mail of the past year. It occurs 6 times in my spam of 2004. A
filter that checks for strict HTML compliance in HTML mail would have
caught all of those, and I see in my current set of Bayesian classifiers
that this 'word' (complete with <>) is part of why the later spams
containing it were marked as probable spam. Similarly, text that is the
same color as the background is a programmatically detectable trick, and
there are already filters in use that detect it as spamsign.
Yes, you can look for bad HTML tags and colored text.. but there will
always be a way to sneak garbage in no matter what you do.