[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Asrg] Re: 2a. Analysis - Spam filled with words




> -----Original Message-----
> From: Terry Sullivan [mailto:terry@pantos.org] 
> Sent: Wednesday, September 10, 2003 12:51 PM
> To: asrg@ietf.org
> Subject: [Asrg] Re: 2a. Analysis - Spam filled with words
> 
> 
> On Tue, 9 Sep 2003 13:45:51 -0400,
> "Hector Santos" <winserver.support@winserver.com>
> 
> > These are called "tag injections."   Its been around for 
> > awhile in HTML email.    
> 
> Without meaning to seem disagreeable, I must disagree.  Hector's 
> point that so-called tag injections/obfuscating comments have been 
> around a while is well taken.  And he's quite correct that messages 
> where the text is broken up or otherwise obfuscated are indeed 
> intended to bypass simple keyword filters.
> 
> But in these "new" emails, exactly opposite is true.  The text is 
> *not* broken up; on the contrary, it's perfectly intact, but "hidden" 
> from *human* readers.  
> 
> The thing that makes these "new" messages different is precisely the 
> fact that they do *not* contain the nonsense words/random characters 
> typical of obfuscating comments.  Instead, they contain literally 
> dozens of "high-end" *content-rich* words, deliberately left intact.  
> That's the "tell" (a poker term) that these messages are probably 
> designed to confuse statistical language classifiers.  (Again, they 
> don't work, won't work--and ultimately *can't* work, for reasons that 
> are interesting only to people like me.)  Admittedly based on a 
> manual "training" run, the Bayesian component of my statistical 
> filter started "catching" these after seeing just two of them.


I don't know that we can simply say that they don't work. For example,
assume that the spammer knows the non-spammy words and inserts those into
the message. This can potentially significantly affect the statistical
scoring. Of course filters are combating this by identifying this invisible
text and ignoring it in the calculation. However in the absence of such
intelligence or when those words are placed plainly in the text of the
message, they have the ability to compensate for the spammy words.

Beyond that, spammers continue to change their vocabulary to be closer to
the vocabulary of legitimate mail either based on: a) the output of
statistical filters, b) spam that attempts to resemble personal notes or c)
spam that attempts to resemble business dialogue. One question is how long
will we have the benefit of such a large distinction between the content of
spam and non-spam?   At what point along the line of the convergence of
these vocabularies does the effectiveness and accuracy of Bayesian filters
become affected? 

There are two different exercises here:
1. A measurement study of the vocabulary space of actual spam mail and
non-spam mail and the change in these spaces over time.
2. An analysis of how the effectiveness and accuracy of Bayesian filters
would be affected given certain measures of distinction between the two
vocabularies. There is probably some existing work here that gets us close
to an answer.

_______________________________________________
Asrg mailing list
Asrg@ietf.org
https://www1.ietf.org/mailman/listinfo/asrg