[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Asrg] Re: 2a. Analysis - Spam filled with words
> > The thing that makes these "new" messages different is precisely the
> > fact that they do *not* contain the nonsense words/random characters
> > typical of obfuscating comments. Instead, they contain literally
> > dozens of "high-end" *content-rich* words, deliberately left intact.
> > That's the "tell" (a poker term) that these messages are probably
> > designed to confuse statistical language classifiers. (Again, they
> > don't work, won't work--and ultimately *can't* work, for reasons that
> > are interesting only to people like me.) Admittedly based on a
> > manual "training" run, the Bayesian component of my statistical
> > filter started "catching" these after seeing just two of them.
> I don't know that we can simply say that they don't work. For
> example, assume that the spammer knows the non-spammy words and
> inserts those into the message. This can potentially significantly
> affect the statistical scoring. Of course filters are combating
> this by identifying this invisible text and ignoring it in the
> calculation. However in the absence of such intelligence or when
> those words are placed plainly in the text of the message, they
> have the ability to compensate for the spammy words.
Can I suggest a subtly different approach? Rather than trying to
characterise spam, why not try and characterise your legitimate
messages and see if incoming messages match that statistical
profile?
My reasoning is based on the fact that the profile of spam
undergoes sudden shifts as spammers switch to using new tactics
each time their old ones become less effective. Whereas, in my
case anyway, the profile of the legitimate mail I receive is
much more stable.
Bayesian classification systems have to undergo training in order
to learn what spam and "ham" look like. But because "spam" keeps
changing, so re-training is needed over time. As time passes, the
class of spam will grow and become less clearly-defined because
the range of tactics used by spammers seems to increase. As the
definition of "spam" becomes fuzzier, does the accuracy of
filtering decrease?
I'm particularly thinking about false positives here: given a
growing, varied, "spam" class compared to a more static "ham"
class - would it not make sense to match against a more stable
message profile?
Obviously each person's profile of received mail would be
different but that approach has the distinct advantage that it
makes it harder for a spammer to know what to put into their
messages to bypass filters because each person's profile of
allowed mail would be statistically unique to them.
This does of course require a smart enough feedback mechanism
(probably integrated with the receiver's MUA) to train the
filtering mechanism. But many of the proposals we have seen
would require changes at the MUA level, so it's no worse than
any of those.
Even if such an idea isn't viable, I would still be interested to
see how the statistical profile of spam differs from that of
legitimate e-mail for a range of users and whether certain types
of metric are more reliable predictors of spam than others.
I have a few ideas for statistical spam characteristics but I must
admit that I lack the in-depth background in statistics to know if
any of them would work in practice. Some expert input here would
be welcome.
Thanks
Andrew
_______________________________________________
Asrg mailing list
Asrg@ietf.org
https://www1.ietf.org/mailman/listinfo/asrg