On Mon, 10 Mar 2003 18:39:57 EST, Kee Hinckley said: > I currently have a sample database 22,000 confirmed spam messages > sent to roughly 200 real email accounts. > > 40% blocked by the country restriction. > 4% blocked due to obvious viruses. > 14% blocked due to system blacklist. > <1% blocked by user blacklists. > > There's less than three percent overlap between those factors. The Actually, there's a hidden assumption here that means that there's a lot MORE than 3% overlap. Your 14% system blacklist refers to a blacklist that was tailored thinking "and this list doesn't include anything from .XY because we country-restrict them already". What's *really* there is a system blacklist that accounts for 54% of catches, where 70% of the rules are country-based and the other 30% are rules to catch stuff the country rules dont.... Pick a country .XY and analyze it carefully - it's fairly likely that if you didn't filter the country, you'd blacklist 3-4 spamhauses that are 95% of the problem in that country. The important question of course becomes whether or not the *rest* of that country's population will start using e-mail enough to increase the risk of false positives and skew your stats... ;) > rest are blocked solely on problems we saw with the headers. There's > certainly overlap between that and the other factors, but we don't > currently log it specifically, so I don't know how much. It would be interesting and informative to have some other numbers. What percent of mail was tagged with the country restriction but *NOT* tagged as spam by users? (For instance, it would be quite easy to flag all mail from .CN as spam - and although my users would probably tag back 100% of the spam from .CN, they'd not tag 100% of the mail from .CN, as many have relatives there.. The fact that 40% of spam fails the country test is not at all a reliable predictor unless there is a near-zero rate of non-spam that fails the country test. Is the "user blacklist" number the percentage caught by pre-established user filters, or is that saying that your other checks were 99% effective in identifying spam and only 1% got through to users for them to report? Do you have any guesstimates of how much *unreported* spam got through to the 200 accounts? Or to turn up the satire, and point out the problem with the analysis: 40% of spammers drank milk at breakfast the day they spammed I saw an amusing statistic once that 99.97% of all felonies are committed while breathing air.... ;)
Attachment:
pgp00008.pgp
Description: PGP signature