[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Asrg] 2.a.1 Analysis of Actual Spam Data - Experimental Desi gn



I rejected only those addresses that did not exist in my address database of
users and honeypot addresses.
If the address did not exist the error is "550 user unknown". I cant go into
anymore detail as to what happens to senders that give more than x number of
550s or how that data is compiled and applied.

Regards, 
Damon Sauer 



-----Original Message-----
From: Yakov Shafranovich [mailto:research@solidmatrix.com]
Sent: Monday, August 18, 2003 3:47 PM
To: Sauer, Damon; 'Tom Thomson'; 'Terry Sullivan'; 'asrg@ietf.org'
Subject: RE: [Asrg] 2.a.1 Analysis of Actual Spam Data - Experimental
Desi gn


Can you provide more detailed information as to what kind of blocking did 
you do. Did you reject all incoming emails with 550 and send a challenge 
message?

At 02:32 PM 8/18/2003, Sauer, Damon wrote:
>Just an FYI - My ENTIRE mail volume dropped from an average of 50M a month
>to 20M a month, with the same percentage being blocked at the content
>filters, after implementing address checking (550 given) at the connection.
>
>
>
>Regards,
>Damon Sauer
>
>
>
>-----Original Message-----
>From: Tom Thomson [mailto:tthomson@neosinteractive.com]
>Sent: Monday, August 18, 2003 10:52 AM
>To: Terry Sullivan; asrg@ietf.org
>Subject: RE: [Asrg] 2.a.1 Analysis of Actual Spam Data - Experimental
Design
>
>
> > I feel pretty confident that one box can respond to requests sent to
> > multiple IP addresses, and therefore can serve as home to an
> > arbitrarily large number of different domains.  If these email
> > addresses "live" on 60 different machines, then there will be an
> > additional mechanical step of "synching" the data from each machine.
> > Then too, keeping one machine up for the experimental period strikes
> > me as less "overhead" than keeping 60 machines going.  That I can see,
> > using multiple boxes only serves as a potential confound, because
> > server availability affects spam volume in a systematic way. If one
> > machine (or  worse, two) go(es) "hard down" for a week or two, the
> > results of the larger experiment are placed at risk.  Addressess
> > served by that/those machine(s) will have a lower spam volume, of
> > course, but not because of the indepdendent variable.  But, as I said,
> > this feature qualifies as nice-to-have, but not required.
>
>I think you misread what I was proposing.  Comparisons have to be between
>mail addresses which are identical except for there 550 behaviour when
>determining whether 550 behaviour affects spam volume.  So you don't
compare
>mailboxes on two different machines, or in two different TLDs.  What I am
>saying in effect is that the experiment needs to be carried out in a number
>of TLDs, since it may deliver different conclusions in different TLDs.
>
> > However, to the extent that there is some reasonable basis for
> > believing that spammers respond differentially to 550s from different
> > TLDs, then that imposes an additional requirement: keep the number of
> > TLDs small (say, 3: .com/.org/.net), or use a LOT more addresses.
>
>If the one-TLD experiment uses 60 pairs of adresses, then a multi-TLD
>experiment must use 60 pairs for each TLD.  Simple as that. Going to even a
>small number of TLDs (eg 3 TLDs) while keeping the original number of
>addreseses as you suggest is going to be a disaster if the TLD does have
>some effect, as it reduces the amount of data which can tell you about the
>effects of the 550 responses where they are the only independent variable
by
>a factor of three. It would be helpful not to restrict the TLDs to those
>where English is thh prime language, as in the three you list.  Maybe use
>.com, .uk, .fr, .de (plus .org and .net maybe).
>
>There are four potential gains to using several TLDs, provided that enough
>data is collected to make a valid experiment within each individual TLD.
>First, we can see whether the 550 method has different effects in different
>domains; second, we can get some idea of the effect of tld on spam volume
>(anecdotal evidence conflicts here, and I've seen no solid numbers);
third,
>if the tld does in fact make no difference we have several times as much
>data to work with; fourth, if the 550 response does indeed have an effect
we
>will be able to see if part of that effect is a reduction or increase in
the
>unexplained variance.
>
> > >...I think it's perfectly reasonable to measure daily volumes...
> >
> > Knock yerself out.  Devote as much time as you like to analyzing daily
> > volume.  In fact, you can start right now, using Peter's data; it's a
> > large enough sample to permit a reasonably robust estimate of the
> > "true" population variance.  You might find analyzing those data to be
> > a statistically informative exercise; I know I did.
>
>I'm not the least bit interested in trying to do any further analysis on
>daily data.  What bothers me about just collecting (say) 90 day volumes is
>that an appropriate measure might be seven day volumes or 1 month volumes
or
>three month volumes or even 90 days (unlikely - 90 days is neither an even
>number of weeks nor an even number of months so it won't properly mask
>periodic effects based on the calendar, which probably will be present).
If
>I have 1 day numbers I can use then to produce the numbers for any period
>which is a multiple of 1 day, and see which multiple of 1 day reduces the
>unexplained variance best (provided the experiment runs for long enough,
>that is) and that way I get to see whether the 550 responses (if they have
>any efect at all) produce a flat reduction or a change in shape or both.
>
>Anyway, you've already seen the comments I made after an initial analysis
of
>Peter's data - I think I was the first to point out that there was no
>evidence of a downward tred, not even evidence of the absence of an upward
>trend, and that daily volume is very noisy indeed. And without a data for a
>properly organised control address to compare it with little useful
analysis
>can be done except to note that there is a good deal of unexplained
variance
>and no visible trend at any reasonable level of significance.
>
>Tom


_______________________________________________
Asrg mailing list
Asrg@ietf.org
https://www1.ietf.org/mailman/listinfo/asrg


*****
"The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential, proprietary, and/or
privileged material. Any review, retransmission, dissemination or other use
of, or taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from all
computers."

_______________________________________________
Asrg mailing list
Asrg@ietf.org
https://www1.ietf.org/mailman/listinfo/asrg