Need HAM:
After we settle on definitions, the main missing ingredient is a good
HAM corpus attached to similarly-sampled SPAM. Multi-language ham is
especially needed (I know SpamAssassin team has issued a call for it,
don't know if it will arrive.)
Just a point. This process works for testing content classification
systems. It's lousy for header analysis. In the first place,
archives almost always elide the recipient's address--which we need
to examine the email's path throught the network. And secondly, the
older the message, the more inaccurate any of the IP address
information is.