On Poison Filtration

I recently found this in a piece of filtered-out spam awaiting its demise in a purge queue. It was one of the ones peddling CDs full of harvested email addresses. I won't go into the merits or truth of their claims, but here's an excerpt:

Remember those 200 million lines of addresses, here's what
we did with them...

1. Cleaned and eliminated all duplicates. This process,
alone, reduced the list into a manageable number.

2. Next, we brought in a filter list of 400+ words/phrases
to clean even more. No addresses with inappropriate or
profane wording survived!

3. Then, a special filter file was used to eliminate the
"Web Poisoned" e-mail addresses from the list.  Our
EXCLUSIVE system reduced these "poison" addresses to near

4. Next we used our private database of thousands of known
"extremists" and kicked off every one we could find.  NOTE:
We maintain the world's largest list of individuals and
groups that are opposed to any kind of commercial
e-marketing... they are gone, nuked!

On the one hand, most of this doesn't deserve the dignity of a response, or indeed a repository anywhere outside /dev/null. However, they do cover, approximately, the normal known methods for sanitizing spam databases. So here's some brief correlated notes:

  1. Duplicate elimination: most of us that do legitimate programming work call this "hashing," mostly because a hash table is one of the quicker ways to render data unique. On a UNIX box it'd be quickest just to pipe the thing through sort -u, but most spammers are subcompetents on Windoze machines with no engineering experience. Be that as it may, it's irrelevant for poisoning purposes.
  2. There are two obvious counters to munged email addresses, e.g. user@spam.host.com. One is simply to drop any address that matches a possibly-munged word, of which the word 'spam' itself is almost certainly the most common. That works, kills off a fair number of legitimate addresses while also getting a significant portion of the munged ones. The other is to try to reverse the munging. Simply removing the word spam and adjusting nearby separators (@, .) is easy enough. The number of permutations of munging is pretty large, and the more of them one writes algorithms to counter the more legitimate data gets discarded. This ilk sells by quantity only -- there's no benefit to the addresslist marketers to lower their numbers in the interests of quality, which can't really be measured in a clandestine fashion anyway. Somewhere in Usenet there's a poster who favored the poster address die-spammer-choking-on-your-own-feces@ucdavis.edu; this is roughly the sort of thing that gets filtered out; so much the better. :)
  3. Poison filters. Ah, the important part relative to sugarplum. Annoyingly, this paragraph doesn't drop many hints. The term "special filter file" is meaningless -- "file full of regular expressions" is the closest likely analog. The same principle applies here as with de-munging; the number of permutations is pretty large. Some is easy -- addresses where the username is a number (numbers aren't valid usernames on UNIX hosts anyway); addresses containing a letter-number ratio greater than, say, 0.4; invalid TLDs, and so forth. Statistical filtration of addresses (e.g. number-letter-punctuation ratios) will likely achieve some minimal success with poison generated from byte-random output, e.g. j28pa9l4@host.com; this will have a fairly high loss-rate of real data. Sugarplum's address generation, as of 0.8.2 (as opposed to the fraction of its output that consists of the addresses of known spammers) uses random dictionary words as hostnames in the US TLDs, with randomly-generated usernames based on a weighted sampling of letters and numbers in an RFC-valid fashion. The weakest part of that tactic is that a DNS MX- and A-record lookup on each address can generally remove most poisoned addresses whose names are made of random words. A future sugarplum release may have the option to use preselected known-good hostnames (especially those in culpable positions, e.g. legislative bodies and ISP/NSPs who don't properly kill off their spammers) for some fraction of the output. TLD selection is weighted in .com's favor, and most every dictionary-word .com domain is taken, resulting in a lot of valid MX returns especially for one-word hostnames, e.g. word.com. A future version of sugarplum will also offer configuration on the frequency of the number of words joined to make the hostname -- weighted strongly in favor of 1, f'rinstance, to increase false-positives on MX checks.