zargony.com

#![desc = "Random thoughts of a software engineer"]

Email address scrambling methods compared

A while ago, I wrote about different methods in JavaScript to prevent spam harvesters from recognizing an email address. These methods work by placing a scrambled version of the email address into the page source so that a spam harvester cannot recognize it as an email address. Using JavaScript, the scrambled text is unscrambled and displayed as usual to human visitors. Usually, the "scrambling" is based on replacing characters of the email address with its hex-entities (Rails' mail_to helper does so if using :encode => :hex or :encode => :javascript). My theory was/is, that using hex-entities is not sufficient anymore nowadays, since they can be easily reversed with simple search-and-replace operations.

So I came up with the idea to use a scrambling method that cannot be easily reversed. I assumed that spam harvesters probably can decode hex-entities, but still aren't able to execute JavaScript. However since this was just an assumption, I started a simple test over the last 6 months to find out how good or bad the different scrambling methods perform.

Test setup

I placed four different email addresses onto the main page of this blog and used different scrambling methods to "display" them. The username parts of each email address were generated from random characters so that dictionary hits are unlikely. Every email address pointed to a server without any spam protection to be sure to receive every mail. All addresses were placed to the sidebar of this blog using an invisible div and weren't used anywhere else. So any mail received on these addresses must be the result of a harvester recognizing it on this blog. The four different scrambling methods were:

  1. no scrambling at all (mail_to)
  2. hex-entities (mail_to :encode => :hex)
  3. hex-entities with JavaScript (mail_to :encode => :js)
  4. my javascript code

The results

Actually I expected to get a lot more spam. Altogether, only 36 mails arrived during the 6 months of the test. Compared to the ~500 spam mails I'm getting per day on my old student account, this seems like nothing. I assume, that my blog is too unpopular to be harvested more often ;-). On the other hand, the test addresses were not used anywhere else except for this test, so they only received spam because of harvesters visiting my blog and not because of other sites, mailing lists or address books being scanned by worms.

Method 1: No scrambling at all

As expected, using no scrambling at all gave the worst results. After just 5 days, the first spam mail arrived. In total, this email address received 31 spam mails.

Method 2: Using hex-entities

Using hex-entities surprisingly resulted in getting 87% less spam. Only 4 spam mails were received. The first one was received after 15 days.

Method 3: Using hex-entities with JavaScript

Even less spam received here. Only a single mail came in 3 month after starting the test.

Method 4: My safe_text method

This makes my assumption looking right: not a single mail was received on this email address.

Conclusion

Basically, it seems that I was right to expect that email address harvesters can decode hex-entities nowadays. Unfortunately, when starting the experiment, I didn't know about using CSS to scramble an email address; otherwise I would have included this method in my experiment. Furthermore I wonder why the overall spam volume stayed so low even without any scrambling at all. Silvian did a similar experiment a a while ago and received megabytes of spam.

However, even with this low volume of spam, I think it's right to conclude that hex-entities are not a safe scrambling method and that we need to make it a bit harder to trick address harvesters. Fortunately, tricking a harvesters is still not very hard (like with my safe_text method).