PHP Email Obfuscator

on 27th August
  • email
  • obfuscator
  • php
  • munge
  • anti-spam
  • counter-spam
  • spambots
  • email harvester
Lately, across our network of sites, we've been getting more spam than usual. The cause of the problem, the spambots are getting better. Spambots are programmatic robots that crawl the internet (in much the same way the search engines do) with the sole purpose of gathering e-mail addresses in order to send you completely irrelevant promotional material. I've often wondered; it would be really easy to determine the context of the e-mail address you find and you could even glean information from multiple sources as to the consumer preferences of your subject, and increase what must be a dismal conversion rate for the spammers. Anyway, that's not what this article is about, this article is about e-mail obfuscation, or making the e-mail addresses on your site readable by humans but unintelligible to robots.

There are a couple of techniques out there. The most basic which has been in use since the dawn of time, (well, spambots) is to fire-up your favourite graphics editor, create an image of your e-mail address and replace the text of your e-mail address on your site with the newly created graphic. This has its limitations, if you run a large site with a lot of e-mail addresses (hence prone to change and additions), you would have to have someone on hand to create this plethora of e-mail address graphics. Waste of time and money. Also, unless you use some form of CAPTCHA looking text, the bot (depending on its sophistication) could harvest the e-mail address from the image. The second technique is to write out the e-mail address but replace "legible" characters with "illegible" characters. So, for example, a mailto link to the e-mail address "me@mysite.com", might look like the following:


<a href='mailto:m%65@%6D%79%73ite%2E%63om'>Mail</a>


The drawbacks; it's painfully easy to "decode". Recently I came across a solution from one Tim Williams (@ University of Arizona) that works well, later modified by Andrew Moulden (@ Site Engineering). Here's the theory and its resulting evolution. Get the e-mail address you want to make unreadable to the spambots, convert to lowercase, create a cipher (an encryption technique), encrypt the e-mail address using the cipher, write out the coded e-mail address to the document, write out the cipher to the document (both are basically useless to a harvester) and then wrap this in a piece of javascript to actually write out the link based on the cipher key and cipher text. Easy as. The trouble with this technique was that it used the same cipher key for each e-mail address so if the technique was used widely, a spambot would just need to take the fixed cipher key and write code (again, really easy - but a lot harder to do it efficiently what with coding idiosyncrasies) to decode the obfuscation technique into a useful e-mail address. Then Andrew Moulden modified the javascript so that a different cipher key would be used every time the script was run. This was a really valuable evolution of the script.

As the script stood in that form, it was the best solution out there for small to medium sites but the problem lied mostly in the distribution. The code used to encrypt the e-mail address was written in javascript and executed from a browser taking the plaintext e-mail address as a parameter. But, of course, you couldn't pass the plaintext e-mail address to the encryption function (munge) because you would have to write out the legible address into the document. I wanted to implement and distribute this code on a reasonably large scale across several sites and so didn't want to generate the code offline and then paste the resulting code into my document, so here's my deviation of the current offering.

The generation code is written in PHP. Every bit of content (where an e-mail address might appear) on every site we've ever built is housed in one or more databases. Before the content is sent to the browser, we pre-parse it looking for plaintext e-mail addresses. I'm not going to reprint the regular expression for detecting standard e-mail addresses here because it's really long and complicated. Conveniently, the addresses on our sites are all of the form w.x@y.z or x@y.z so some of the "deeper" patterns are irrelevant. Once we find an e-mail address in our pre-parse we replace the plaintext with the result of a call to the PHP function munge:


function munge($address)
{
$address = strtolower($address);
$coded = "";
$unmixedkey = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789.@";
$inprogresskey = $unmixedkey;
$mixedkey="";
$unshuffled = strlen($unmixedkey);
for ($i = 0; $i <= strlen($unmixedkey); $i++)
{
$ranpos = rand(0,$unshuffled-1);
$nextchar = $inprogresskey{$ranpos};
$mixedkey .= $nextchar;
$before = substr($inprogresskey,0,$ranpos);
$after = substr($inprogresskey,$ranpos+1,$unshuffled-($ranpos+1));
$inprogresskey = $before.''.$after;
$unshuffled -= 1;
}
$cipher = $mixedkey;

$shift = strlen($address);

$txt = "<script type=\"text/javascript\" language=\"javascript\">\n" .
"<!-"."-\n" .
"// Email obfuscator script 2.1 by Tim Williams, University of Arizona\n".
"// Random encryption key feature by Andrew Moulden, Site Engineering Ltd\n".
"// PHP version coded by Ross Killen, Celtic Productions Ltd\n".
"// This code is freeware provided these six comment lines remain intact\n".
"// A wizard to generate this code is at http://www.jottings.com/obfuscator/\n".
"// The PHP code may be obtained from http://www.celticproductions.net/\n\n";

for ($j=0; $j<strlen($address); $j++)
{
if (strpos($cipher,$address{$j}) == -1 )
{
$chr = $address{$j};
$coded .= $address{$j};
}
else
{
$chr = (strpos($cipher,$address{$j}) + $shift) % strlen($cipher);
$coded .= $cipher{$chr};
}
}


$txt .= "\ncoded = \"" . $coded . "\"\n" .
" key = \"".$cipher."\"\n".
" shift=coded.length\n".
" link=\"\"\n".
" for (i=0; i<coded.length; i++) {\n" .
" if (key.indexOf(coded.charAt(i))==-1) {\n" .
" ltr = coded.charAt(i)\n" .
" link += (ltr)\n" .
" }\n" .
" else { \n".
" ltr = (key.indexOf(coded.charAt(i))-
shift+key.length) % key.length\n".
" link += (key.charAt(ltr))\n".
" }\n".
" }\n".
"document.write(\"<a href='mailto:\"+link+\"'>\"+link+\"</a>\")\n" .
"\n".
"//-"."->\n" .
"<" . "/script><noscript>N/A" .
"<"."/noscript>";
return $txt;
}




This produces something like the following in the output (the variables coded and key will be different):


<script type="text/javascript" language="javascript">
<!--

coded = "fbmfiVpyp@V23@S6lJMxHvpU@bu"
key = "98UJ3q.RmbHOyjDJFknIHNIe7PfuG8td0Fl9Vp5sog2C@hYWv1N"
shift=coded.length
link=""
for (i=0; i<coded.length; i++) {
if (key.indexOf(coded.charAt(i))==-1) {
ltr = coded.charAt(i)
link += (ltr)
}
else {
ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length
link += (key.charAt(ltr))
}
}
document.write("<a href='mailto:"+link+"'>"+link+"</a>")
//-->
</script><noscript>N/A</noscript>



This is fairly unintelligble to even quite sophisticated spambots. I hope you find this function useful, it's a small evolution of really great code by the people mentioned above. I have also developed a version of this code where the number of line breaks is random and the variable names in the generated javascript are random. Out of the e-mail addresses that I've obfuscated and published to live sites, I'm informed not even a shred of spam has been received since. You have to be really paranoid to use the random variable name technique but rest assured the bots will get better, best to be prepared.

As with all articles on Celtic Productions, this article is protected by international copyright laws. It may be linked to (we are of course most grateful of links to our articles), however, it may never be reproduced without the prior express permission of its owners, Celtic Productions. The code contained therein of course can be used freely (as long as the javascript credits remain in place).