Title: PHP Email Obfuscator
Posted: Mon, 27th August 2007
Category: EmailLately, across our network of sites, we've been getting more spam
than usual. The cause of the problem, the spambots are getting
better. Spambots are programmatic robots that crawl the internet
(in much the same way the search engines do) with the sole
purpose of gathering e-mail addresses in order to send you
completely irrelevant promotional material. I've often wondered;
it would be really easy to determine the context of the e-mail
address you find and you could even glean information from
multiple sources as to the consumer preferences of your subject,
and increase what must be a dismal conversion rate for the
spammers. Anyway, that's not what this article is about, this
article is about e-mail obfuscation, or making the e-mail
addresses on your site readable by humans but unintelligible to
robots.
There are a couple of techniques out there. The most basic which
has been in use since the dawn of time, (well, spambots) is to
fire-up your favourite graphics editor, create an image of your
e-mail address and replace the text of your e-mail address on
your site with the newly created graphic. This has its
limitations, if you run a large site with a lot of e-mail
addresses (hence prone to change and additions), you would have
to have someone on hand to create this plethora of e-mail address
graphics. Waste of time and money. Also, unless you use some form
of CAPTCHA looking text, the bot (depending on its
sophistication) could harvest the e-mail address from the image.
The second technique is to write out the e-mail address but
replace "legible" characters with "illegible" characters. So, for
example, a
mailto link to the e-mail address
"me@mysite.com", might look like the following:
<a
href='mailto:m%65@%6D%79%73ite%2E%63om'>Mail</a>
The drawbacks; it's painfully easy to "decode". Recently I came
across a solution from one Tim Williams (@
University
of Arizona) that works well, later modified by Andrew Moulden
(@
Site Engineering). Here's the theory and its resulting
evolution. Get the e-mail address you want to make unreadable to
the spambots, convert to lowercase, create a cipher (an
encryption technique), encrypt the e-mail address using the
cipher, write out the coded e-mail address to the document, write
out the cipher to the document (both are basically useless to a
harvester) and then wrap this in a piece of javascript to
actually write out the link based on the cipher key and cipher
text. Easy as. The trouble with this technique was that it used
the same cipher key for each e-mail address so if the technique
was used widely, a spambot would just need to take the fixed
cipher key and write code (again, really easy - but a lot harder
to do it efficiently what with coding idiosyncrasies) to decode
the obfuscation technique into a useful e-mail address. Then
Andrew Moulden modified the javascript so that a different cipher
key would be used every time the script was run. This was a
really valuable evolution of the script.
As the script stood in that form, it was the best solution out
there for small to medium sites but the problem lied mostly in
the distribution. The code used to encrypt the e-mail address was
written in javascript and executed from a browser taking the
plaintext e-mail address as a parameter. But, of course, you
couldn't pass the plaintext e-mail address to the encryption
function (
munge) because you would have to write out
the legible address into the document. I wanted to implement and
distribute this code on a reasonably large scale across several
sites and so didn't want to generate the code offline and then
paste the resulting code into my document, so here's my deviation
of the current offering.
The generation code is written in PHP. Every bit of content
(where an e-mail address might appear) on every site we've ever
built is housed in one or more databases. Before the content is
sent to the browser, we pre-parse it looking for plaintext e-mail
addresses. I'm not going to reprint the regular expression for
detecting standard e-mail addresses here because it's really long
and complicated. Conveniently, the addresses on our sites are all
of the form w.x@y.z or x@y.z so some of the "deeper" patterns are
irrelevant. Once we find an e-mail address in our pre-parse we
replace the plaintext with the result of a call to the PHP
function
munge:
function munge($address)
{
$address = strtolower($address);
$coded = "";
$unmixedkey =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789.@";
$inprogresskey = $unmixedkey;
$mixedkey="";
$unshuffled = strlen($unmixedkey);
for ($i = 0; $i <= strlen($unmixedkey); $i++)
{
$ranpos = rand(0,$unshuffled-1);
$nextchar = $inprogresskey{$ranpos};
$mixedkey .= $nextchar;
$before = substr($inprogresskey,0,$ranpos);
$after =
substr($inprogresskey,$ranpos+1,$unshuffled-($ranpos+1));
$inprogresskey = $before.''.$after;
$unshuffled -= 1;
}
$cipher = $mixedkey;
$shift = strlen($address);
$txt = "<script type=\"text/javascript\"
language=\"javascript\">\n" .
"<!-"."-\n" .
"// Email obfuscator script 2.1 by Tim Williams,
University of Arizona\n".
"// Random encryption key feature by Andrew Moulden,
Site Engineering Ltd\n".
"// PHP version coded by Ross Killen, Celtic
Productions Ltd\n".
"// This code is freeware provided these six comment
lines remain intact\n".
"// A wizard to generate this code is at
http://www.jottings.com/obfuscator/\n".
"// The PHP code may be obtained from
http://www.celticproductions.net/\n\n";
for ($j=0; $j<strlen($address); $j++)
{
if (strpos($cipher,$address{$j}) == -1 )
{
$chr = $address{$j};
$coded .= $address{$j};
}
else
{
$chr = (strpos($cipher,$address{$j}) + $shift) %
strlen($cipher);
$coded .= $cipher{$chr};
}
}
$txt .= "\ncoded = \"" . $coded . "\"\n" .
" key = \"".$cipher."\"\n".
" shift=coded.length\n".
" link=\"\"\n".
" for (i=0; i<coded.length; i++) {\n" .
" if (key.indexOf(coded.charAt(i))==-1) {\n" .
" ltr = coded.charAt(i)\n" .
" link += (ltr)\n" .
" }\n" .
" else { \n".
" ltr = (key.indexOf(coded.charAt(i))-
shift+key.length) % key.length\n".
" link += (key.charAt(ltr))\n".
" }\n".
" }\n".
"document.write(\"<a
href='mailto:\"+link+\"'>\"+link+\"</a>\")\n" .
"\n".
"//-"."->\n" .
"<" . "/script><noscript>N/A" .
"<"."/noscript>";
return $txt;
}
This produces something like the following in the output (the
variables coded and key will be different):
<script type="text/javascript" language="javascript">
<!--
coded = "fbmfiVpyp@V23@S6lJMxHvpU@bu"
key = "98UJ3q.RmbHOyjDJFknIHNIe7PfuG8td0Fl9Vp5sog2C@hYWv1N"
shift=coded.length
link=""
for (i=0; i<coded.length; i++) {
if (key.indexOf(coded.charAt(i))==-1) {
ltr = coded.charAt(i)
link += (ltr)
}
else {
ltr = (key.indexOf(coded.charAt(i))-shift+key.length) %
key.length
link += (key.charAt(ltr))
}
}
document.write("<a href='mailto:"+link+"'>"+link+"</a>")
//-->
</script><noscript>N/A</noscript>
This is fairly unintelligble to even quite sophisticated
spambots. I hope you find this function useful, it's a small
evolution of really great code by the people mentioned above. I
have also developed a version of this code where the number of
line breaks is random and the variable names in the generated
javascript are random. Out of the e-mail addresses that I've
obfuscated and published to live sites, I'm informed not even a
shred of spam has been received since. You have to be really
paranoid to use the random variable name technique but rest
assured the bots will get better, best to be prepared.
As with all articles on Celtic Productions, this article is
protected by international copyright laws. It may be linked to
(we are of course most grateful of links to our articles),
however, it may never be reproduced without the prior express
permission of its owners, Celtic Productions. The code contained
therein of course can be used freely (as long as the javascript
credits remain in place).
del.icio.us
.
Digg It
.
BlinkList
.
Fark
.
Google
.
Ma.gnolia
.
Netvouz
NewsVine
.
RawSugar
.
Shadows
.
Stumble
.
Technorati