Thwarting the Advances of Internet Robots and Other Pests
Erik Schorr <phelix@arpa.org>
July 19, 1998 for www.arpa.org, last revised January 12, 1999
Duplication permitted, provided content is unmodified.
------------------------------------------------------------------------------

1) The purpose of this article is to provide ideas and insight on the
proactive and reactive countermeasures against automated `pests' on
the internet, targeted at UNIX system administrators who run publicly
accessible mail, web, and ftp servers. Internet service providers
and private organizations alike are being frustrated by the reality
of thousands of money-craving individuals unleashing their tools upon
the net to gather and sift through just about any email addresses they
can find, only to then include such email addresses in unsolicited mass
email attacks. This happens to be the most abundant case of web and mail
abuse, and will be the primary focus of the article.
Another problem many administrators are facing, yet not nearly as
annoying as the case above, are people who, usually innocently, sometimes
intentionally, mirror their websites and ftp servers to reproduce
the content onto a local machine. Not only does it cause massive
bandwidth usage which would otherwise be readily available for legitimate
users, but it's quite possible that the data being reproduced is more
than the person was interested in to begin with. Also, it may be illegal
to mirror, in the case of copying the content of a pay-per-use website,
for example. There will always be the copyright notices, and there will
always be people and robots that ignore them.


2) Prevention. Let us use the following example.
A website administrator is noticing `spiders' and `email extractors'
are spending appreciable time and bandwidth scanning through the
server's content, following every link, and presumably saving the
content for either email addresses or pictures. He sees in the logs
that it's not from a reputable search engine, and in fact, none of the
images were ever retrieved.
Shortly after, perhaps even the same day, he notices all his users
have been emailed harrassing unsolicited bulk email, advertising some
shady bulk email software or some hot new product guaranteed to
make his website famous for a small fee of $49.95. Chances are, their
email addresses have been `extracted', saved to a list, ready to be
sold to internet advertising agencies.
There are several well-known email extraction programs readily
available for advertisers to get literally millions of email addresses
for `free'. Because it costs next to nothing to send email these days,
it looks like a very enticing plan to business owners to get their
spew out to the masses. All it requires is a connection to the
internet, the software to scan the net for email addresses, and what-
ever similar programs that are required to push their ads out to the
victims.

The first and most obvious way to prevent this is just not to have
your email addresses listed anywhere. Certainly not very practical, and
in the long run you end up losing business because the children of the
point-and-click generation just want to be able to click on your email
link, and end up going somewhere else if they can't. A compromise to
this would be to 'encode' your email address with HTML tags and not
include it in an anchor. For example, where you would normally use
<A HREF="mailto:me@myplace.com">email me@myplace.com</A>, replace it
with: <u>email me at <b>me</b>@<b>myplace.com</b></u>. This may look
like a link, but wont be `clickable', and because of the tags within the
email address, it is effectively munged to anything that can't correctly
parse the HTML, but is copied as "me@myplace.com" when highlighted
within a real browser.

Javascript. It's beyond the scope of this article to teach one how
to do it, but you can place a javascript applet within an anchor, which
when run, can produce the true email address from an encrypted (rot13? :)
string. A capable web browser would parse this correctly and generate the
appropriate link, while a semi-stupid email extractor would just ignore it.
This is a bit more effective than the technique above, but much more
confusing, and not always guaranteed to work for all `real' clients.

Server-parsed HTML. One of the many beautiful features of modern
web server software (I'll plug Apache, the choice of between 50% and
80% of webserver administrators around the world), is the ability
to change the content of an HTML document as it's being served to the
client. SSI, or Server Side Includes, are similar to CGIs but their
generated output is literally included as part of the web page it was
called from. Headers and Footers can be generated by SSI programs
called from within a server-parsed HTML document, for example. It's
beyond our scope to document all the features of SHTML and SSI, but
what I will mention is the <!--#if ... --> tag and its <!--#endif -->
complement. After you're sure you have SHTML working with your
favorite Apache server, read up on these metatags in the mod_include
documentation. By the way, if you don't already have apache-1.3, get it.
The #if element can test the contents of a variable in an easy-to-learn
expression format, including regex string matches. In this example, the
HTTP_USER_AGENT variable, usually passed to the server by the client's
web browser as a description of itself, will be tested to see if it
contains certain strings. You obviously want everyone using a legitimate
browser to see the full content of your HTML pages, and since we all know
about the most common `real' web browsers, we can test to see if the
client is, in fact, one of our known browsers. If it isn't, we can pass
a completely bogus document to them, or a polite message explaining why
they can't see the real document. Of course, you wont put any email
addresses or links in the `bogus' document, but I can't stop you from
putting a link in for say, http://spam.abuse.net :)
Here is an example of a Server-Parsed HTML document I would use to
produce the real content for known clients, and bogus content for the
unknowns. Remember that for most installations, the file must have
the .shtml extension to be parsed by the mod_include module.

<HTML><HEAD><TITLE> My Document </TITLE></HEAD>
<BODY>
<!--#set var="c" value="$HTTP_USER_AGENT" -->
<!--#if expr="$c = /MSIE/ || $c = /Mozilla/ || $c = /Lynx/" -->
Here's the content we'd like to show our friendly audience.<BR>
(Those of you with MSIE, Netscape, or Lynx)<P>
<!--#else -->
Here's what we'd like the email extractors and web-bots to see.<BR>
A link for the spammers to follow:
<A HREF="http://127.0.0.1/">hi.</A>
And another elite link:
<A HREF="http://www.arpa.org/raid/blackflag">Bye.</A>
<!--#endif -->
</BODY>
</HTML>

Basically, the #if statement checks the HTTP_USER_AGENT variable
passed to it and matches if it contains the strings MSIE, Mozilla, or
Lynx. Assuming it finds one of these matches, it will continue to
deliver the text immediately following the statement, up to the #else
element. If there was no match, we'll assume the `client' is unknown,
so we give it meaningless garbage. This technique is limited to web-
server software that is capable of parsing the document before serving
it the client. It is a bit tricky to set up for beginners, and may seem
a bit strict and BOFHistic, but if you consider that the HTTP_USER_AGENT
a web spider or email extractor would report to the server is almost
always either null, or the real name of the software (see below for
examples I have found), you'll have a very effective way to prevent the
unwanted pests from gathering your info. All we need is to serve
trusted and recognized clients, such as MSIE, Netscape, and Lynx
(To name just a few.)
HTTP_USER_AGENT strings to watch out for, in case you feel like
logging them like I do (I log this sort of thing religiously), include
words like the following:
htdig, Slurp, Wget, Webscanner, Namecrawler, CherryPicker,
Konqueror, Scanner, EchO, LinkWalker, Teleport, EmailSiphon

EmailSiphon is the one you really have to watch for. It _will_
send you email soon after it finds your email address.
This is not my original idea. www.rootshell.com uses a similar
technique to prevent most people from mirroring their archives.
A simple way around this is to have your mirroring software send
an HTTP_USER_AGENT string to make it look like Netscape or Lynx.


3) Tracking the ones that get past us.
It may not be totally feasible or effective to employ the above
techniques to prevent people from abusing your web and email services,
but at least we have a highly effective way of tracking them.
I've already tracked down and killed (figuratively. I just got
their dialup accounts revoked) at least 10 spammers in the past few
months with this technique. It includes using the above technique
to hide legitimate email addresses but also involves a program I call
"poison" to generate unique _real_ email addresses that we actually
_want_ the spammers to extract and send mail to. This may sound
strange. Read on.
Imagine an email extracting web bot pulling a web page from our
server. Whether the server recognizes or trusts it is irrelevant at
this point. We still give it a unique email address, imbedded in our
HTML somewhere inconspicuous to humans, where the username is a simple
hash based on the location of the page, and the current time as the
page is retrieved. The domain portion is one that we set up in
advance to redirect ALL incoming mail (for just this reserved domain)
to a file on our mailserver. It still looks like a real email address
and the web bot scribbles it down in its notes for a later spamming.
In this example, we'll use the domain poison.arpa.org. When the
HTML is generated, the date/time happens to be 1998, July 19th, at
13:27:23. Our SSI program injects an email address with this information
into the HTML code as such: r980719132723@poison.arpa.org
The email address can easily be broken down into the exact date and
time elements, and since it is generated this way, it's almost
unique for every request assuming you don't get more than 1 HTTP
request per second. The email address starts with the letter "r"
to signify that it was generated for my www.arpa.org web page, as
well as to make it more believable as an `email address'.
Before the SSI program is finished, it logs this request with the
generated email address, as well as the client's HTTP_USER_AGENT string
to a logfile for later reference.
Our perpetrating web bot then notices and stores the email address.
Later that week, perhaps 4 days later, it sends out its spew to all
the email addresses it gathered, including ours. Of course after 4
days, this guy is on a different dialup IP address, and quite
possibly, on a totally different ISP. No matter, we already know
who he is and how he got that address...
We get an email directed to some user at poison.arpa.org. Our
mail configuration for that virtual domain (using virtusertable in
sendmail) directs _all_ mail with any username to a mailspool file
called poison.mail, which is easily readable with `elm -f poison.mail`
When we look at the headers, we'll notice the email was directed to
r980719132723. We'll just go back and consult our poison.log file,
see where that email address was extracted from, and what kind of
client they were using. We have not only the IP address he sent the
unsolicited mail from, but we now have the IP address he was on when
he extracted the email address. Let's go call both his ISPs, send
them logs of the two events (the extraction and the spam), and have
his accounts suspended for abusing our web and email services.
Here's a _very_ simple shell script that you can use to generate
these poison email addresses, included in your server-parsed html
with the line: <!--#include virtual="/cgi-bin/poison?r" -->
Replace the "r" in the query string with any unique letter.

#!/bin/bash
# requires GNU date command
d=`date +%y%m%d%H%M%S`
q="$QUERY_STRING"
# Apache requires the HTTP header because this _is_ a CGI :)
echo "Content-type: text/html"
echo ""
echo "<A HREF=\"mailto:${q}${d}@poison.cwnet.com\">
echo "<FONT color=#ffffff>${q}${d}@poison.cwnet.com</FONT></A>"
echo "${d} (${q}) ${REMOTE_ADDR} ${HTTP_USER_AGENT}" >>/www/log/poison.log
exit

The reason for the white font color is to make the email address
link invisible against the white background on the rest of the page.
Also, make sure the poison.log file already exists, and is writable by
the same user the httpd server runs as. It's trivial to rewrite this
in C, and to even have it hash the date into alphanumeric characters,
but just remember that the output should always be unique, and never
random.
To the best of my knowledge, this is in an original idea. I've not
come across anything even resembling poison to track web email
extractors.


4) You aren't limited to trying to prevent the abuse of your services.
You can also abuse the abuser :)

I invite you to visit the URL shown in the above example,
http://www.arpa.org/raid/blackflag

This is something I came up with one night while watching my logs fly by
as we got spidered. It gave me the idea to _trap_ the spider.
Basically, if a spider or webbot of any sort comes across this `page',
every single email address it finds is completely random and bogus, and
every single hyperlink it finds links to what appears to be a new web
page, but in fact is new output from the same CGI, just referenced by
a different name. The document the spider sees looks like every other
web page. Lots of text, lots of email addresses, and lots of links to
other pages. For all it knows, it found a goldmine. It continues to
follow every link, save every email address, in an endless loop with
no link out of. I've watched web bots spend HOURS on this CGI,
loading it tens of thousands of times.
The content of the page is generated in a random fashion every time it
is accessed, with words pulled out of a short dictionary file of about
1000 words. Also, because it is a CGI and not a static HTML document,
the server will make sure it isn't cached (the last-modified date will
always be updated.) 

There are a number of ways to execute this CGI. One way is to have
Apache use it as a 'handler' for everything in a directory which
doesn't actually exist. It would execute the CGI as if it were
passing it a filename within that directory to process, then serve
to the client. Again beyond the scope of this article.
The way I do it is to create a ScriptAlias in the Apache srm.conf
configuration file that maps /raid/ to the real cgi-bin directory.
Most robots wont recursively follow links given by CGIs or that
include "cgi-bin" in the URL. This is likely to look more like real
HTML, albeit the last-modified date will always be `now'.
Every single link generated by the CGI points back to itself with
/raid/blackflag/, followed by some random word (for example,
"http://arpa.org/raid/blackflag/phelix") to appear as different
directories or documents. When a requested URL component is found in a
ScriptAlias'd CGI directory, anything in the URL following the CGI name
is stripped, and inserted into the PATH_INFO variable, which is ignored
in this case. All these tidbits of information may not be very useful,
but when the appearance of the links look unique from the others, they
are more likely to be followed by the robot, therefore wasting more of
its time and disk space.
I advise against using this technique if you have limited bandwidth
and/or processor power on your web server. It was only my intention to
trap the robot with this program and waste ITS time and disk space, not
my own resources. The CGI itself is written in perl, and could probably
work on ANY web server, including one running under NT. The source is
available near the end of this article. If you do choose to use this,
it may be a good idea to link to it on all your web pages. That way,
any robot that comes across your web pages will eventually be trapped
by it.
This is NOT one of my original ideas. I came across a very similar
program that was mentioned in the signature of someone whose identity
escapes me. I believe he posts regularly to the comp.os.unix.* groups.
Or maybe the net-abuse groups.


5) FTP Mirroring is the last and most complex problem to prevent.
There are plenty of programs available for UNIX and Windows that enable
one to copy literally everything on a remote ftp server to the local
machine with minimal effort on the part of the user. Examples of
ftp mirroring software are wget, mirror, pmirror, and bpftp (bpftp
is a Windblows ftp client with mirroring capabilities.)
The problem with these programs is that, by default, if something
goes wrong, such as a network timeout or disconnection, it will
immediately reconnect and continue to retrieve the files. It may keep
trying forever until the user stops it, or until some preset timeout
occurs. Have you ever added a mirroring client's address to your
hosts.access rules and killed their connection? It may sit there and
try to reconnect forever, thinking it will eventually log in, until
you either block the connections at a lower network level or the user
realizes what is going on and stops it.
The idea that I've come up with recently, while sharing it with my
good friends Dave, Window, and Matt at Burger King one day, is to
have the ftp server itself generate blatantly bogus directory entries
inserted into every directory listing the client requests. The file-
name generated should be a pseudo-random string of average filename
length and have a displayed filesize of 1 byte, that the server can
remember for as long as the client is connected. A human would most
likely _not_ download a file with a name full of random characters,
or be 1 byte in size, but a mirroring program wouldn't hesitate to
attempt to grab the file.
Our magically patched ftp server would remember the bogus "files"
it includes in the lists, and when it notices the client attempt to
receive it, instead of producing any error or the standard "200
PORT command successful", it would produce a message informing the
user that attempting to GET the bogus file is indicative of mirroring,
and is not allowed. For example:
200-Access to the file OykWN2zw0DcnSt is prohibited.
200 Please do not mirror this directory or ftp site.
We use the 200 code instead of a 4xx or 5xx error code, so the client
doesn't suspect that something has gone wrong. Immediately following
the message, the ftp server just waits in an idle loop until the
client closes the connection, or a specified timeout passes. A
reasonable figure would be an hour before closing the connection.
During this time, there would obviously be no traffic to or from the
offending client for this server. The presence of the open FTP
connection, and the idle ftp process are the only hints of the event.
Closing the connection immediately, or rejecting it in the
hosts.access files would cause the client to reconnect to the server
rapidly, which would likely cause inetd to temporarily shut down the
ftp service, thinking it was looping. This is how a mildly annoying
"innocent" person may cause a denial of service attack while trying
to mirror your ftp server. Many of these people seem to ignore any
warnings you may have put in your ftp login banners. Perhaps it is
because their point-and-click ftp program doesn't display this highly
important information to the user at the controls, or displays it for
about 0.0002 seconds in a `debug' window barely large enough to fit a
phone number into.
Since there are just so many ftp servers out there for the various
UNIX variants, it should be up to the reader to determine where to
obtain the source and how to modify it to accomplish our goal here.
I have also considered using this technique for HTTP, but determined
that it would probably be useless, since many web robots can open
multiple connections to the web server, effectively "getting around"
having only one connection locked, and possibly have very short
`retrieval timeout'. I suppose we could add an httpd access rule
to deny serving the offending client, but we all know how dangerous
it is for a webserver to be able to modify its own configuration.

Now I get food.

- Erik Schorr 980719

<--- cut here - here's the perl source for blackflag - cut here --->
#!/usr/bin/perl
# We require perl5 here.
# blackflag, Copywrong (d) 1998 Erik Schorr <phelix@arpa.org>

# file to log all accesses to this cgi (with HTTP_USER_AGENT):
$logfile="/usr/local/etc/httpd/logs/blackflag";

# wordlist to pull words from (this file should be 500-1000 lines long
# for best performance. Format: SINGLE word on each line, no delimiters,
# preferably english words :-)
$wordfile="/usr/dict/shortwords";

# list of URLS (on this machine or others) that will always point to this CGI:
# (/raid/ is a ScriptAlias on my machine, to hide the real /cgi-bin/ from
# the email extractors - they usually ignore anything with 'cgi' in the URL.)
$url[1]="poison.cwnet.com/raid/blackflag";
$url[2]="arena.cwnet.com/raid/blackflag";
$url[3]="www.slackware.org/raid/blackflag";
$url[4]="irc.chatting.com/raid/blackflag";
$url[5]="www.arpa.org/raid/blackflag";
# number of URLS in this list:
$numurl=5;

# want a quick wordlist? use this command in unix:
# cat /var/spool/mail/USERNAME | grep -v "[A-Z0-9]" | (continued next line)
# tr -cs "[:alpha:]" "\n" | sort -u > shortwords
# (replace USERNAME with your own username, or use anybody's mailspool)

# print http header...
print "Content-type: text/html\n\n";
print "<HTML><HEAD>\n";

# open/create logfile. this SHOULD already exist and be writable by us.
if ( -e $logfile ) {
open LOG, ">>$logfile";
} else {
open LOG, ">$logfile";
}

$time=localtime;
print LOG "$time $ENV{'REMOTE_ADDR'} $ENV{'HTTP_HOST'}$ENV{'REQUEST_URI'}"
. " $ENV{'HTTP_REFERER'} \"$ENV{'HTTP_USER_AGENT'}\"\n";
close LOG;


$punctuation[1]=".";
$punctuation[2]="!";
$punctuation[3]="\?";
$punctuation[4]=":";

# grab words from wordlist
open WORDS,"<$wordfile";
$wordnum=0;
while (<WORDS>) {
$wordnum++;
$line=$_;
chomp $line;
$word[$wordnum]=$line;
}

$title=$word[int(rand $wordnum)+1] . " " . $word[int(rand $wordnum)+1]
. " " . $word[int(rand $wordnum)+1] . " " . $word[int(rand $wordnum)+1];
print "<TITLE>$title</TITLE></HEAD>\n";
print "<BODY BGCOLOR=#FFFFFF>\n";
$paragraphs=int(rand 10)+3;
$pgn=0;
while($pgn < $paragraphs) {
$pgn++;
$wip=int(rand 80)+10;
$tw=0;
while($tw<$wip) {
$tw++;
$prword=$word[1+int(rand $wordnum)];
print "$prword";
if((rand 10)<1) {
$punc=$punctuation[int(rand 4)+1];
print "$punc<BR>\n";
}
print " ";
}
print "<BR>\n";
$nad=int(rand 10)+3;
$pad=0;
while($pad<$nad) {
$pad++;
$aaa=$word[1+int(rand $wordnum)];
$bbb=$word[1+int(rand $wordnum)];
$ccc=$word[1+int(rand $wordnum)];
if((rand 4)>1) {
if((rand 3)>1) {
$tld="com";
} else {
$tld="net";
}
} else {
$tld="org";
}
$mailaddr=$aaa . "\@" . $bbb . $ccc . "." . $tld;
# Don't ask. It just works.
if((rand 4)>3) {
$urlhead="http://";
$urlbody=$url[int(rand $numurl)+1];
$urltail="/" . $word[int(rand $wordnum)+1];
$urlp=$urlhead . $urlbody . $urltail;
} else {
$urlp="mailto:" . $mailaddr;
}
print "<A HREF=\"$urlp\">$mailaddr</A><BR>\n";
}
}
print "</BODY></HTML>\n";
exit;
<--- cut here - EOF - cut here --->

sitemap