|
In plain English, this means that you can check any property (for instance,
the User-Agent: header, Referer: header, the URL of the request, and many
others) and perform certain actions based on the value of that property. For our
purpose, we rely on the fact that many popular spambot packages are actually
dumb enough to announce themselves as such.
I won't go into the gory details of mod_rewrite, as that would take far more
room than I have in this article. But I will give you an overview of how
mod_rewrite works so that we can check a User-Agent string and redirect the
spambot to a page that lets it know we don't allow scraping on our site.
First, we need to enable the mod_rewrite engine. This is done by including a
simple set of commands in your httpd.conf file, as shown in Example 1. The
RewriteEngine directive is set to "on," enabling mod_rewrite. The
RewriteLog directive turns on logging. In this case, output is written to /var/log/mod_rewrite.log—you
may wish to put it somewhere else, for example: /usr/local/apache/logs/rewrite.log.
For our RewriteLogLevel directive, logging level is set to zero or silent.
You may wish to increase the logging level a bit during testing, to ensure
that you're only catching the files you wish to block and that the redirects are
happening appropriately. A setting of nine gives you the most output (far too
much output for most cases) and a level of three gives acceptable output for
debugging purposes. Once you're done debugging, feel free to set it back to zero
or anything below three, depending on your needs. Restart Apache using the
apachectl restart command after you change your configuration settings, to make
sure they take effect.
With a LogLevel setting of three or higher, you can supervise the mod_rewrite
engine while it's running. Just run tail -f/path/ to/log. To test whether things
are working properly, telnet to the server's HTTP port (usually 80) and request
the root document while supplying a User-Agent string that matches the various
spamware agents in the rewrite rules discussed below. See the file session.txt
(available online) to view the output of a test that worked. The output you can
expect from the Apache logs for a successful redirect at a RewriteLogLevel of
three is in log.txt.
Finally, our configuration loads the mod_rewrite rulesets by way of Apache's
Include directive. All of the directives associated with mod_rewrite are wrapped
in a conditional IfModule block, which makes sure that mod_rewrite is
operational before trying to read them.
Laying Down the Law
The ruleset itself, shown in Listing 1, includes several conditionals (RewriteCond),
each of which may take several arguments. Server variables are referenced with
the %{SERVER_VAR} construct.
In the first conditional, we make sure that we're only checking requests for
HTML files. These are the files the spambots will be searching through for email
addresses to scrape. The %{REQUEST_FILENAME} server variable contains the
resolved path to the requested file. We check to see that the variable ends with
html?, which covers .html, .htm, .xhtml, .htm, and anything that ends in htm or
html. The ? means that the l is optional, and the $ binds the match to the end
of the string.
Once we've determined that the request in question is an HTML file, we then
compare the contents of the User-Agent: HTTP header with a list of known spambot
signatures. For example,
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
checks for any User-Agent that begins with EmailSiphon, the name of a common
spambot.
The carat (^) binds the match to the beginning of the User-Agent string. The
[OR] at the end is a mod_rewrite operator that lets other conditionals follow.
At the end of the list of User-Agents for which we're checking, there is no [OR]—this
terminates the compound conditional. In English, it reads something like this:
"If the user agent is requesting an HTML file, and it identifies itself as
EmailSiphon or EmailWolf (or other spambots), then do the next thing."
The "next thing," as you might have figured out, is a redirect to a
page containing information about why the requested page wasn't delivered, who
to contact for more information, and so forth. On our page, we also include a
mailto: link to our abuse address, abuse@hesketh.net, for those spammers who are
dumb enough to announce themselves to the folks most unlikely to enjoy being
spammed.
The redirect is expressed using the RewriteRule directive, which simply
redirects all matching requests (^.*$) to the URL in the next argument. The [R]
operator tells mod_rewrite to redirect the visitor to the page. Another option
is to use a "pass through" or [PT] operator instead of issuing an HTTP
redirect. This is most useful for situations in which your configuration
involves many Aliases and the like, as it simply rewrites the guts of the
request record so that subsequent modules (such as mod_alias) can do the right
thing.
Technical Pg 3
|