Vectors & Interfaces
The networking specialist
About Vectors & Interfaces Network support services Useful resources PC News Contact The support specialist Support Guide

In plain English, this means that you can check any property (for instance, the User-Agent: header, Referer: header, the URL of the request, and many others) and perform certain actions based on the value of that property. For our purpose, we rely on the fact that many popular spambot packages are actually dumb enough to announce themselves as such.

I won't go into the gory details of mod_rewrite, as that would take far more room than I have in this article. But I will give you an overview of how mod_rewrite works so that we can check a User-Agent string and redirect the spambot to a page that lets it know we don't allow scraping on our site.

First, we need to enable the mod_rewrite engine. This is done by including a simple set of commands in your httpd.conf file, as shown in Example 1. The RewriteEngine directive is set to "on," enabling mod_rewrite. The RewriteLog directive turns on logging. In this case, output is written to /var/log/mod_rewrite.log—you may wish to put it somewhere else, for example: /usr/local/apache/logs/rewrite.log. For our RewriteLogLevel directive, logging level is set to zero or silent.

You may wish to increase the logging level a bit during testing, to ensure that you're only catching the files you wish to block and that the redirects are happening appropriately. A setting of nine gives you the most output (far too much output for most cases) and a level of three gives acceptable output for debugging purposes. Once you're done debugging, feel free to set it back to zero or anything below three, depending on your needs. Restart Apache using the apachectl restart command after you change your configuration settings, to make sure they take effect.

With a LogLevel setting of three or higher, you can supervise the mod_rewrite engine while it's running. Just run tail -f/path/ to/log. To test whether things are working properly, telnet to the server's HTTP port (usually 80) and request the root document while supplying a User-Agent string that matches the various spamware agents in the rewrite rules discussed below. See the file session.txt (available online) to view the output of a test that worked. The output you can expect from the Apache logs for a successful redirect at a RewriteLogLevel of three is in log.txt.

Finally, our configuration loads the mod_rewrite rulesets by way of Apache's Include directive. All of the directives associated with mod_rewrite are wrapped in a conditional IfModule block, which makes sure that mod_rewrite is operational before trying to read them.

Laying Down the Law

The ruleset itself, shown in Listing 1, includes several conditionals (RewriteCond), each of which may take several arguments. Server variables are referenced with the %{SERVER_VAR} construct.

In the first conditional, we make sure that we're only checking requests for HTML files. These are the files the spambots will be searching through for email addresses to scrape. The %{REQUEST_FILENAME} server variable contains the resolved path to the requested file. We check to see that the variable ends with html?, which covers .html, .htm, .xhtml, .htm, and anything that ends in htm or html. The ? means that the l is optional, and the $ binds the match to the end of the string.

Once we've determined that the request in question is an HTML file, we then compare the contents of the User-Agent: HTTP header with a list of known spambot signatures. For example,

RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]

checks for any User-Agent that begins with EmailSiphon, the name of a common spambot.

The carat (^) binds the match to the beginning of the User-Agent string. The [OR] at the end is a mod_rewrite operator that lets other conditionals follow. At the end of the list of User-Agents for which we're checking, there is no [OR]—this terminates the compound conditional. In English, it reads something like this: "If the user agent is requesting an HTML file, and it identifies itself as EmailSiphon or EmailWolf (or other spambots), then do the next thing."

The "next thing," as you might have figured out, is a redirect to a page containing information about why the requested page wasn't delivered, who to contact for more information, and so forth. On our page, we also include a mailto: link to our abuse address, abuse@hesketh.net, for those spammers who are dumb enough to announce themselves to the folks most unlikely to enjoy being spammed.

The redirect is expressed using the RewriteRule directive, which simply redirects all matching requests (^.*$) to the URL in the next argument. The [R] operator tells mod_rewrite to redirect the visitor to the page. Another option is to use a "pass through" or [PT] operator instead of issuing an HTTP redirect. This is most useful for situations in which your configuration involves many Aliases and the like, as it simply rewrites the guts of the request record so that subsequent modules (such as mod_alias) can do the right thing.


Technical Pg 3