Apache and PHP vs. the Spambots
09/22/2000
by Edward Piou
This article is about stopping bad robots - specifically, spambots. A "spambot" is a program that visits a website, collects as many email addresses as it can from the site, and (generally) moves on to another site, where it will collect more email addresses. People run these programs to get lists of addresses which they'll either spam themselves, or sell to other people to spam.
To stop spambots using the methods listed here, you'll need:
The methods can actually be pretty easily adapted for Perl instead of PHP. If you're using a webserver other than Apache, you'll need to figure out how to adapt the little tricks we use here for your own server software.
There are 2 steps involved in dealing with any spambots that come to your site:
- detecting the spambot
- neutralizing the spambot
Detecting a Spambot
I first started thinking about detecting spambots when I attended a tutorial given by Lincoln Stein, Perl guru. One of the things he talked about was figuring out which of the "users" visiting his website were actually robots, and which of those robots were "bad" robots - robots that didn't check his robots.txt file to determine which pages they weren't allowed to access.
Stein wrote a perl program, Robo-Cop, which went through his web server logs to identify robots and bad robots. The source code of the program is available online. Here is some sample output from a modified version of the program I used to run (don't worry, we'll get to PHP soon):
| Client |
Robot |
Hits |
Interval |
Hit_Percent |
Index |
| 216.112.23.11: htdig/3.0.7 (andrew@contigo.com) |
yes |
7954 |
0.65 |
9.03 |
13.91 |
| 10.0.0.1: AVSearch-3.0(EoExchange/Liberty) |
yes |
2029 |
20.52 |
2.30 |
0.11 |
| 10.0.0.2: PLSpider/V1.0 |
no |
1987 |
0.61 |
2.26 |
3.70 |
| 10.0.0.3: Microsoft Internet Explorer/4.40.426 (Windows 95) |
no |
1311 |
7.16 |
1.49 |
0.21 |
| 10.0.0.4: Slurp/2.0-BigOwlWeekly (spider@aeneid.com; http://www.inktomi.com/slurp.html) |
yes |
985 |
32.28 |
1.12 |
0.03 |
| 10.0.0.5: AVSearch-3.0(liberty/libertycrawl) |
yes |
606 |
17.81 |
0.69 |
0.04 |
What the columns mean:
- Client - identifies a user based on what IP address and user agent/browser/robot they used
- Robot - whether or not they looked at the "robots.txt" file (the theory is, good robots look, bad robots don't)
- Hits - how many times that user came to the website
- Interval - the average number of seconds between requests from that client in one session (30 minutes)
- Hit_Percent - the percent of total hits that came from that user
- Index - a "score" giving the likelihood that the user actually is a robot, based on the number of hits and the interval between hits.
The program worked pretty well at first; but as I fed it larger server logs, it started taking too much memory and system resources (I was on a low-powered machine). Plus, if I wanted to actually do something about bad bots, I had to wait until the program ran every six hours (through a cron job), go to the output for the program, look for high Index values, and figure out if I wanted to neutralize the bot. This meant a spambot could be scouring my pages for 6 hours (or longer, if I didn't bother checking or was asleep) before I did anything about it.
Using PHP and robots.txt, I decided to be more proactive. To detect bad robots (almost) immediately, I would create web pages which neither people nor "good" robots would visit, and have my webserver send me email whenever anyone (presumably a bad robot) visited the page. |