PLEASE note: These pages are here solely for historic purposes. New articles have not been written since 2001; many links in the index are broken; and most ahref.com email addresses will now bounce. Try visiting ep Productions, Incorporated, the web programming and development company behind this site.

Tip: Want to know when we post new content? Subscribe to our newsletter.

web index ahref.com: a community space for web developers------ -----
IndexToolsCareersTalk
ahref.com > Guides > Technology
Technology Guide


Apache and PHP vs. the Spambots

09/22/2000
by Edward Piou

This article is about stopping bad robots - specifically, spambots. A "spambot" is a program that visits a website, collects as many email addresses as it can from the site, and (generally) moves on to another site, where it will collect more email addresses. People run these programs to get lists of addresses which they'll either spam themselves, or sell to other people to spam.

To stop spambots using the methods listed here, you'll need:

The methods can actually be pretty easily adapted for Perl instead of PHP. If you're using a webserver other than Apache, you'll need to figure out how to adapt the little tricks we use here for your own server software.

There are 2 steps involved in dealing with any spambots that come to your site:

  • detecting the spambot
  • neutralizing the spambot

Detecting a Spambot

I first started thinking about detecting spambots when I attended a tutorial given by Lincoln Stein, Perl guru. One of the things he talked about was figuring out which of the "users" visiting his website were actually robots, and which of those robots were "bad" robots - robots that didn't check his robots.txt file to determine which pages they weren't allowed to access.

Stein wrote a perl program, Robo-Cop, which went through his web server logs to identify robots and bad robots. The source code of the program is available online. Here is some sample output from a modified version of the program I used to run (don't worry, we'll get to PHP soon):

Client Robot Hits Interval Hit_Percent Index
216.112.23.11: htdig/3.0.7 (andrew@contigo.com) yes 7954 0.65 9.03 13.91
10.0.0.1: AVSearch-3.0(EoExchange/Liberty) yes 2029 20.52 2.30 0.11
10.0.0.2: PLSpider/V1.0 no 1987 0.61 2.26 3.70
10.0.0.3: Microsoft Internet Explorer/4.40.426 (Windows 95) no 1311 7.16 1.49 0.21
10.0.0.4: Slurp/2.0-BigOwlWeekly (spider@aeneid.com; http://www.inktomi.com/slurp.html) yes 985 32.28 1.12 0.03
10.0.0.5: AVSearch-3.0(liberty/libertycrawl) yes 606 17.81 0.69 0.04

What the columns mean:

  • Client - identifies a user based on what IP address and user agent/browser/robot they used
  • Robot - whether or not they looked at the "robots.txt" file (the theory is, good robots look, bad robots don't)
  • Hits - how many times that user came to the website
  • Interval - the average number of seconds between requests from that client in one session (30 minutes)
  • Hit_Percent - the percent of total hits that came from that user
  • Index - a "score" giving the likelihood that the user actually is a robot, based on the number of hits and the interval between hits.

The program worked pretty well at first; but as I fed it larger server logs, it started taking too much memory and system resources (I was on a low-powered machine). Plus, if I wanted to actually do something about bad bots, I had to wait until the program ran every six hours (through a cron job), go to the output for the program, look for high Index values, and figure out if I wanted to neutralize the bot. This meant a spambot could be scouring my pages for 6 hours (or longer, if I didn't bother checking or was asleep) before I did anything about it.

Using PHP and robots.txt, I decided to be more proactive. To detect bad robots (almost) immediately, I would create web pages which neither people nor "good" robots would visit, and have my webserver send me email whenever anyone (presumably a bad robot) visited the page.

continue reading >>>
or jump to a topic:

Detecting a Spambot
Detecting a Spambot Faster
Neutralizing the Spambot
Discussion and Resources

view a printable version of this article


To suggest a topic, please email info@ahref.com.

 


HOME ||| ABOUT AHREF.COM ||| ADVERTISE ||| FEEDBACK ||| SEARCH THIS SITE ||| CONTRIBUTE

This site © 1998-1999 ep Productions, Inc. Text of any articles is copyright of the author. All rights reserved. Terms of use.