[ Team LiB ] Previous Section Next Section

Hack 17 Respecting robots.txt

figs/beginner.giffigs/hack17.gif

The robots.txt file is a bastion of fair play, allowing a site to restrict what visiting scrapers are allowed to see and do or, indeed, keep them out entirely. Play fair by respecting their requests.

If you've ever built your own web site, you may have come across something called a robots.txt file (http://www.robotstxt.org)—a magical bit of text that you, as web developer and site owner, can create to control the capabilities of third-party robots, agents, scrapers, spiders, or what have you. Here is an example of a robots.txt file that blocks any robot's access to three specific directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/

Applications that understood your robots.txt file will resolutely abstain from indexing those parts of your site, or they'll leave dejectedly if you deny them outright, as per this example:

User-agent: *
Disallow: /

If you're planning on releasing your scraper or spider into the wild, it's important that you make every possible attempt to support robots.txt. Its power comes solely from the number of clients that choose to respect it. Thankfully, with LWP, we can rise to the occasion quite simply.

If you want to make sure that your LWP-based program respects robots.txt, you can use the LWP::RobotUA class (http://search.cpan.org/author/GAAS/libwww-perl/lib/LWP/RobotUA.pm) instead of LWP::UserAgent. Doing so also ensures that your script doesn't make requests too many times a second, saturating the site's bandwidth unnecessarily. LWP::RobotUA is just like LWP::UserAgent, and you can use it like so:

use LWP::RobotUA;

# Your bot's name and your email address
my $browser = LWP::RobotUA->new('SuperBot/1.34', 'you@site.com');
my $response = $browser->get($url);

If the robots.txt file on $url's server forbids you from accessing $url, then the $browser object (assuming it's of the class LWP::RobotUA) won't actually request it, but instead will give you back (in $response) a 403 error with a message "Forbidden by robots.txt." Trap such an eventuality like so:

die "$url -- ", $response->status_line, "\nAborted"
    unless $response->is_success;

Upon encountering such a resource, your script would die with:

http://whatever.site.int/pith/x.html -- 403 Forbidden by robots.txt

If this $browser object sees that the last time it talked to $url's server was too recently, it will pause (via sleep) to avoid making too many requests too often. By default, it will pause for one minute, but you can control the length of the pause with the $browser->delay(minutes) attribute.

For example, $browser->delay(7/60) means that this browser will pause when it needs to avoid talking to any given server more than once every seven seconds.

—Sean Burke

    [ Team LiB ] Previous Section Next Section
    китайскоС ΠΎΠ±ΠΎΡ€ΡƒΠ΄ΠΎΠ²Π°Π½ΠΈΠ΅