Subject: | robots.txt User-Agent substring match: inverted logic |
The way I interpret the robots.txt spec, a robot should try a substring match by trying to find the string it parsed from robots.txt case insensitively from its own versionless user-agent string, not the other way around as LWP up to 5.78 seems to do.
So, IMO a robot "FooBarBot" should match "User-Agent: Bar" in robots.txt, not the other way around.
The "not-yet-deployed" draft puts this slightly better than the original, compare http://www.robotstxt.org/wc/norobots.html with http://www.robotstxt.org/wc/norobots-rfc.html (section 3.2.1).
The included patch fixes this, and includes some test cases (+ a trivial comment typo fix).
Index: lib/WWW/RobotRules.pm
===================================================================
RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v
retrieving revision 1.29
diff -a -u -r1.29 RobotRules.pm
--- lib/WWW/RobotRules.pm 6 Apr 2004 11:37:32 -0000 1.29
+++ lib/WWW/RobotRules.pm 7 Apr 2004 21:32:21 -0000
@@ -13,7 +13,7 @@
sub new {
my($class, $ua) = @_;
- # This ugly hack is needed to ensure backwards compatability.
+ # This ugly hack is needed to ensure backwards compatibility.
# The "WWW::RobotRules" class is now really abstract.
$class = "WWW::RobotRules::InCore" if $class eq "WWW::RobotRules";
@@ -121,7 +121,7 @@
# See whether my short-name is a substring of the
# "User-Agent: ..." line that we were passed:
- if(index(lc($ua_line), lc($me)) >= 0) {
+ if(index(lc($me), lc($ua_line)) >= 0) {
LWP::Debug::debug("\"$ua_line\" applies to \"$me\"")
if defined &LWP::Debug::debug;
return 1;
Index: t/robot/rules.t
===================================================================
RCS file: /cvsroot/libwww-perl/lwp5/t/robot/rules.t,v
retrieving revision 1.5
diff -a -u -r1.5 rules.t
--- t/robot/rules.t 7 Apr 2000 20:23:01 -0000 1.5
+++ t/robot/rules.t 7 Apr 2004 21:32:22 -0000
@@ -15,7 +15,7 @@
use Carp;
use strict;
-print "1..32\n"; # for Test::Harness
+print "1..38\n"; # for Test::Harness
# We test a number of different /robots.txt files,
#
@@ -133,6 +133,18 @@
30 => "http://foo/" => 1,
31 => "http://foo/this" => 1,
32 => "http://bar/" => 1,
+ ],
+
+ [$content4, "MomSpiderJr" => # should match "MomSpider"
+ 33 => 'http://foo/private' => 1,
+ 34 => 'http://foo/also_private' => 1,
+ 35 => 'http://foo/this/' => 0,
+ ],
+
+ [$content4, "SvartEnk" => # should match "*"
+ 36 => "http://foo/" => 1,
+ 37 => "http://foo/private/" => 0,
+ 38 => "http://bar/" => 1,
],
# when adding tests, remember to increase