Bug #13227 for WWW-Mechanize: patch for "WWW:Mechanize::Polite"

Mon Jun 13 15:08:18 2005 JEFFA [...] cpan.org - Ticket created

Subject:

patch for "WWW:Mechanize::Polite"

See http://perlmonks.org/index.pl?node_id=330872 for explanation. (jeffa is lazy today)

*** Mechanize.pm 2005-05-27 11:29:53.000000000 -0500 --- /usr/lib/perl5/site_perl/5.8.3/WWW/Mechanize.pm 2005-05-27 11:41:10.000000000 -0500 *************** *** 100,105 **** --- 100,106 ---- use HTML::Form 1.00; use HTML::TokeParser; use URI::URL; + use WWW::RobotRules; use base 'LWP::UserAgent'; *************** *** 206,211 **** --- 207,214 ---- my $self = $class->SUPER::new( %parent_parms ); bless $self, $class; + $self->{robo_rules} = WWW::RobotRules->new($self->{agent}); + # Use the mech parms now that we have a mech object. for my $parm ( keys %mech_parms ) { $self->{$parm} = $mech_parms{$parm}; *************** *** 287,292 **** --- 290,309 ---- return sort keys %known_agents; } + =head2 parse_robots() + + Tells Mech where to find a site's robots.txt file + + =cut + + sub parse_robots { + my $self = shift; + my $url = shift; + + $self->get($url); + $self->{robo_rules}->parse($url, $self->content); + } + =head1 PAGE-FETCHING METHODS =head2 $mech->get( $url ) *************** *** 322,327 **** --- 339,363 ---- return $self->SUPER::get( $uri->as_string, @_ ); } + =head2 polite_get() + + Simply calls Mech's get() method, but consults with the site's + robots.txt file (via WWW::RobotRules) before fetching the URI + + =cut + + sub polite_get { + my $self = shift; + my $uri = shift; + + if ($self->{robo_rules}->allowed($uri)) { + $self->get($uri); + } + else { + undef $self->{content}; + } + } + =head2 $mech->reload() Acts like the reload button in a browser: repeats the current

Tue May 01 22:38:37 2007 PETDANCE [...] cpan.org - Correspondence added

Thanks, but I'm not interested in doing this.

Tue May 01 22:38:38 2007 The RT System itself - Status changed from 'new' to 'open'

Tue May 01 22:38:38 2007 PETDANCE [...] cpan.org - Status changed from 'open' to 'rejected'

Wed May 02 00:22:34 2007 captvanhalen [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #13227] patch for "WWW:Mechanize::Polite"
Date:	Wed, 2 May 2007 00:22:18 -0400
To:	bug-WWW-Mechanize [...] rt.cpan.org
From:	"Jeff Anderson" <captvanhalen [...] gmail.com>

I only submitted that patch because you asked to me via a permonks /msg On 5/1/07, Andy Lester via RT <bug-WWW-Mechanize@rt.cpan.org> wrote: Show quoted text

> > <URL: http://rt.cpan.org/Ticket/Display.html?id=13227 > > > Thanks, but I'm not interested in doing this. >

-- jeffa

Wed May 02 00:22:38 2007 The RT System itself - Status changed from 'rejected' to 'open'

Wed May 02 00:52:41 2007 PETDANCE [...] cpan.org - Correspondence added

OK, I relooked at it again, and I really should be doing it. The robots thing would be good. I'm reopening this, but I don't want it to be a subclass. Instead, it should just be default behavior.

Tue Oct 30 01:21:06 2007 PETDANCE [...] cpan.org - Correspondence added

Moved over to http://code.google.com/p/www-mechanize/issues/detail?id=6

Tue Oct 30 01:21:08 2007 PETDANCE [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #13227 for WWW-Mechanize: patch for "WWW:Mechanize::Polite"

Preferred bug tracker