HEllo
Thanks for you WWW::Robot we used quite extensively. I love the hook configuration.
We based on your module a system of distributed robots which go quite well.
However, in Robot.pm, it look that Robot::UA simple_request is called far too many times as we
keep when calling check_mime_type repeatedly on the same file.
And this simple_request will cause a sleep to be called (and duplicated web traffic).
Suggestions
* why not calling that line at first?
next if $url_seen{ $link };
* or call add-url-test hook also before?
thanks alex
It happens around line 939 of Robot.pm
unless ( $self->{ 'ANY_URL' } ||
# only follow html links (.html or .htm or no extension)
$link =~ /\.s?html?/ || $link =~ m{/$} )
# lets assume .s?html or "/" type links really are text/html
{
# put in some obvious ones here ...
next if $link =~
/(?:ftp|gopher|mailto|news|telnet|javascript):/
;
next if $link =~ /\.(?:gif|jpe?g)/;
if ( $self->{ 'CHECK_MIME_TYPES' } )
{
# grab anchor / area / frame links
$self->verbose( " check mime type ..." );
next unless
$self->check_mime_type( $link_url_abs, [ 'text/html' ] )
;
}
}
# only follow links we haven't seen yet ...
next if $url_seen{ $link };
$url_seen{ $link }++;
next if (
exists $self->{ 'HOOKS' }->{ 'add-url-test' } and
not $self->invoke_hook_functions(
'add-url-test',
$link_url_abs
)
);