[SREZIC - Mon Dec 5 04:40:46 2005]:
[snip patch]
Show quoted text>
> I don't think excluding URLs with fragments is a good idea --- a
> page containing a fragment should be spidered and validated like
> other pages. My suggestion was just about to optimize spidering.
> Two different URLs which just differs in the fragment part could
> be fetched and validated only once, because they point to the same
> page.
Ok, I reverted that bit and changed it to something that looks more like your original
proposal (this is the main hunk of change 446):
@@ -230,11 +233,17 @@
my $new_base = $mech->uri;
foreach my $link ( @candidates ) {
- my $check = URI->new_abs( $link->url, $new_base )->as_string;
+ my $new = URI->new_abs( $link->url, $new_base )->as_string;
+ my $check = $self->strip_uri( $new );
my $data;
if ( $data = $cache->has( $check ) ) {
+ my $frag;
+ if ( $new ne $check && ! ($frag = $cache->has( $new )) ) {
+ $frag = [ WCS_TOFOLLOW, undef, $this_page->[2] + 1 ];
+ $cache->set( $new => $frag );
+ }
} else {
- if ( $self->uri_ok( $check ) ) {
+ if ( $self->uri_ok( $new ) ) {
$stack->push( $check );
$data = [ WCS_TOSPIDER, undef, $this_page->[2] + 1 ];
} else {
HTH,
-- Abe.