Skip Menu |

This queue is for tickets about the WWW-CheckSite CPAN distribution.

Report information
The Basics
Id: 16162
Status: resolved
Priority: 0/
Queue: WWW-CheckSite

People
Owner: abeltje [...] cpan.org
Requestors: SREZIC [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Unimportant
Broken in: 0.015
Fixed in: (no value)



Subject: Ignore fragments while spidering
Fragments (e.g. http://host/path#fragment) should be ignored while spidering pages. Something like this in WWW::CheckSite::Spider::_update_stack could work: my $check_uri = URI->new_abs( $link->url, $new_base ); $check_uri->fragment(undef) if defined $check_uri->fragment; my $check = $check_uri->as_string; Regards, Slaven
[SREZIC - Mon Nov 28 12:25:32 2005]: Show quoted text
> Fragments (e.g. http://host/path#fragment) should be ignored while > spidering > pages. Something like this in WWW::CheckSite::Spider::_update_stack > could work: > > my $check_uri = URI->new_abs( $link->url, $new_base ); > $check_uri->fragment(undef) if defined $check_uri->fragment; > my $check = $check_uri->as_string;
Well it should have been caught by the exclusion pattern, but that is overridable, so I've added fragmented uri's to the list of non-spidarable uri's in uri_ok() with this chunk from change 438: @@ -453,12 +453,14 @@ sub uri_ok { my( $self, $uri ) = @_; + my $check_uri = URI->new( $uri ); $self->{_uri_ok} = ''; $self->{v} and print " Check '$uri'"; - $self->{_uri_ok} = 'scope' unless $uri =~ /^$self->{_self_base}/; - $self->{_uri_ok} = 'pattern' if $uri =~ m/$self->{exclude}/; + $self->{_uri_ok} = 'scope' unless $uri =~ m/^$self->{_self_base}/; + $self->{_uri_ok} = 'fragment' if $check_uri->fragment; + $self->{_uri_ok} = 'pattern' if $uri =~ m/$self->{exclude}/; - $self->{_uri_ok} = 'robots' unless $self->{_norules} || + $self->{_uri_ok} = 'robots' unless $self->{_norules} || $self->allowed( $uri ); $self->{v} and HTH + Good luck, -- Abe.
From: srezic [...] cpan.org
[ABELTJE - Sun Dec 4 11:21:13 2005]: Show quoted text
> [SREZIC - Mon Nov 28 12:25:32 2005]: >
> > Fragments (e.g. http://host/path#fragment) should be ignored while > > spidering > > pages. Something like this in WWW::CheckSite::Spider::_update_stack > > could work: > > > > my $check_uri = URI->new_abs( $link->url, $new_base ); > > $check_uri->fragment(undef) if defined $check_uri->fragment; > > my $check = $check_uri->as_string;
> > Well it should have been caught by the exclusion pattern, but that is > overridable, so I've > added fragmented uri's to the list of non-spidarable uri's in uri_ok() > with this chunk from > change 438: > > @@ -453,12 +453,14 @@ > sub uri_ok { > my( $self, $uri ) = @_; > > + my $check_uri = URI->new( $uri ); > $self->{_uri_ok} = ''; > $self->{v} and print " Check '$uri'"; > - $self->{_uri_ok} = 'scope' unless $uri =~ /^$self-
> >{_self_base}/;
> - $self->{_uri_ok} = 'pattern' if $uri =~ m/$self->{exclude}/; > + $self->{_uri_ok} = 'scope' unless $uri =~ m/^$self-
> >{_self_base}/;
> + $self->{_uri_ok} = 'fragment' if $check_uri->fragment; > + $self->{_uri_ok} = 'pattern' if $uri =~ m/$self->{exclude}/; > > - $self->{_uri_ok} = 'robots' unless $self->{_norules} || > + $self->{_uri_ok} = 'robots' unless $self->{_norules} || > $self->allowed( $uri ); > > $self->{v} and >
I don't think excluding URLs with fragments is a good idea --- a page containing a fragment should be spidered and validated like other pages. My suggestion was just about to optimize spidering. Two different URLs which just differs in the fragment part could be fetched and validated only once, because they point to the same page. Regards, Slaven
[SREZIC - Mon Dec 5 04:40:46 2005]: [snip patch] Show quoted text
> > I don't think excluding URLs with fragments is a good idea --- a > page containing a fragment should be spidered and validated like > other pages. My suggestion was just about to optimize spidering. > Two different URLs which just differs in the fragment part could > be fetched and validated only once, because they point to the same > page.
Ok, I reverted that bit and changed it to something that looks more like your original proposal (this is the main hunk of change 446): @@ -230,11 +233,17 @@ my $new_base = $mech->uri; foreach my $link ( @candidates ) { - my $check = URI->new_abs( $link->url, $new_base )->as_string; + my $new = URI->new_abs( $link->url, $new_base )->as_string; + my $check = $self->strip_uri( $new ); my $data; if ( $data = $cache->has( $check ) ) { + my $frag; + if ( $new ne $check && ! ($frag = $cache->has( $new )) ) { + $frag = [ WCS_TOFOLLOW, undef, $this_page->[2] + 1 ]; + $cache->set( $new => $frag ); + } } else { - if ( $self->uri_ok( $check ) ) { + if ( $self->uri_ok( $new ) ) { $stack->push( $check ); $data = [ WCS_TOSPIDER, undef, $this_page->[2] + 1 ]; } else { HTH, -- Abe.