Bug #3165 for WWW-Mechanize: Follow link based on surrounding text

Wed Aug 06 05:22:29 2003 Guest - Ticket created

Subject:

Follow link based on surrounding text

This is a great module, but I run into the following issue every time: Websites with electronic versions of (academic) journal articles, e.g., sciencedirect.com, generally present the stuff you want, followed by links to possible generic actions. My wish is therefore to be able to follow a link based on the *preceding material*, and not one of the properties of the link itself. This is comparable with the request someone did for scraping Google news: suppose I want to read all the stories about Iraq, then I won't get far by examining the url-text/href on news.google.com.... And now for a concrete example: Suppose we have two entries for journal articles on 1 page: On the theory of reference-dependent preferences, Pages 407-428 Alistair Munro and Robert Sugden Abstract | Full Text + Links | PDF (149 K) Melioration learning in games with constant and frequency-dependent pay-offs, Pages 429-448 Thomas Brenner and Ulrich Witt Abstract | Full Text + Links | PDF (114 K) In this case, doing $agent->follow('PDF') (or having an url_regex matching '.pdf') is not useful, as you do not want to follow a pdf link, but follow the link to the pdf just right after the correct pagenumbers are mentioned. This is a problem that can occur for several other applications, I imagine. For example, screen scraping your inbox from webmail: for each subject line you can choose 'reply', 'read', 'delete', etc., but the links to those actions are not distinguishable by their name(or url) for the different emails. I think this problem is ultimately solved by having something like a function "follow_context(R1, R2)", which matches R1 on the visible text, and matches R2 on the links that follow after the match of R1. Also, R2 could be allowed to be an integer (possibly negative), which gives the link number starting from the match of R1. I am using my own patched version of WWW::Mechanize that does this and it works great. Therefore, I would love to send in a patch, but I need to think of a non-dirty way of getting the text nodes from a page. I.e., HTML::TokeParser only accepts one parameter in get_text, while we need it to get the text until it meets an <a>, <iframe>, or <frame> tag. Any ideas on this?

Wed Aug 06 07:05:38 2003 Guest - Correspondence added

Seems interesting and useful. I think this is a great idea, but we need a real implementation to discuss the benefits of this API further.

Sat Oct 04 09:38:42 2003 Guest - Correspondence added

From:

siegmann [...] tinbergen.nl

[guest - Wed Aug 6 07:05:38 2003]: Show quoted text

> Seems interesting and useful. I think this is a great idea, but we

need Show quoted text

> a real implementation to discuss the benefits of this API further.

Please find a patch attached oldnew.patch (unified diff) relative to version 0.58. It extends the implementation of follow_link to include a context_regex, which is matched on the (non-link) text of a page. If a match is found $parms{n} gives the number of the nth link after the match that should be followed (default 1). It took a few changes in follow_link to include context_regex. A few extra lines do the match on the non-link text and return the nth link after the match. I've tried to minimize the changes to _extract_links in order to get it filling a @texts array with the pieces of text between the recorded urltags. Please let me know what you think! Arjen Siegmann

--- Mechanize.old Tue Sep 2 09:30:30 2003 +++ Mechanize.new Wed Sep 3 11:20:09 2003 @@ -853,13 +853,14 @@ my %parms = ( n=>1, @_ ); my @links = @{$self->{links}}; + my @texts = @{$self->{texts}}; return unless @links ; my $wantall = ( $parms{n} eq "all" ); for ( keys %parms ) { - if ( !/^(n|(text|url)(_regex)?)$/ ) { + if ( !/^(n|(text|url|context)(_regex)?)$/ ) { $self->_carp( qq{Unknown link-finding parameter "$_"} ); } } @@ -869,6 +870,7 @@ push @conditions, q/ $_[0]->[0] =~ $parms{url_regex} / if defined $parms{url_regex}; push @conditions, q/ defined($_[0]->[1]) and $_[0]->[1] eq $parms{text} / if defined $parms{text}; push @conditions, q/ defined($_[0]->[1]) and $_[0]->[1] =~ $parms{text_regex} / if defined $parms{text_regex}; + push @conditions, q/ defined($_[0]) and $_[0] =~ $parms{context_regex} / if defined $parms{context_regex}; my $matchfunc; if ( @conditions ) { @@ -878,6 +880,8 @@ $matchfunc = sub{1}; } + # simple case: search url or text of links + if(not defined $parms{context_regex}) { my $nmatches = 0; my @matches; for my $link ( @links ) { @@ -897,7 +901,21 @@ } return; -} # find_link + } # find_link + + # defined $param{context_regex} + my $link_count = 0; +my $i=0; + for my $text ( @texts ) { + if ( $matchfunc->($text) ) { + my $num = $link_count+$parms{n}-1; + return $links[$num]; + } + $link_count++; + } + + return 0; #nothing found +} =head2 C<< $a->find_all_links( ... ) >> @@ -1148,24 +1166,41 @@ my $p = HTML::TokeParser->new(\$self->{content}); $self->{links} = []; + $self->{texts} = []; + + my $inter_text; + my $text; + my $token; + + # We need to workaround incorrect html, i.e., not starting with a tag + while ( $text = $p->get_trimmed_text or $token = $p->get_tag) { + $inter_text .= $text; + ( $token = $p->get_tag ) unless defined $text; - while (my $token = $p->get_tag( keys %urltags )) { - my $tag = $token->[0]; - my $url = $token->[1]{$urltags{$tag}}; - next unless defined $url; # probably just a name link or <AREA NOHREF...> + my $tag= $token->[0]; + next unless( $urltags{$tag} ); # read more text if not an urltag - my $text; + my $url = $token->[1]{$urltags{$tag}}; + next unless defined $url; # probably just a name link or <AREA NOHREF...> + + my $link_text; my $name; if ( $tag eq "a" ) { - $text = $p->get_trimmed_text("/$tag"); - $text = "" unless defined $text; + $link_text = $p->get_trimmed_text("/$tag"); + $link_text = "" unless defined $link_text; } if ( $tag ne "area" ) { $name = $token->[1]{name}; } - push( @{$self->{links}}, WWW::Mechanize::Link->new( $url, $text, $name, $tag ) ); - } + push( @{$self->{texts}}, $inter_text ); + $inter_text = ""; + push( @{$self->{links}}, WWW::Mechanize::Link->new( $url, $link_text, $name, $tag ) ); + } + + if($inter_text ne "") { #some text after final urltag + push( @{$self->{texts}}, $inter_text ); + } # Old extract_links() returned a value. Carp if someone expects # this version to return something.

Mon Oct 18 21:42:11 2004 MARKSTOS [...] cpan.org - Status changed from 'new' to 'stalled'

Mon Oct 18 21:42:11 2004 MARKSTOS [...] cpan.org - Correspondence added

[guest - Sat Oct 4 09:38:42 2003]: Show quoted text

> [guest - Wed Aug 6 07:05:38 2003]: >

> > Seems interesting and useful. I think this is a great idea, but we

> need

> > a real implementation to discuss the benefits of this API further.

> > Please find a patch attached oldnew.patch (unified diff) relative to > version 0.58. It extends the implementation of follow_link to include > a > context_regex, which is matched on the (non-link) text of a page. If a > match is found $parms{n} gives the number of the nth link after the > match that should be followed (default 1).

Sorry for the massive delay. Andy got way behind on the bug tracking system, and I'm helping him out. This still looks like an interesting feature. Could you extend you patch to update the test suite and documentation as well? The feature is unique enough that we'll probably want to try it an beta before finally accepting. You'll will be given credit in the 'Changes' file for the idea and implementation. Thanks for the submission Arjen! Mark

Sat Jul 15 20:23:59 2006 MARKSTOS [...] cpan.org - Correspondence added

Subject:	Follow link based on surrounding text (can it be a plugin?)
From:	MARKSTOS [...] cpan.org

Hello, I'm following up on this old "Wish" for Mechanize. It is an interesting idea, but I think it is best to keep the Mechanize module itself lean, and extend it through plugins. I encourage to look at releasing this functionality but writing a plugin for use with WWW::Mechanize::Pluggable. If for some reason that isn't feasible, I'm interested to know about the details, so that we can improve the plugin system to enable features likes this to be plugged-in. Mark On Wed Aug 06 05:22:29 2003, guest wrote: Show quoted text

> This is a great module, but I run into the following issue every time: > Websites with electronic versions of (academic) journal > articles, e.g., sciencedirect.com, generally present the stuff you > want, followed by links to possible generic actions. My wish is > therefore to be able to follow a link based on the *preceding > material*, and not one of the properties of the link itself. This > is comparable with the request someone did for scraping Google > news: suppose I want to read all the stories about Iraq, then I > won't get far by examining the url-text/href on news.google.com.... > > And now for a concrete example: > Suppose we have two entries for journal articles on 1 page: > > On the theory of reference-dependent preferences, Pages 407-428 > Alistair Munro and Robert Sugden > Abstract | Full Text + Links | PDF (149 K) > > Melioration learning in games with constant and frequency-dependent > pay-offs, Pages 429-448 > Thomas Brenner and Ulrich Witt > Abstract | Full Text + Links | PDF (114 K) > > In this case, doing $agent->follow('PDF') (or having an url_regex > matching '.pdf') is not useful, as you do not > want to follow a pdf link, but follow the link to the pdf just > right after the correct pagenumbers are mentioned. This is a problem > that > can occur for several other applications, I imagine. For example, > screen scraping your inbox from webmail: for each subject line you can > choose 'reply', 'read', 'delete', etc., but the links to those actions > are not distinguishable by their name(or url) for the different > emails. > > I think this problem is ultimately solved by having something like a > function "follow_context(R1, R2)", which matches R1 on the visible > text, and matches R2 on the links that follow > after the match of R1. Also, R2 could be allowed to be an integer > (possibly negative), which gives the > link number starting from the match of R1. > > I am using my own patched version of WWW::Mechanize that does this and > it works great. Therefore, I would love to send in a patch, but I > need to think of a non-dirty way of getting the text nodes from a > page. I.e., HTML::TokeParser only accepts one parameter in > get_text, while we need it to get the text until it meets an <a>, > <iframe>, or <frame> tag. > Any ideas on this?

Sat Jul 15 20:23:59 2006 The RT System itself - Status changed from 'stalled' to 'open'

Tue Oct 30 01:09:38 2007 PETDANCE [...] cpan.org - Status changed from 'open' to 'rejected'

Bug #3165 for WWW-Mechanize: Follow link based on surrounding text

Preferred bug tracker