Subject: | Follow link based on surrounding text |
This is a great module, but I run into the following issue every time: Websites with electronic versions of (academic) journal
articles, e.g., sciencedirect.com, generally present the stuff you
want, followed by links to possible generic actions. My wish is therefore to be able to follow a link based on the *preceding material*, and not one of the properties of the link itself. This is comparable with the request someone did for scraping Google news: suppose I want to read all the stories about Iraq, then I won't get far by examining the url-text/href on news.google.com....
And now for a concrete example:
Suppose we have two entries for journal articles on 1 page:
On the theory of reference-dependent preferences, Pages 407-428
Alistair Munro and Robert Sugden
Abstract | Full Text + Links | PDF (149 K)
Melioration learning in games with constant and frequency-dependent
pay-offs, Pages 429-448
Thomas Brenner and Ulrich Witt
Abstract | Full Text + Links | PDF (114 K)
In this case, doing $agent->follow('PDF') (or having an url_regex matching '.pdf') is not useful, as you do not
want to follow a pdf link, but follow the link to the pdf just
right after the correct pagenumbers are mentioned. This is a problem that
can occur for several other applications, I imagine. For example,
screen scraping your inbox from webmail: for each subject line you can
choose 'reply', 'read', 'delete', etc., but the links to those actions
are not distinguishable by their name(or url) for the different
emails.
I think this problem is ultimately solved by having something like a
function "follow_context(R1, R2)", which matches R1 on the visible text, and matches R2 on the links that follow
after the match of R1. Also, R2 could be allowed to be an integer (possibly negative), which gives the
link number starting from the match of R1.
I am using my own patched version of WWW::Mechanize that does this and it works great. Therefore, I would love to send in a patch, but I need to think of a non-dirty way of getting the text nodes from a page. I.e., HTML::TokeParser only accepts one parameter in get_text, while we need it to get the text until it meets an <a>, <iframe>, or <frame> tag.
Any ideas on this?