Bug #2811 for WWW-Mechanize: incorrect and invalid behaviour of extract

Tue Jun 17 06:11:52 2003 Guest - Ticket created

Subject:

incorrect and invalid behaviour of extract_links()

Using WWW-Mechanize-0.44, I have found an issue with extract_links(). Currently the behaviour does not follow the documentation. It claims that element [1] of a frame/iframe link will be set to the text enclosed be the tags, when in fact it sets it to the name. Even if this were the desired behaviour, it causes a problem. When the name attribute of the frame/iframe tag is not set, element [1] of the link is returned as undefined. Consequently, when find_link() is called with "text => 'foo'", an error is produced because perl attempts to compare 'foo' with the undefined value. If using the name is the behaviour you want, then you could check within extract_links() to see if the name attributed is undefined, and return an empty string instead. Then update the documentation. However, I believe that Mechanize should work as the documentation descibes. So, the correct fix is to always use get_trimmed_text() (or similar). I have attached a patch against WWW-Mechanize-0.44 to do this. This fixes the undefined value problem, but it changes the behaviour of extract_links() slightly. Consider the following example: <a href="uri1">A</a> <iframe src="uri2"><a href="uri3"><img alt="B" src="uri4"></a></iframe> <a href="uri5">C</a> With Mechanize 0.44 This will produce the following links (url, text): 1, A 2, *undefined* 3, B 4, C With my patched version, we get: 1, A 2, B 4, C This is because the process of getting the text within the frame/iframe skips over the tags inside it and so they never get added to the list of links. I suppose you could get the raw HTML/text from inside it and do a recursive call on that to search for links, before doing get_trimmed_text() on it. However, I think my patched version is the correct behaviour. My reasoning for this is that by taking the properties of the frame/iframe and making them visible, we are acting as a user agent that understands frames/iframes. Hence, we should ignore content inside them. If this change is integrated, then I'd suggest perhaps adding a note to the changelog to make people aware of this change in behaviour. Patch tested using perl version v5.6.1 built for sun4-solaris-thread-multi with patch "ActivePerl Build 631" applied, on a SUNW,Ultra-250 running Solaris 8.

820c820 < my $text = $tag_is_a ? $p->get_trimmed_text("/a") : $token->[1]{name}; --- > my $text = $p->get_trimmed_text("/" . $token->[0]);

Sun Jul 20 00:39:25 2003 andy [...] petdance.com - Taken

Sun Jul 20 00:39:45 2003 andy [...] petdance.com - Correspondence added

Fixed in 0.54. Also, extract_links() is now dead.

Sun Jul 20 00:39:45 2003 andy [...] petdance.com - Status changed from 'new' to 'resolved'

Bug #2811 for WWW-Mechanize: incorrect and invalid behaviour of extract_links()

Preferred bug tracker