Bug #47510 for WWW-Mechanize: ->find_link() behaviour depends on internal encoding of strings

Thu Jul 02 10:54:20 2009 FANY [...] cpan.org - Ticket created

Subject:

->find_link() behaviour depends on internal encoding of strings

If ->find_link( text => $text ) works correctly for links that include non-breaking spaces depends on the internal encoding of the $mech->content one one hand and that of $text on the other. If the utf8 flag is not set for the content, non-breaking spaces will not get removed by the get_trimmed_text method within HTML::TokeParser, because /\s/ does _not_ match non-breaking spaces for latin1 strings, and so they will have to be specified in the $text in order to find the matching link. If $text has an utf8 bit set, find_link(), however, will then complain that "'...' is space-padded and cannot succeed" and discard this filter argument, because /\s/ _does_ match non-breaking spaces in this case. Please find a test script attached, which should produce the following output: 25: ( "\xA0Vermietungen", "/anzeigen/antz2/index.html?xv[order]=pdat+desc%2Cfirst_mod+desc%2Csort2+desc%2C+sort3+desc&xv[start]=0&xv[vwnum]=10&xv[cart_query]=&qv[categories]=92", ) 27: ' Vermietungen' is space-padded and cannot succeed at /tmp/test_find_link line 21 (undef, "/mp_styles/mainpost_global.css") (Disclaimer: For some reason I was not yet able to track down, it did behaved differently when I tried it on another computer, because there the content of the page in question was returned with an utf8 bit set.) I suggest to utf8::upgrade($content) before feeding it to HTML::TokeParser and also to utf8::upgrade() the values in WWW::Mechanize->_clean_keys() to avoid this problem. BTW, I also think that it's not a good idea to simply discard arguments which seem invalid, because this simply causes _any_ link to be found, which is usually not what you would expect. IMHO one should return _no_ link in this case. Regards, fany

Subject:

test_find_link

Download test_find_link
application/octet-stream 572b

Message body not shown because it is not plain text.

Mon Jul 06 15:45:09 2009 PETDANCE [...] cpan.org - Correspondence added

Please repost at http://code.google.com/p/www-mechanize/issues/list

Mon Jul 06 15:45:11 2009 The RT System itself - Status changed from 'new' to 'open'

Mon Jul 06 15:45:12 2009 PETDANCE [...] cpan.org - Status changed from 'open' to 'rejected'

Bug #47510 for WWW-Mechanize: ->find_link() behaviour depends on internal encoding of strings

Preferred bug tracker