Subject: | ->find_link() behaviour depends on internal encoding of strings |
If ->find_link( text => $text ) works correctly for links that include
non-breaking spaces depends on the internal encoding of the
$mech->content one one hand and that of $text on the other.
If the utf8 flag is not set for the content, non-breaking spaces will
not get removed by the get_trimmed_text method within HTML::TokeParser,
because /\s/ does _not_ match non-breaking spaces for latin1 strings,
and so they will have to be specified in the $text in order to find the
matching link.
If $text has an utf8 bit set, find_link(), however, will then complain
that "'...' is space-padded and cannot succeed" and discard this filter
argument, because /\s/ _does_ match non-breaking spaces in this case.
Please find a test script attached, which should produce the following
output:
25: (
"\xA0Vermietungen",
"/anzeigen/antz2/index.html?xv[order]=pdat+desc%2Cfirst_mod+desc%2Csort2+desc%2C+sort3+desc&xv[start]=0&xv[vwnum]=10&xv[cart_query]=&qv[categories]=92",
)
27: ' Vermietungen' is space-padded and cannot succeed at
/tmp/test_find_link line 21
(undef, "/mp_styles/mainpost_global.css")
(Disclaimer: For some reason I was not yet able to track down, it did
behaved differently when I tried it on another computer, because there
the content of the page in question was returned with an utf8 bit set.)
I suggest to utf8::upgrade($content) before feeding it to
HTML::TokeParser and also to utf8::upgrade() the values in
WWW::Mechanize->_clean_keys() to avoid this problem.
BTW, I also think that it's not a good idea to simply discard arguments
which seem invalid, because this simply causes _any_ link to be found,
which is usually not what you would expect. IMHO one should return _no_
link in this case.
Regards,
fany
Subject: | test_find_link |
Message body not shown because it is not plain text.