Subject: | "Parsing of undecoded UTF-8 will give garbage" warnings |
In a similar situation to #20274 (and #28815, but see below),
WWW::Mechanize uses (in the title, _extract_links and _extract_images
functions) HTML parsers (both HTML::TokeParser and HTML::HeadParser)
that give the "Parsing of undecoded UTF-8 will give garbage" warning
when supplied with UTF-8 data in bytes - which is what you get when
fetching a UTF-8 encoded web page - with no way of setting utf8_mode on
the parsers. For example:
balti:~$ cat wwwmech.pl
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $m = WWW::Mechanize->new();
$m->get('http://www.pt-br.pledgebank.com/');
$m->images();
$m->title();
balti:~$ perl wwwmech.pl
Parsing of undecoded UTF-8 will give garbage when decoding entities at /
usr/lib/perl5/site_perl/5.8/cygwin/HTML/PullParser.pm line 83.
Parsing of undecoded UTF-8 will give garbage when decoding entities at /
usr/lib/perl5/site_perl/5.8/cygwin/HTML/PullParser.pm line 83.
Parsing of undecoded UTF-8 will give garbage when decoding entities at /
usr/lib/perl5/site_perl/5.8/WWW/Mechanize.pm line 509.
balti:~$
The first "Parsing of..." warning comes from the call to HTML::Form-
Show quoted text
>parse within get() logged in #28815 and below; the second is from the
call to images(); the third from title().
The fix is as in the other two bugs mentioned - calling utf8_mode(1)
just after the parser is initialised. Given WWW::Mechanize itself has
fetched the content, this is fine for the three functions mentioned at
the start of this bug.
However, as per the comment I've just left on my #28815, HTML::Form-
Show quoted text>parse() wants to be passed decoded_content, and I don't see why that
should be changed given it would break things already written. For this
call (in update_html() ), it seems we need to pass in decoded_content.
This is stored in $self->{res}->decoded_content() and would replace the
$html on line 1952. I'm not sure if this would have any other effects,
I don't see why it should.