Bug #28837 for WWW-Mechanize: "Parsing of undecoded UTF-8 will give garbage" warnings

Tue Aug 14 15:06:24 2007 MYSOCIETY [...] cpan.org - Ticket created

Subject:

"Parsing of undecoded UTF-8 will give garbage" warnings

In a similar situation to #20274 (and #28815, but see below), WWW::Mechanize uses (in the title, _extract_links and _extract_images functions) HTML parsers (both HTML::TokeParser and HTML::HeadParser) that give the "Parsing of undecoded UTF-8 will give garbage" warning when supplied with UTF-8 data in bytes - which is what you get when fetching a UTF-8 encoded web page - with no way of setting utf8_mode on the parsers. For example: balti:~$ cat wwwmech.pl #!/usr/bin/perl -w use strict; use WWW::Mechanize; my $m = WWW::Mechanize->new(); $m->get('http://www.pt-br.pledgebank.com/'); $m->images(); $m->title(); balti:~$ perl wwwmech.pl Parsing of undecoded UTF-8 will give garbage when decoding entities at / usr/lib/perl5/site_perl/5.8/cygwin/HTML/PullParser.pm line 83. Parsing of undecoded UTF-8 will give garbage when decoding entities at / usr/lib/perl5/site_perl/5.8/cygwin/HTML/PullParser.pm line 83. Parsing of undecoded UTF-8 will give garbage when decoding entities at / usr/lib/perl5/site_perl/5.8/WWW/Mechanize.pm line 509. balti:~$ The first "Parsing of..." warning comes from the call to HTML::Form- Show quoted text

>parse within get() logged in #28815 and below; the second is from the

call to images(); the third from title(). The fix is as in the other two bugs mentioned - calling utf8_mode(1) just after the parser is initialised. Given WWW::Mechanize itself has fetched the content, this is fine for the three functions mentioned at the start of this bug. However, as per the comment I've just left on my #28815, HTML::Form- Show quoted text

>parse() wants to be passed decoded_content, and I don't see why that

should be changed given it would break things already written. For this call (in update_html() ), it seems we need to pass in decoded_content. This is stored in $self->{res}->decoded_content() and would replace the $html on line 1952. I'm not sure if this would have any other effects, I don't see why it should.

Fri Nov 09 22:50:42 2007 PETDANCE [...] cpan.org - Correspondence added

Moved to http://code.google.com/p/www-mechanize/issues/detail?id=34

Fri Nov 09 22:50:43 2007 The RT System itself - Status changed from 'new' to 'open'

Fri Nov 09 22:50:44 2007 PETDANCE [...] cpan.org - Status changed from 'open' to 'resolved'

Mon Nov 12 22:57:36 2007 PETDANCE [...] cpan.org - Correspondence added

Moved to http://code.google.com/p/www-mechanize/issues/detail?id=35

Mon Nov 12 22:57:41 2007 The RT System itself - Status changed from 'resolved' to 'open'

Mon Nov 12 22:58:01 2007 PETDANCE [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #28837 for WWW-Mechanize: "Parsing of undecoded UTF-8 will give garbage" warnings

Preferred bug tracker