Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the WWW-Mechanize CPAN distribution.

Report information
The Basics
Id: 28837
Status: resolved
Priority: 0/
Queue: WWW-Mechanize

People
Owner: Nobody in particular
Requestors: MYSOCIETY [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 1.30
Fixed in: (no value)



Subject: "Parsing of undecoded UTF-8 will give garbage" warnings
In a similar situation to #20274 (and #28815, but see below), WWW::Mechanize uses (in the title, _extract_links and _extract_images functions) HTML parsers (both HTML::TokeParser and HTML::HeadParser) that give the "Parsing of undecoded UTF-8 will give garbage" warning when supplied with UTF-8 data in bytes - which is what you get when fetching a UTF-8 encoded web page - with no way of setting utf8_mode on the parsers. For example: balti:~$ cat wwwmech.pl #!/usr/bin/perl -w use strict; use WWW::Mechanize; my $m = WWW::Mechanize->new(); $m->get('http://www.pt-br.pledgebank.com/'); $m->images(); $m->title(); balti:~$ perl wwwmech.pl Parsing of undecoded UTF-8 will give garbage when decoding entities at / usr/lib/perl5/site_perl/5.8/cygwin/HTML/PullParser.pm line 83. Parsing of undecoded UTF-8 will give garbage when decoding entities at / usr/lib/perl5/site_perl/5.8/cygwin/HTML/PullParser.pm line 83. Parsing of undecoded UTF-8 will give garbage when decoding entities at / usr/lib/perl5/site_perl/5.8/WWW/Mechanize.pm line 509. balti:~$ The first "Parsing of..." warning comes from the call to HTML::Form- Show quoted text
>parse within get() logged in #28815 and below; the second is from the
call to images(); the third from title(). The fix is as in the other two bugs mentioned - calling utf8_mode(1) just after the parser is initialised. Given WWW::Mechanize itself has fetched the content, this is fine for the three functions mentioned at the start of this bug. However, as per the comment I've just left on my #28815, HTML::Form- Show quoted text
>parse() wants to be passed decoded_content, and I don't see why that
should be changed given it would break things already written. For this call (in update_html() ), it seems we need to pass in decoded_content. This is stored in $self->{res}->decoded_content() and would replace the $html on line 1952. I'm not sure if this would have any other effects, I don't see why it should.