Skip Menu |

This queue is for tickets about the libwww-perl CPAN distribution.

Report information
The Basics
Id: 28815
Status: rejected
Priority: 0/
Queue: libwww-perl

People
Owner: Nobody in particular
Requestors: MYSOCIETY [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 5.808
Fixed in: (no value)



Subject: HTML::Form warns about "Parsing of undecoded UTF-8"
In a similar situation to #20274, HTML::Form also contains the creation of an HTML parser (HTML::TokeParser in this case) that gives the "Parsing of undecoded UTF-8 will give garbage" warning when supplied with UTF-8 data in bytes, with no way of calling utf8_mode on the parser. For example: matthew@balti:~$ cat htmlform.pl #!/usr/bin/perl -w use strict; use HTML::Form; HTML::Form->parse("<form>\xc3\xa9</form>", 'http://www.mysociety.org/'); matthew@balti:~$ perl htmlform.pl Parsing of undecoded UTF-8 will give garbage when decoding entities at / usr/lib/perl5/HTML/PullParser.pm line 83. matthew@balti:~$ (WWW::Mechanize calls HTML::Form->parse() after every HTML request, which is where I discovered the issue.) I've attached an identical style patch to that already applied for #20274; not sure if http://www.nntp.perl.org/group/perl.libwww/2007/06/ msg7016.html needs looking at too. Hope that's helpful.
Subject: Form.pm.patch
--- Form.pm 2007-08-13 17:08:02.000000000 +0100 +++ patched/Form.pm 2007-08-13 17:08:16.000000000 +0100 @@ -115,6 +115,7 @@ require HTML::TokeParser; my $p = HTML::TokeParser->new(ref($html) ? $html->decoded_content(ref => 1) : \$html); + $p->utf8_mode(1) if $] >= 5.008 && $HTML::Parser::VERSION >= 3.40; eval { # optimization $p->report_tags(qw(form input textarea select optgroup option keygen label));
From: mysociety [...] cpan.org
On Mon Aug 13 12:11:23 2007, I wrote: Show quoted text
> I've attached an identical style patch to that already applied for > #20274
This is silly, sorry, given that if an object is passed it uses its decoded_content() method which might well contain proper UTF-8 characters. I guess this should actually be fixed in WWW::Mechanize, by making sure it passes in decoded_content rather than content as it currently does (it doesn't look like it uses that at all, though).
On Tue Aug 14 15:00:16 2007, I wrote: Show quoted text
> I guess this should actually be fixed in WWW::Mechanize, by > making sure it passes in decoded_content rather than content as it > currently does (it doesn't look like it uses that at all, though).
I've raised Ticket #28837 for all the "Parsing of undecoded UTF-8"s in WWW::Mechanize, including this one.
Assuming this means that this is an WWW::Mechanize issue then.