Bug #28815 for libwww-perl: HTML::Form warns about "Parsing of undecoded UTF-8"

Mon Aug 13 12:11:23 2007 MYSOCIETY [...] cpan.org - Ticket created

Subject:

HTML::Form warns about "Parsing of undecoded UTF-8"

In a similar situation to #20274, HTML::Form also contains the creation of an HTML parser (HTML::TokeParser in this case) that gives the "Parsing of undecoded UTF-8 will give garbage" warning when supplied with UTF-8 data in bytes, with no way of calling utf8_mode on the parser. For example: matthew@balti:~$ cat htmlform.pl #!/usr/bin/perl -w use strict; use HTML::Form; HTML::Form->parse("<form>\xc3\xa9</form>", 'http://www.mysociety.org/'); matthew@balti:~$ perl htmlform.pl Parsing of undecoded UTF-8 will give garbage when decoding entities at / usr/lib/perl5/HTML/PullParser.pm line 83. matthew@balti:~$ (WWW::Mechanize calls HTML::Form->parse() after every HTML request, which is where I discovered the issue.) I've attached an identical style patch to that already applied for #20274; not sure if http://www.nntp.perl.org/group/perl.libwww/2007/06/ msg7016.html needs looking at too. Hope that's helpful.

Subject:

Form.pm.patch

--- Form.pm 2007-08-13 17:08:02.000000000 +0100 +++ patched/Form.pm 2007-08-13 17:08:16.000000000 +0100 @@ -115,6 +115,7 @@ require HTML::TokeParser; my $p = HTML::TokeParser->new(ref($html) ? $html->decoded_content(ref => 1) : \$html); + $p->utf8_mode(1) if $] >= 5.008 && $HTML::Parser::VERSION >= 3.40; eval { # optimization $p->report_tags(qw(form input textarea select optgroup option keygen label));

Tue Aug 14 15:00:16 2007 MYSOCIETY [...] cpan.org - Correspondence added

From:

mysociety [...] cpan.org

On Mon Aug 13 12:11:23 2007, I wrote: Show quoted text

> I've attached an identical style patch to that already applied for > #20274

This is silly, sorry, given that if an object is passed it uses its decoded_content() method which might well contain proper UTF-8 characters. I guess this should actually be fixed in WWW::Mechanize, by making sure it passes in decoded_content rather than content as it currently does (it doesn't look like it uses that at all, though).

Tue Aug 14 15:07:55 2007 MYSOCIETY [...] cpan.org - Correspondence added

On Tue Aug 14 15:00:16 2007, I wrote: Show quoted text

> I guess this should actually be fixed in WWW::Mechanize, by > making sure it passes in decoded_content rather than content as it > currently does (it doesn't look like it uses that at all, though).

I've raised Ticket #28837 for all the "Parsing of undecoded UTF-8"s in WWW::Mechanize, including this one.

Tue Aug 14 15:08:15 2007 MYSOCIETY [...] cpan.org - Status changed from 'new' to 'open'

Mon Apr 14 04:31:46 2008 GAAS [...] cpan.org - Correspondence added

Assuming this means that this is an WWW::Mechanize issue then.

Mon Apr 14 04:31:48 2008 GAAS [...] cpan.org - Status changed from 'open' to 'rejected'