Subject: | HTML::Form warns about "Parsing of undecoded UTF-8" |
In a similar situation to #20274, HTML::Form also contains the creation
of an HTML parser (HTML::TokeParser in this case) that gives the
"Parsing of undecoded UTF-8 will give garbage" warning when supplied
with UTF-8 data in bytes, with no way of calling utf8_mode on the
parser. For example:
matthew@balti:~$ cat htmlform.pl
#!/usr/bin/perl -w
use strict;
use HTML::Form;
HTML::Form->parse("<form>\xc3\xa9</form>", 'http://www.mysociety.org/');
matthew@balti:~$ perl htmlform.pl
Parsing of undecoded UTF-8 will give garbage when decoding entities at /
usr/lib/perl5/HTML/PullParser.pm line 83.
matthew@balti:~$
(WWW::Mechanize calls HTML::Form->parse() after every HTML request,
which is where I discovered the issue.)
I've attached an identical style patch to that already applied for
#20274; not sure if http://www.nntp.perl.org/group/perl.libwww/2007/06/
msg7016.html needs looking at too. Hope that's helpful.
Subject: | Form.pm.patch |
--- Form.pm 2007-08-13 17:08:02.000000000 +0100
+++ patched/Form.pm 2007-08-13 17:08:16.000000000 +0100
@@ -115,6 +115,7 @@
require HTML::TokeParser;
my $p = HTML::TokeParser->new(ref($html) ? $html->decoded_content(ref => 1) : \$html);
+ $p->utf8_mode(1) if $] >= 5.008 && $HTML::Parser::VERSION >= 3.40;
eval {
# optimization
$p->report_tags(qw(form input textarea select optgroup option keygen label));