Skip Menu |

This queue is for tickets about the libwww-perl CPAN distribution.

Report information
The Basics
Id: 20274
Status: resolved
Priority: 0/
Queue: libwww-perl

People
Owner: Nobody in particular
Requestors: imacat [...] mail.imacat.idv.tw
Cc: scop [...] cpan.org
AdminCc:

Bug Information
Severity: Normal
Broken in: 5.805
Fixed in: (no value)

Attachments


Subject: HTML::HeadParser Complaints for Parsing Undecoded UTF-8
Hi. This is imacat from Taiwan. I got warnings when using LWP::UserAgent on web sites with UTF-8 pages. I have tried to dig into the code. It seems that HTML::HeadParser is not satisfied with undecoded UTF-8 data. I do not know why HTML::HeadParser is not satisfied. I attempted to make a patch to solve this, and the warnings are gone. But I do not know if this patch (parsing raw undecoded UTF-8) is a good idea. Maybe you can look into this issue. I have attached my patch. The error log is below. Please tell me if there is any problem. Thank you. imacat@rinse /tmp % cat /tmp/test.pl #! /usr/bin/perl -w use LWP::UserAgent; use vars qw($UA $url $r); $UA = new LWP::UserAgent; $url = "http://zh.wikipedia.org/"; $r = $UA->get($url); print "$url " . $r->status_line . "\n"; imacat@rinse /tmp % /tmp/test.pl Parsing of undecoded UTF-8 will give garbage when decoding entities at /home/imacat/lib/perl5/LWP/Protocol.pm line 115. http://zh.wikipedia.org/ 200 OK imacat@rinse /tmp %
Sorry I forgot to attach my patch. Here it is. Sorry for the disturbance.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 diff -u -r libwww-perl-5.805.orig/lib/LWP/Protocol.pm libwww-perl-5.805/lib/LWP/Protocol.pm - --- libwww-perl-5.805.orig/lib/LWP/Protocol.pm 2004-11-12 21:34:10.000000000 +0800 +++ libwww-perl-5.805/lib/LWP/Protocol.pm 2006-07-05 00:45:05.000000000 +0800 @@ -104,6 +104,7 @@ if ($parse_head && $response->content_type eq 'text/html') { require HTML::HeadParser; $parser = HTML::HeadParser->new($response->{'_headers'}); + $parser->utf8_mode(1); } my $content_size = 0; -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (GNU/Linux) iD8DBQFEqptii9gubzC5S1wRAvHqAJ4zsxBTvoVFV+9MVX9cDK1rz0SgRgCeLbOM LWBd8tpfdtF/yELWyPAsTQo= =sfar -----END PGP SIGNATURE-----
Well, I have reviewed the HTML::Parser POD and its code again. I believe my patch on $parser->utf8_mode(1) is the correct answer. Could you please fix it? Thank you.
From: imacat [...] mail.imacat.idv.tw
Hi. This is imacat from Taiwan. Here is a revised patch that work with older Perl < 5.8 that does not have UTF-8 mode, or older HTML::Parser < 3.40 that does not have utf8_mode. The previous patch does not work with older Perl < 5.8 or HTML::Parser < 3.40. Please use this patch instead of the previous one. Thank you.
Download libwww-perl-5.805-u8parse-2.diff.asc
application/octet-stream 775b

Message body not shown because it is not plain text.

Applied. In 5.806.