Skip Menu |

This queue is for tickets about the libwww-perl CPAN distribution.

Report information
The Basics
Id: 48621
Status: resolved
Priority: 0/
Queue: libwww-perl

People
Owner: Nobody in particular
Requestors: 643opk102 [...] sneakemail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 5.830
Fixed in: (no value)



Subject: Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/share/perl5/HTTP/Message.pm line 264
$ grep '$VERSION =' /usr/share/perl5/HTTP/Message.pm $VERSION = "5.828"; $ perl -wMLWP::Simple -e 'get ("http://jobboerse.arbeitsagentur.de/")' Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/share/perl5/HTTP/Message.pm line 264. Oops. That's not so good. Line 264 is in content_charset, enlisting HTML::Parser to parse the header. I have attached the HTTP::Response object so the problem can be demonstrated without needing to connect to the website: $ perl -wMLWP::UserAgent -MData::Dumper -e ' my $ua = new LWP::UserAgent; my $r = $ua->get("http://jobboerse.arbeitsagentur.de/"); print Dumper $r; ' > /tmp/HTTP::Response-object (I've attached the file) $ perl -wMLWP -e ' my $r = do "/tmp/HTTP::Response-object"; $r->content_charset; ' I've got the mad idea that stripping/killing all 8-bit-chars for the parser --- along the lines of a "tr [\200-\377] [\000-\177];" --- might work, if we're only looking for headers that are ASCII encoded, but I am convinced that that's not really the right way. I am also not sure I truly understand what HTML::Parser is trying to tell HTTP::Message.
Fixed in http://github.com/gisle/libwww- perl/commit/84a9452eb58eeac7f988f68840e8231566caec45