Subject: | Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/share/perl5/HTTP/Message.pm line 264 |
$ grep '$VERSION =' /usr/share/perl5/HTTP/Message.pm
$VERSION = "5.828";
$ perl -wMLWP::Simple -e 'get ("http://jobboerse.arbeitsagentur.de/")'
Parsing of undecoded UTF-8 will give garbage when decoding entities at
/usr/share/perl5/HTTP/Message.pm line 264.
Oops. That's not so good.
Line 264 is in content_charset, enlisting HTML::Parser to parse the header.
I have attached the HTTP::Response object so the problem can be
demonstrated without needing to connect to the website:
$ perl -wMLWP::UserAgent -MData::Dumper -e '
my $ua = new LWP::UserAgent;
my $r = $ua->get("http://jobboerse.arbeitsagentur.de/");
print Dumper $r;
' > /tmp/HTTP::Response-object
(I've attached the file)
$ perl -wMLWP -e '
my $r = do "/tmp/HTTP::Response-object";
$r->content_charset;
'
I've got the mad idea that stripping/killing all 8-bit-chars for the
parser --- along the lines of a "tr [\200-\377] [\000-\177];" --- might
work, if we're only looking for headers that are ASCII encoded, but I am
convinced that that's not really the right way. I am also not sure I
truly understand what HTML::Parser is trying to tell HTTP::Message.