Skip Menu |

This queue is for tickets about the libwww-perl CPAN distribution.

Report information
The Basics
Id: 69393
Status: resolved
Priority: 0/
Queue: libwww-perl

People
Owner: Nobody in particular
Requestors: vincent [...] vinc17.net
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 6.02
Fixed in: (no value)



Subject: Incorrect encoding handling for text/html files with LWP::Simple::get
When a file declared as iso-8859-1 and served as text/html is also a valid UTF-8 file, LWP::Simple::get regards it as a UTF-8 file. This is incorrect. Unless the website from which I fetch data recently changed, it seems to be a regression from libwww-perl 5.x. For instance, with lwp-dump being #!/usr/bin/env perl use strict; use Devel::Peek; use LWP::Simple; @ARGV == 1 or die "Usage: $0 <URL>\n"; my $url = shift; my $file = LWP::Simple::get($url); defined $file or die "$0: can't fetch $url\n"; Dump $file; and when running for i in 1a 1h 2a 2h do ./lwp-dump http://www.vinc17.net/test/perl-lwp-test$i.xml \ Show quoted text
2> perl-lwp-test$i.dump
done I get: ==> perl-lwp-test1a.dump <== SV = PV(0x194dac8) at 0x6a02d0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1308cd0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... A</root>\n"] CUR = 71 LEN = 80 ==> perl-lwp-test1h.dump <== SV = PV(0x194dac8) at 0x6a02d0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x13097d0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{e9}... A</root>\n"] CUR = 69 LEN = 80 ==> perl-lwp-test2a.dump <== SV = PV(0x194dac8) at 0x6a02d0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1308cd0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... \303\203</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... \x{c3}</root>\n"] CUR = 72 LEN = 80 ==> perl-lwp-test2h.dump <== SV = PV(0x194dac8) at 0x6a02d0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1309850 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... \303\203</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... \x{c3}</root>\n"] CUR = 72 LEN = 80 Due to the inconsistency between the 1h and 2h files, one cannot easily correct the variable to get the real data.
From: vincent [...] vinc17.org
Note: my examples are not HTML files, but this doesn't matter. I first thought the problem occurred for all text/* files (e.g. text/xml, that's why I just wrote basic XML files), but in fact only text/html seems to be affected. Moreover LWP::Simple::get is not sufficiently documented. This means that the other cases are potentially wrong too. Indeed an old version didn't set the UTF8 flag, i.e. one just gets a sequence of bytes. This is probably what one expects for files without a HTTP charset (e.g. served as application/xml). Also, what happens if a file is sent as text/html with UTF-8 charset, but isn't a valid UTF-8 file?
From: vincent [...] vinc17.org
The problem with the 1h file may come from HTTP::Message, with a default charset guessed by content_charset(), if LWP::Simple::get uses decoded_content from HTTP::Message with a default charset guessed by content_charset(). Charset guessing should strictly follow the explicit rules from http://www.w3.org/TR/REC-html40/charset.html#spec-char-encoding to avoid inconsistencies like here.