Bug #69393 for libwww-perl: Incorrect encoding handling for text/html files with LWP::Simple::get

Sun Jul 10 20:15:07 2011 vincent [...] vinc17.org - Ticket created

Subject:

Incorrect encoding handling for text/html files with LWP::Simple::get

When a file declared as iso-8859-1 and served as text/html is also a valid UTF-8 file, LWP::Simple::get regards it as a UTF-8 file. This is incorrect. Unless the website from which I fetch data recently changed, it seems to be a regression from libwww-perl 5.x. For instance, with lwp-dump being #!/usr/bin/env perl use strict; use Devel::Peek; use LWP::Simple; @ARGV == 1 or die "Usage: $0 <URL>\n"; my $url = shift; my $file = LWP::Simple::get($url); defined $file or die "$0: can't fetch $url\n"; Dump $file; and when running for i in 1a 1h 2a 2h do ./lwp-dump http://www.vinc17.net/test/perl-lwp-test$i.xml \ Show quoted text

2> perl-lwp-test$i.dump

done I get: ==> perl-lwp-test1a.dump <== SV = PV(0x194dac8) at 0x6a02d0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1308cd0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... A</root>\n"] CUR = 71 LEN = 80 ==> perl-lwp-test1h.dump <== SV = PV(0x194dac8) at 0x6a02d0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x13097d0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{e9}... A</root>\n"] CUR = 69 LEN = 80 ==> perl-lwp-test2a.dump <== SV = PV(0x194dac8) at 0x6a02d0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1308cd0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... \303\203</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... \x{c3}</root>\n"] CUR = 72 LEN = 80 ==> perl-lwp-test2h.dump <== SV = PV(0x194dac8) at 0x6a02d0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1309850 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... \303\203</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... \x{c3}</root>\n"] CUR = 72 LEN = 80 Due to the inconsistency between the 1h and 2h files, one cannot easily correct the variable to get the real data.

Sun Jul 10 20:48:15 2011 vincent [...] vinc17.org - Correspondence added

From:

vincent [...] vinc17.org

Note: my examples are not HTML files, but this doesn't matter. I first thought the problem occurred for all text/* files (e.g. text/xml, that's why I just wrote basic XML files), but in fact only text/html seems to be affected. Moreover LWP::Simple::get is not sufficiently documented. This means that the other cases are potentially wrong too. Indeed an old version didn't set the UTF8 flag, i.e. one just gets a sequence of bytes. This is probably what one expects for files without a HTTP charset (e.g. served as application/xml). Also, what happens if a file is sent as text/html with UTF-8 charset, but isn't a valid UTF-8 file?

Sun Jul 10 20:48:16 2011 The RT System itself - Status changed from 'new' to 'open'

Sun Jul 10 21:09:32 2011 vincent [...] vinc17.org - Correspondence added

From:

vincent [...] vinc17.org

The problem with the 1h file may come from HTTP::Message, with a default charset guessed by content_charset(), if LWP::Simple::get uses decoded_content from HTTP::Message with a default charset guessed by content_charset(). Charset guessing should strictly follow the explicit rules from http://www.w3.org/TR/REC-html40/charset.html#spec-char-encoding to avoid inconsistencies like here.

Fri Mar 31 15:03:39 2017 olaf [...] wundersolutions.com - Correspondence added

Ticket migrated to github as https://github.com/libwww-perl/libwww-perl/issues/226

Fri Mar 31 15:03:50 2017 olaf [...] wundersolutions.com - Status changed from 'open' to 'resolved'