Subject: | Incorrect encoding handling for text/html files with LWP::Simple::get |
When a file declared as iso-8859-1 and served as text/html is also a
valid UTF-8 file, LWP::Simple::get regards it as a UTF-8 file. This is
incorrect. Unless the website from which I fetch data recently changed,
it seems to be a regression from libwww-perl 5.x.
For instance, with lwp-dump being
#!/usr/bin/env perl
use strict;
use Devel::Peek;
use LWP::Simple;
@ARGV == 1 or die "Usage: $0 <URL>\n";
my $url = shift;
my $file = LWP::Simple::get($url);
defined $file or die "$0: can't fetch $url\n";
Dump $file;
and when running
for i in 1a 1h 2a 2h
do
./lwp-dump http://www.vinc17.net/test/perl-lwp-test$i.xml \
Show quoted text
2> perl-lwp-test$i.dump
done
I get:
==> perl-lwp-test1a.dump <==
SV = PV(0x194dac8) at 0x6a02d0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1308cd0 "<?xml version=\"1.0\"
encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... A</root>\n"\0
[UTF8 "<?xml version="1.0"
encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... A</root>\n"]
CUR = 71
LEN = 80
==> perl-lwp-test1h.dump <==
SV = PV(0x194dac8) at 0x6a02d0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x13097d0 "<?xml version=\"1.0\"
encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0 [UTF8
"<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{e9}...
A</root>\n"]
CUR = 69
LEN = 80
==> perl-lwp-test2a.dump <==
SV = PV(0x194dac8) at 0x6a02d0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1308cd0 "<?xml version=\"1.0\"
encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251...
\303\203</root>\n"\0 [UTF8 "<?xml version="1.0"
encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... \x{c3}</root>\n"]
CUR = 72
LEN = 80
==> perl-lwp-test2h.dump <==
SV = PV(0x194dac8) at 0x6a02d0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1309850 "<?xml version=\"1.0\"
encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251...
\303\203</root>\n"\0 [UTF8 "<?xml version="1.0"
encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... \x{c3}</root>\n"]
CUR = 72
LEN = 80
Due to the inconsistency between the 1h and 2h files, one cannot easily
correct the variable to get the real data.