Subject: | Mixing character encodings when using XMLin from a file handle under perl 5.10 |
Date: | Sun, 15 Jun 2008 14:20:16 +0200 |
To: | bug-XML-Simple [...] rt.cpan.org |
From: | Tomas Ögren <stric [...] acc.umu.se> |
Hello.
When XML::Simple::XMLin() is used on a file handle, the first 4k of the
file is treated differently (encoding-wise) than the rest. When using it
on a string, the output is consistent.
In "v5.8.7 built for sun4-solaris" on Solaris 10/SPARC, the example
below reads the ISO-8859-1 input and outputs ISO-8859-1.
With "v5.10.0 built for i86pc-solaris-thread-multi" on Solaris 10/X86,
it outputs UTF-8 for the first 4k of input, then ISO-8859-1 when reading
from a file. When reading from a string, it outputs UTF-8 all the way.
Both using XML-Simple-2.18.
Example code:
--------------8<-----------------
use XML::Simple;
use Data::Dumper;
print "Use XMLin on the file handle:\n";
open($fh, "test.xml");
$p = XMLin($fh);
close($fh);
print Dumper($p->{entry});
print "Use XMLin on a string:\n";
open($fh, "test.xml");
$a = join("",<$fh>);
$p = XMLin($a);
close($fh);
print Dumper($p->{entry});
--------------8<-----------------
Example output perl 5.10.0:
--------------8<-----------------
Use XMLin on the file handle:
$VAR1 = [
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö',
"Blah \x{e5}\x{e4}\x{f6}",
"Blah \x{e5}\x{e4}\x{f6}",
"Blah \x{e5}\x{e4}\x{f6}"
];
Use XMLin on a string:
$VAR1 = [
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö',
'Blah åäö'
];
--------------8<-----------------
Example output perl 5.8.7:
--------------8<-----------------
Use XMLin on the file handle:
$VAR1 = [
"Blah \x{e5}\x{e4}\x{f6}",
... repeated
];
Use XMLin on a string:
$VAR1 = [
"Blah \x{e5}\x{e4}\x{f6}",
... repeated
];
--------------8<-----------------
Input:
--------------8<-----------------
<?xml version="1.0" encoding="ISO-8859-1"?>
<thing>
<entry>Blah åäö</entry>
<junk>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</junk>
... entry and junk repeated 10 times ...
</thing>
--------------8<-----------------
/Tomas
--
Tomas Ögren, stric@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se