Skip Menu |

This queue is for tickets about the XML-Simple CPAN distribution.

Report information
The Basics
Id: 36765
Status: open
Priority: 0/
Queue: XML-Simple

People
Owner: Nobody in particular
Requestors: stric [...] acc.umu.se
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Mixing character encodings when using XMLin from a file handle under perl 5.10
Date: Sun, 15 Jun 2008 14:20:16 +0200
To: bug-XML-Simple [...] rt.cpan.org
From: Tomas Ögren <stric [...] acc.umu.se>
Hello. When XML::Simple::XMLin() is used on a file handle, the first 4k of the file is treated differently (encoding-wise) than the rest. When using it on a string, the output is consistent. In "v5.8.7 built for sun4-solaris" on Solaris 10/SPARC, the example below reads the ISO-8859-1 input and outputs ISO-8859-1. With "v5.10.0 built for i86pc-solaris-thread-multi" on Solaris 10/X86, it outputs UTF-8 for the first 4k of input, then ISO-8859-1 when reading from a file. When reading from a string, it outputs UTF-8 all the way. Both using XML-Simple-2.18. Example code: --------------8<----------------- use XML::Simple; use Data::Dumper; print "Use XMLin on the file handle:\n"; open($fh, "test.xml"); $p = XMLin($fh); close($fh); print Dumper($p->{entry}); print "Use XMLin on a string:\n"; open($fh, "test.xml"); $a = join("",<$fh>); $p = XMLin($a); close($fh); print Dumper($p->{entry}); --------------8<----------------- Example output perl 5.10.0: --------------8<----------------- Use XMLin on the file handle: $VAR1 = [ 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö', "Blah \x{e5}\x{e4}\x{f6}", "Blah \x{e5}\x{e4}\x{f6}", "Blah \x{e5}\x{e4}\x{f6}" ]; Use XMLin on a string: $VAR1 = [ 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö', 'Blah åäö' ]; --------------8<----------------- Example output perl 5.8.7: --------------8<----------------- Use XMLin on the file handle: $VAR1 = [ "Blah \x{e5}\x{e4}\x{f6}", ... repeated ]; Use XMLin on a string: $VAR1 = [ "Blah \x{e5}\x{e4}\x{f6}", ... repeated ]; --------------8<----------------- Input: --------------8<----------------- <?xml version="1.0" encoding="ISO-8859-1"?> <thing> <entry>Blah åäö</entry> <junk>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</junk> ... entry and junk repeated 10 times ... </thing> --------------8<----------------- /Tomas -- Tomas Ögren, stric@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se
XML::Simple does not parse XML itself - this task is delegated to a parser module. If you have XML::SAX installed, the default SAX parser will be used, otherwise XML::Parser will be used if it is available. It seems likely that in your case XML::SAX::PurePerl is your default SAX parser. You can confirm this by running 'make test' for the XML::Simple distribution - the initial test file outputs information about installed module versions. Post that information back here on RT. If XML::SAX::PurePerl is your default parser, then you should install either XML::SAX::Expat or XML::SAX::ExpatXS. Either of these modules will fix the problem and also they will both be *much* faster than the PurePerl module.