Subject: | XMLin() mysteriously recodes UTF-8 input to Latin-1 |
Date: | Mon, 17 Apr 2017 11:04:11 +0200 |
To: | bug-XML-Simple [...] rt.cpan.org |
From: | Tore Anderson <tore [...] fud.no> |
If I feed XMLin() an XML document containing UTF-8 data, the parsed
hash will have mysteriously been recoded to Latin-1. This is of course
not the desired behaviour.
The following test script and UTF-8 XML file (also attached)
demonstrates the problem:
[xmlintest.pl]
#! /usr/bin/perl
use Data::Dumper;
use XML::Simple;
my $xs = XML::Simple->new();
my $ref = $xs->XMLin('test.xml');
print "->" . $ref->{bar} . "<-\n";
print Dumper $ref->{bar};
open(my $fh, 'test.xml');
my $xml = <$fh>;
print $xml;
print Dumper $xml;
[test.xml]
<?xml version="1.0" encoding="utf-8"?><foo><bar>æøå</bar></foo>
The script produces the following output when run:
->���<-
$VAR1 = "\x{e6}\x{f8}\x{e5}";
<?xml version="1.0" encoding="utf-8"?><foo><bar>æøå</bar></foo>
$VAR1 = '<?xml version="1.0" encoding="utf-8"?><foo><bar>æøå</bar></foo>
';
Piping the output through hexdump -C produces the following output:
00000000 2d 3e e6 f8 e5 3c 2d 0a 24 56 41 52 31 20 3d 20 |->...<-.$VAR1 = |
"e6 f8 e5" is "æ ø å" in latin1. So it appears that the utf8 data that
was input to XMLin() has inexplicably been recoded to latin1, resulting
in garbage characters my terminal cannot print.
The last couple of lines show that this does not happen when perl reads
in and prints the XML file contents directly. Hexdump of that part of
the output:
00000050 3e 3c 62 61 72 3e c3 a6 c3 b8 c3 a5 3c 2f 62 61 |><bar>......</ba|
"c3 a6" = "æ", "c3 b8" = "ø", "c3 a5" = "å" - so this is proper and
correct UTF-8, which my terminal display correctly too.
For what it's worth, my locale is of course all UTF-8:
$ locale
LANG=nn_NO.utf8
LC_CTYPE="nn_NO.utf8"
LC_NUMERIC="nn_NO.utf8"
LC_TIME="nn_NO.utf8"
LC_COLLATE="nn_NO.utf8"
LC_MONETARY="nn_NO.utf8"
LC_MESSAGES="nn_NO.utf8"
LC_PAPER="nn_NO.utf8"
LC_NAME="nn_NO.utf8"
LC_ADDRESS="nn_NO.utf8"
LC_TELEPHONE="nn_NO.utf8"
LC_MEASUREMENT="nn_NO.utf8"
LC_IDENTIFICATION="nn_NO.utf8"
LC_ALL=
As far as I can tell, there is no mention whatsoever of the latin1
character set anywhere in this environment. Why XML::Simple recodes the
XML data to it anyway is a mystery to me.
This is XML::Simple version 2.22-3 on Fedora 25. While this is not the
newest version, the changes in v2.23 and v2.24 doesn't seem relevant.
Tore
Message body not shown because it is not plain text.
Message body is not shown because sender requested not to inline it.