Skip Menu |

This queue is for tickets about the XML-Simple CPAN distribution.

Report information
The Basics
Id: 121202
Status: rejected
Priority: 0/
Queue: XML-Simple

People
Owner: Nobody in particular
Requestors: tore [...] fud.no
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: XMLin() mysteriously recodes UTF-8 input to Latin-1
Date: Mon, 17 Apr 2017 11:04:11 +0200
To: bug-XML-Simple [...] rt.cpan.org
From: Tore Anderson <tore [...] fud.no>
If I feed XMLin() an XML document containing UTF-8 data, the parsed hash will have mysteriously been recoded to Latin-1. This is of course not the desired behaviour. The following test script and UTF-8 XML file (also attached) demonstrates the problem: [xmlintest.pl] #! /usr/bin/perl use Data::Dumper; use XML::Simple; my $xs = XML::Simple->new(); my $ref = $xs->XMLin('test.xml'); print "->" . $ref->{bar} . "<-\n"; print Dumper $ref->{bar}; open(my $fh, 'test.xml'); my $xml = <$fh>; print $xml; print Dumper $xml; [test.xml] <?xml version="1.0" encoding="utf-8"?><foo><bar>æøå</bar></foo> The script produces the following output when run: ->���<- $VAR1 = "\x{e6}\x{f8}\x{e5}"; <?xml version="1.0" encoding="utf-8"?><foo><bar>æøå</bar></foo> $VAR1 = '<?xml version="1.0" encoding="utf-8"?><foo><bar>æøå</bar></foo> '; Piping the output through hexdump -C produces the following output: 00000000 2d 3e e6 f8 e5 3c 2d 0a 24 56 41 52 31 20 3d 20 |->...<-.$VAR1 = | "e6 f8 e5" is "æ ø å" in latin1. So it appears that the utf8 data that was input to XMLin() has inexplicably been recoded to latin1, resulting in garbage characters my terminal cannot print. The last couple of lines show that this does not happen when perl reads in and prints the XML file contents directly. Hexdump of that part of the output: 00000050 3e 3c 62 61 72 3e c3 a6 c3 b8 c3 a5 3c 2f 62 61 |><bar>......</ba| "c3 a6" = "æ", "c3 b8" = "ø", "c3 a5" = "å" - so this is proper and correct UTF-8, which my terminal display correctly too. For what it's worth, my locale is of course all UTF-8: $ locale LANG=nn_NO.utf8 LC_CTYPE="nn_NO.utf8" LC_NUMERIC="nn_NO.utf8" LC_TIME="nn_NO.utf8" LC_COLLATE="nn_NO.utf8" LC_MONETARY="nn_NO.utf8" LC_MESSAGES="nn_NO.utf8" LC_PAPER="nn_NO.utf8" LC_NAME="nn_NO.utf8" LC_ADDRESS="nn_NO.utf8" LC_TELEPHONE="nn_NO.utf8" LC_MEASUREMENT="nn_NO.utf8" LC_IDENTIFICATION="nn_NO.utf8" LC_ALL= As far as I can tell, there is no mention whatsoever of the latin1 character set anywhere in this environment. Why XML::Simple recodes the XML data to it anyway is a mystery to me. This is XML::Simple version 2.22-3 on Fedora 25. While this is not the newest version, the changes in v2.23 and v2.24 doesn't seem relevant. Tore
Download test.xml
application/xml 67b

Message body not shown because it is not plain text.

Message body is not shown because sender requested not to inline it.

The problem you describe is not a problem with XML::Simple. Your test script is missing a declaration of the output encoding. When you read data into a script (for example from an XML file), the data will be 'decoded' from the source encoding into Perl's internal representation of characters. When you output from a script, you need to declare an output encoding so that the data can be 'encoded' from the internal representation into whatever encoding you require. You probably just need to add something like this: binmode(STDOUT, ':utf8'); Regards Grant
Subject: Re: [rt.cpan.org #121202] XMLin() mysteriously recodes UTF-8 input to Latin-1
Date: Mon, 17 Apr 2017 13:11:24 +0200
To: bug-XML-Simple [...] rt.cpan.org
From: Tore Anderson <tore [...] fud.no>
* Grant McLean via RT Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=121202 > > > The problem you describe is not a problem with XML::Simple. Your > test script is missing a declaration of the output encoding. > > When you read data into a script (for example from an XML file), the > data will be 'decoded' from the source encoding into Perl's internal > representation of characters. When you output from a script, you > need to declare an output encoding so that the data can be 'encoded' > from the internal representation into whatever encoding you require. > > You probably just need to add something like this: > > binmode(STDOUT, ':utf8');
Thank you for clarifying. Considering that my locale is already UTF-8, it seems odd to me that this is not already the default. Oh well. In any case, adding binmode(STDOUT, ':utf8') fixes XML::Simple-"filtered" output (the «print "->" . $ref->{bar} . "<-\n"» line), but at the same time it breaks the later print statement for data not passed through XML::Simple (the «print $xml» line). These now result in «...<bar>æøå</bar>...» which is typical when you take string that is already UTF-8 and tjem feed it through a latin1-to-utf8 conversion. So it seems something more is needed in order for this to work completely. In any case, thanks again! Tore