Skip Menu |

This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id: 25261
Status: rejected
Priority: 0/
Queue: XML-Twig

People
Owner: Nobody in particular
Requestors: ddascalescu [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Parsing UTF-8 XMLs reduces Latin chars to bytes if keep_encoding => 0
Date: Fri, 2 Mar 2007 22:32:46 -0800
To: bug-XML-Twig [...] rt.cpan.org, "Michel Rodriguez" <mirod [...] xmltwig.com>
From: "Dan Dascalescu" <ddascalescu [...] gmail.com>
After parsing UTF8 XMLs, the ->text method of XML::Twig::Elt seems to encode Latin characters in the iso-8859-1 encoding. $twig->print dumps the correct byte sequence. The test I included in the attachment uses the following characters: ß, á, à. Hope that helps, Dan Dascalescu #! perl -w use strict; use XML::Twig; sub hex_dump($) { my $input = shift; my $result = "Input: <\n$input\n>\nHex dump:\n"; while ($input =~ /./gs) { $result .= "<$&>" . sprintf "%02X ", ord($&); } return $result; } my $filename = shift; open my $file_out, '>:raw', "$filename.out.xml"; # parse the UTF-8-encoded XML my $twig= XML::Twig->new( keep_encoding => 0 # the default; '1' fixes the issue ); $twig->parsefile($filename); # dump element text print $file_out "Element text dump:\n"; foreach my $elt ($twig->get_xpath('//seg')) { print $file_out ($elt->text), "\n"; } # dump twig print $file_out "\n\nTwig print:\n"; $twig->print($file_out); # read the XML file with the UTF-8 discipline, and pass it through to the ':raw' output file open my $file_in, '<:utf8', $filename or die $!; undef $/; print $file_out "\n\nPass-through:\n", <$file_in>; __END__ XML file: <?xml version='1.0' encoding='UTF-8' ?> <tmx> <seg>Latin chars: Schließen, á, à</seg> <seg>Thai char: ว</seg> <seg>Russian stuff: Обновить</seg> </tmx>
Download UTF8_keep_encoding.zip
application/zip 848b

Message body not shown because it is not plain text.

On Sat Mar 03 01:33:12 2007, ddascalescu@gmail.com wrote: Show quoted text
> After parsing UTF8 XMLs, the ->text method of XML::Twig::Elt seems to > encode Latin characters in the iso-8859-1 encoding. $twig->print dumps > the correct byte sequence. The test I included in the attachment uses > the following characters: ß, á, à.
Hi Dan, Indeed, it looks like the utf8 flag is not set on the string created by the text method. I have no idea why. I have to write some tests, with and without the keep_encoding option, to figure out exactly in which case the flag needs to be set. __ mirod
Closing the report, a few years late. In order to print utf8 characters, you need to specify the encoding when you open the file. writing open my $file_out, '>:utf8', "$filename.out.xml"; instead of open my $file_out, '>:raw', "$filename.out.xml"; does the right thing. I believe this is normal Perl behaviour __ mirod