Subject: | Parsing UTF-8 XMLs reduces Latin chars to bytes if keep_encoding => 0 |
Date: | Fri, 2 Mar 2007 22:32:46 -0800 |
To: | bug-XML-Twig [...] rt.cpan.org, "Michel Rodriguez" <mirod [...] xmltwig.com> |
From: | "Dan Dascalescu" <ddascalescu [...] gmail.com> |
After parsing UTF8 XMLs, the ->text method of XML::Twig::Elt seems to
encode Latin characters in the iso-8859-1 encoding. $twig->print dumps
the correct byte sequence. The test I included in the attachment uses
the following characters: ß, á, à.
Hope that helps,
Dan Dascalescu
#! perl -w
use strict;
use XML::Twig;
sub hex_dump($) {
my $input = shift;
my $result = "Input: <\n$input\n>\nHex dump:\n";
while ($input =~ /./gs) {
$result .= "<$&>" . sprintf "%02X ", ord($&);
}
return $result;
}
my $filename = shift;
open my $file_out, '>:raw', "$filename.out.xml";
# parse the UTF-8-encoded XML
my $twig= XML::Twig->new(
keep_encoding => 0 # the default; '1' fixes the issue
);
$twig->parsefile($filename);
# dump element text
print $file_out "Element text dump:\n";
foreach my $elt ($twig->get_xpath('//seg')) {
print $file_out ($elt->text), "\n";
}
# dump twig
print $file_out "\n\nTwig print:\n";
$twig->print($file_out);
# read the XML file with the UTF-8 discipline, and pass it through to
the ':raw' output file
open my $file_in, '<:utf8', $filename or die $!;
undef $/; print $file_out "\n\nPass-through:\n", <$file_in>;
__END__
XML file:
<?xml version='1.0' encoding='UTF-8' ?>
<tmx>
<seg>Latin chars: Schließen, á, à</seg>
<seg>Thai char: ว</seg>
<seg>Russian stuff: Обновить</seg>
</tmx>
Message body not shown because it is not plain text.