Subject: | XML::Parser screws up non-ascii text (and does so inconsistently) |
Noticed on Debian Sarge:
libexpat1 1.95.8-1
libxml-parser-perl 2.34-3
perl 5.8.4-6
and Gentoo:
expat-1.95.8
XML-Parser-2.34
perl-5.8.5
XML::Parser is screwing around with non-ascii characters - most of the time, accented characters are converted from utf-8 down to iso-8859-1. After much debugging, I determined it wasn't Expat.so doing it but Parser.pm, despite the documentation saying that all text is returned as utf-8.
In the attached tar file, I have two xml files and a sample perl script... there is only one character difference between the xml file but perl handles them differently. the perl-unicode manpage says:
If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will be created by
decoding the byte strings as ISO 8859-1 (Latin-1) [...]
Anyway, putting "use encoding 'utf8';" at the top of XML::Parser made perl keep the string as utf-8 instead of munging the accented characters. It also worked putting it at the top of the script with the Char handler, but it really should be in XML::Parser if you want it to always return utf-8 like it claims to do, I think.
John McPherson
Message body not shown because it is not plain text.