Skip Menu |

This queue is for tickets about the XML-Parser CPAN distribution.

Report information
The Basics
Id: 11899
Status: resolved
Priority: 0/
Queue: XML-Parser

People
Owner: Nobody in particular
Requestors: jrm+bug [...] wlug.org.nz
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 2.34
Fixed in: (no value)



Subject: XML::Parser screws up non-ascii text (and does so inconsistently)
Noticed on Debian Sarge: libexpat1 1.95.8-1 libxml-parser-perl 2.34-3 perl 5.8.4-6 and Gentoo: expat-1.95.8 XML-Parser-2.34 perl-5.8.5 XML::Parser is screwing around with non-ascii characters - most of the time, accented characters are converted from utf-8 down to iso-8859-1. After much debugging, I determined it wasn't Expat.so doing it but Parser.pm, despite the documentation saying that all text is returned as utf-8. In the attached tar file, I have two xml files and a sample perl script... there is only one character difference between the xml file but perl handles them differently. the perl-unicode manpage says: If strings operating under byte semantics and strings with Unicode character data are concatenated, the new string will be created by decoding the byte strings as ISO 8859-1 (Latin-1) [...] Anyway, putting "use encoding 'utf8';" at the top of XML::Parser made perl keep the string as utf-8 instead of munging the accented characters. It also worked putting it at the top of the script with the Char handler, but it really should be in XML::Parser if you want it to always return utf-8 like it claims to do, I think. John McPherson
Download demo.tar
application/x-tar 13.5k

Message body not shown because it is not plain text.

Ticket migrated to github as https://github.com/toddr/XML-Parser/issues/30