Subject: | Chinese characters get lost in the XML::Xerces 'characters' callback subroutine |
Distro: XML::Xerces 2.3.0-4
Perl: ActivePerl v5.8.4 binary build 810
OS: MS Windows XP
Problem: When the source XML parsed by XML::Xerces contains a text node which contains a Chinese character, that Chinese character somehow turns into an empty string when it is passed to the characters call-back subroutine. The parsing does not generate errors. Attached code samples demonstrate this. The Chinese character in question is in test.xml, inside the text node of the first project_number element:
utf8 char here:(...)
(where ... is the Chinese chacter, U+6B63)
Run the test like this:
perl xerces-sax2-counter.pl test.xml
This will produce an output file: xerces-sax2-counter.out.txt.
Currently, the first line is:
[]
when it should be:
[utf8 char here:(...)]
because of this line in the code:
print O "[$str]\n";
I have added Perl 5.8 features such as use utf8 and binmode(..., ":utf8") in the code, but the Unicode Chinese character still got lost.
I don't know if there is something in the XML::Xerces documentation that mentions the correct way of capturing a CJK character.
Thanks!
Message body not shown because it is not plain text.