Skip Menu |

This queue is for tickets about the XML-Xerces CPAN distribution.

Report information
The Basics
Id: 7104
Status: resolved
Priority: 0/
Queue: XML-Xerces

People
Owner: jasons [...] cpan.org
Requestors: ekliao [...] yahoo.com
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 2.3.0-4
Fixed in: (no value)

Attachments
xml-xerces-problem-with-Chinese.zip



Subject: Chinese characters get lost in the XML::Xerces 'characters' callback subroutine
Distro: XML::Xerces 2.3.0-4 Perl: ActivePerl v5.8.4 binary build 810 OS: MS Windows XP Problem: When the source XML parsed by XML::Xerces contains a text node which contains a Chinese character, that Chinese character somehow turns into an empty string when it is passed to the characters call-back subroutine. The parsing does not generate errors. Attached code samples demonstrate this. The Chinese character in question is in test.xml, inside the text node of the first project_number element: utf8 char here:(...) (where ... is the Chinese chacter, U+6B63) Run the test like this: perl xerces-sax2-counter.pl test.xml This will produce an output file: xerces-sax2-counter.out.txt. Currently, the first line is: [] when it should be: [utf8 char here:(...)] because of this line in the code: print O "[$str]\n"; I have added Perl 5.8 features such as use utf8 and binmode(..., ":utf8") in the code, but the Unicode Chinese character still got lost. I don't know if there is something in the XML::Xerces documentation that mentions the correct way of capturing a CJK character. Thanks!
Download xml-xerces-problem-with-Chinese.zip
application/x-zip-compressed 2k

Message body not shown because it is not plain text.

On Sun Jul 25 20:05:28 2004, guest wrote: Show quoted text
> Distro: XML::Xerces 2.3.0-4 > Perl: ActivePerl v5.8.4 binary build 810 > OS: MS Windows XP > > Problem: When the source XML parsed by XML::Xerces contains a text > node which contains a Chinese character, that Chinese character > somehow turns into an empty string when it is passed to the > characters call-back subroutine. The parsing does not generate > errors. Attached code samples demonstrate this. The Chinese > character in question is in test.xml, inside the text node of the > first project_number element: >
This was an error in the SAX2 callbacks - they were transcoding to the local code page by default - and not to UTF8 - so all unicode was lost. This is now fixed in 2.7, please try again!