Bug #42347 for XML-TreePP: Multi-byte XML entities improperly decoded with utf8

Mon Jan 12 18:00:21 2009 haarg [...] haarg.org - Ticket created

Subject:

Multi-byte XML entities improperly decoded with utf8_flag on

When the utf8_flag option is used to parse a file, the result should be data decoded into Perl characters. The decoding of XML entities doesn't take this into account, and always decodes into a UTF-8 byte string instead. The result then is a mixed decoded/encoded character string, which is invalid. I've attached a patch that works for me to correct this behavior. When decoding the XML entities, it checks if the source string is marked with perl's utf8 flag, and if so decodes the XML character entity into a Perl character rather than a UTF-8 byte string. However, this will likely be an invalid solution for earlier perl versions.

Subject:

xml-treepp-utf8-entities.patch

--- lib/XML/TreePP.pm 2008-10-26 01:17:10.000000000 -0500 +++ lib/XML/TreePP.pm 2009-01-12 16:31:53.000000000 -0600 @@ -1205,12 +1205,10 @@ sub char_deref { my( $str, $dec, $hex ) = @_; - if ( defined $dec ) { - return &code_to_utf8( $dec ) if ( $dec < 256 ); - } - elsif ( defined $hex ) { - my $num = hex($hex); - return &code_to_utf8( $num ) if ( $num < 256 ); + my $num = defined $dec ? $dec : defined $hex ? hex($hex) : undef; + if ( defined $num && $num < 256 ) { + my $char = &code_to_utf8( $num ); + return utf8::is_utf8($str) ? Encode::decode_utf8( $char ) : $char; } return $str; }

Thu Jan 15 10:33:28 2009 u-suke [...] kawa.net - Correspondence added

RT-Send-CC:

xml-treepp [...] yahoogroups.com

Thanks for your patch. XML::TreePP's behavior for decoding character reference between  to ÿ is just bonus trick for usual applications in Latin-1 world. It would be too complex to support utf-8 flag in this function without ignoring Perl 5.005 users. It's better to limit the trick up to &#127, isn't it? Plain US-ASCII would have no problem with utf-8 flag. By the way, the trick could be evolved to a new official option like 'decode_xmlref'. What do you think? I'll work for it. --- Yusuke Kawaaski On 2009/01/12 18:00:21, haarg wrote: Show quoted text

> When the utf8_flag option is used to parse a file, the result should be > data decoded into Perl characters. The decoding of XML entities doesn't > take this into account, and always decodes into a UTF-8 byte string > instead. The result then is a mixed decoded/encoded character string, > which is invalid. > > I've attached a patch that works for me to correct this behavior. When > decoding the XML entities, it checks if the source string is marked with > perl's utf8 flag, and if so decodes the XML character entity into a Perl > character rather than a UTF-8 byte string. However, this will likely be > an invalid solution for earlier perl versions.

Thu Jan 15 10:33:29 2009 The RT System itself - Status changed from 'new' to 'open'

Sat Jan 17 10:26:34 2009 u-suke [...] kawa.net - Correspondence added

XML::TreePP version 0.37 fixes the problem. http://search.cpan.org/dist/XML-TreePP/ http://www.kawa.net/works/perl/treepp/dist/XML-TreePP-0.37.tar.gz http://xml-treepp.googlecode.com/svn/trunk/XML-TreePP/Changes 2009/01/17 (0.37) * new option: xml_deref dereferences the numeric character references, like ë, 漢 etc. Now UTF-8 flag is correctly treated. (thanks to haarg) http://rt.cpan.org/Public/Bug/Display.html?id=42347 * without xml_deref option, the numeric character references between U+0080 and U+00FF are not dereferenced any more. the numeric character references up to U+007F and the predefined character entity references are still dereferenced per default.

Sat Jan 17 10:26:36 2009 u-suke [...] kawa.net - Status changed from 'open' to 'resolved'

Bug #42347 for XML-TreePP: Multi-byte XML entities improperly decoded with utf8_flag on