Skip Menu |

This queue is for tickets about the XML-TreePP CPAN distribution.

Report information
The Basics
Id: 42347
Status: resolved
Priority: 0/
Queue: XML-TreePP

People
Owner: Nobody in particular
Requestors: graham [...] plainblack.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 0.36
Fixed in: (no value)



Subject: Multi-byte XML entities improperly decoded with utf8_flag on
When the utf8_flag option is used to parse a file, the result should be data decoded into Perl characters. The decoding of XML entities doesn't take this into account, and always decodes into a UTF-8 byte string instead. The result then is a mixed decoded/encoded character string, which is invalid. I've attached a patch that works for me to correct this behavior. When decoding the XML entities, it checks if the source string is marked with perl's utf8 flag, and if so decodes the XML character entity into a Perl character rather than a UTF-8 byte string. However, this will likely be an invalid solution for earlier perl versions.
Subject: xml-treepp-utf8-entities.patch
--- lib/XML/TreePP.pm 2008-10-26 01:17:10.000000000 -0500 +++ lib/XML/TreePP.pm 2009-01-12 16:31:53.000000000 -0600 @@ -1205,12 +1205,10 @@ sub char_deref { my( $str, $dec, $hex ) = @_; - if ( defined $dec ) { - return &code_to_utf8( $dec ) if ( $dec < 256 ); - } - elsif ( defined $hex ) { - my $num = hex($hex); - return &code_to_utf8( $num ) if ( $num < 256 ); + my $num = defined $dec ? $dec : defined $hex ? hex($hex) : undef; + if ( defined $num && $num < 256 ) { + my $char = &code_to_utf8( $num ); + return utf8::is_utf8($str) ? Encode::decode_utf8( $char ) : $char; } return $str; }
RT-Send-CC: xml-treepp [...] yahoogroups.com
Thanks for your patch. XML::TreePP's behavior for decoding character reference between &#0; to &#255; is just bonus trick for usual applications in Latin-1 world. It would be too complex to support utf-8 flag in this function without ignoring Perl 5.005 users. It's better to limit the trick up to &#127, isn't it? Plain US-ASCII would have no problem with utf-8 flag. By the way, the trick could be evolved to a new official option like 'decode_xmlref'. What do you think? I'll work for it. --- Yusuke Kawaaski On 2009/01/12 18:00:21, haarg wrote: Show quoted text
> When the utf8_flag option is used to parse a file, the result should be > data decoded into Perl characters. The decoding of XML entities doesn't > take this into account, and always decodes into a UTF-8 byte string > instead. The result then is a mixed decoded/encoded character string, > which is invalid. > > I've attached a patch that works for me to correct this behavior. When > decoding the XML entities, it checks if the source string is marked with > perl's utf8 flag, and if so decodes the XML character entity into a Perl > character rather than a UTF-8 byte string. However, this will likely be > an invalid solution for earlier perl versions.
XML::TreePP version 0.37 fixes the problem. http://search.cpan.org/dist/XML-TreePP/ http://www.kawa.net/works/perl/treepp/dist/XML-TreePP-0.37.tar.gz http://xml-treepp.googlecode.com/svn/trunk/XML-TreePP/Changes 2009/01/17 (0.37) * new option: xml_deref dereferences the numeric character references, like &#xEB;, &#28450; etc. Now UTF-8 flag is correctly treated. (thanks to haarg) http://rt.cpan.org/Public/Bug/Display.html?id=42347 * without xml_deref option, the numeric character references between U+0080 and U+00FF are not dereferenced any more. the numeric character references up to U+007F and the predefined character entity references are still dereferenced per default.