Subject: | Multi-byte XML entities improperly decoded with utf8_flag on |
When the utf8_flag option is used to parse a file, the result should be
data decoded into Perl characters. The decoding of XML entities doesn't
take this into account, and always decodes into a UTF-8 byte string
instead. The result then is a mixed decoded/encoded character string,
which is invalid.
I've attached a patch that works for me to correct this behavior. When
decoding the XML entities, it checks if the source string is marked with
perl's utf8 flag, and if so decodes the XML character entity into a Perl
character rather than a UTF-8 byte string. However, this will likely be
an invalid solution for earlier perl versions.
Subject: | xml-treepp-utf8-entities.patch |
--- lib/XML/TreePP.pm 2008-10-26 01:17:10.000000000 -0500
+++ lib/XML/TreePP.pm 2009-01-12 16:31:53.000000000 -0600
@@ -1205,12 +1205,10 @@
sub char_deref {
my( $str, $dec, $hex ) = @_;
- if ( defined $dec ) {
- return &code_to_utf8( $dec ) if ( $dec < 256 );
- }
- elsif ( defined $hex ) {
- my $num = hex($hex);
- return &code_to_utf8( $num ) if ( $num < 256 );
+ my $num = defined $dec ? $dec : defined $hex ? hex($hex) : undef;
+ if ( defined $num && $num < 256 ) {
+ my $char = &code_to_utf8( $num );
+ return utf8::is_utf8($str) ? Encode::decode_utf8( $char ) : $char;
}
return $str;
}