Skip Menu |

This queue is for tickets about the XML-LibXML CPAN distribution.

Report information
The Basics
Id: 38347
Status: rejected
Priority: 0/
Queue: XML-LibXML

People
Owner: Nobody in particular
Requestors: dsteinbrunner [...] pobox.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: UTF8 not being set properly on toString of Document
I have two platforms I'm dealing with which are RHEL4 and RHEL5. RHEL4 uses perl 5.8.5 and has XML::LibXML version 1.58 while RHEL5 perl 5.8.8 and has XML::LibXML version 1.66. We are in the early stages of transitioning a perl based system from 4 to 5 and have run into an issue where unicode chars are being double encoded on the newer platform. If an xml file gets read and written many times over it can explode in size due to the exponential nature of the doubling. From my digging it appears that $doc->toString() or $doc->documentElement()->ownerDocument()->toString() return differently on the two platforms, which is the source of the issue. From what I am seeing, they both return utf8 strings but on the RHEL5 box the UTF8 flag is not set, viewed via Devel::Peek. Using Encode::decode_utf8 on the resulting string is needed to get things work comparably. Of course, we would rather not have to pepper the application code with Encode::decode_utf8 and make it backward incompatible in the process. Would this issue be a regression? Could it be fixed in an upcomming release? Is there a better work around than what I have found thus far? The following test code was used on both platforms and it passes on RHEL4, while two of the tests fail on the RHEL5 box. When I tested the code on RHEL4 but against XML::LibXML version 1.66 I found that the two same tests fail then, also. use strict; use Test::More; use XML::LibXML; use Encode; plan tests => 2; my $xmldoc = <<EOXML; <?xml version="1.0"?> <properties> <p name="key"><v>v&#xe4;l</v></p> </properties> EOXML my $parser = XML::LibXML->new(); my $doc = $parser->parse_string( $xmldoc ); #$doc->setEncoding('UTF-8'); #$doc->documentElement()->ownerDocument()->setEncoding('UTF-8'); #diag($doc->documentElement()->ownerDocument()); # documentElement ok( Encode::is_utf8($doc->documentElement()->toString()), "utf8?" ); #diag($doc->documentElement()->ownerDocument()->actualEncoding()); #diag($doc->documentElement()->ownerDocument()->toString()); # documentElement ownerDocument ok( Encode::is_utf8($doc->documentElement()->ownerDocument()->toString()), "utf8?" ); diag($doc->documentElement()->toString()); #diag($doc->actualEncoding()); #diag($doc->toString); # document ok( Encode::is_utf8($doc->toString()), "utf8?" );
The behavior of toString for document nodes changed in 1.63, the change was intentional and is well documented. Unlike $node->toString, $document->toString returns a string of bytes in the document encoding, not a string of characters if the document encoding happens to be UTF-8. Please update to the latest version of XML::LibXML on all your platforms.