Subject: | UTF8 not being set properly on toString of Document |
I have two platforms I'm dealing with which are RHEL4 and RHEL5. RHEL4 uses perl 5.8.5 and has XML::LibXML version 1.58 while RHEL5
perl 5.8.8 and has XML::LibXML version 1.66.
We are in the early stages of transitioning a perl based system from 4 to 5 and have run into an issue where unicode chars are being double
encoded on the newer platform. If an xml file gets read and written many times over it can explode in size due to the exponential nature
of the doubling.
From my digging it appears that $doc->toString() or $doc->documentElement()->ownerDocument()->toString() return differently on the
two platforms, which is the source of the issue. From what I am seeing, they both return utf8 strings but on the RHEL5 box the UTF8 flag is
not set, viewed via Devel::Peek. Using Encode::decode_utf8 on the resulting string is needed to get things work comparably. Of course, we
would rather not have to pepper the application code with Encode::decode_utf8 and make it backward incompatible in the process.
Would this issue be a regression? Could it be fixed in an upcomming release? Is there a better work around than what I have found thus
far?
The following test code was used on both platforms and it passes on RHEL4, while two of the tests fail on the RHEL5 box. When I tested
the code on RHEL4 but against XML::LibXML version 1.66 I found that the two same tests fail then, also.
use strict;
use Test::More;
use XML::LibXML;
use Encode;
plan tests => 2;
my $xmldoc = <<EOXML;
<?xml version="1.0"?>
<properties>
<p name="key"><v>väl</v></p>
</properties>
EOXML
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string( $xmldoc );
#$doc->setEncoding('UTF-8');
#$doc->documentElement()->ownerDocument()->setEncoding('UTF-8');
#diag($doc->documentElement()->ownerDocument());
# documentElement
ok( Encode::is_utf8($doc->documentElement()->toString()), "utf8?" );
#diag($doc->documentElement()->ownerDocument()->actualEncoding());
#diag($doc->documentElement()->ownerDocument()->toString());
# documentElement ownerDocument
ok( Encode::is_utf8($doc->documentElement()->ownerDocument()->toString()), "utf8?" );
diag($doc->documentElement()->toString());
#diag($doc->actualEncoding());
#diag($doc->toString);
# document
ok( Encode::is_utf8($doc->toString()), "utf8?" );