Subject: | Some CDATA values are corrupted during serialization to non-unicode encodings. |
In serizalized XML documents in CDATA sections characters like '&' have
its literal values (ampersand). You cannot express unicode character
inside a CDATA section that is not supported by XML document encoding
(syntax of character references doesn't work). However sometimes we need
to write XML documents in legacy non-unicode encodings and the documents
can contain characters unsupported by these encodings. For example, such
data can be obtained by parsing some other document and then importing
some parts of it to the destination document.
Currently in these cases XML::LibXML silently writes to destination
document values like '&#nnnn;'. But, unfortunately, when such document
is read again the value of this CDATA doesn't reflect initial CDATA value.
I think that when serializing CDATA sections, when unable to express
CDATA value in a serialized document, LibXML should either warn() or
die() or to upgrade CDATA node type to Text node type (where character
references can be used). There may be some global switch or argument to
tell Perl what to do.
Subject: | test_cdata_bug.pl |
#!/usr/bin/perl
use strict;
use XML::LibXML;
use utf8;
# Creating CDATA section with cyrillic letter YA (looks as mirror image of 'R')
my $doc = XML::LibXML->new->parse_string( '<root/>' );
$doc->documentElement->appendChild( XML::LibXML::CDATASection->new( 'Я' ) );
# Serializing in 8-bit encoding to $buffer
#$doc->setEncoding( 'ascii' ); # Will fail
$doc->setEncoding( 'latin1' ); # Will fail
#$doc->setEncoding( 'windows-1251' ); # This works OK because windows-1251 contains cyrillic YA
my $buffer = $doc->serialize, "\n";
# Test contents of original document (OK)
if ( substr( $doc->documentElement->firstChild->nodeValue, 0, 1 ) eq '&' )
{
die "Wrong";
}
else
{
if ( substr( $doc->documentElement->firstChild->nodeValue, 0, 1 ) eq 'Я' ) # First char of CDATA value is YA
{
print "Original document value is OK\n"; # Cyrillic YA
}
else
{
die "Wrong value";
}
}
print "Serialization of original document: ", $buffer, "\n";
# Parsing the $buffer which was serialized before
my $doc2 = XML::LibXML->new->parse_string( $buffer );
$doc2->setEncoding( 'utf-8' );
print "Serialization of parsed document: ", $doc2->serialize, "\n"; # If first serialization was done with wrong encoding, the CDATA section contains &
# Shows the value of CDATA section
print "The [corrupted] value of text inside a firstChild of a parsed doc2: ",
$doc2->documentElement->firstChild->nodeValue, "\n";
# This shows that in LibXML 1.70 the first char of CDATA section is '&', not cyrillic YA
if ( substr( $doc2->documentElement->firstChild->nodeValue, 0, 1 ) eq '&' )
{
die "I did not add text '&' to CDATA!!!"; # OOPS!
}
print "If we come here, the bug may have been fixed! Thank you! :)\n";