Bug #75083 for XML-LibXML: Some CDATA values are corrupted during serialization to non-unicode encodings.

Fri Feb 17 02:19:07 2012 allter [...] gmail.com - Ticket created

Subject:

Some CDATA values are corrupted during serialization to non-unicode encodings.

In serizalized XML documents in CDATA sections characters like '&' have its literal values (ampersand). You cannot express unicode character inside a CDATA section that is not supported by XML document encoding (syntax of character references doesn't work). However sometimes we need to write XML documents in legacy non-unicode encodings and the documents can contain characters unsupported by these encodings. For example, such data can be obtained by parsing some other document and then importing some parts of it to the destination document. Currently in these cases XML::LibXML silently writes to destination document values like '&#nnnn;'. But, unfortunately, when such document is read again the value of this CDATA doesn't reflect initial CDATA value. I think that when serializing CDATA sections, when unable to express CDATA value in a serialized document, LibXML should either warn() or die() or to upgrade CDATA node type to Text node type (where character references can be used). There may be some global switch or argument to tell Perl what to do.

Subject:

test_cdata_bug.pl

#!/usr/bin/perl use strict; use XML::LibXML; use utf8; # Creating CDATA section with cyrillic letter YA (looks as mirror image of 'R') my $doc = XML::LibXML->new->parse_string( '<root/>' ); $doc->documentElement->appendChild( XML::LibXML::CDATASection->new( 'Ð¯' ) ); # Serializing in 8-bit encoding to $buffer #$doc->setEncoding( 'ascii' ); # Will fail $doc->setEncoding( 'latin1' ); # Will fail #$doc->setEncoding( 'windows-1251' ); # This works OK because windows-1251 contains cyrillic YA my $buffer = $doc->serialize, "\n"; # Test contents of original document (OK) if ( substr( $doc->documentElement->firstChild->nodeValue, 0, 1 ) eq '&' ) { die "Wrong"; } else { if ( substr( $doc->documentElement->firstChild->nodeValue, 0, 1 ) eq 'Ð¯' ) # First char of CDATA value is YA { print "Original document value is OK\n"; # Cyrillic YA } else { die "Wrong value"; } } print "Serialization of original document: ", $buffer, "\n"; # Parsing the $buffer which was serialized before my $doc2 = XML::LibXML->new->parse_string( $buffer ); $doc2->setEncoding( 'utf-8' ); print "Serialization of parsed document: ", $doc2->serialize, "\n"; # If first serialization was done with wrong encoding, the CDATA section contains & # Shows the value of CDATA section print "The [corrupted] value of text inside a firstChild of a parsed doc2: ", $doc2->documentElement->firstChild->nodeValue, "\n"; # This shows that in LibXML 1.70 the first char of CDATA section is '&', not cyrillic YA if ( substr( $doc2->documentElement->firstChild->nodeValue, 0, 1 ) eq '&' ) { die "I did not add text '&' to CDATA!!!"; # OOPS! } print "If we come here, the bug may have been fixed! Thank you! :)\n";

Fri Mar 07 17:29:20 2014 NWELLNHOF [...] cpan.org - Correspondence added

Running your test script with the latest version of XML::LibXML, I get the following warning when trying to serialize the document: encoding error : output conversion failed due to conv error, bytes 0x00 0x3F 0x78 0x6D I/O error : encoder error This seems correct since it's impossible to create a CDATA section with characters outside of the encoding's character set. But I think XML::LibXML should throw an exception in this case. When I use 'windows-1251' as encoding, everything works.

Fri Mar 07 17:29:21 2014 The RT System itself - Status changed from 'new' to 'open'