Skip Menu |

This queue is for tickets about the XML-LibXML CPAN distribution.

Report information
The Basics
Id: 75083
Status: open
Priority: 0/
Queue: XML-LibXML

People
Owner: Nobody in particular
Requestors: allter [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in:
  • 1.70
  • 1.90
Fixed in: (no value)



Subject: Some CDATA values are corrupted during serialization to non-unicode encodings.
In serizalized XML documents in CDATA sections characters like '&' have its literal values (ampersand). You cannot express unicode character inside a CDATA section that is not supported by XML document encoding (syntax of character references doesn't work). However sometimes we need to write XML documents in legacy non-unicode encodings and the documents can contain characters unsupported by these encodings. For example, such data can be obtained by parsing some other document and then importing some parts of it to the destination document. Currently in these cases XML::LibXML silently writes to destination document values like '&#nnnn;'. But, unfortunately, when such document is read again the value of this CDATA doesn't reflect initial CDATA value. I think that when serializing CDATA sections, when unable to express CDATA value in a serialized document, LibXML should either warn() or die() or to upgrade CDATA node type to Text node type (where character references can be used). There may be some global switch or argument to tell Perl what to do.
Subject: test_cdata_bug.pl
#!/usr/bin/perl use strict; use XML::LibXML; use utf8; # Creating CDATA section with cyrillic letter YA (looks as mirror image of 'R') my $doc = XML::LibXML->new->parse_string( '<root/>' ); $doc->documentElement->appendChild( XML::LibXML::CDATASection->new( 'Я' ) ); # Serializing in 8-bit encoding to $buffer #$doc->setEncoding( 'ascii' ); # Will fail $doc->setEncoding( 'latin1' ); # Will fail #$doc->setEncoding( 'windows-1251' ); # This works OK because windows-1251 contains cyrillic YA my $buffer = $doc->serialize, "\n"; # Test contents of original document (OK) if ( substr( $doc->documentElement->firstChild->nodeValue, 0, 1 ) eq '&' ) { die "Wrong"; } else { if ( substr( $doc->documentElement->firstChild->nodeValue, 0, 1 ) eq 'Я' ) # First char of CDATA value is YA { print "Original document value is OK\n"; # Cyrillic YA } else { die "Wrong value"; } } print "Serialization of original document: ", $buffer, "\n"; # Parsing the $buffer which was serialized before my $doc2 = XML::LibXML->new->parse_string( $buffer ); $doc2->setEncoding( 'utf-8' ); print "Serialization of parsed document: ", $doc2->serialize, "\n"; # If first serialization was done with wrong encoding, the CDATA section contains & # Shows the value of CDATA section print "The [corrupted] value of text inside a firstChild of a parsed doc2: ", $doc2->documentElement->firstChild->nodeValue, "\n"; # This shows that in LibXML 1.70 the first char of CDATA section is '&', not cyrillic YA if ( substr( $doc2->documentElement->firstChild->nodeValue, 0, 1 ) eq '&' ) { die "I did not add text '&' to CDATA!!!"; # OOPS! } print "If we come here, the bug may have been fixed! Thank you! :)\n";
Running your test script with the latest version of XML::LibXML, I get the following warning when trying to serialize the document: encoding error : output conversion failed due to conv error, bytes 0x00 0x3F 0x78 0x6D I/O error : encoder error This seems correct since it's impossible to create a CDATA section with characters outside of the encoding's character set. But I think XML::LibXML should throw an exception in this case. When I use 'windows-1251' as encoding, everything works.