Bug #26318 for XML-LibXML: Improve $doc->toString() to encode document correctly

Fri Apr 13 07:01:45 2007 alankila [...] elma.fi - Ticket created

Subject:

Improve $doc->toString() to encode document correctly

The current implementation of toString() leaves the document in perl's internal unicode encoding in many cases. This is difficult to fix, because people likely do not know how to fix the problem, and resort to random hacks that seem to work, but could be wrong. For instance, the simple task of printing many documents to STDOUT tends to invoke "wide character in print" warnings, and the end result that goes to STDOUT might be corrupt XML. Use of "toFH(\*STDOUT)" somewhat works around this issue, but it's not convenient when you aren't actually writing to a file, but need the document as string, maybe to pass to other XML-expecting APIs, or something. Or maybe you are implementing XML-DSig and need to calculate SHA-1 hashes of documents. (I know that you most often do canonicalization on the XML-DSig, and this fixes the encoding to UTF-8, but this is not always true.) My argument is however, that this must work just like toFH(\*STDOUT) works: print STDOUT $xml->toString() The solution appears to be something along the lines that if the document that comes out has perl's unicode flag set, then you must Encode::encode() it to UTF-8. I am not 100 % sure of the correctness of the solution, but it appears to do the right thing. For instance, if ISO-8859-15 is used to describe a document with euro, then the result has UTF-8 flag off (looks like it is ISO-8859-1), and the character a4 is put where Euro symbol should be, making Perl replace it with ? as it attempts to convert iso-8859-1 \xa4 to iso-8859-15 equivalent, which does not exist: use Encode; use XML::LibXML; my $x = XML::LibXML->new->parse_string("<?xml version=\"1.0\" encoding=\"ISO-8859-15\"?><x>\xa4</x>"); print Encode::encode("ISO-8859-15", $x->toString); and the output is: <?xml version="1.0" encoding="ISO-8859-15"?> <x>?</x> Now, let's try the same with UTF-8: my $x = XML::LibXML->new->parse_string("<?xml version=\"1.0\" encoding=\"UTF-8\"?><x>\xe2\x82\xac</x>"); print Encode::encode("ISO-8859-15", $x->toString); Outputs: <?xml version="1.0" encoding="UTF-8"?> <x>€</x> Ugh! Perl sees the euro symbol as a single character, instead of the original sequence of 3 octets! What this means is that to correctly stringify this document UTF-8 encoding needs to be performed when the string has Encode::is_utf8() on. Or in other words, turning the Unicode flag off "fixes" it so that we work the same way regardless what the original encoding of the document is! Why this works is that Encoding::is_utf8 apparently stays off as long as the characters put into the document have char values less than 256. This means that regardless of document content, it all gets written the same way to regular, encoding-unaware filehandles. When the higher characters are present in the stream, then the flag somehow gets turned on, and chaos ensues because the fact that document contains these high characters now require a different treatment!!! Corrupt documents may result. Warnings about prints of wide characters occur. This is no good at all. There are more small issues: sometimes $doc->getEncoding is not defined, which basically means that XML version is 1.0 and encoding is UTF-8. (According to documentation and XML specification.) However, when outputting, UTF-8 is not assumed. Without encoding declaration: print XML::LibXML->new->parse_string("<x>\xe2\x82\xac</x>")->toString' <?xml version="1.0"?> <x>€</x> I do think that getEncoding() should probably still return UTF-8, because formally this is true for a prologless XML file, and also for XML file that misses encoding information. I realize the method was probably meant as an accessor, to find out what the value in "encoding" field is, but if we aren't concerned about text representation of XML, we don't really care what was in the original file, we care about what the file's content interpreted as XML _mean_. So I do not like the fact that XML::LibXML pretends that there is "no" encoding, because text strings always are in some encoding, and UTF-8 is assumed according to the XML spec. And this is clearly what it is doing. My take on this is that XML::LibXML should either put prolog there and declare encoding as UTF-8 honestly, or just return UTF-8 from getEncoding() and omit the prolog. Either way, it now says that there is no encoding (this is impossible, the fact it doesn't appear in prolog is irrelevant), and it changes the document by adding the prolog when input did not actually have a prolog! So this is quite possibly the worst possible way to treat it. I wonder if the document piece saying that "$doc->setEncoding() is unsafe" is true any more. Maybe it depends on libxml2 version? It would appear that XML::LibXML performs character reference substitutions as appropriate, and everything works just fine as I'm testing it. Let's just fix this mess so that toString() properly encodes the document to UTF-8 when the unicode bit is on and hands out octets. Let us remove the mention about setEncoding() being unsafe, because it seems perfectly safe. And getEncoding() should return UTF-8, never undef, and the missing prolog problem could be handled by always adding a prolog to output and explicitly choosing UTF-8 encoding for the document, which is what XML standard implies the document's content is.

Fri Apr 13 17:31:03 2007 pajas [...] matfyz.cz - Correspondence added

On Friday 13 April 2007, Antti S. Lankila via RT wrote: ... Show quoted text

> The solution appears to be something along the lines that if the > document that comes out has perl's unicode flag set, then you must > Encode::encode() it to UTF-8. I am not 100 % sure of the correctness

of Show quoted text

> the solution, but it appears to do the right thing. For instance, if > ISO-8859-15 is used to describe a document with euro, then the result > has UTF-8 flag off (looks like it is ISO-8859-1), and the character

a4 Show quoted text

> is put where Euro symbol should be, making Perl replace it with ? as

it Show quoted text

> attempts to convert iso-8859-1 \xa4 to iso-8859-15 equivalent, which > does not exist: > > use Encode; > use XML::LibXML; > my $x = XML::LibXML->new->parse_string("<?xml version=\"1.0\" > encoding=\"ISO-8859-15\"?><x>\xa4</x>"); > print Encode::encode("ISO-8859-15", $x->toString); > > and the output is: > > <?xml version="1.0" encoding="ISO-8859-15"?> > <x>?</x>

yes, but note that you in fact do something completely weird there. $x->toString is a string of octets in the iso-8859-15 encoding and by passing it to Encode::encode(...) you tell Perl (implicitly) to upgrade it to UTF-8 (as if it were Latin-1) and then (explicitly) to convert so upgraded string back to iso-8859-15. What you should do here is just print $x->toString (no Encode). You'll get the same bytes in the output as were on the input. Show quoted text

> Now, let's try the same with UTF-8: > > my $x = XML::LibXML->new->parse_string("<?xml version=\"1.0\" > encoding=\"UTF-8\"?><x>\xe2\x82\xac</x>"); > print Encode::encode("ISO-8859-15", $x->toString); > > Outputs: > > <?xml version="1.0" encoding="UTF-8"?> > <x>€</x> > > Ugh! Perl sees the euro symbol as a single character, instead of the > original sequence of 3 octets!

not sure what was your intention here, but yes, the fact that UTF8 flag is ON in this case is really odd. On the other hand, your code seems to be aware of it (otherwise why would you pass it to Encode::encode(...) without doing Encode::decode('UTF-8',...) first, right? :-)). Show quoted text

> What this means is that to correctly > stringify this document UTF-8 encoding needs to be performed when the > string has Encode::is_utf8() on. Or in other words, turning the

Unicode Show quoted text

> flag off "fixes" it so that we work the same way regardless what the > original encoding of the document is!

I agree completely that $doc->toString should behave consistently. The current inconsistency lies in the fact that if the document encoding is UTF-8, the XML is returned with UTF8 flag is on (imposing character semantics), whereas for all other encodings are returned (of course) as bytes. Ideally, $doc->toString would always return characters (UTF-8 encoded string with the UTF8 on), just as $node->toString does and there would be another API for returning the XML in the document encoding. But now its too late to go back this way. For compatibility with existing applications, $doc->toString must remain in the document encoding. But then the flag should never be ON. The fact that it is set ON for UTF-8 documents is (probably) because XML::LibXML originated before Perl 5.6 appeared, and when it did, people were confused and did not fully understand the semantics of the UTF8 flag. read on... Show quoted text

> > I wonder if the document piece saying that "$doc->setEncoding() is > unsafe" is true any more.

no, it's not. Removed. Show quoted text

> Maybe it depends on libxml2 version? It would > appear that XML::LibXML performs character reference substitutions as > appropriate, and everything works just fine as I'm testing it. > > Let's just fix this mess so that toString() properly encodes the > document to UTF-8 when the unicode bit is on and hands out octets.

If you wish toString() to always return a character string (with UTF8 "on"), then no, I can't do that (would break too many existing applications). If you wish toString() to consistently return a byte string - i.e. octets (UTF8 flag "off"), then yes, I think this is currently the best way to go (consistent and least invasive). There is only a marginal chance of breaking existing code and in those cases adding $string = $doc->toString; Encode::decode_utf8($string,0) if Encode::is_utf8($string); simulates the old behavior. So I just changed the code in the SVN in this way. This means that now it is ensured that the result is a byte string in the document encoding and as such it can be safely passed to a :byte (default) I/O layer or re-encoded using the Encode module without warnings or any obscure side effects. Show quoted text

> Let > us remove the mention about setEncoding() being unsafe, because it

seems Show quoted text

> perfectly safe.

done Show quoted text

> And getEncoding() should return UTF-8, never undef,.

Can't do in this way - I know some real applications that make use of this distinction. This function is indeed meant as an accessor, like xmlEncoding in DOM Level 3. There is, however, an existing alias for getEncoding named actualEncoding which seems quite suitable for this. I changed it to return what you suggest. Show quoted text

> the missing prolog problem could be handled by always adding a prolog

to Show quoted text

> output and explicitly choosing UTF-8 encoding for the document, which

is Show quoted text

> what XML standard implies the document's content is

sorry, but no automatic adding of encoding declarations! That could break things too and many people require control over the resulting XML. Please try the current SVN version: svn co svn://axkit.org/XML-LibXML/trunk perl Makefile.PL make make docs perl Makefile.PL # yes, again make make test make install Feel free to reopen the bug if needed or contact me on pajas at matfyz dot cz -- Petr

Fri Apr 13 17:31:06 2007 The RT System itself - Status changed from 'new' to 'open'

Fri Apr 13 17:31:08 2007 pajas [...] matfyz.cz - Status changed from 'open' to 'resolved'

Fri Apr 13 17:36:42 2007 pajas [...] matfyz.cz - Correspondence added

On pá 13.dub.2007 17:31:03, PAJAS wrote: Show quoted text

> $string = $doc->toString; > Encode::decode_utf8($string,0) if Encode::is_utf8($string);

oops, should have been something like $string = Encode::decode($string,0) if $doc->actualEncoding=~/^UTF-?8/; -- p

Fri Apr 13 17:36:46 2007 The RT System itself - Status changed from 'resolved' to 'open'

Fri Apr 13 17:41:14 2007 pajas [...] matfyz.cz - Status changed from 'open' to 'resolved'