Subject: | Improve $doc->toString() to encode document correctly |
The current implementation of toString() leaves the document in perl's
internal unicode encoding in many cases. This is difficult to fix,
because people likely do not know how to fix the problem, and resort to
random hacks that seem to work, but could be wrong.
For instance, the simple task of printing many documents to STDOUT tends
to invoke "wide character in print" warnings, and the end result that
goes to STDOUT might be corrupt XML.
Use of "toFH(\*STDOUT)" somewhat works around this issue, but it's not
convenient when you aren't actually writing to a file, but need the
document as string, maybe to pass to other XML-expecting APIs, or
something. Or maybe you are implementing XML-DSig and need to calculate
SHA-1 hashes of documents. (I know that you most often do
canonicalization on the XML-DSig, and this fixes the encoding to UTF-8,
but this is not always true.)
My argument is however, that this must work just like toFH(\*STDOUT) works:
print STDOUT $xml->toString()
The solution appears to be something along the lines that if the
document that comes out has perl's unicode flag set, then you must
Encode::encode() it to UTF-8. I am not 100 % sure of the correctness of
the solution, but it appears to do the right thing. For instance, if
ISO-8859-15 is used to describe a document with euro, then the result
has UTF-8 flag off (looks like it is ISO-8859-1), and the character a4
is put where Euro symbol should be, making Perl replace it with ? as it
attempts to convert iso-8859-1 \xa4 to iso-8859-15 equivalent, which
does not exist:
use Encode;
use XML::LibXML;
my $x = XML::LibXML->new->parse_string("<?xml version=\"1.0\"
encoding=\"ISO-8859-15\"?><x>\xa4</x>");
print Encode::encode("ISO-8859-15", $x->toString);
and the output is:
<?xml version="1.0" encoding="ISO-8859-15"?>
<x>?</x>
Now, let's try the same with UTF-8:
my $x = XML::LibXML->new->parse_string("<?xml version=\"1.0\"
encoding=\"UTF-8\"?><x>\xe2\x82\xac</x>");
print Encode::encode("ISO-8859-15", $x->toString);
Outputs:
<?xml version="1.0" encoding="UTF-8"?>
<x>€</x>
Ugh! Perl sees the euro symbol as a single character, instead of the
original sequence of 3 octets! What this means is that to correctly
stringify this document UTF-8 encoding needs to be performed when the
string has Encode::is_utf8() on. Or in other words, turning the Unicode
flag off "fixes" it so that we work the same way regardless what the
original encoding of the document is!
Why this works is that Encoding::is_utf8 apparently stays off as long as
the characters put into the document have char values less than 256.
This means that regardless of document content, it all gets written the
same way to regular, encoding-unaware filehandles.
When the higher characters are present in the stream, then the flag
somehow gets turned on, and chaos ensues because the fact that document
contains these high characters now require a different treatment!!!
Corrupt documents may result. Warnings about prints of wide characters
occur. This is no good at all.
There are more small issues: sometimes $doc->getEncoding is not defined,
which basically means that XML version is 1.0 and encoding is UTF-8.
(According to documentation and XML specification.) However, when
outputting, UTF-8 is not assumed.
Without encoding declaration:
print XML::LibXML->new->parse_string("<x>\xe2\x82\xac</x>")->toString'
<?xml version="1.0"?>
<x>€</x>
I do think that getEncoding() should probably still return UTF-8,
because formally this is true for a prologless XML file, and also for
XML file that misses encoding information.
I realize the method was probably meant as an accessor, to find out what
the value in "encoding" field is, but if we aren't concerned about text
representation of XML, we don't really care what was in the original
file, we care about what the file's content interpreted as XML _mean_.
So I do not like the fact that XML::LibXML pretends that there is "no"
encoding, because text strings always are in some encoding, and UTF-8 is
assumed according to the XML spec. And this is clearly what it is doing.
My take on this is that XML::LibXML should either put prolog there and
declare encoding as UTF-8 honestly, or just return UTF-8 from
getEncoding() and omit the prolog. Either way, it now says that there is
no encoding (this is impossible, the fact it doesn't appear in prolog is
irrelevant), and it changes the document by adding the prolog when input
did not actually have a prolog! So this is quite possibly the worst
possible way to treat it.
I wonder if the document piece saying that "$doc->setEncoding() is
unsafe" is true any more. Maybe it depends on libxml2 version? It would
appear that XML::LibXML performs character reference substitutions as
appropriate, and everything works just fine as I'm testing it.
Let's just fix this mess so that toString() properly encodes the
document to UTF-8 when the unicode bit is on and hands out octets. Let
us remove the mention about setEncoding() being unsafe, because it seems
perfectly safe. And getEncoding() should return UTF-8, never undef, and
the missing prolog problem could be handled by always adding a prolog to
output and explicitly choosing UTF-8 encoding for the document, which is
what XML standard implies the document's content is.