Subject: | Default document encoding is not utf-8 |
I found a problem in the XML::LibXML modul concerning the encoding of the resulting XML document (as output by toString). The documentation for createDocument says, that the default encoding is an implicitly defined utf-8 (as the XML-1.0 standards defines).
The following code first creates a string containing the single character ä (U00E4). The first document is output using the default encoding (which according to the documentation should be implicitly utf-8), the second part sets the document encoding explicit to utf-8 before outputting it. In both cases the document is sent to stdout.
The binmode statement makes sure that perl is capable of utf-8 output on stdout.
Here is the output I get from this code:
<?xml version="1.0"?>
<test contents="ä"/>
<?xml version="1.0" encoding="utf-8"?>
<test contents="ä"/>
As you can see the 'ä' in the first output is iso-8859-1 encoded instead of the expected utf-8. The second output is correct.
Here is my example code to reproduce the bug:
############################################################
binmode(STDOUT, ':utf8');
# the small letter a with diaresis (ä) as an example
my $in = pack('U', 0x00e4);
use XML::LibXML;
my $doc = XML::LibXML::Document->new();
my $node = XML::LibXML::Element->new('test');
$node->setAttribute(contents => $in);
$doc->setDocumentElement($node);
# First output
print $doc->toString(1);
# Second output
$doc->setEncoding('utf-8');
print $doc->toString(1);
############################################################
Versions of the libraries:
libc6 2.3.2ds1
libxml2 2.6.11
XML::LibXML 1.58
Here are information about perl and its system environent:
$ perl -v
This is perl, v5.8.4 built for i386-linux-thread-multi
...
$ uname -a
Linux myrkr 2.6.7 #1 Sat Sep 4 20:20:27 CEST 2004 i686 GNU/Linux
$ locale
LANG=de_DE.UTF-8
LC_CTYPE=de_DE.UTF-8
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES=POSIX
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=
If you need more information about my system please tell me.
Torsten
Message body not shown because it is not plain text.