Skip Menu |

This queue is for tickets about the XML-LibXML CPAN distribution.

Report information
The Basics
Id: 7645
Status: resolved
Worked: 5 hours (300 min)
Priority: 0/
Queue: XML-LibXML

People
Owner: phish [...] cpan.org
Requestors: torsten.hilbrich [...] gmx.net
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 1.58
Fixed in: (no value)



Subject: Default document encoding is not utf-8
I found a problem in the XML::LibXML modul concerning the encoding of the resulting XML document (as output by toString). The documentation for createDocument says, that the default encoding is an implicitly defined utf-8 (as the XML-1.0 standards defines). The following code first creates a string containing the single character ä (U00E4). The first document is output using the default encoding (which according to the documentation should be implicitly utf-8), the second part sets the document encoding explicit to utf-8 before outputting it. In both cases the document is sent to stdout. The binmode statement makes sure that perl is capable of utf-8 output on stdout. Here is the output I get from this code: <?xml version="1.0"?> <test contents="&#xE4;"/> <?xml version="1.0" encoding="utf-8"?> <test contents="ä"/> As you can see the 'ä' in the first output is iso-8859-1 encoded instead of the expected utf-8. The second output is correct. Here is my example code to reproduce the bug: ############################################################ binmode(STDOUT, ':utf8'); # the small letter a with diaresis (ä) as an example my $in = pack('U', 0x00e4); use XML::LibXML; my $doc = XML::LibXML::Document->new(); my $node = XML::LibXML::Element->new('test'); $node->setAttribute(contents => $in); $doc->setDocumentElement($node); # First output print $doc->toString(1); # Second output $doc->setEncoding('utf-8'); print $doc->toString(1); ############################################################ Versions of the libraries: libc6 2.3.2ds1 libxml2 2.6.11 XML::LibXML 1.58 Here are information about perl and its system environent: $ perl -v This is perl, v5.8.4 built for i386-linux-thread-multi ... $ uname -a Linux myrkr 2.6.7 #1 Sat Sep 4 20:20:27 CEST 2004 i686 GNU/Linux $ locale LANG=de_DE.UTF-8 LC_CTYPE=de_DE.UTF-8 LC_NUMERIC="de_DE.UTF-8" LC_TIME="de_DE.UTF-8" LC_COLLATE="de_DE.UTF-8" LC_MONETARY="de_DE.UTF-8" LC_MESSAGES=POSIX LC_PAPER="de_DE.UTF-8" LC_NAME="de_DE.UTF-8" LC_ADDRESS="de_DE.UTF-8" LC_TELEPHONE="de_DE.UTF-8" LC_MEASUREMENT="de_DE.UTF-8" LC_IDENTIFICATION="de_DE.UTF-8" LC_ALL= If you need more information about my system please tell me. Torsten
Download bug
application/octet-stream 383b

Message body not shown because it is not plain text.

From: Torsten.Hilbrich [...] gmx.net
It seems the HTML generated do not quote the special characters: Show quoted text
> <?xml version="1.0"?> > <test contents="&#xE4;"/> > <?xml version="1.0" encoding="utf-8"?> > <test contents="ä"/>
The output should be read as (quoting the ampersand character): <?xml version="1.0"?> <test contents="&amp;#xE4;"/> <?xml version="1.0" encoding="utf-8"?> <test contents="ä"/> Torsten
From: reporter
Show quoted text
> As you can see the 'ä' in the first output is iso-8859-1 encoded > instead of the expected utf-8. The second output is correct.
I have additional information. It seems the character entity output of the first line is correct XML syntax and also correctly transformed to the ä character on parsing. So the only remaining issue is that the output is not utf-8 but rather ASCII with using character entities for all non-ASCII characters. This should possibly be documented but cannot be considered a real bug.
The problem is not related to XML::LibXML but to libxml2. this problem is fixed with libxml2 2.6.15, maybe earlier, but I have not tested it against other versions, yet. Christian