Skip Menu |

This queue is for tickets about the XML-LibXML CPAN distribution.

Report information
The Basics
Id: 115190
Status: open
Priority: 0/
Queue: XML-LibXML

People
Owner: Nobody in particular
Requestors: IKEGAMI [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: $doc->toString() isn't encoded
The docs say $doc->toString() produces an encoded document, but it doesn't for me. $ perl -e' use strict; use warnings; use XML::LibXML qw( ); my $doc = XML::LibXML::Document->new(1.0, "UTF-8"); my $root = $doc->createElement("root"); $root->appendText("\x{C9}ric"); $doc->setDocumentElement($root); print $doc->toString(); ' | od -c 0000000 < ? x m l v e r s i o n = " 1 0000020 " e n c o d i n g = " U T F - 0000040 8 " ? > \n < r o o t > 311 r i c < 0000060 / r o o t > \n 0000067 "É" is encoded as "\xC9" ("\0311"), when its UTF-8 encoding is "\xC3\xE9" ("\0303\0211"). I think ->appendText suffers from The Unicode Bug.
Subject: ->appendText suffers from The Unicode Bug
This more clearly demonstrates the bug: $ perl -e' use strict; use warnings; use XML::LibXML qw( ); sub _u { my $s = shift; utf8::upgrade($s); $s } sub _d { my $s = shift; utf8::downgrade($s, 1); $s } my $doc = XML::LibXML::Document->new(1.0, "UTF-8"); my $root = $doc->createElement("root"); $doc->setDocumentElement($root); for my $text (_d("\x{C9}ric"), _u("\x{C9}ric")) { my $ele = $doc->createElement("ele"); $root->appendChild($ele); $ele->appendText($text); } print $doc->toString(); ' | od -c 0000000 < ? x m l v e r s i o n = " 1 0000020 " e n c o d i n g = " U T F - 0000040 8 " ? > \n < r o o t > < e l e > 0000060 311 r i c < / e l e > < e l e > 303 0000100 211 r i c < / e l e > < / r o o t 0000120 > \n 0000122
Subject: ->appendText suffers from The Unicode Bug
It's not limited to when the document encoding is UTF-8. If the document encoding were changed to "cp437", one gets the following output: 0000000 < ? x m l v e r s i o n = " 1 0000020 " e n c o d i n g = " c p 4 3 0000040 7 " ? > \n < r o o t > < e l e > 0000060 311 r i c < / e l e > < e l e > 220 0000100 r i c < / e l e > < / r o o t > 0000120 \n 0000121 Again, the result is correct (0220) when then input string has UTF8=1, but incorrect (0311) when the input string has UTF8=0. Two strings that are equal should produce the same output. Producing different output based on how they are stored is a bug. The internal storage format no effect
Subject: ->appendText suffers from The Unicode Bug
On Thu Jun 09 15:14:51 2016, ikegami wrote: Show quoted text
> The internal storage format no effect
Ignore this line. It's just garbage.
UTF-16le 0000000 < \0 ? \0 x \0 m \0 l \0 \0 v \0 e \0 0000020 r \0 s \0 i \0 o \0 n \0 = \0 " \0 1 \0 0000040 " \0 \0 e \0 n \0 c \0 o \0 d \0 i \0 0000060 n \0 g \0 = \0 " \0 U \0 T \0 F \0 - \0 0000100 1 \0 6 \0 l \0 e \0 " \0 ? \0 > \0 \n \0 0000120 < \0 r \0 o \0 o \0 t \0 > \0 < \0 e \0 0000140 l \0 e \0 > \0 311 r i c < \0 / \0 e \0 0000160 l \0 e \0 > \0 < \0 e \0 l \0 e \0 > \0 0000200 311 \0 r \0 i \0 c \0 < \0 / \0 e \0 l \0 0000220 e \0 > \0 < \0 / \0 r \0 o \0 o \0 t \0 0000240 > \0 \n \0 0000244 "311 r i c "!!!
Hi IKEGAMI, please provide a test case as an attachment of a perl file - not as a command line -e invocation. If it can be a test file that can be run as prove t/mytest.t that would be best.
Attached.
Subject: rt115190.t
#!/usr/bin/perl use strict; use warnings; use XML::LibXML qw( ); use Test::More; sub _u($) { my $s = shift; utf8::upgrade($s); $s } sub _d($) { my $s = shift; utf8::downgrade($s, 1); $s } sub gen_text_node { my ($encoding, $text) = @_; my $doc = XML::LibXML::Document->new("1.0", $encoding); my $root = $doc->createElement("root"); $doc->setDocumentElement($root); $root->appendText($text); return $doc->toString(); } sub parse_text_node { my ($xml) = @_; my $parser = XML::LibXML->new(); my $doc = $parser->parse_string($xml); return $doc->documentElement->textContent(); } sub gen_attr_node { my ($encoding, $text) = @_; my $doc = XML::LibXML::Document->new("1.0", $encoding); my $root = $doc->createElement("root"); $doc->setDocumentElement($root); $root->setAttribute('attr', $text); return $doc->toString(); } sub parse_attr_node { my ($xml) = @_; my $parser = XML::LibXML->new(); my $doc = $parser->parse_string($xml); return $doc->documentElement->getAttribute('attr'); } { my @encodings = qw( UTF-8 cp437 UTF-16le ); plan tests => 1+2*@encodings*2; my $text = "\xC9\xE9"; my $text_u = _u $text; my $text_d = _d $text; is($text_d, $text_u, "assert(\$text_d eq \$text_u)"); for ( [ "upgraded" => $text_u ], [ "downgraded" => $text_d ], ) { my ($format, $text) = @$_; for my $encoding (@encodings) { { my $xml = gen_text_node($encoding, $text); if (!eval { my $got = parse_text_node($xml); is(sprintf("%vX", $got), sprintf("%vX", $text), "Round trip $format text node using $encoding"); return 1; }) { fail("Round trip $format text node using $encoding") or diag( ( split /^/, $@ )[0] ); } } { my $xml = gen_attr_node($encoding, $text); if (!eval { my $got = parse_attr_node($xml); is(sprintf("%vX", $got), sprintf("%vX", $text), "Round trip $format attribute using $encoding"); return 1; }) { fail("Round trip $format attribute using $encoding") or diag( ( split /^/, $@ )[0] ); } } } } }
FWIW, I think this is actually (and historically) WAI, see http://search.cpan.org/dist/XML-LibXML/LibXML.pod#ENCODINGS_SUPPORT_IN_XML::LIBXML To avoid, always make sure the input string to the API has utf-8 flag on.
CC: IKEGAMI [...] cpan.org
Subject: Re: [rt.cpan.org #115190] ->appendText suffers from The Unicode Bug
Date: Thu, 9 Jun 2016 19:00:30 -0400
To: bug-XML-LibXML [...] rt.cpan.org
From: Eric Brine <ikegami [...] adaelis.com>
Explicitly upgrading every string to pass to XML::LibXML is frankly ridiculous, and so rarely needed that noone's going to even think of doing it until this bug causes problem. Like just happened to me. I appreciate the backwards compatibility requirement, so I propose the addition of a parser option. On Thu, Jun 9, 2016 at 6:00 PM, Petr Pajas via RT < bug-XML-LibXML@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=115190 > > > FWIW, I think this is actually (and historically) WAI, see > > > http://search.cpan.org/dist/XML-LibXML/LibXML.pod#ENCODINGS_SUPPORT_IN_XML::LIBXML > > To avoid, always make sure the input string to the API has utf-8 flag on. >