Bug #53532 for XML-LibXML: appendTextChild is sensitive to internal format of text

Fri Jan 08 23:47:13 2010 IKEGAMI [...] cpan.org - Ticket created

appendTextChild is sensitive to the internal format Perl is using to store the string containg its second argument. Show quoted text

---------- BEGIN CODE ---------- use strict; use warnings; use XML::LibXML qw( ); my $s = "abcd\x{f6}efgh"; if ($ARGV[0]) { # One internal format utf8::downgrade($s); } else { # Other internal format utf8::upgrade($s); } XML::LibXML::Element->new('foo')->appendTextChild('Node', $s);

---------- END CODE ----------

---------- BEGIN OUTPUT ----------

>perl test.pl 0

>perl test.pl 1

error : xmlEncodeEntitiesReentrant : char out of range

---------- END OUTPUT ---------- Interestingly, the resulting XML is identical.

Fri Jan 08 23:47:45 2010 IKEGAMI [...] cpan.org - Subject changed from (no value) to 'appendTextChild is sensitive to internal format of text'

Tue Feb 02 09:54:23 2010 CJK [...] cpan.org - Correspondence added

More seriously, for other Latin-1 characters, the output gets mangled and no warning is emitted. # perl test.pl f6 0 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdöefgh</Node></foo> # perl test.pl f6 1 error : xmlEncodeEntitiesReentrant : char out of range <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdöefgh</Node></foo> # perl test.pl e1 0 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdáefgh</Node></foo> # perl test.pl e1 1 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdᥦgh</Node></foo>

Subject:

test.pl

#!/usr/bin/perl use strict; use warnings; use XML::LibXML qw( ); my $s = "abcd" . chr(hex($ARGV[0])) . "efgh"; if ($ARGV[1]) { # One internal format utf8::downgrade($s); } else { # Other internal format utf8::upgrade($s); } my $e = XML::LibXML::Element->new('foo'); $e->appendTextChild('Node', $s); my $d = XML::LibXML::Document->new("1.0","UTF-8"); $d->setDocumentElement($e); $e->appendTextChild('Node', $s); print $d->toString();

Tue Feb 02 09:54:25 2010 The RT System itself - Status changed from 'new' to 'open'

Tue Feb 02 10:05:31 2010 CJK [...] cpan.org - Correspondence added

Please disregard previous comment + attachment, it confuses two issues that were to be reported seperately. test1.pl demonstrates how some Latin-1 input characters get silently corrupted: # perl test1.pl f6 0 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdöefgh</Node></foo> # perl test1.pl f6 1 error : xmlEncodeEntitiesReentrant : char out of range <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdöefgh</Node></foo> # perl test1.pl e1 0 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdáefgh</Node></foo> # perl test1.pl e1 1 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdᥦgh</Node></foo> test2.pl demonstrates how invalid utf-8 output can result: #perl /srv/scratch/test2.pl f6 0 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdöefgh</Node></foo> #perl /srv/scratch/test2.pl f6 1 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcd?efgh</Node></foo> #perl /srv/scratch/test2.pl f6 1 | hexdump -Cv 00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml version="1| 00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 |.0" encoding="UT| 00000020 46 2d 38 22 3f 3e 0a 3c 66 6f 6f 3e 3c 4e 6f 64 |F-8"?>.<foo><Nod| 00000030 65 3e 61 62 63 64 f6 65 66 67 68 3c 2f 4e 6f 64 |e>abcd.efgh</Nod| 00000040 65 3e 3c 2f 66 6f 6f 3e 0a |e></foo>.| 00000049

Subject:

test2.pl

#!/usr/bin/perl use strict; use warnings; use XML::LibXML qw( ); my $s = "abcd" . chr(hex($ARGV[0])) . "efgh"; if ($ARGV[1]) { # One internal format utf8::downgrade($s); } else { # Other internal format utf8::upgrade($s); } my $e = XML::LibXML::Element->new('foo'); my $d = XML::LibXML::Document->new("1.0","UTF-8"); $d->setDocumentElement($e); $e->appendTextChild('Node', $s); print $d->toString();

Subject:

test1.pl

#!/usr/bin/perl use strict; use warnings; use XML::LibXML qw( ); my $s = "abcd" . chr(hex($ARGV[0])) . "efgh"; if ($ARGV[1]) { # One internal format utf8::downgrade($s); } else { # Other internal format utf8::upgrade($s); } my $e = XML::LibXML::Element->new('foo'); $e->appendTextChild('Node', $s); my $d = XML::LibXML::Document->new("1.0","UTF-8"); $d->setDocumentElement($e); print $d->toString();

Thu Nov 25 15:09:24 2010 daniel.frett [...] ccci.org - Correspondence added

From:

daniel.frett [...] ccci.org

This is actually documented behavior, in the XML::LibXML documentation under the "ENCODINGS SUPPORT IN XML::LIBXML" section it says: 3. DOM methods also accept binary strings in the original encoding of the document to which the node belongs (UTF-8 is assumed if the node is not attached to any document). Exploiting this feature is NOT RECOMMENDED since it is considered a bad practice. my $doc = XML::LibXML:Document->new('1.0','iso-8859-2'); my $text = $doc->createTextNode($some_latin2_encoded_byte_string); # WORKS, BUT NOT RECOMMENDED! I personally would prefer if XML::LibXML would by default treat all strings as character strings and automatically convert them, especially since the documented functionality is considered "bad practice". But because it is documented behavior, this will break backwards compatability. Maybe an alternative to help move people away from the bad practice of setting byte strings directly and encourage the use of character strings would be to add a global flag that can turn on/off treating all strings as character strings. Then through a deprecation schedule over several versions, start off with the option disabled by default, then enable it option by default, and (maybe?) at some point remove the byte string support. On Fri Jan 08 23:47:13 2010, ikegami wrote: Show quoted text

> appendTextChild is sensitive to the internal format Perl is using to > store the string containg its second argument. > > ---------- BEGIN CODE ---------- > use strict; > use warnings; > > use XML::LibXML qw( ); > > my $s = "abcd\x{f6}efgh"; > > if ($ARGV[0]) { > # One internal format > utf8::downgrade($s); > } else { > # Other internal format > utf8::upgrade($s); > } > > XML::LibXML::Element->new('foo')->appendTextChild('Node', $s); > ---------- END CODE ---------- > > ---------- BEGIN OUTPUT ----------

> >perl test.pl 0

>

> >perl test.pl 1

> error : xmlEncodeEntitiesReentrant : char out of range > ---------- END OUTPUT ---------- > > Interestingly, the resulting XML is identical.

Thu Nov 25 16:49:11 2010 ikegami [...] adaelis.com - Correspondence added

Subject:	Re: [rt.cpan.org #53532] appendTextChild is sensitive to internal format of text
Date:	Thu, 25 Nov 2010 16:49:03 -0500
To:	bug-XML-LibXML [...] rt.cpan.org
From:	Eric Brine <ikegami [...] adaelis.com>

On Thu, Nov 25, 2010 at 3:09 PM, Daniel Frett via RT <bug-XML-LibXML@rt.cpan.org> wrote: Show quoted text

> especially since the documented functionality is considered > "bad practice".

It's bad practice because it makes the following two statements not equivalent even though "é" is Unicode character E9: use utf8; ->createTextNode("abcdéf"); use utf8; ->createTextNode("abcd\x{E9}f"); Workaround for Perl 5.12+: use utf8; ->createTextNode("abcd\N{U+E9}f"); Show quoted text

> But because it is documented behavior, this will break backwards > compatability.

That's unfortunate. Your solution sounds perfectly reasonable, though. Thanks for having a look, Eric

Wed Jul 20 16:11:57 2011 SHLOMIF [...] cpan.org - Correspondence added

Resolving as rejected per the discussion. Regards, -- Shlomi Fish

Wed Jul 20 16:11:58 2011 SHLOMIF [...] cpan.org - Status changed from 'open' to 'rejected'

Wed Jul 20 17:17:33 2011 ikegami [...] adaelis.com - Correspondence added

CC:	IKEGAMI [...] cpan.org
Subject:	Re: [rt.cpan.org #53532] appendTextChild is sensitive to internal format of text
Date:	Wed, 20 Jul 2011 17:17:22 -0400
To:	bug-XML-LibXML [...] rt.cpan.org
From:	Eric Brine <ikegami [...] adaelis.com>

On Wed, Jul 20, 2011 at 4:11 PM, Shlomi Fish via RT < bug-XML-LibXML@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=53532 > > > Resolving as rejected per the discussion. >

The *ticket* shouldn't be rejected. The only thing that was rejected was *changing createTextNode*. That does not preclude adding a function that actually works.

Wed Jul 20 17:17:35 2011 The RT System itself - Status changed from 'rejected' to 'open'

Wed Jul 20 17:26:29 2011 ikegami [...] adaelis.com - Correspondence added

CC:	IKEGAMI [...] cpan.org
Subject:	Re: [rt.cpan.org #53532] appendTextChild is sensitive to internal format of text
Date:	Wed, 20 Jul 2011 17:26:15 -0400
To:	bug-XML-LibXML [...] rt.cpan.org
From:	Eric Brine <ikegami [...] adaelis.com>

On Wed, Jul 20, 2011 at 5:17 PM, Eric Brine <ikegami@adaelis.com> wrote: Show quoted text

> On Wed, Jul 20, 2011 at 4:11 PM, Shlomi Fish via RT < > bug-XML-LibXML@rt.cpan.org> wrote: >

>> <URL: https://rt.cpan.org/Ticket/Display.html?id=53532 > >> >> Resolving as rejected per the discussion. >>

> > The *ticket* shouldn't be rejected. The only thing that was rejected was > *changing createTextNode*. That does not preclude adding a function that > actually works. >

Possible ways forward: - Adding a note in the documentation that utf8::upgrade must be called on text passed to createTextNode. - Adding new function that always treats the argument as text. - Adding a config option that makes createTextNode always treat the argument as text.