Skip Menu |

This queue is for tickets about the XML-LibXML CPAN distribution.

Report information
The Basics
Id: 53532
Status: open
Priority: 0/
Queue: XML-LibXML

People
Owner: Nobody in particular
Requestors: IKEGAMI [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in:
  • 1.66
  • 1.70
Fixed in: (no value)



appendTextChild is sensitive to the internal format Perl is using to store the string containg its second argument. Show quoted text
---------- BEGIN CODE ---------- use strict; use warnings; use XML::LibXML qw( ); my $s = "abcd\x{f6}efgh"; if ($ARGV[0]) { # One internal format utf8::downgrade($s); } else { # Other internal format utf8::upgrade($s); } XML::LibXML::Element->new('foo')->appendTextChild('Node', $s);
---------- END CODE ----------
---------- BEGIN OUTPUT ----------
>perl test.pl 0
>perl test.pl 1
error : xmlEncodeEntitiesReentrant : char out of range
---------- END OUTPUT ---------- Interestingly, the resulting XML is identical.
More seriously, for other Latin-1 characters, the output gets mangled and no warning is emitted. # perl test.pl f6 0 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdöefgh</Node></foo> # perl test.pl f6 1 error : xmlEncodeEntitiesReentrant : char out of range <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdöefgh</Node></foo> # perl test.pl e1 0 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdáefgh</Node></foo> # perl test.pl e1 1 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdᥦgh</Node></foo>
Subject: test.pl
#!/usr/bin/perl use strict; use warnings; use XML::LibXML qw( ); my $s = "abcd" . chr(hex($ARGV[0])) . "efgh"; if ($ARGV[1]) { # One internal format utf8::downgrade($s); } else { # Other internal format utf8::upgrade($s); } my $e = XML::LibXML::Element->new('foo'); $e->appendTextChild('Node', $s); my $d = XML::LibXML::Document->new("1.0","UTF-8"); $d->setDocumentElement($e); $e->appendTextChild('Node', $s); print $d->toString();
Please disregard previous comment + attachment, it confuses two issues that were to be reported seperately. test1.pl demonstrates how some Latin-1 input characters get silently corrupted: # perl test1.pl f6 0 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdöefgh</Node></foo> # perl test1.pl f6 1 error : xmlEncodeEntitiesReentrant : char out of range <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdöefgh</Node></foo> # perl test1.pl e1 0 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdáefgh</Node></foo> # perl test1.pl e1 1 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdᥦgh</Node></foo> test2.pl demonstrates how invalid utf-8 output can result: #perl /srv/scratch/test2.pl f6 0 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcdöefgh</Node></foo> #perl /srv/scratch/test2.pl f6 1 <?xml version="1.0" encoding="UTF-8"?> <foo><Node>abcd?efgh</Node></foo> #perl /srv/scratch/test2.pl f6 1 | hexdump -Cv 00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml version="1| 00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 |.0" encoding="UT| 00000020 46 2d 38 22 3f 3e 0a 3c 66 6f 6f 3e 3c 4e 6f 64 |F-8"?>.<foo><Nod| 00000030 65 3e 61 62 63 64 f6 65 66 67 68 3c 2f 4e 6f 64 |e>abcd.efgh</Nod| 00000040 65 3e 3c 2f 66 6f 6f 3e 0a |e></foo>.| 00000049
Subject: test2.pl
#!/usr/bin/perl use strict; use warnings; use XML::LibXML qw( ); my $s = "abcd" . chr(hex($ARGV[0])) . "efgh"; if ($ARGV[1]) { # One internal format utf8::downgrade($s); } else { # Other internal format utf8::upgrade($s); } my $e = XML::LibXML::Element->new('foo'); my $d = XML::LibXML::Document->new("1.0","UTF-8"); $d->setDocumentElement($e); $e->appendTextChild('Node', $s); print $d->toString();
Subject: test1.pl
#!/usr/bin/perl use strict; use warnings; use XML::LibXML qw( ); my $s = "abcd" . chr(hex($ARGV[0])) . "efgh"; if ($ARGV[1]) { # One internal format utf8::downgrade($s); } else { # Other internal format utf8::upgrade($s); } my $e = XML::LibXML::Element->new('foo'); $e->appendTextChild('Node', $s); my $d = XML::LibXML::Document->new("1.0","UTF-8"); $d->setDocumentElement($e); print $d->toString();
From: daniel.frett [...] ccci.org
This is actually documented behavior, in the XML::LibXML documentation under the "ENCODINGS SUPPORT IN XML::LIBXML" section it says: 3. DOM methods also accept binary strings in the original encoding of the document to which the node belongs (UTF-8 is assumed if the node is not attached to any document). Exploiting this feature is NOT RECOMMENDED since it is considered a bad practice. my $doc = XML::LibXML:Document->new('1.0','iso-8859-2'); my $text = $doc->createTextNode($some_latin2_encoded_byte_string); # WORKS, BUT NOT RECOMMENDED! I personally would prefer if XML::LibXML would by default treat all strings as character strings and automatically convert them, especially since the documented functionality is considered "bad practice". But because it is documented behavior, this will break backwards compatability. Maybe an alternative to help move people away from the bad practice of setting byte strings directly and encourage the use of character strings would be to add a global flag that can turn on/off treating all strings as character strings. Then through a deprecation schedule over several versions, start off with the option disabled by default, then enable it option by default, and (maybe?) at some point remove the byte string support. On Fri Jan 08 23:47:13 2010, ikegami wrote: Show quoted text
> appendTextChild is sensitive to the internal format Perl is using to > store the string containg its second argument. > > ---------- BEGIN CODE ---------- > use strict; > use warnings; > > use XML::LibXML qw( ); > > my $s = "abcd\x{f6}efgh"; > > if ($ARGV[0]) { > # One internal format > utf8::downgrade($s); > } else { > # Other internal format > utf8::upgrade($s); > } > > XML::LibXML::Element->new('foo')->appendTextChild('Node', $s); > ---------- END CODE ---------- > > ---------- BEGIN OUTPUT ----------
> >perl test.pl 0
>
> >perl test.pl 1
> error : xmlEncodeEntitiesReentrant : char out of range > ---------- END OUTPUT ---------- > > Interestingly, the resulting XML is identical.
Subject: Re: [rt.cpan.org #53532] appendTextChild is sensitive to internal format of text
Date: Thu, 25 Nov 2010 16:49:03 -0500
To: bug-XML-LibXML [...] rt.cpan.org
From: Eric Brine <ikegami [...] adaelis.com>
On Thu, Nov 25, 2010 at 3:09 PM, Daniel Frett via RT <bug-XML-LibXML@rt.cpan.org> wrote: Show quoted text
> especially since the documented functionality is considered > "bad practice".
It's bad practice because it makes the following two statements not equivalent even though "é" is Unicode character E9: use utf8; ->createTextNode("abcdéf"); use utf8; ->createTextNode("abcd\x{E9}f"); Workaround for Perl 5.12+: use utf8; ->createTextNode("abcd\N{U+E9}f"); Show quoted text
> But because it is documented behavior, this will break backwards > compatability.
That's unfortunate. Your solution sounds perfectly reasonable, though. Thanks for having a look, Eric
Resolving as rejected per the discussion. Regards, -- Shlomi Fish
CC: IKEGAMI [...] cpan.org
Subject: Re: [rt.cpan.org #53532] appendTextChild is sensitive to internal format of text
Date: Wed, 20 Jul 2011 17:17:22 -0400
To: bug-XML-LibXML [...] rt.cpan.org
From: Eric Brine <ikegami [...] adaelis.com>
On Wed, Jul 20, 2011 at 4:11 PM, Shlomi Fish via RT < bug-XML-LibXML@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=53532 > > > Resolving as rejected per the discussion. >
The *ticket* shouldn't be rejected. The only thing that was rejected was *changing createTextNode*. That does not preclude adding a function that actually works.
CC: IKEGAMI [...] cpan.org
Subject: Re: [rt.cpan.org #53532] appendTextChild is sensitive to internal format of text
Date: Wed, 20 Jul 2011 17:26:15 -0400
To: bug-XML-LibXML [...] rt.cpan.org
From: Eric Brine <ikegami [...] adaelis.com>
On Wed, Jul 20, 2011 at 5:17 PM, Eric Brine <ikegami@adaelis.com> wrote: Show quoted text
> On Wed, Jul 20, 2011 at 4:11 PM, Shlomi Fish via RT < > bug-XML-LibXML@rt.cpan.org> wrote: >
>> <URL: https://rt.cpan.org/Ticket/Display.html?id=53532 > >> >> Resolving as rejected per the discussion. >>
> > The *ticket* shouldn't be rejected. The only thing that was rejected was > *changing createTextNode*. That does not preclude adding a function that > actually works. >
Possible ways forward: - Adding a note in the documentation that utf8::upgrade must be called on text passed to createTextNode. - Adding new function that always treats the argument as text. - Adding a config option that makes createTextNode always treat the argument as text.