Skip Menu |

This queue is for tickets about the XML-LibXML CPAN distribution.

Report information
The Basics
Id: 32152
Status: resolved
Priority: 0/
Queue: XML-LibXML

People
Owner: Nobody in particular
Requestors: gert [...] space.net
MARKOV [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Critical
Broken in: 1.65
Fixed in: (no value)



Subject: charset problems
It is very simple to break XML::LibXML, simply use latin1 characters above 127, as used in most European languages. I suspect that the library is using old interfaces to Perl's unicode support, or simplifying life a too much. The attached file will demonstrate the problems (same results for perl5.8.7 and perl5.10.0) Is libxml2 pure utf8? Then, each string parameter which is passed in by the user should call encode("utf8", $string); Also, $xml->toString() should set binmode on the output file temporarily to ":encoding(charset)", where charset is defined with the document creation. Changing the interface a little bit in this respect is without consequences: the current behavior is very broken, as the attachment demonstrates.
Subject: utf
Download utf
application/octet-stream 1k

Message body not shown because it is not plain text.

CC: bug-XML-LibXML [...] rt.cpan.org
Subject: Re: [bug-XML-LibXML@rt.cpan.org: [rt.cpan.org #32152] charset problems
Date: Tue, 8 Jan 2008 12:35:23 +0100
To: Gert Doering <gert [...] space.net>
From: Mark Overmeer <solutions [...] overmeer.net>
* Gert Doering (gert@space.net) [080108 10:34]: Show quoted text
> when I download the test program, I get something that cannot work, due > to confusion in the declarations of "one" and "two" (declared as "no utf", > but containing wide characters). > > Even just plain printing of $one and $two doesn't produce legibile output, > without even invoking XML::libXML. > > Could you re-send the script, zip'ed (or something), so I can verify that > it wasn't garbled downloading from CPAN?
Oops, I really messed things up! Things got broken during transport, but also some mistakes in my script. What I have to conclude with my current script (packaged as zip, this time) is that my report was flawed and should be closed. Sorry for the confusion... toString() returns the utf8 version of the output, which is probably not a bad idea. toFH() does the right thing. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net
Download utf.zip
application/zip 737b

Message body not shown because it is not plain text.

ok, that was basically what I was prepared to answer:-) Thanks, -- Petr
Subject: Re: [rt.cpan.org #32152] charset problems
Date: Wed, 9 Jan 2008 17:05:43 +0100
To: Petr Pajas via RT <bug-XML-LibXML [...] rt.cpan.org>
From: Mark Overmeer <mark [...] overmeer.net>
* Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [080108 12:39]: Show quoted text
> > <URL: http://rt.cpan.org/Ticket/Display.html?id=32152 > > > ok, that was basically what I was prepared to answer:-)
Sorry, I am still struggling with conversion problems. Maybe, you can help me out. Is this designed behavior, or do I make a mistake: use XML::LibXML; my $doc = XML::LibXML::Document->new('1.0', 'UTF-8'); my $node = $doc->createElement("aap\x{03bc}"); $doc->setDocumentElement($node); my $x = $doc->toString; print $x; print utf8::is_utf8($x) ? "yes" : "no"; The string is correct utf8, but does not have the utf8 flag set. Or am I doing something wrong (again)? -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net
Sorry for not responding sooner. This is not a bug, on a XML::LibXML::Document, the function returns a byte-string in the document encoding. This is documented (in XML::LibXML::Document manpage). But since there is no mention of it in the description of XML::LibXML::Node::toString, I rewrote that paragraph to read as follows: This method is similar to XML::LibXML::Document::toString but for a single node. It returns a string consisting of XML serialization of the given node and all its descendants. Unlike XML::LibXML::Document::toString, in this case the resulting string is by default a character string (UTF-8 encoded with UTF8 flag on). An optional flag $format controls indentation, as in XML::LibXML::Document::toString. If the second optional $docencoding flag is true, the result will be a byte string in the document encoding (see XML::LibXML::Document::actualEncoding). This closes the ticket. -- Petr Dne st 09.led.2008 11:06:38, Mark@Overmeer.net napsal(a): Show quoted text
> * Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [080108 12:39]:
> > > > <URL: http://rt.cpan.org/Ticket/Display.html?id=32152 > > > > > ok, that was basically what I was prepared to answer:-)
> > Sorry, I am still struggling with conversion problems. Maybe, you > can help me out. Is this designed behavior, or do I make a mistake: > > use XML::LibXML; > my $doc = XML::LibXML::Document->new('1.0', 'UTF-8'); > my $node = $doc->createElement("aap\x{03bc}"); > $doc->setDocumentElement($node); > > my $x = $doc->toString; > print $x; > print utf8::is_utf8($x) ? "yes" : "no"; > > The string is correct utf8, but does not have the utf8 flag set. > Or am I doing something wrong (again)?
Subject: Re: [rt.cpan.org #32152] charset problems
Date: Mon, 28 Jan 2008 13:27:59 +0100
To: Petr Pajas via RT <bug-XML-LibXML [...] rt.cpan.org>
From: Mark Overmeer <solutions [...] overmeer.net>
* Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [080128 12:10]: Show quoted text
> <URL: http://rt.cpan.org/Ticket/Display.html?id=32152 > > > Sorry for not responding sooner. This is not a bug, on a > XML::LibXML::Document, the function returns a byte-string in the > document encoding.
Yes, it does return the byte-string in the correct encoding. However, my complaint is, that if the resulting enconding is utf8, then the output does not have the utf8 flag on. What I do is: open OUT, '>:utf8', $file; print OUT $doc->toString; (simplified version of the real code, I known there is a toFile) The problem is that I get a double encoding of the utf8 data, because toString() does not set the utf8 flag correctly. The output layer is encoding again. In my module, the problem is extra nasty, because I had someone creating iso-8859-1 xml data, which then had to be transmitted as UTF8 because SOAP requires it. Problems came with ë and friends (yes, he is German) in that data. The right flag could avoid it. Of course, there is no right flag when the encoding is neither latin1 nor utf8... I do not know whether changing this behavior will break other people's code, but I assume everyone ignores io-layers, so it may not be a problem. Show quoted text
> Dne st 09.led.2008 11:06:38, Mark@Overmeer.net napsal(a):
> > my $x = $doc->toString; > > print $x; > > print utf8::is_utf8($x) ? "yes" : "no"; > > > > The string is correct utf8, but does not have the utf8 flag set. > > Or am I doing something wrong (again)?
-- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net
If it is a "byte" string, it is not a "character" string, so character semantics (which is what UTF8 flag stands for, in spite of its name) does not apply to it, hence the UTF8 flag is off (and this is deliberate). What the documentation tries to say is indeed that while for a node, UTF8 flag is on (and you get what appears as characters to Perl), for a document the flag is off (and you get what should appear to Perl as binary data). So, when dumping a document to a file (like you do below), you don't want to use the :utf8 IO layer. This is because the encoding of the document may be arbitrary (in fact, you should be completely ignorant about it) and the same open OUT, $file; print OUT $doc->toString; will work for all documents. I probably have to further extend the paragraph in the documentation to make this even clearer... -- Petr Dne po 28.led.2008 07:28:23, solutions@overmeer.net napsal(a): Show quoted text
> * Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [080128 12:10]:
> > <URL: http://rt.cpan.org/Ticket/Display.html?id=32152 > > > > > Sorry for not responding sooner. This is not a bug, on a > > XML::LibXML::Document, the function returns a byte-string in the > > document encoding.
> > Yes, it does return the byte-string in the correct encoding.
However, Show quoted text
> my complaint is, that if the resulting enconding is utf8, then the > output does not have the utf8 flag on. > > What I do is: > open OUT, '>:utf8', $file; > print OUT $doc->toString; > > (simplified version of the real code, I known there is a toFile) > > The problem is that I get a double encoding of the utf8 data, > because toString() does not set the utf8 flag correctly. The > output layer is encoding again. > > In my module, the problem is extra nasty, because I had someone
creating Show quoted text
> iso-8859-1 xml data, which then had to be transmitted as UTF8 because > SOAP requires it. Problems came with ë and friends (yes, he is
German) Show quoted text
> in that data. The right flag could avoid it. Of course, there is no > right flag when the encoding is neither latin1 nor utf8... > > I do not know whether changing this behavior will break other > people's code, but I assume everyone ignores io-layers, so it > may not be a problem. >
> > Dne st 09.led.2008 11:06:38, Mark@Overmeer.net napsal(a):
> > > my $x = $doc->toString; > > > print $x; > > > print utf8::is_utf8($x) ? "yes" : "no"; > > > > > > The string is correct utf8, but does not have the utf8 flag set. > > > Or am I doing something wrong (again)?
>