Bug #127083 for XML-LibXML: Bug in XML::LibXML , findvalue. returns precomposed uft-8 even when the XML document ist composed utf-8

Mon Sep 10 08:10:37 2018 r.porth [...] tu-berlin.de - Ticket created

Subject:	Bug in XML::LibXML , findvalue. returns precomposed uft-8 even when the XML document ist composed utf-8
Date:	Mon, 10 Sep 2018 12:05:25 +0000
To:	"bug-XML-LibXML [...] rt.cpan.org" <bug-XML-LibXML [...] rt.cpan.org>
From:	"Porth, Robert, Dr." <r.porth [...] tu-berlin.de>

Hello to bug-XML-LibXML@rt.cpan.org<mailto:bug-XML-LibXML@rt.cpan.org> $ uname -a Linux ubsrvapp01 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ perl -v This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-linux-gnu-thread-multi ---------------------------------------- The shortest, clearest code you can manage to write which reproduces the bug described would be $ cat uml_test1.pl #!/usr/bin/perl -w use strict; use XML::LibXML; open (FH_in, 'str.in' ) || die "Error: Could not open file $! \n"; my $str = <FH_in>; close FH_in; printf("XX 01 / %v02X\n", $str); my $xml = "<r> <s>$str</s> </r> "; my $dom = XML::LibXML->load_xml( string => $xml ); $dom->setEncoding('UTF-8'); my $s = $dom->findvalue('/r/s'); printf("XX 02 / %v02X\n", $s); ---------------------------------------- Input file is $ cat str.in Müller The ü in Müller in composed uft-8 coding. The output running the program is $ perl uml_test1.pl XX 01 / 4D.C3.BC.6C.6C.65.72.0A XX 02 / 4D.FC.6C.6C.65.72.0A The original string in the XML is "Müller" with ü coded as C3.BC (composed uft-8) But findvalue gets "Müller" with ü coded as FC (precomposed uft-8) Problem: Working with findvalue with an document in composed utf-8 will result in a document with a mixture of composed (original in the document) and precomposed character. (I only tested that for german umlauts). This inconsistency causes a lot of trouble with the document. Thus, findvalue should get a string in XML without any changes in the coding. As a note: The line in the code $dom->setEncoding('UTF-8'); actually does not change anything. I just let it there to show, that it does not help All the best Robert -- Robert Porth Abt. Bibliothekssysteme Fachreferate Informatik, Geodäsie Technische Universität Berlin Universitätsbibliothek Fasanenstraße 88, 10623 Berlin +49 (0)30 314-76311 r.porth@tu-berlin.de www.ub.tu-berlin.de

Mon Sep 10 15:51:00 2018 SREZIC [...] cpan.org - Correspondence added

On 2018-09-10 08:10:37, r.porth@tu-berlin.de wrote: Show quoted text

> Hello to bug-XML-LibXML@rt.cpan.org<mailto:bug-XML-LibXML@rt.cpan.org> > > $ uname -a > Linux ubsrvapp01 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 > UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > $ perl -v > This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64- > linux-gnu-thread-multi > > ---------------------------------------- > The shortest, clearest code you can manage to write which reproduces > the bug described would be > $ cat uml_test1.pl > #!/usr/bin/perl -w > use strict; > use XML::LibXML; > > open (FH_in, 'str.in' ) || die "Error: Could not open file $! \n"; > my $str = <FH_in>; > close FH_in; > > printf("XX 01 / %v02X\n", $str); > > my $xml = > "<r> > <s>$str</s> > </r> > "; > > my $dom = XML::LibXML->load_xml( string => $xml ); > $dom->setEncoding('UTF-8'); > my $s = $dom->findvalue('/r/s'); > > printf("XX 02 / %v02X\n", $s); > > ---------------------------------------- > Input file is > $ cat str.in > Müller > > The ü in Müller in composed uft-8 coding. > > The output running the program is > $ perl uml_test1.pl > XX 01 / 4D.C3.BC.6C.6C.65.72.0A > XX 02 / 4D.FC.6C.6C.65.72.0A > > The original string in the XML is "Müller" with ü coded as C3.BC > (composed uft-8) > But findvalue gets "Müller" with ü coded as FC (precomposed uft-8) > > Problem: Working with findvalue with an document in composed utf-8 > will result in a document with a mixture of composed (original in the > document) and precomposed character. (I only tested that for german > umlauts). This inconsistency causes a lot of trouble with the > document. Thus, findvalue should get a string in XML without any > changes in the coding. > > As a note: The line in the code > $dom->setEncoding('UTF-8'); > actually does not change anything. I just let it there to show, that > it does not help

First the short answer: I think that XML-LibXML's behavior here is right. The longer explanation will follow. Before the long answer, a note about terminology: the usage is "composed utf-8" and "precomposed utf-8" is non-standard in the Perl world and even confusing (there is a (de)composition concept in Unicode, but apparently you don't mean this). It's better to talk about characters (text strings) vs. bytes (octets, binary strings), like it's done in the Encode or perlunitut manpages. Also, using 'printf "%v"' isn't adequate to show how a perl string is to be interpreted. It's better to use Devel::Peek::Dump, which shows a mixture of the internal representation and the character semantics. The task of the parsing component in XML-LibXML is to parse XML documents which are taken as binary data (bytes, octets). The XML document itself has the information in which encoding this binary data should be interpreted, typically with the "encoding" attribute in the XML declaration, or a BOM, or if it is missing, the encoding defaults to utf-8. After parsing, the data within the XML document should be presented to the user as character strings, so the user does not have to worry about doing the encoding himself. If Devel::Peek::Dump() calls are added to the test script, then we see this output for the initial script: PV = 0x236a7a0 "M\303\274ller\n"\0 So this is a utf8-encoded string, very probably to be interpreted as an byte (octet) string (theoretically it's possible to interpret the two non-ASCII bytes as characters, but it does not make sense here). The findvalue return value looks like this: PV = 0x2381aa0 "M\303\274ller\n"\0 [UTF8 "M\x{fc}ller\n"] The internal representation looks the same, but now Perl knows that this has to be interpreted as utf-8, and shows an additional interpretation as a character string (with the unicode codepoint U+00FC which is LATIN SMALL LETTER U WITH DIAERESIS, as expected). So everything works as expected: input is binary, output is characters. I am not sure what your use case exactly is --- maybe you do some "editing" of XML data only partially with the use of XML-LibXML, and partially with bare perl functionality. In this case you have to be careful and know if you currently deal with characters or octets. But this is just a guess --- maybe you can describe your use case? Regards, Slaven (a TU Berlin alumnus)

Mon Sep 10 15:51:01 2018 The RT System itself - Status changed from 'new' to 'open'

Tue Sep 11 05:49:40 2018 r.porth [...] tu-berlin.de - Correspondence added

Subject:	AW: [rt.cpan.org #127083] Bug in XML::LibXML , findvalue. returns precomposed uft-8 even when the XML document ist composed utf-8
Date:	Tue, 11 Sep 2018 09:44:37 +0000
To:	"bug-XML-LibXML [...] rt.cpan.org" <bug-XML-LibXML [...] rt.cpan.org>
From:	"Porth, Robert, Dr." <r.porth [...] tu-berlin.de>

Hi Slaven thanks for the fast and detailed answer (and even from a TU Berlin alumnus :-) To give a better understanding about my case: I work at a library in the library system administration group. Since about 2 years we are live with a new library system (Alma) that has a API that allows to get und put different kind of data from / to the system. The data are objects (users, invoices, bibliographic data ...) in an XML schema. And thus I started using the perl XML libraries to work with these data. In various scenarios, like creating report, change data in the system or other things. It works fine, but I then did run into a problem with german umlauts. After some debugging I found the root cause the issue mentioned in this ticket. But I think you're right I was not 100% correct about some terms. So would suggest to use precomposed and decomposed (and not composed) as in the wikipedia page: https://en.wikipedia.org/wiki/Precomposed_character Thus independently from the coding, in the basic idea of precomposed: as one single character --> example Åström (U+00C5 U+0073 U+0074 U+0072 U+00F6 U+006D) decomposed: as the combination of the base character plus the diacritics --> example Åström (U+0041 U+030A U+0073 U+0074 U+0072 U+006F U+0308 U+006D) In my situation here, findvalue has changed the coding (in and that unfortunately creates problems. Not in the perl code itself. But in other programs that deal with the objects / documents that are created with the code when they contain different types of coding of special characters like umlauts. Unfortunately that is a problem that I had in my work and what limits using LibXML for my day-to-day work now. So, wouldn't it not be on the safe side for all situations if findvalue would return the string in the given XML tag in exactly that coding (as long its correct uft-8) as it is there? All the best from the TU Berlin, Robert -- Robert Porth Abt. Bibliothekssysteme Fachreferate Informatik, Geodäsie Technische Universität Berlin Universitätsbibliothek Fasanenstraße 88, 10623 Berlin +49 (0)30 314-76311 r.porth@tu-berlin.de www.ub.tu-berlin.de Show quoted text

-----Ursprüngliche Nachricht----- Von: Slaven_Rezic via RT <bug-XML-LibXML@rt.cpan.org> Gesendet: Montag, 10. September 2018 21:51 An: Porth, Robert, Dr. <r.porth@tu-berlin.de> Betreff: [rt.cpan.org #127083] Bug in XML::LibXML , findvalue. returns precomposed uft-8 even when the XML document ist composed utf-8 <URL: https://rt.cpan.org/Ticket/Display.html?id=127083 > On 2018-09-10 08:10:37, r.porth@tu-berlin.de wrote:

> Hello to bug-XML-LibXML@rt.cpan.org<mailto:bug-XML-LibXML@rt.cpan.org> > > $ uname -a > Linux ubsrvapp01 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 > UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > $ perl -v > This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64- > linux-gnu-thread-multi > > ---------------------------------------- > The shortest, clearest code you can manage to write which reproduces > the bug described would be $ cat uml_test1.pl #!/usr/bin/perl -w use > strict; use XML::LibXML; > > open (FH_in, 'str.in' ) || die "Error: Could not open file $! \n"; my > $str = <FH_in>; close FH_in; > > printf("XX 01 / %v02X\n", $str); > > my $xml = > "<r> > <s>$str</s> > </r> > "; > > my $dom = XML::LibXML->load_xml( string => $xml ); > $dom->setEncoding('UTF-8'); my $s = $dom->findvalue('/r/s'); > > printf("XX 02 / %v02X\n", $s); > > ---------------------------------------- > Input file is > $ cat str.in > Müller > > The ü in Müller in composed uft-8 coding. > > The output running the program is > $ perl uml_test1.pl > XX 01 / 4D.C3.BC.6C.6C.65.72.0A > XX 02 / 4D.FC.6C.6C.65.72.0A > > The original string in the XML is "Müller" with ü coded as C3.BC > (composed uft-8) But findvalue gets "Müller" with ü coded as FC > (precomposed uft-8) > > Problem: Working with findvalue with an document in composed utf-8 > will result in a document with a mixture of composed (original in the > document) and precomposed character. (I only tested that for german > umlauts). This inconsistency causes a lot of trouble with the > document. Thus, findvalue should get a string in XML without any > changes in the coding. > > As a note: The line in the code > $dom->setEncoding('UTF-8'); > actually does not change anything. I just let it there to show, that > it does not help

First the short answer: I think that XML-LibXML's behavior here is right. The longer explanation will follow. Before the long answer, a note about terminology: the usage is "composed utf-8" and "precomposed utf-8" is non-standard in the Perl world and even confusing (there is a (de)composition concept in Unicode, but apparently you don't mean this). It's better to talk about characters (text strings) vs. bytes (octets, binary strings), like it's done in the Encode or perlunitut manpages. Also, using 'printf "%v"' isn't adequate to show how a perl string is to be interpreted. It's better to use Devel::Peek::Dump, which shows a mixture of the internal representation and the character semantics. The task of the parsing component in XML-LibXML is to parse XML documents which are taken as binary data (bytes, octets). The XML document itself has the information in which encoding this binary data should be interpreted, typically with the "encoding" attribute in the XML declaration, or a BOM, or if it is missing, the encoding defaults to utf-8. After parsing, the data within the XML document should be presented to the user as character strings, so the user does not have to worry about doing the encoding himself. If Devel::Peek::Dump() calls are added to the test script, then we see this output for the initial script: PV = 0x236a7a0 "M\303\274ller\n"\0 So this is a utf8-encoded string, very probably to be interpreted as an byte (octet) string (theoretically it's possible to interpret the two non-ASCII bytes as characters, but it does not make sense here). The findvalue return value looks like this: PV = 0x2381aa0 "M\303\274ller\n"\0 [UTF8 "M\x{fc}ller\n"] The internal representation looks the same, but now Perl knows that this has to be interpreted as utf-8, and shows an additional interpretation as a character string (with the unicode codepoint U+00FC which is LATIN SMALL LETTER U WITH DIAERESIS, as expected). So everything works as expected: input is binary, output is characters. I am not sure what your use case exactly is --- maybe you do some "editing" of XML data only partially with the use of XML-LibXML, and partially with bare perl functionality. In this case you have to be careful and know if you currently deal with characters or octets. But this is just a guess --- maybe you can describe your use case? Regards, Slaven (a TU Berlin alumnus)

Tue Sep 11 06:15:43 2018 r.porth [...] tu-berlin.de - Correspondence added

Subject:	AW: [rt.cpan.org #127083] Bug in XML::LibXML , findvalue. returns precomposed uft-8 even when the XML document ist composed utf-8
Date:	Tue, 11 Sep 2018 10:14:57 +0000
To:	"bug-XML-LibXML [...] rt.cpan.org" <bug-XML-LibXML [...] rt.cpan.org>
From:	"Porth, Robert, Dr." <r.porth [...] tu-berlin.de>

Hi Slaven as an additional information. I did add the line chomp $str; after reading the string from file to get rid of the \n and in the end if ( $str eq $s ) { print "XX 03 / same: ->$str<->$s<-\n"; } else { print "XX 03 / not the same: ->$str<->$s<- \n"; } to compare the strings. And the output is now XX 01 / 4D.C3.BC.6C.6C.65.72 XX 02 / 4D.FC.6C.6C.65.72 XX 03 / not the same: ->Müller<->M▒ller<- Rendering the output string on my linux machine in XX 02 looks strange. And eq finds both strings to be different. Which shows the problems mentioned in this ticket. All the best Robert -- Robert Porth Abt. Bibliothekssysteme Fachreferate Informatik, Geodäsie Technische Universität Berlin Universitätsbibliothek Fasanenstraße 88, 10623 Berlin +49 (0)30 314-76311 r.porth@tu-berlin.de www.ub.tu-berlin.de Show quoted text

-----Ursprüngliche Nachricht----- Von: Slaven_Rezic via RT <bug-XML-LibXML@rt.cpan.org> Gesendet: Montag, 10. September 2018 21:51 An: Porth, Robert, Dr. <r.porth@tu-berlin.de> Betreff: [rt.cpan.org #127083] Bug in XML::LibXML , findvalue. returns precomposed uft-8 even when the XML document ist composed utf-8 <URL: https://rt.cpan.org/Ticket/Display.html?id=127083 > On 2018-09-10 08:10:37, r.porth@tu-berlin.de wrote:

> Hello to bug-XML-LibXML@rt.cpan.org<mailto:bug-XML-LibXML@rt.cpan.org> > > $ uname -a > Linux ubsrvapp01 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 > UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > $ perl -v > This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64- > linux-gnu-thread-multi > > ---------------------------------------- > The shortest, clearest code you can manage to write which reproduces > the bug described would be $ cat uml_test1.pl #!/usr/bin/perl -w use > strict; use XML::LibXML; > > open (FH_in, 'str.in' ) || die "Error: Could not open file $! \n"; my > $str = <FH_in>; close FH_in; > > printf("XX 01 / %v02X\n", $str); > > my $xml = > "<r> > <s>$str</s> > </r> > "; > > my $dom = XML::LibXML->load_xml( string => $xml ); > $dom->setEncoding('UTF-8'); my $s = $dom->findvalue('/r/s'); > > printf("XX 02 / %v02X\n", $s); > > ---------------------------------------- > Input file is > $ cat str.in > Müller > > The ü in Müller in composed uft-8 coding. > > The output running the program is > $ perl uml_test1.pl > XX 01 / 4D.C3.BC.6C.6C.65.72.0A > XX 02 / 4D.FC.6C.6C.65.72.0A > > The original string in the XML is "Müller" with ü coded as C3.BC > (composed uft-8) But findvalue gets "Müller" with ü coded as FC > (precomposed uft-8) > > Problem: Working with findvalue with an document in composed utf-8 > will result in a document with a mixture of composed (original in the > document) and precomposed character. (I only tested that for german > umlauts). This inconsistency causes a lot of trouble with the > document. Thus, findvalue should get a string in XML without any > changes in the coding. > > As a note: The line in the code > $dom->setEncoding('UTF-8'); > actually does not change anything. I just let it there to show, that > it does not help

First the short answer: I think that XML-LibXML's behavior here is right. The longer explanation will follow. Before the long answer, a note about terminology: the usage is "composed utf-8" and "precomposed utf-8" is non-standard in the Perl world and even confusing (there is a (de)composition concept in Unicode, but apparently you don't mean this). It's better to talk about characters (text strings) vs. bytes (octets, binary strings), like it's done in the Encode or perlunitut manpages. Also, using 'printf "%v"' isn't adequate to show how a perl string is to be interpreted. It's better to use Devel::Peek::Dump, which shows a mixture of the internal representation and the character semantics. The task of the parsing component in XML-LibXML is to parse XML documents which are taken as binary data (bytes, octets). The XML document itself has the information in which encoding this binary data should be interpreted, typically with the "encoding" attribute in the XML declaration, or a BOM, or if it is missing, the encoding defaults to utf-8. After parsing, the data within the XML document should be presented to the user as character strings, so the user does not have to worry about doing the encoding himself. If Devel::Peek::Dump() calls are added to the test script, then we see this output for the initial script: PV = 0x236a7a0 "M\303\274ller\n"\0 So this is a utf8-encoded string, very probably to be interpreted as an byte (octet) string (theoretically it's possible to interpret the two non-ASCII bytes as characters, but it does not make sense here). The findvalue return value looks like this: PV = 0x2381aa0 "M\303\274ller\n"\0 [UTF8 "M\x{fc}ller\n"] The internal representation looks the same, but now Perl knows that this has to be interpreted as utf-8, and shows an additional interpretation as a character string (with the unicode codepoint U+00FC which is LATIN SMALL LETTER U WITH DIAERESIS, as expected). So everything works as expected: input is binary, output is characters. I am not sure what your use case exactly is --- maybe you do some "editing" of XML data only partially with the use of XML-LibXML, and partially with bare perl functionality. In this case you have to be careful and know if you currently deal with characters or octets. But this is just a guess --- maybe you can describe your use case? Regards, Slaven (a TU Berlin alumnus)