Bug #44418 for XML-LibXML: Perl internal representation and output encoding

Thu Mar 19 05:46:39 2009 MARKOV [...] cpan.org - Ticket created

Subject:

Perl internal representation and output encoding

Sorry to bother you again, with a question you have answered hundreds of times before, but each time slightly different. On the output side of $doc->toString, we have a byte-stream. On the Perl side we have strings in "Perl internal representation", which can be either (something close to) latin1 or (something close to) utf-8. As user, you expect it to DWIM: the differences should be invisible. But, when I run the attached script (the string contains three characters in latin1... may get mutilated during transport), then I see that they are not represented by utf-8 bytes in the output... still latin1. This breaks the parser on the other side. Of course, I can utf8::upgrade all the data fields myself. On dozens of spots. And all other XML::LibXML users can do the same. But this does not DWIM, is more error prone, more work and probably slower than calling the upgrade within XML::LibXML. I was under the impression that this did work correctly some time ago, but in my current setting it fails: Perl5.10, libxml2.7.3, XML::LibXML 1.69.

Subject:

umlauts

Download umlauts
application/octet-stream 284b

Message body not shown because it is not plain text.

Thu Mar 19 06:33:10 2009 pajas [...] matfyz.cz - Correspondence added

Dne čt 19.bře.2009 05:46:39, MARKOV napsal(a): Show quoted text

> Sorry to bother you again, with a question you have answered

hundreds of Show quoted text

> times before, but each time slightly different.

not again:-) Show quoted text

> On the output side of $doc->toString, we have a byte-stream. On the > Perl side we have strings in "Perl internal representation", which

can Show quoted text

> be either (something close to) latin1 or (something close to) utf-8.

Not exactly true. The strings in "Perl internal representation" are either (something close to) UTF-8 (UTF8 flagged), yes, or octet streams in unknown encoding (!). That Perl (to make the western world happy) deliberately chooses to assume latin-1 when it must convert an unknown octet stream to UTF-8 is a completely different matter. (Not that we could do anything about it, but much better solution would have been to always warn or even die if a non-ascii octet from a octet stream is to be upgraded to a character). As Show quoted text

> user, you expect it to DWIM: the differences should be invisible. > > But, when I run the attached script (the string contains three > characters in latin1... may get mutilated during transport), then I

see Show quoted text

> that they are not represented by utf-8 bytes in the output... still > latin1. This breaks the parser on the other side. > > Of course, I can utf8::upgrade all the data fields myself. On

dozens of Show quoted text

> spots.

This is not true, you just have add one simple line to your code: use encoding 'iso-8859-1'; Show quoted text

> And all other XML::LibXML users can do the same.

yes, in similar case they should, in fact, do the above! Show quoted text

> But this does > not DWIM,

DWIM always depends on how you look at it and there is never a perfect compromise between DWIM and consistency of behavior; the latter is for me more critical. The problem is not in the toString output, but on the imput. XML::LibXML's APIs take scalars. If they are strings (UTF8 flagged) there should be no problem. If they are bytes (no UTF8 flag), XML::LibXML assumes they are row octet sequences in the encoding of the XML document. This behavior is what makes inconsistent with how Perl treats unknown octet streams. But I'm afraid it is too late to change it unless we want to make even more harm. HTH, -- Petr

Thu Mar 19 06:33:11 2009 The RT System itself - Status changed from 'new' to 'open'

Thu Mar 19 06:33:11 2009 pajas [...] matfyz.cz - Status changed from 'open' to 'rejected'

Thu Mar 19 06:57:47 2009 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #44418] Perl internal representation and output encoding
Date:	Thu, 19 Mar 2009 11:57:21 +0100
To:	Petr Pajas via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

Thank you for responding so fast. * Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [090319 10:33]: Show quoted text

> > On the output side of $doc->toString, we have a byte-stream. On the > > Perl side we have strings in "Perl internal representation", which

> can

> > be either (something close to) latin1 or (something close to) utf-8.

> > Not exactly true. The strings in "Perl internal representation" are > either (something close to) UTF-8 (UTF8 flagged), yes, or octet > streams in unknown encoding (!).

There has been a lot of discussion about the subject recently (also on YAPCs etc)... and perlunicode confirm my stand: If strings operating under byte semantics and strings with Unicode character data are concatenated, the new string will be created by decoding the byte strings as ISO 8859-1 (Latin-1), even if the old Unicode string used EBCDIC. This translation is done without regard to the system's native 8-bit encoding. In case where Unicode and non-unicode string representations are mixed (in our case), the non-unicode strings are interpreted as latin1. It is not ASCII! Show quoted text

> This is not true, you just have add one simple line to your code: > use encoding 'iso-8859-1';

Yes, this helps. Should be the default. Show quoted text

> > But this does > > not DWIM,

> > DWIM always depends on how you look at it and there is never a perfect > compromise between DWIM and consistency of behavior; the latter is for > me more critical.

Correct behavior is the most important, I fully agree. The correct behavior is "latin1" default for byte encoding. Seems to me as a simple change. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Thu Mar 19 06:57:47 2009 The RT System itself - Status changed from 'rejected' to 'open'

Thu Mar 19 07:38:17 2009 pajas [...] matfyz.cz - Correspondence added

Dne čt 19.bře.2009 06:57:47, Mark@Overmeer.net napsal(a): Show quoted text

> > Thank you for responding so fast. > > * Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [090319 10:33]:

> > > On the output side of $doc->toString, we have a byte-stream. On

the Show quoted text

> > > Perl side we have strings in "Perl internal representation",

which Show quoted text

> > can

> > > be either (something close to) latin1 or (something close to)

utf-8. Show quoted text

> > > > Not exactly true. The strings in "Perl internal representation"

are Show quoted text

> > either (something close to) UTF-8 (UTF8 flagged), yes, or octet > > streams in unknown encoding (!).

> > There has been a lot of discussion about the subject recently (also > on YAPCs etc)... and perlunicode confirm my stand: > > If strings operating under byte semantics and strings with Unicode > character data are concatenated, the new string will be created by > decoding the byte strings as ISO 8859-1 (Latin-1), even if the old > Unicode string used EBCDIC. This translation is done without

regard Show quoted text

> to the system's native 8-bit encoding.

Yes, you edited out the part of my response where I was basically ranting about the same. Sure I never said the default is or should be native 8bit encoding. Show quoted text

> In case where Unicode and non-unicode string representations are > mixed (in our case), the non-unicode strings are interpreted as

latin1. Show quoted text

> It is not ASCII!

nobody said it was otherwise. I said that in this respect XML::LibXML has chosen (long time before I took over its maintenance) to assume all non-unicode string to be a byte stream in the ENCODING OF THE DOCUMENT. Not latin-1, not an encoding enforce by locale, not ascii. Maybe it was not a good decision, but once the decision was made, there is no going back. Show quoted text

> > This is not true, you just have add one simple line to your code: > > use encoding 'iso-8859-1';

> > Yes, this helps. Should be the default. >

> > > But this does > > > not DWIM,

> > > > DWIM always depends on how you look at it and there is never a

perfect Show quoted text

> > compromise between DWIM and consistency of behavior; the latter is

for Show quoted text

> > me more critical.

> > Correct behavior is the most important, I fully agree. The correct > behavior is "latin1" default for byte encoding. Seems to me as a > simple change.

I agree the change is *possible*, yes it is *simple*, it would probably indeed make XML::LibXML behave more like other Perl interfaces, that's all true. But that's a dream about another world. The change would completely BREAK EXISTING CODE that uses XML::LibXML the way it behaves today! Even if you convinced me that the behavior you suggest would make 100 times better user experience, I'm not taking the responsibility for breaking existing code, that's it. So all I can do is to be terribly sorry for you to have to do what the rest of the world where local alphabet is not covered by latin-1 has to do anyway. Perl chose to made from latin-1 a special exception. Ok, XML::LibXML did not (it chose to favor the native encoding of the document instead), and it is too *late* to take those decisions back now. Hopefully you'll understand my point of view, -- Petr

Thu Mar 19 07:38:18 2009 pajas [...] matfyz.cz - Status changed from 'open' to 'rejected'

Thu Mar 19 08:57:25 2009 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #44418] Perl internal representation and output encoding
Date:	Thu, 19 Mar 2009 13:57:07 +0100
To:	Petr Pajas via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [090319 11:38]: Show quoted text

> Yes, you edited out the part of my response where I was basically > ranting about the same. Sure I never said the default is or should be > native 8bit encoding.

You said: That Perl (to make the western world happy) deliberately chooses to assume latin-1 when it must convert an unknown octet stream to UTF-8 is a completely different matter So, here you agree that Perl choose for Latin-1 as default. This is an official definition. Actually, Perl has three kinds of strings: . bytes . latin1 . utf-8 You cannot distiguish between the first two in the Perl internals, which is a pity. Juerd suggested last year a BLOB tie to the first... but that is not in for 5.12 (yet) So, I think your formulation is not exactly to the point. If you use a scalar (not flagged as UTF8) in a string operation, than the sequence of bytes has to be interpreted as latin1. If you use the same scalar value for other things (like IO) than Perl does not know what the meaning of the content is: either raw bytes or latin1 text. Show quoted text

> I said that in this respect XML::LibXML has chosen (long time before I > took over its maintenance) to assume all non-unicode string to be a > byte stream in the ENCODING OF THE DOCUMENT. Not latin-1, not an > encoding enforce by locale, not ascii. Maybe it was not a good > decision, but once the decision was made, there is no going back.

Show quoted text

> The change would completely BREAK EXISTING CODE that uses XML::LibXML > the way it behaves today!

I do agree that we should never break existing code, and I think you are a fantastic maintainer of the module. Honestly. Does changing to the correct default break existing code? As demonstrated in my example, the output of XML::LibXML is broken when encoding is not provided. So: in real life, that code should not exist (and if it exists, it is a sleeping disaster to happen). Can you give me an example of an application which now works correctly, but will stop working when we set the default? (Sorry for consuming your time) -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Thu Mar 19 08:57:25 2009 The RT System itself - Status changed from 'rejected' to 'open'

Thu Mar 19 09:28:43 2009 pajas [...] matfyz.cz - Correspondence added

Dne čt 19.bře.2009 08:57:25, Mark@Overmeer.net napsal(a): Show quoted text

> Does changing to the correct default break existing code?

You still keep claiming that latin1 is the correct default. But is passing a string to an API really necessary a string operation that should cause the auto-conversion from latin1? And is it really so if the documentation of the API claims otherwise? It explicitly says: "DOM methods also accept binary strings in the original encoding of the document to which the node belongs " Show quoted text

> As demonstrated in my example, the output of XML::LibXML is broken

when Show quoted text

> encoding is not provided. So: in real life, that code should not

exist Show quoted text

> (and if it exists, it is a sleeping disaster to happen).

I agree, we should do sanity checks, i.e. verify that the data passed to the functions are in the correct encoding. In this we trust the user (as we do in some other things, e.g. the correct use of namespaces, element names, etc.). Show quoted text

> Can you give me an example of an application which now works

correctly, Show quoted text

> but will stop working when we set the default? (Sorry for consuming > your time)

Any application whose default encoding is not necessarily latin1 and that does no 'use utf8', ':utf8', Encode or utf8::upgrade/downgrade, but simply reads data, processes them, puts the results to a DOM. This is typical for scripts that predate perl-5.6.1. There are many of them still in use on many obscure platforms (I'm still sometimes asked for support from the admins that upgrade XML::LibXML on those servers). Here is a practical example: you have a base of text paragraphs in latin-2 encoding. You open it, read the data (the script does not set any I/O layers), create a iso-8859-2 encoded document and populate it with the data. You can do DOM transformations or apply XSLT and $doc->toString to print the result to a file. For latin-2 you can substitute any encoding that libxml2 supports. -- Petr

Thu Mar 19 09:28:45 2009 pajas [...] matfyz.cz - Status changed from 'open' to 'rejected'

Thu Mar 19 10:55:05 2009 webmaster [...] nlnet.nl - Correspondence added

Subject:	Re: [rt.cpan.org #44418] Perl internal representation and output encoding
Date:	Thu, 19 Mar 2009 15:54:39 +0100
To:	Petr Pajas via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	NLnet webmaster <webmaster [...] nlnet.nl>

* Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [090319 13:28]: Show quoted text

> You still keep claiming that latin1 is the correct default. But is > passing a string to an API really necessary a string operation that > should cause the auto-conversion from latin1? And is it really so if > the documentation of the API claims otherwise? It explicitly says:

Well, that is a point of view. In my opinion, XML is a textual interface. So yes, I do see all fields passed into an XML library as strings, also the numbers. For a classical RPC interface, it would be different. Show quoted text

> "DOM methods also accept binary strings in the original encoding of > the document to which the node belongs "

The problem only appears when you are constructing a document, which means that it will need to be made fit for a document. There is no problem in the reader. Show quoted text

> > Can you give me an example of an application which now works correctly, > > but will stop working when we set the default? (Sorry for consuming > > your time)

> > Any application whose default encoding is not necessarily latin1 and > that does no 'use utf8', ':utf8', Encode or utf8::upgrade/downgrade, > but simply reads data, processes them, puts the results to a DOM. This > is typical for scripts that predate perl-5.6.1. There are many of them > still in use on many obscure platforms (I'm still sometimes asked for > support from the admins that upgrade XML::LibXML on those servers).

But I do not see how they would suffer from a correct default. For instance a none utf8-aware application is passing bytes to the XML::LibXML library, which outputs it again as UTF-8 can break on any print-statement in its code. Any application which does not handle pure ASCII will break eventually. On the other hand, no application which handles pure ASCII will have a problem with the latin1 default. Show quoted text

> Here is a practical example: you have a base of text paragraphs in > latin-2 encoding. You open it, read the data (the script does not set > any I/O layers), create a iso-8859-2 encoded document and populate it > with the data. You can do DOM transformations or apply XSLT and > $doc->toString to print the result to a file. > For latin-2 you can substitute any encoding that libxml2 supports.

Ouch, this is very bad practice in a unicode world. And XML is a unicode World. You do provide an explicit conversion at the output side of the script (::Document->new('1.0', 'latin2')) but neglecting the incoming side. Yaiks! And later, in the next generation of the code, people understand that all encodings besides UTF-8 are deprecated for XML... they change it... and everything falls apart. Trying to find a solution which works around the above code. To avoid extensive bughut by people (even those very common to utf8 in Perl, like me). Would it be possible to , produce a warning (when warnings are enabled) when people create documents in non-utf8 encoding, and where the encoding is not explicitly set. And default to latin1 in all utf8 output cases. -- MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Thu Mar 19 10:55:05 2009 The RT System itself - Status changed from 'rejected' to 'open'

Thu Mar 19 12:06:58 2009 pajas [...] matfyz.cz - Correspondence added

Dne čt 19.bře.2009 10:55:05, webmaster@nlnet.nl napsal(a): Show quoted text

> But I do not see how they would suffer from a correct default. > For instance a none utf8-aware application is passing bytes to the > XML::LibXML library, which outputs it again as UTF-8 can break on

any Show quoted text

> print-statement in its code.

Not really. This kind of XML::LibXML applications that predate any unicode-aware perl were aware of the fact that the strings returned by XML::LibXML were utf-8 encoded (on old versions of Perl they did not carry the UTF-8 flag), so the first thing the application did was applying XML::LibXML's old "decodeFromUTF8" to convert back to the document's encoding (which was fixed). Show quoted text

> Any application which does not handle pure > ASCII will break eventually. On the other hand, no application

which Show quoted text

> handles pure ASCII will have a problem with the latin1 default. >

> > Here is a practical example: you have a base of text paragraphs in > > latin-2 encoding. You open it, read the data (the script does not

set Show quoted text

> > any I/O layers), create a iso-8859-2 encoded document and populate

it Show quoted text

> > with the data. You can do DOM transformations or apply XSLT and > > $doc->toString to print the result to a file. > > For latin-2 you can substitute any encoding that libxml2 supports.

> > Ouch, this is very bad practice in a unicode world.

Perl was not a unicode world back then and if you ever worked for a larger company you would know how things are done behind the scenes. That's scary sometimes! Show quoted text

> And XML is a > unicode World. You do provide an explicit conversion at the output > side of the script (::Document->new('1.0', 'latin2')) but neglecting

the Show quoted text

> incoming side.

The incomming encoding is fixed, you know it apriori. You don't necessarily have to think of an application used by public, but of an in-door solution within a CGI script or something like that. Also, prior to perl 5.6.1 there was nothing like I/O layers. Also, the aforementioned scenario works for UTF-8 documents too. You read the input data in UTF-8, but the old way, as UTF-8 byte streams, as such you pass them to XML::LibXML to populate a DOM. The only problem can arise when you do a print on something returned from the DOM, since you attempt to print characters when operating in byte semantics. Show quoted text

> Yaiks! And later, in the next generation of the code, > people understand that all encodings besides UTF-8 are deprecated

for Show quoted text

> XML... they change it... and everything falls apart.

yes, except that they will never do; they do not develop the code anymore, they just want to make sure that it works (with the latest = safest libxml2). Show quoted text

> Trying to find a solution which works around the above code.

ok... Show quoted text

> To avoid extensive bughut by people (even those very common to utf8

in Show quoted text

> Perl, like me). Would it be possible to > , produce a warning (when warnings are enabled) when people create > documents in non-utf8 encoding, and where the encoding is not > explicitly set.

now I'm lost. You really want XML::LibXML to issue a warning whenever somebody does createDocument('1.0','iso-8895-2')? Or just if the document's encoding is utf-8 but the user passes non-utf8-flagged data? The latter would seem reasonable, I'm for warnings. But I'm still not sure about the latin1 conversion even in this case. Note my above remark that latin2 can be replaced by UTF-8 in my previous example, so even this would be a change in behavior; result would be like in Encode::decode('latin-1',$utf8_encoded_string). Show quoted text

> And default to latin1 in all utf8 output cases.

What does all utf8 output cases mean here? -- Petr

Fri Mar 20 06:30:54 2009 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #44418] Perl internal representation and output encoding
Date:	Fri, 20 Mar 2009 11:30:38 +0100
To:	Petr Pajas via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [090319 16:06]: Show quoted text

> Not really. This kind of XML::LibXML applications that predate any > unicode-aware perl were aware of the fact that the strings returned by > XML::LibXML were utf-8 encoded...

On some moment, people must realize that the World goes on, and they have to get their unicode correctly. You can help them forward. Show quoted text

> > To avoid extensive bughut by people (even those very common to utf8 > > in Perl, like me). Would it be possible to > > , produce a warning (when warnings are enabled) when people create > > documents in non-utf8 encoding, and where the encoding is not > > explicitly set.

> > now I'm lost. You really want XML::LibXML to issue a warning whenever > somebody does createDocument('1.0','iso-8895-2')? Or just if the > document's encoding is utf-8 but the user passes non-utf8-flagged > data?

When "createDocument('1.0','iso-8895-2')" but without "use encoding 'iso-8895-2'" Show quoted text

> The latter would seem reasonable, I'm for warnings. But I'm still not > sure about the latin1 conversion even in this case. Note my above > remark that latin2 can be replaced by UTF-8 in my previous example, so > even this would be a change in behavior; result would be like in > Encode::decode('latin-1',$utf8_encoded_string).

Hum... even worse. Well, now I have two ideas: 1. createDocument('1.0', xxx) but without "use encoding 'yyy'" enforce (via warning) the use of "use encoding" in all programs which use XML::LibXML 2. jump version to 2.00, removing all cruft made for Perl's pre- unicode era (require Perl 5.8.4 - april 2004) Show quoted text

> > And default to latin1 in all utf8 output cases.

> What does all utf8 output cases mean here?

I ment: in all cases where the output is in utf-8 -- Regards, MarkOv ------------------------------------------------------------------------ drs Mark A.C.J. Overmeer MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Fri Sep 25 09:09:54 2009 pajas [...] matfyz.cz - Correspondence added

I'm postponing this issue for now - please do not reopen this bug unless there is something entirely new to say about the topic. -- Petr

Fri Sep 25 09:09:57 2009 pajas [...] matfyz.cz - Status changed from 'open' to 'stalled'

Fri Sep 25 14:36:46 2009 phish [...] cpan.org - Correspondence added

On Fri Sep 25 09:09:54 2009, PAJAS wrote: Show quoted text

> I'm postponing this issue for now - please do not reopen this bug unless > there is something entirely new to say about the topic. > > -- Petr

There are some considerations to be taken into account here. For some (cloudy) reason Perl did not made the shift to UFT8 for internal character representation. However, the XML standard makes a strong statement about UTF8. UTF8 is in fact the default encoding for XML documents and also a must for the internal string representation of DOMs. libxml2 complies to these standards. This means that we always get UTF8 from the library's DOM functions or we get errors if people decide to do nasty things outside. In the following, I try to clarify the decision we took during the early development of XML::LibXML. The XML standard also defines how to specify if a document is in a different encoding. This is when an specific encoding is provided in the XML declaration. However, this is ONLY for XML documents and has absolutely no meaning for the data in DOM trees. In DOM trees by definition all data is UTF8 encoded and there are absolutely no exceptions. Additionally, DOM trees provide information in which encoding the output should be provided if the entire tree is serialized (but not fragments or sub-trees). Now with perl things are nasty and there were looooooooooooong discussions about encoding and utf8 handling within perl (besides huge troubles in making things work correctly). AFAIK the only character encoding that all perl version > 5.6 can actually identify is UTF8. All other character encodings including latin-1 (which has been outdated for almost a decade, btw.) are indistinguishable for the perl internals. The problem with DWIM is that people dump all kind of octet streams and hope that they will work as strings with the correct encoding. Even with "use encoding" this problem is not solved. Just take the example: lets take a XML document that is _explicitly_ encoded in ISO-8859-6 and perl _assuming_ everything is ISO-8859-2 unless it is marked as UTF8, because some trainee developer decided that latin-2 is a good thing. So if one dumps a string from the DOM in the XML's native encoding (that is ISO-8859-6) and then adds it back to the DOM, what kind of information would people assume to show up? The DWIM metaphor will automatically say the original string (in ISO-8859-6) and NOT in the wrong and almost entirely incompatible ISO-8859-2 encoded version. The problem with perl's "use encoding" is that it allows people to tell what perl should _assume_ a string is encoded in, even if most parts of the additional logic _know_ that it is not (in the example XML::LibXML knows that everything internal is UTF8 and the external representation is ISO-8859-6). Because UTF8 is THE STANDARD encoding for XML and DOM, we decided (after long discussions on the perl-xml list and in the related IRC channels) that in case of doubt we should always opt for what we KNOW while running the code and not for what a programmer might have ASSUMED at the time of writing it. The reason for this decision was that programmers are almost always wrong in predicting encoding related aspects of their code. Unfortunately, the perl development seems to judge assumptions higher than knowledge (if the earlier statements are correct, and there I have doubts). At that time we also decided that the correct behaviour for using XML::LibXML would be, that all DOM functions expect and return UTF8 strings by default. All non-UTF8 octet streams are assumed to be the result of some IO operation and are in the same encoding as the document. If a programmer wants to insert some nasty encoding that is not UTF8 then it has to be upgraded to UTF8 first, unless the encoding is the same as the encoding of the document. The last bit we decided for the DWIM reasons given in the example. In order to meet these requirements, XML::LibXML already came with proper utf8 encoding functions before perl had a working "use encoding" feature. If a programmer decides not to "use encoding utf8" when working with XML and DOM data structures than - given to the related standards - we have good reasons to expect that the programmer actually MEANS that the consistency of the data is irrelevant and that the result of an operation can be random. In this sense the correct usage of XML::LibXML has to follow 5 simple rules 1. ALWAYS use encoding UTF8 2. disable perl's auto upgrading of strings during all IO operations and leave the tricky bits to libxml2. 3. assure that no arbitrary octet streams get near the DOM. 4. upgrade all strings that should get into the DOM to UTF8 5. never try to extract the XML-document's original encoding from the DOM unless while serializing the entire DOM. Usually this means for a programmer is happy with "use encoding utf8" and mostly ignoring the external XML representation of the DOM. The correctness of this decision is related to the XML related specifications. There might be conflicts with other views or discussions related to other parts of perl development. Most importantly is that during the early development we tried to make the XML usage as straight forward as possible for the perl developers. I hope this comment clarifies the reasoning behind XML::LibXML's current string encoding approach. If encoding issues are increasingly getting into the way of developers, the standard conform and thus correct solution would be to enforce only UTF8 encoded strings as parameters for DOM functions. My gut feelings tell me that how emotional the encoding discussion might get, no one would like to see the correct behaviour implemented. Instead, I propose as a fix to add some bit and pieces of this comment to the documentation and make the reasoning explicit to developers. Maybe some examples of possible pitfalls related to the perl internals can spice such a documentation up. Christian

Fri Sep 25 14:36:47 2009 The RT System itself - Status changed from 'stalled' to 'open'

Fri Sep 25 14:37:46 2009 phish [...] cpan.org - Status changed from 'open' to 'stalled'

Fri Sep 25 15:43:51 2009 solutions [...] overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #44418] Perl internal representation and output encoding
Date:	Fri, 25 Sep 2009 21:43:19 +0200
To:	Christian Glahn via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <solutions [...] overmeer.net>

Hi Christian, Thanks for all the good work you did on this module! * Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [090925 18:36]: Show quoted text

> For some (cloudy) reason Perl did not made the shift to UFT8 for > internal character representation.

The reason is simply: performance. Character processing, for instance in regular expressions, is much much slower when the string is UTF8. And... they could make it to work having two kinds of strings so there was no need to move over to UTF8. Actually, there should have been three kinds of strings: a different type which handles binary data. Perl6 solves this nicely, having a character encoding label to each row of bytes. At any string operation, it will unify the types it finds, hopefully giving better performance. Show quoted text

> AFAIK the only character encoding that all perl version > 5.6 can > actually identify is UTF8. All other character encodings including > latin-1 (which has been outdated for almost a decade, btw.) are > indistinguishable for the perl internals.

No, the official statement for Perl is: you have utf8-like (not realy utf-8) strings and Latin-1 strings. The discussion I had with Petr is that XML::LibXML decided for: you have utf-8 strings and strings which are in an undefined encoding. That differs from Perl's current specification. From "man perlunicode": By default, there is a fundamental asymmetry in Perl's Unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1. Show quoted text

> The problem with DWIM is that people dump all kind of octet streams and > hope that they will work as strings with the correct encoding. Even with > "use encoding" this problem is not solved. Just take the example: lets > take a XML document that is _explicitly_ encoded in ISO-8859-6 and perl > _assuming_ everything is ISO-8859-2 unless it is marked as UTF8, because > some trainee developer decided that latin-2 is a good thing. So if one > dumps a string from the DOM in the XML's native encoding (that is > ISO-8859-6) and then adds it back to the DOM, what kind of information > would people assume to show up? The DWIM metaphor will automatically say > the original string (in ISO-8859-6) and NOT in the wrong and almost > entirely incompatible ISO-8859-2 encoded version.

Well, no. If you follow Perl's current guidelines, you will need to explicitly express the character encoding on all input and output of a program if any utf-8 handling is done. Either with the global "use encoding", setting a default, and/or which each open() and database access. So, above example does not hold: as long as XML::LibXML returns the data in Latin1 or UTF-8 to the Perl program, the user's application will force it into the correct encoding when displaying on the screen or writing it to file. Inside the perl program, the encoding does not matter at all (except for reasons of performance) Show quoted text

> The problem with perl's "use encoding" is that it allows people to tell > what perl should _assume_ a string is encoded in, even if most parts of > the additional logic _know_ that it is not (in the example XML::LibXML > knows that everything internal is UTF8 and the external representation > is ISO-8859-6).

use encoding "iso-8859-6"; open F, "<$f"; is an alternative for open F, "<:encoding(iso-8859-6)", $f; including the character-set in which to interpret STDIN, STDOUT and STDERR, and all other files which are read or written.. It tells Perl's IO layers a default. You also need to use :encoding(utf-8) on read and write, because Perls internal utf8 (without dash) is not strict: could produce undefined characters, does not understand markers etc. Show quoted text

> Because UTF8 is THE STANDARD encoding for XML and DOM, we decided (after > long discussions on the perl-xml list and in the related IRC channels) > that in case of doubt we should always opt for what we KNOW while > running the code and not for what a programmer might have ASSUMED at the > time of writing it.

Perl expliciet defines Latin1, but programmers may not know that and assume the wrong thing. The problem is now: XML::LibXML did not punish these users while developing their code, therefore changing this may break some people's existing code (while helping new developers) Show quoted text

> 1. ALWAYS use encoding UTF8

This is totally incorrect. This sets the STDIN/STDOUT/STDERR and other file defaults to utf8. But that is a system setting. (If I interpret this point as "use encoding 'utf-8';") Show quoted text

> 2. disable perl's auto upgrading of strings during all IO operations and > leave the tricky bits to libxml2.

...which does it in a Perl incompatible way. It will break all other corners of your application, like database access, which actually do interface correctly in this respect. Show quoted text

> 3. assure that no arbitrary octet streams get near the DOM.

byte streams: a pitty that BLOB is not a core construct See http://search.cpan.org/~juerd/BLOB-1.01 Character streams: are Latin1, no problem to put them in the DOM. Show quoted text

> 4. upgrade all strings that should get into the DOM to UTF8

Gladly, XML::LibXML does that for me. Show quoted text

> 5. never try to extract the XML-document's original encoding from the > DOM unless while serializing the entire DOM.

Extracting is into Perl, and as long as the text stays there, you do not have a visible encoding. Only when you write it out into a file later. Show quoted text

> The correctness of this decision is related to the XML related > specifications. There might be conflicts with other views or discussions > related to other parts of perl development.

I disagree fully. Both Perl and XML are very well defined about how to handle encodings. Only XML::LibXML has chosen not to follow a small part of Perl's official specs (which do have changed over time, but the last change unicode change was a long time ago!) When users of Perl programs put "use encoding 'iso-8859-1';" explicitly in their programs, then XML::LibXML works as Perl prescribes. And all other explicit encoding statements work correctly as well. Only when you do not state the encoding explicitly the problem becomes clear. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Fri Sep 25 15:43:51 2009 The RT System itself - Status changed from 'stalled' to 'open'

Sat Sep 26 06:49:34 2009 phish [...] cpan.org - Correspondence added

Hi Mark, On Fri Sep 25 15:43:51 2009, solutions@overmeer.net wrote: Show quoted text

> Hi Christian, > > Thanks for all the good work you did on this module! > > * Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [090925 18:36]:

> > For some (cloudy) reason Perl did not made the shift to UFT8 for > > internal character representation.

> > The reason is simply: performance. Character processing, for instance > in regular expressions, is much much slower when the string is UTF8. > And... they could make it to work having two kinds of strings so there > was no need to move over to UTF8.

IMHO performance is a very very poor excuse for limiting character processing to mainly the North American language zone. But this is not a discussion we should have here. Show quoted text

> Actually, there should have been three kinds of strings: a different > type which handles binary data. Perl6 solves this nicely, having a > character encoding label to each row of bytes. At any string operation, > it will unify the types it finds, hopefully giving better performance.

Well, we all know that perl5 suffers from backward compatibility when it comes to encodings. Show quoted text

> > AFAIK the only character encoding that all perl version > 5.6 can > > actually identify is UTF8. All other character encodings including > > latin-1 (which has been outdated for almost a decade, btw.) are > > indistinguishable for the perl internals.

> > No, the official statement for Perl is: you have utf8-like (not realy > utf-8) strings and Latin-1 strings. The discussion I had with Petr is > that XML::LibXML decided for: you have utf-8 strings and strings > which are in an undefined encoding. That differs from Perl's current > specification. > > From "man perlunicode": > > By default, there is a fundamental asymmetry in Perl's Unicode > model: implicit upgrading from byte strings to Unicode strings > assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode > strings are downgraded with UTF-8 encoding. This happens because > the first 256 codepoints in Unicode happens to agree with Latin-1.

Exactly here lies the problem where the XML specs and libxml2 follows a very different approach than perl. Show quoted text

> > The problem with DWIM is that people dump all kind of octet streams and > > hope that they will work as strings with the correct encoding. Even with > > "use encoding" this problem is not solved. Just take the example: lets > > take a XML document that is _explicitly_ encoded in ISO-8859-6 and perl > > _assuming_ everything is ISO-8859-2 unless it is marked as UTF8, because > > some trainee developer decided that latin-2 is a good thing. So if one > > dumps a string from the DOM in the XML's native encoding (that is > > ISO-8859-6) and then adds it back to the DOM, what kind of information > > would people assume to show up? The DWIM metaphor will automatically say > > the original string (in ISO-8859-6) and NOT in the wrong and almost > > entirely incompatible ISO-8859-2 encoded version.

> > Well, no. If you follow Perl's current guidelines, you will need to > explicitly express the character encoding on all input and output of > a program if any utf-8 handling is done. Either with the global "use > encoding", setting a default, and/or which each open() and database > access. So, above example does not hold: as long as XML::LibXML returns > the data in Latin1 or UTF-8 to the Perl program, the user's application > will force it into the correct encoding when displaying on the screen > or writing it to file. Inside the perl program, the encoding does not > matter at all (except for reasons of performance)

Just remind yourself, that with XML you cannot know the encoding of a stream in advance. The encoding is defined for each document separately in the XML declaration and only this declaration is informing a system about the REAL encoding of an XML document. libxml2 even goes further and does some encoding analysis of the data if the encoding is not defined before it assumes that the data is really in UTF8. Show quoted text

> > The problem with perl's "use encoding" is that it allows people to tell > > what perl should _assume_ a string is encoded in, even if most parts of > > the additional logic _know_ that it is not (in the example XML::LibXML > > knows that everything internal is UTF8 and the external representation > > is ISO-8859-6).

> > use encoding "iso-8859-6"; > open F, "<$f"; > > is an alternative for > > open F, "<:encoding(iso-8859-6)", $f; > > including the character-set in which to interpret STDIN, STDOUT and > STDERR, and all other files which are read or written.. It tells Perl's > IO layers a default. > > You also need to use :encoding(utf-8) on read and write, because > Perls internal utf8 (without dash) is not strict: could produce > undefined characters, does not understand markers etc.

The problem with XML aware code is that you don't now what format you will end up in the real world. The same code may load XML documents in UTF8, other UTF dialects and in all kinds of ISO-8859-* family. You simply cannot know what encoding is used with a particular XML document, unless you actually go an read it. Thus, hard-wiring the encoding into the IO layer is the wrong solution and may break your code. The actual problem is not so much the output but the input. However, even with the output you may end up with problem if the actual encoding of an XML document is not honoured. Show quoted text

> > Because UTF8 is THE STANDARD encoding for XML and DOM, we decided (after > > long discussions on the perl-xml list and in the related IRC channels) > > that in case of doubt we should always opt for what we KNOW while > > running the code and not for what a programmer might have ASSUMED at the > > time of writing it.

> > Perl expliciet defines Latin1, but programmers may not know that and > assume the wrong thing. The problem is now: XML::LibXML did not punish > these users while developing their code, therefore changing this may > break some people's existing code (while helping new developers)

The problem is again KNOWING vs. ASSUMING. Why should someone get punished just because the system assumes that something is wrong? Show quoted text

> > 1. ALWAYS use encoding UTF8

> > This is totally incorrect. This sets the STDIN/STDOUT/STDERR and other > file defaults to utf8. But that is a system setting. (If I interpret > this point as "use encoding 'utf-8';")

Sorry, but you completely misunderstand the problem. For the internal string representation you should FORCE all IO of perl into UTF8 mode. I don't know if this keeps perl from doing character downgrading, but it should. This *should* tell perl that the programmer prefers correct character handling over performance. The programmer should force the attention that with XML data you should use actually UTF8 for all internal operations. This has nothing to do with system settings, but with the fact that you actually don't know the encoding of an XML document until you actually read it. Show quoted text

> > 2. disable perl's auto upgrading of strings during all IO operations and > > leave the tricky bits to libxml2.

> > ...which does it in a Perl incompatible way. It will break all other > corners of your application, like database access, which actually do > interface correctly in this respect.

OK, you are mixing topics here - maybe you are also missing the differences between the different parts of a system. Different to many other systems - particularly databases - XML does not data from a preconfigurable environment. It is a standardized approach to deal with the messy data that are around for all kinds of nationalist reasons. Perl 5 lives the utopia that you can predict you IO - and it lives in the utopia that you can boil everything down to the northern-American lifestyle in one way or the other. XML::LibXML tried to bridge these cultural difference (that is between Perl and XML, not between northern America and the rest of the world). In that sense, you statement of correctness refers to the unwillingness to understand that you cannot predict a messy environment in the same ways as you can predict an organized environment. As soon you use not UTF8 as the default encoding in your entire organized environment and you use XML also for input rather than only for output purposes, you have to let go of your idea "correct behaviour". However, if you stick with UTF8, XML::LibXML actively supports you to remove the encoding related issues from messy input. For some reason I am not confronted with breaking applications, although I work in completely messy environments using XML::LibXML. Therefore, I assume that if something breaks at other corners of your system it is related to the design of an application, and not to incompatible ways of handling data. Show quoted text

> > 3. assure that no arbitrary octet streams get near the DOM.

> > byte streams: a pitty that BLOB is not a core construct > See http://search.cpan.org/~juerd/BLOB-1.01 > Character streams: are Latin1, no problem to put them in the DOM.

Yes, but you have to assure that your character stream is not actually a downgraded or recoded version of something else. So, in that sense, latin-1 is also an arbitrary octet stream. Please note that here we differ in terminology. With latin-1 you HOPE that you got the correct data, with UTF8 you KNOW that you have the correct data (unless you do something completely stupid such as downgrading a UTF8 string to latin1 and then promoting it back to UTF8). Show quoted text

> > 4. upgrade all strings that should get into the DOM to UTF8

> > Gladly, XML::LibXML does that for me.

Then what is your problem?!? Show quoted text

> > 5. never try to extract the XML-document's original encoding from the > > DOM unless while serializing the entire DOM.

> > Extracting is into Perl, and as long as the text stays there, you do not > have a visible encoding. Only when you write it out into a file later.

I mean that you should not force XML::LibXML to return the original encoding instead of UTF8 characters. The default is UTF8, so this should be usually transparent to developers. UTF8 data should be handled correctly by perl's IO layer even if no explicit encoding is given. For all other encodings, you should strip the document's encoding first and make it explicit. The problem is that this is nothing for what XML::LibXML can be blamed. Show quoted text

> > The correctness of this decision is related to the XML related > > specifications. There might be conflicts with other views or discussions > > related to other parts of perl development.

> > I disagree fully. Both Perl and XML are very well defined about > how to handle encodings. Only XML::LibXML has chosen not to follow > a small part of Perl's official specs (which do have changed over time, > but the last change unicode change was a long time ago!) > > When users of Perl programs put "use encoding 'iso-8859-1';" explicitly > in their programs, then XML::LibXML works as Perl prescribes. And all > other explicit encoding statements work correctly as well. Only when > you do not state the encoding explicitly the problem becomes clear.

Again, you complain that the world is not ideal and that XML has been designed for a world that is not ideal. This includes that you cannot (how hard you might wish or try) predict the unpredictable. The only correct way of reading external data into XML::LibXML is to pass it untouched by perl to the library. If you let perl touch the data first, you choose to be on your own.

Sat Sep 26 06:49:36 2009 phish [...] cpan.org - Status changed from 'open' to 'stalled'

Sat Sep 26 08:07:08 2009 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #44418] Perl internal representation and output encoding
Date:	Sat, 26 Sep 2009 14:06:47 +0200
To:	Christian Glahn via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [090926 10:49]: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=44418 >

This discussion is getting nowhere. You define "latin-1" as a ill-defined encoding, and want to force everyone into using "utf-8", where that is not required at all to make it work. Show quoted text

> Well, we all know that perl5 suffers from backward compatibility > when it comes to encodings.

No. Give me an example. Show quoted text

> > From "man perlunicode": > > By default, there is a fundamental asymmetry in Perl's Unicode > > model: implicit upgrading from byte strings to Unicode strings > > assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode > > strings are downgraded with UTF-8 encoding. This happens because > > the first 256 codepoints in Unicode happens to agree with Latin-1.

> > Exactly here lies the problem where the XML specs and libxml2 follows a > very different approach than perl.

You wrote a layer to connect libXML2 with Perl. There is a difference in approach, which your layer can solve 100%. But it does not. Where everyone else (like DBI) does understand that Perl's internal string representation is either Latin-1 or utf8 (no-dash), XML::LibXML takes it as "undefined" or "utf8" (with dash?) My bug-report asks you to fix this. Show quoted text

> Just remind yourself, that with XML you cannot know the encoding of a > stream in advance. The encoding is defined for each document separately > in the XML declaration and only this declaration is informing a system > about the REAL encoding of an XML document.

I know Show quoted text

> libxml2 even goes further > and does some encoding analysis of the data if the encoding is not > defined before it assumes that the data is really in UTF8.

Wow, that's a security risk. Can this be disabled? Show quoted text

> The problem with XML aware code is that you don't now what format you > will end up in the real world. The same code may load XML documents in > UTF8, other UTF dialects and in all kinds of ISO-8859-* family. You > simply cannot know what encoding is used with a particular XML document, > unless you actually go an read it. Thus, hard-wiring the encoding into > the IO layer is the wrong solution and may break your code.

The ancient idea of ASCII is still too much entangled with UNIX. There are problems everywhere. For instance with the syslog() interface, where applications have to write strings in the encoding of the root user who opens the log-files... Or with pipes between programs. It would be smart if everyone uses the samen encoding. Sure. But the OS is not there. So the second best is, that the applications know what encoding data is in. It's a very poor alternative, but that is what Perl as "last resort" took. It's unconvenient but fits into current OSes. Show quoted text

> > Perl expliciet defines Latin1, but programmers may not know that and > > assume the wrong thing. The problem is now: XML::LibXML did not punish > > these users while developing their code, therefore changing this may > > break some people's existing code (while helping new developers)

> > The problem is again KNOWING vs. ASSUMING. Why should someone get > punished just because the system assumes that something is wrong?

If people do not use a programming language as it was designed, then they are wrong and should KNOW better. If they can ASSUME that all their code is always correct, then we would never have bugs. Show quoted text

> For the internal string representation you should FORCE all IO of perl > into UTF8 mode. I don't know if this keeps perl from doing character > downgrading, but it should. This *should* tell perl that the programmer > prefers correct character handling over performance.

Latin-1 character handling IS NOT INCORRECT. It is perfect, only limited to a certain sub-set of world-wide available character. Please show me an example where Perl's character handling is not correct. Show quoted text

> > ...which does it in a Perl incompatible way. It will break all other > > corners of your application, like database access, which actually do > > interface correctly in this respect.

> > OK, you are mixing topics here - maybe you are also missing the > differences between the different parts of a system.

This is an insult. Please check on the background of the people you are discussing with before writing these things. Show quoted text

> Perl 5 lives the utopia that you can predict you IO - and it lives in > the utopia that you can boil everything down to the northern-American > lifestyle in one way or the other.

Give an example. This statement is untrue. Show quoted text

> With latin-1 you HOPE that you got the correct data

Why? It is latin-1 data, so it is correct data. Show quoted text

> with UTF8 you KNOW that you have the correct data

Yes, it is utf-8, so it is correct data. What's the difference? Show quoted text

> > > 4. upgrade all strings that should get into the DOM to UTF8

> > Gladly, XML::LibXML does that for me.

> Then what is your problem?!?

That it does not use the default for non-utf8 strings. I (and users of my module) have to explicitly set "use encoding 'iso-8859-1'" to their programs to have XML::LibXML behave correctly. Show quoted text

> Again, you complain that the world is not ideal and that XML has been > designed for a world that is not ideal. This includes that you cannot > (how hard you might wish or try) predict the unpredictable.

Also Perl did *exclude* any attempts to "auto-dect"/predict the environment. It has very good ways for programmers to tell it which character-encodings are used, because the environment usually doesn't (except databases these days... or, if you are lucky some environment variables). The approach is: I support everything, but if you do things incorrectly you have to live with my defaults. Show quoted text

> The only correct way of reading external data into XML::LibXML is to > pass it untouched by perl to the library. If you let perl touch the data > first, you choose to be on your own.

Pass it from where??? This idea only works if you can pass text directly from file into an XML document. If any (even simple) processing by Perl is involved (like regexes), then Perl will interpret it as latin1, whether you like it or not. Any chance to meet you on a Perl meeting to discuss this with a group of people? Do you visit the Vienna PM meetings? I know people from that group which can really well explain the current implementation. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Sat Sep 26 08:07:09 2009 The RT System itself - Status changed from 'stalled' to 'open'

Sat Sep 26 09:06:44 2009 pajas [...] matfyz.cz - Correspondence added

Dne so 26.zář.2009 08:07:08, Mark@Overmeer.net napsal(a): Show quoted text

> > This discussion is getting nowhere.

Indeed. Let me sum up: 1. Data passed to XML::LibXML DOM functions that are not utf8 strings are assumed to be in the encoding of the document. This is a documented behavior. 2. The current maintainers do not consider this to be a bug, because it was their intention that XML::LibXML behaved this way. 3. The behavior will remain as it is, at least as long as XML::LibXML is actively maintained by me. 4. We maintain this library as a service to the community. You are free to be of an opinion that we are providing ill service because we implemented things as we did, go and compete (in fact, since we are talking Perl, creating a wrapper module which changes the behavior of XML::LibXML just in this respect is very easy). Thanks for the discussion, anyway. -- Petr

Sat Sep 26 09:06:46 2009 pajas [...] matfyz.cz - Status changed from 'open' to 'rejected'

Sat Sep 26 09:19:50 2009 solutions [...] overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #44418] Perl internal representation and output encoding
Date:	Sat, 26 Sep 2009 15:19:31 +0200
To:	Petr Pajas via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <solutions [...] overmeer.net>

* Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [090926 13:06]: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=44418 > > Dne so 26.zář.2009 08:07:08, Mark@Overmeer.net napsal(a):

> > > > This discussion is getting nowhere.

> > Indeed. Let me sum up: > > 1. Data passed to XML::LibXML DOM functions that are not utf8 strings > are assumed to be in the encoding of the document. This is a documented > behavior. > > 2. The current maintainers do not consider this to be a bug, because it > was their intention that XML::LibXML behaved this way. > > 3. The behavior will remain as it is, at least as long as XML::LibXML is > actively maintained by me.

All three of these were clear when you closed the ticket. (I did not reopen the debate) Show quoted text

> 4. We maintain this library as a service to the community. You are free > to be of an opinion that we are providing ill service because we > implemented things as we did, go and compete (in fact, since we are > talking Perl, creating a wrapper module which changes the behavior of > XML::LibXML just in this respect is very easy).

It is by far the best XML module for Perl. It is very well maintained as well. The subject of the whole discussion was that (1) conflicts with Perl. It seems based on false ideas about how Perl defines charsets. My intension is now to add something like this to my XML::Compile modules: unless user used "use encoding '...'", run "use encoding 'latin1'" That will probably avoid some bug-reports that I receive. Now, I need to find a way to implement above check, preferrably in pure perl. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Sat Sep 26 09:19:51 2009 The RT System itself - Status changed from 'rejected' to 'open'

Sat Sep 26 16:38:30 2009 pajas [...] matfyz.cz - Correspondence added

Dne so 26.zář.2009 09:19:50, solutions@overmeer.net napsal(a): Show quoted text

> * Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [090926 13:06]:

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=44418 > > > Dne so 26.zář.2009 08:07:08, Mark@Overmeer.net napsal(a):

> > > > > > This discussion is getting nowhere.

> > > > Indeed. Let me sum up: > > > > 1. Data passed to XML::LibXML DOM functions that are not utf8

strings Show quoted text

> > are assumed to be in the encoding of the document. This is a

documented Show quoted text

> > behavior. > > > > 2. The current maintainers do not consider this to be a bug,

because it Show quoted text

> > was their intention that XML::LibXML behaved this way. > > > > 3. The behavior will remain as it is, at least as long as XML::LibXML

is Show quoted text

> > actively maintained by me.

> > All three of these were clear when you closed the ticket. (I did not > reopen the debate)

ok. Show quoted text

> > 4. We maintain this library as a service to the community. You are

free Show quoted text

> > to be of an opinion that we are providing ill service because we > > implemented things as we did, go and compete (in fact, since we

are Show quoted text

> > talking Perl, creating a wrapper module which changes the behavior

of Show quoted text

> > XML::LibXML just in this respect is very easy).

> > It is by far the best XML module for Perl. It is very well maintained > as well. > > The subject of the whole discussion was that (1) conflicts with Perl.

not technically, the current API as it behaves shouldn't lead to errors if properly used. Show quoted text

> It seems based on false ideas about how Perl defines charsets.

not really, no. It was based on the assumtions we made about how people are going to use the API. Show quoted text

> My intension is now to add something like this to my XML::Compile > modules: > > unless user used "use encoding '...'", > run "use encoding 'latin1'"

What about upgrading the scalars you pass to XML::LibXML? Show quoted text

> That will probably avoid some bug-reports that I receive. Now, I need > to find a way to implement above check, preferrably in pure perl.

Please don't go that way! When people didn't do 'use encoding', they most likely didn't want to. Just make sure all strings become utf8 before you pass them down to our APIs, that's all. Shouldn't be much overhead either. Best, -- Petr

Sat Sep 26 16:38:31 2009 pajas [...] matfyz.cz - Status changed from 'open' to 'rejected'

Sat Sep 26 17:51:57 2009 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #44418] Perl internal representation and output encoding
Date:	Sat, 26 Sep 2009 23:51:37 +0200
To:	Petr Pajas via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [090926 20:38]: Show quoted text

> > The subject of the whole discussion was that (1) conflicts with Perl.

> not technically, the current API as it behaves shouldn't lead to > errors if properly used.

"when properly used". In the simple case, where text is read from a file and directly translated into XML::LibXML nodes, there is no problem to use it properly. In any more complex case, where Perl does need to do some (minor) processing on the strings (like sorting or regexps) the concept of "undefined encoding" conflicts with Perl's idea about the encoding. Show quoted text

> > It seems based on false ideas about how Perl defines charsets.

> not really, no. It was based on the assumtions we made about how > people are going to use the API.

Your library integrates with Perl, not with peoples assumptions. How do you know what people's assumptions are? Probably the assumption is that it DWIMs, which means: does behave as expected. For instance, that it intergrates wrt unicode the same way as DBI does. And all other unicode capable interface modules. Show quoted text

> > unless user used "use encoding '...'", > > run "use encoding 'latin1'"

> > What about upgrading the scalars you pass to XML::LibXML?

Still, the user has to specify the character-set where the strings are in before my abstract library can upgrade it into utf-8. Perl's concept is that this task is done only at the extrance and exit points of the script *only*: where data enters and leaves the program. It is very simple to change your programming API to follow this model. I agree that it is not always easy on the "read/write to external resource" side. I have the experience with Mail::Box, where character-sets within mbox files change with each message part. And even within one MIME header field. This is a valid message header field: Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?= Of course, the default MailBox behavior is to return the decoded utf-8 string to the user. I understand what kind of complications you encounter. But they are all solvable in the few basic rules Perl has set-up. This should be the way your output works, for any kind of encoding of the XML document (also for utf-8 output!): use Encoding; my $enc = find_encoding($doc->encoding) or die; open OUT, '>:raw', 'file.xml'; print OUT "<$tag>", $enc->encode($data), "</$tag>\n"; Show quoted text

> > That will probably avoid some bug-reports that I receive. Now, I need > > to find a way to implement above check, preferrably in pure perl.

> > Please don't go that way! When people didn't do 'use encoding', they > most likely didn't want to.

Most people are not used to work explicitly with utf8. It normally just DWIMs. But in this case, they need some help to make it DWIM more. I want XML::Compile to follow Perl's rules some way or the other, not XML::LibXML "we know better" rules. Sorry. Show quoted text

> Just make sure all strings become utf8 before you pass them down to > our APIs, that's all. Shouldn't be much overhead either.

I create elements and attributes on many many spots in the program... I would need to add the upgrade on dozens of places. It's a pity when I need work-around to get libraries to work smoothly. An other option may be to die() when people create XML with an encoding != UTF8 and do not specify the same character-set explictly in a "use encoding". Have to sleep over this, to figure-out whether this catches most case I came across. At least, I have added a detailed explanation in the FAQ. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Sat Sep 26 17:51:58 2009 The RT System itself - Status changed from 'rejected' to 'open'

Sat Sep 26 19:05:11 2009 phish [...] cpan.org - Correspondence added

Hi Mark, I scanned your XML::Compile module and the problem seems to have a simple solution. Actually, you explicitly force yourself (and your users) into UTF8 only mode. Thus, the problem is in your code. If you would create iso-8859-1 documents rather than utf8 documents your problems would go away. You can always change the encoding of the documents to UTF8 at any point. Alternatively, you can pass the document element for output rather than the iso-8859-1 declared document itself. In this case you get perfectly standard compliant XML documents in UTF8 but have no problems with adding latin-1 strings along the way. The XML::LibXML::SAX handlers do something similar like your module, but have never created any real problems when it came to tree building (at least not encoding related AFAIK). Cheers Christian On Sat Sep 26 17:51:57 2009, Mark@Overmeer.net wrote: Show quoted text

> * Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [090926 20:38]:

> > > The subject of the whole discussion was that (1) conflicts with Perl.

> > not technically, the current API as it behaves shouldn't lead to > > errors if properly used.

> > "when properly used". In the simple case, where text is read from a file > and directly translated into XML::LibXML nodes, there is no problem to use > it properly. In any more complex case, where Perl does need to do some > (minor) processing on the strings (like sorting or regexps) the concept of > "undefined encoding" conflicts with Perl's idea about the encoding. >

> > > It seems based on false ideas about how Perl defines charsets.

> > not really, no. It was based on the assumtions we made about how > > people are going to use the API.

> > Your library integrates with Perl, not with peoples assumptions. How do > you know what people's assumptions are? Probably the assumption is that > it DWIMs, which means: does behave as expected. For instance, that it > intergrates wrt unicode the same way as DBI does. And all other unicode > capable interface modules. >

> > > unless user used "use encoding '...'", > > > run "use encoding 'latin1'"

> > > > What about upgrading the scalars you pass to XML::LibXML?

> > Still, the user has to specify the character-set where the strings are in > before my abstract library can upgrade it into utf-8. Perl's concept is > that this task is done only at the extrance and exit points of the script > *only*: where data enters and leaves the program. It is very simple to > change your programming API to follow this model. I agree that it is > not always easy on the "read/write to external resource" side. > > I have the experience with Mail::Box, where character-sets within mbox > files change with each message part. And even within one MIME header > field. This is a valid message header field: > > Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= > =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?= > > Of course, the default MailBox behavior is to return the decoded > utf-8 string to the user. I understand what kind of complications > you encounter. But they are all solvable in the few basic rules Perl > has set-up. This should be the way your output works, for any kind > of encoding of the XML document (also for utf-8 output!): > > use Encoding; > my $enc = find_encoding($doc->encoding) or die; > open OUT, '>:raw', 'file.xml'; > print OUT "<$tag>", $enc->encode($data), "</$tag>\n"; >

> > > That will probably avoid some bug-reports that I receive. Now, I need > > > to find a way to implement above check, preferrably in pure perl.

> > > > Please don't go that way! When people didn't do 'use encoding', they > > most likely didn't want to.

> > Most people are not used to work explicitly with utf8. It normally > just DWIMs. But in this case, they need some help to make it DWIM > more. I want XML::Compile to follow Perl's rules some way or the other, > not XML::LibXML "we know better" rules. Sorry. >

> > Just make sure all strings become utf8 before you pass them down to > > our APIs, that's all. Shouldn't be much overhead either.

> > I create elements and attributes on many many spots in the program... I > would need to add the upgrade on dozens of places. It's a pity when I > need work-around to get libraries to work smoothly. > > An other option may be to die() when people create XML with an encoding > != UTF8 and do not specify the same character-set explictly in a "use > encoding". Have to sleep over this, to figure-out whether this catches > most case I came across. At least, I have added a detailed explanation > in the FAQ.

Sat Sep 26 19:05:26 2009 phish [...] cpan.org - Status changed from 'open' to 'rejected'

Sun Sep 27 02:55:52 2009 pajas [...] matfyz.cz - Correspondence added

Show quoted text

> > What about upgrading the scalars you pass to XML::LibXML?

> > Still, the user has to specify the character-set where the strings are in > before my abstract library can upgrade it into utf-8.

Oh really? I thought it wil just DWIM. Show quoted text

> Perl's concept is > that this task is done only at the extrance and exit points of the script > *only*: where data enters and leaves the program.

If you do it, e.g. via setting an :encoding() or :utf8 IO layer, the strings are already upgraded. Show quoted text

>... But they are all solvable in the few basic rules Perl > has set-up. This should be the way your output works, for any kind > of encoding of the XML document (also for utf-8 output!): > > use Encoding; > my $enc = find_encoding($doc->encoding) or die; > open OUT, '>:raw', 'file.xml'; > print OUT "<$tag>", $enc->encode($data), "</$tag>\n";

The current implementation assumes that you do this: open OUT, '>:raw', 'file.xml'; $doc->toFH(\*OUT); or open OUT, '>:raw', 'file.xml'; print OUT $doc->toString(); Nothing that counterintuitive about that! You only need to encode if you are dumping a non-document node, e.g. $doc->documentElement->toString in which case you get a utf8 string and you encode it to whatever encoding you like. Show quoted text

> > > That will probably avoid some bug-reports that I receive. Now, I need > > > to find a way to implement above check, preferrably in pure perl.

> > > > Please don't go that way! When people didn't do 'use encoding', they > > most likely didn't want to.

Our "we know better" decision has no side-effects. It only concerns the entry points to our APIs, we don't enforce any semantics on other parts of the user's code. Enforcing 'use encoding' on somebody elses literals whose purpose you have no knowledge about is an "I know better" approach on your side. Of course, this is your module, you make the decisions. Show quoted text

> > Just make sure all strings become utf8 before you pass them down to > > our APIs, that's all. Shouldn't be much overhead either.

How many different XML::LibXML methods do you use for this? A bet no more than five or ten. Just wrap those and call your wrappers. Might come handy in the future. Or follow Christians suggestion. -- Petr

Sun Sep 27 02:55:53 2009 The RT System itself - Status changed from 'rejected' to 'open'

Sun Sep 27 08:59:18 2009 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #44418] Perl internal representation and output encoding
Date:	Sun, 27 Sep 2009 14:58:44 +0200
To:	Christian Glahn via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [090926 23:05]: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=44418 > > > Actually, you explicitly force yourself (and your users) into UTF8 only > mode. Thus, the problem is in your code.

Certainly not. Utf-8 is set as default, but I know people who use other encodings. (They cannot help it: large international commercial spec) XML::Compile only helps people in constructing (parts) of their XML. It understands Schema's but does not have anything to do with the XML text or encodings. Works perfectly... except for the problem under discussion. Show quoted text

> If you would create iso-8859-1 documents rather than utf8 documents your > problems would go away.

Why? Sometimes users of my module do have real utf-data, so why should they create latin-1 documents? I do not understand this at all: how should this help the implementation of a general purpose module? Maybe you should give me an example. In my arguments, I have asked you for a few more examples for statements you make, but you seem to ignore these. Remarks about "latin1 is not well-defined" and such. I don't like it when arguments get ignored: they should get weighted. Show quoted text

> Alternatively, you can pass the document element for output rather than > the iso-8859-1 declared document itself. In this case you get perfectly > standard compliant XML documents in UTF8 but have no problems with > adding latin-1 strings along the way.

This is what I already do. The user will usually only output full doc-trees. This is not an issue at all. Petr had a good summary of the actual subject. Perl defines byte-strings as "latin1" and XML::LibXML as "undefined". And your think to serve people better with this difference. I don't think that's true, and will find a way to help my users around it. Let's close this ticket (for a while) -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Sun Sep 27 11:53:30 2009 pajas [...] matfyz.cz - Correspondence added

Dne ne 27.zář.2009 08:59:18, Mark@Overmeer.net napsal(a): Show quoted text

> * Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [090926 23:05]:

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=44418 > > > > > Actually, you explicitly force yourself (and your users) into UTF8 only > > mode. Thus, the problem is in your code.

> > Certainly not. Utf-8 is set as default, but I know people who use other > encodings. (They cannot help it: large international commercial spec) > XML::Compile only helps people in constructing (parts) of their XML. It > understands Schema's but does not have anything to do with the XML text > or encodings. Works perfectly... except for the problem under discussion. >

> > If you would create iso-8859-1 documents rather than utf8 documents your > > problems would go away.

> > Why? Sometimes users of my module do have real utf-data, so why should > they create latin-1 documents? I do not understand this at all: how > should this help the implementation of a general purpose module?

Seems you misunderstand Christian's point: If you create a DOM Document with iso-8859-1 encoding and then perform DOM operations on this document, XML::LibXML will behave almost the way you want it to behave by default, because it will: - internally represent everything in UTF-8 - insert utf8 strings as they are - assume all other strings are latin1 and decode them Only when you serialize the document you set the encoding to whatever your user wanted. Does it make sense, now? Show quoted text

> Maybe > you should give me an example

Sure: your user wants to create a XML document encoded in iso-8859-6 (farsi). You create the document node with encoding iso-8859-1. Internally, everything is UTF-8 in libxml2. Two things can occur: the user passes utf8 strings, which are inserted into the document tree without reencoding, or passes a non-utf8 strings, which are upgraded to utf8 using equivalent of decode('latin1',$string), which is how Perl would behave in this situation elsewhere. Then the user asks to serialize the document. You use $doc->setEncoding('iso-8859-6') and then either $doc->toString, $doc->toFH or whatever. The user gets the document in farsi, as requested. Makes sense? Show quoted text

> Let's close this ticket (for a while)

I keep doing that:-) -- Petr

Sun Sep 27 11:53:32 2009 pajas [...] matfyz.cz - Status changed from 'open' to 'rejected'

Sun Sep 27 16:33:29 2009 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #44418] Perl internal representation and output encoding
Date:	Sun, 27 Sep 2009 22:33:10 +0200
To:	Petr Pajas via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [090927 15:53]: Show quoted text

> Seems you misunderstand Christian's point: > > If you create a DOM Document with iso-8859-1 encoding and then perform > DOM operations on this document, XML::LibXML will behave almost the way > you want it to behave by default, because it will: > > - internally represent everything in UTF-8 > - insert utf8 strings as they are > - assume all other strings are latin1 and decode them > > Only when you serialize the document you set the encoding to whatever > your user wanted. > > Does it make sense, now?

This seems like an excellent solution, now I understand it... however... in my API, people create the Document themselves: it is not produced in my library. A ready XML::LibXML::Document is to be passed as argument to "my" writers, because people may be working on various different XML documents at a time, using the same schema's but different encodings and data. Things like document compression and IO are fully outside my scope: the user has control over the document. Seriously, it's a pity that this work-around would require a major API change. It, however, may be the best solution to describe in the FAQ. Thanks for taking the time to make me understand this trick. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Sun Sep 27 16:33:29 2009 The RT System itself - Status changed from 'rejected' to 'open'

Wed Sep 30 09:37:50 2009 pajas [...] matfyz.cz - Status changed from 'open' to 'rejected'