Hi Mark,
On Fri Sep 25 15:43:51 2009, solutions@overmeer.net wrote:
Show quoted text> Hi Christian,
>
> Thanks for all the good work you did on this module!
>
> * Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [090925 18:36]:
> > For some (cloudy) reason Perl did not made the shift to UFT8 for
> > internal character representation.
>
> The reason is simply: performance. Character processing, for instance
> in regular expressions, is much much slower when the string is UTF8.
> And... they could make it to work having two kinds of strings so there
> was no need to move over to UTF8.
IMHO performance is a very very poor excuse for limiting character
processing to mainly the North American language zone. But this is not a
discussion we should have here.
Show quoted text> Actually, there should have been three kinds of strings: a different
> type which handles binary data. Perl6 solves this nicely, having a
> character encoding label to each row of bytes. At any string operation,
> it will unify the types it finds, hopefully giving better performance.
Well, we all know that perl5 suffers from backward compatibility when it
comes to encodings.
Show quoted text> > AFAIK the only character encoding that all perl version > 5.6 can
> > actually identify is UTF8. All other character encodings including
> > latin-1 (which has been outdated for almost a decade, btw.) are
> > indistinguishable for the perl internals.
>
> No, the official statement for Perl is: you have utf8-like (not realy
> utf-8) strings and Latin-1 strings. The discussion I had with Petr is
> that XML::LibXML decided for: you have utf-8 strings and strings
> which are in an undefined encoding. That differs from Perl's current
> specification.
>
> From "man perlunicode":
>
> By default, there is a fundamental asymmetry in Perl's Unicode
> model: implicit upgrading from byte strings to Unicode strings
> assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode
> strings are downgraded with UTF-8 encoding. This happens because
> the first 256 codepoints in Unicode happens to agree with Latin-1.
Exactly here lies the problem where the XML specs and libxml2 follows a
very different approach than perl.
Show quoted text> > The problem with DWIM is that people dump all kind of octet streams and
> > hope that they will work as strings with the correct encoding. Even with
> > "use encoding" this problem is not solved. Just take the example: lets
> > take a XML document that is _explicitly_ encoded in ISO-8859-6 and perl
> > _assuming_ everything is ISO-8859-2 unless it is marked as UTF8, because
> > some trainee developer decided that latin-2 is a good thing. So if one
> > dumps a string from the DOM in the XML's native encoding (that is
> > ISO-8859-6) and then adds it back to the DOM, what kind of information
> > would people assume to show up? The DWIM metaphor will automatically say
> > the original string (in ISO-8859-6) and NOT in the wrong and almost
> > entirely incompatible ISO-8859-2 encoded version.
>
> Well, no. If you follow Perl's current guidelines, you will need to
> explicitly express the character encoding on all input and output of
> a program if any utf-8 handling is done. Either with the global "use
> encoding", setting a default, and/or which each open() and database
> access. So, above example does not hold: as long as XML::LibXML returns
> the data in Latin1 or UTF-8 to the Perl program, the user's application
> will force it into the correct encoding when displaying on the screen
> or writing it to file. Inside the perl program, the encoding does not
> matter at all (except for reasons of performance)
Just remind yourself, that with XML you cannot know the encoding of a
stream in advance. The encoding is defined for each document separately
in the XML declaration and only this declaration is informing a system
about the REAL encoding of an XML document. libxml2 even goes further
and does some encoding analysis of the data if the encoding is not
defined before it assumes that the data is really in UTF8.
Show quoted text> > The problem with perl's "use encoding" is that it allows people to tell
> > what perl should _assume_ a string is encoded in, even if most parts of
> > the additional logic _know_ that it is not (in the example XML::LibXML
> > knows that everything internal is UTF8 and the external representation
> > is ISO-8859-6).
>
> use encoding "iso-8859-6";
> open F, "<$f";
>
> is an alternative for
>
> open F, "<:encoding(iso-8859-6)", $f;
>
> including the character-set in which to interpret STDIN, STDOUT and
> STDERR, and all other files which are read or written.. It tells Perl's
> IO layers a default.
>
> You also need to use :encoding(utf-8) on read and write, because
> Perls internal utf8 (without dash) is not strict: could produce
> undefined characters, does not understand markers etc.
The problem with XML aware code is that you don't now what format you
will end up in the real world. The same code may load XML documents in
UTF8, other UTF dialects and in all kinds of ISO-8859-* family. You
simply cannot know what encoding is used with a particular XML document,
unless you actually go an read it. Thus, hard-wiring the encoding into
the IO layer is the wrong solution and may break your code.
The actual problem is not so much the output but the input. However,
even with the output you may end up with problem if the actual encoding
of an XML document is not honoured.
Show quoted text> > Because UTF8 is THE STANDARD encoding for XML and DOM, we decided (after
> > long discussions on the perl-xml list and in the related IRC channels)
> > that in case of doubt we should always opt for what we KNOW while
> > running the code and not for what a programmer might have ASSUMED at the
> > time of writing it.
>
> Perl expliciet defines Latin1, but programmers may not know that and
> assume the wrong thing. The problem is now: XML::LibXML did not punish
> these users while developing their code, therefore changing this may
> break some people's existing code (while helping new developers)
The problem is again KNOWING vs. ASSUMING. Why should someone get
punished just because the system assumes that something is wrong?
Show quoted text> > 1. ALWAYS use encoding UTF8
>
> This is totally incorrect. This sets the STDIN/STDOUT/STDERR and other
> file defaults to utf8. But that is a system setting. (If I interpret
> this point as "use encoding 'utf-8';")
Sorry, but you completely misunderstand the problem.
For the internal string representation you should FORCE all IO of perl
into UTF8 mode. I don't know if this keeps perl from doing character
downgrading, but it should. This *should* tell perl that the programmer
prefers correct character handling over performance.
The programmer should force the attention that with XML data you should
use actually UTF8 for all internal operations.
This has nothing to do with system settings, but with the fact that you
actually don't know the encoding of an XML document until you actually
read it.
Show quoted text>
> > 2. disable perl's auto upgrading of strings during all IO operations and
> > leave the tricky bits to libxml2.
>
> ...which does it in a Perl incompatible way. It will break all other
> corners of your application, like database access, which actually do
> interface correctly in this respect.
OK, you are mixing topics here - maybe you are also missing the
differences between the different parts of a system.
Different to many other systems - particularly databases - XML does not
data from a preconfigurable environment. It is a standardized approach
to deal with the messy data that are around for all kinds of nationalist
reasons.
Perl 5 lives the utopia that you can predict you IO - and it lives in
the utopia that you can boil everything down to the northern-American
lifestyle in one way or the other.
XML::LibXML tried to bridge these cultural difference (that is between
Perl and XML, not between northern America and the rest of the world).
In that sense, you statement of correctness refers to the unwillingness
to understand that you cannot predict a messy environment in the same
ways as you can predict an organized environment.
As soon you use not UTF8 as the default encoding in your entire
organized environment and you use XML also for input rather than only
for output purposes, you have to let go of your idea "correct
behaviour". However, if you stick with UTF8, XML::LibXML actively
supports you to remove the encoding related issues from messy input.
For some reason I am not confronted with breaking applications, although
I work in completely messy environments using XML::LibXML. Therefore, I
assume that if something breaks at other corners of your system it is
related to the design of an application, and not to incompatible ways of
handling data.
Show quoted text> > 3. assure that no arbitrary octet streams get near the DOM.
>
> byte streams: a pitty that BLOB is not a core construct
> See
http://search.cpan.org/~juerd/BLOB-1.01
> Character streams: are Latin1, no problem to put them in the DOM.
Yes, but you have to assure that your character stream is not actually a
downgraded or recoded version of something else. So, in that sense,
latin-1 is also an arbitrary octet stream. Please note that here we
differ in terminology. With latin-1 you HOPE that you got the correct
data, with UTF8 you KNOW that you have the correct data (unless you do
something completely stupid such as downgrading a UTF8 string to latin1
and then promoting it back to UTF8).
Show quoted text> > 4. upgrade all strings that should get into the DOM to UTF8
>
> Gladly, XML::LibXML does that for me.
Then what is your problem?!?
Show quoted text> > 5. never try to extract the XML-document's original encoding from the
> > DOM unless while serializing the entire DOM.
>
> Extracting is into Perl, and as long as the text stays there, you do not
> have a visible encoding. Only when you write it out into a file later.
I mean that you should not force XML::LibXML to return the original
encoding instead of UTF8 characters. The default is UTF8, so this should
be usually transparent to developers. UTF8 data should be handled
correctly by perl's IO layer even if no explicit encoding is given.
For all other encodings, you should strip the document's encoding first
and make it explicit. The problem is that this is nothing for what
XML::LibXML can be blamed.
Show quoted text> > The correctness of this decision is related to the XML related
> > specifications. There might be conflicts with other views or discussions
> > related to other parts of perl development.
>
> I disagree fully. Both Perl and XML are very well defined about
> how to handle encodings. Only XML::LibXML has chosen not to follow
> a small part of Perl's official specs (which do have changed over time,
> but the last change unicode change was a long time ago!)
>
> When users of Perl programs put "use encoding 'iso-8859-1';" explicitly
> in their programs, then XML::LibXML works as Perl prescribes. And all
> other explicit encoding statements work correctly as well. Only when
> you do not state the encoding explicitly the problem becomes clear.
Again, you complain that the world is not ideal and that XML has been
designed for a world that is not ideal. This includes that you cannot
(how hard you might wish or try) predict the unpredictable.
The only correct way of reading external data into XML::LibXML is to
pass it untouched by perl to the library. If you let perl touch the data
first, you choose to be on your own.