If it is a "byte" string, it is not a "character" string, so character
semantics (which is what UTF8 flag stands for, in spite of its name)
does not apply to it, hence the UTF8 flag is off (and this is
deliberate).
What the documentation tries to say is indeed that while for a node,
UTF8 flag is on (and you get what appears as characters to Perl), for a
document the flag is off (and you get what should appear to Perl as
binary data).
So, when dumping a document to a file (like you do below), you don't
want to use the :utf8 IO layer. This is because the encoding of the
document may be arbitrary (in fact, you should be completely ignorant
about it) and the same
open OUT, $file;
print OUT $doc->toString;
will work for all documents. I probably have to further extend the
paragraph in the documentation to make this even clearer...
-- Petr
Dne po 28.led.2008 07:28:23, solutions@overmeer.net napsal(a):
Show quoted text> * Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [080128 12:10]:
>
> Yes, it does return the byte-string in the correct encoding.
However,
Show quoted text> my complaint is, that if the resulting enconding is utf8, then the
> output does not have the utf8 flag on.
>
> What I do is:
> open OUT, '>:utf8', $file;
> print OUT $doc->toString;
>
> (simplified version of the real code, I known there is a toFile)
>
> The problem is that I get a double encoding of the utf8 data,
> because toString() does not set the utf8 flag correctly. The
> output layer is encoding again.
>
> In my module, the problem is extra nasty, because I had someone
creating
Show quoted text> iso-8859-1 xml data, which then had to be transmitted as UTF8 because
> SOAP requires it. Problems came with ë and friends (yes, he is
German)
Show quoted text> in that data. The right flag could avoid it. Of course, there is no
> right flag when the encoding is neither latin1 nor utf8...
>
> I do not know whether changing this behavior will break other
> people's code, but I assume everyone ignores io-layers, so it
> may not be a problem.
>
> > Dne st 09.led.2008 11:06:38, Mark@Overmeer.net napsal(a):
> > > my $x = $doc->toString;
> > > print $x;
> > > print utf8::is_utf8($x) ? "yes" : "no";
> > >
> > > The string is correct utf8, but does not have the utf8 flag set.
> > > Or am I doing something wrong (again)?
>