Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the YAML-LibYAML CPAN distribution.

Report information
The Basics
Id: 54683
Status: open
Priority: 0/
Queue: YAML-LibYAML

People
Owner: Nobody in particular
Requestors: mschwern [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 0.32
Fixed in: (no value)



Subject: Load() and utf8.pm don't play nice.
The attached program demonstrates utf8 and YAML::XS don't mix. -------- #!/usr/bin/perl use utf8; use YAML::XS qw(Load); my $yaml =q[ Ævar Arnfjörð Bjarmason: - avar@cpan.org - avarab@gmail.com ]; Load($yaml); print "Done\n"; --------- I get: YAML::XS::Load Error: The problem: Invalid trailing UTF-8 octet was found at document: 0 With a wide character, like so: my $yaml =q[ 貞廣知行: - bqw10602@nifty.com ]; I get: Wide character in subroutine entry at yourfile line 11.
On Tue Feb 16 20:34:03 2010, MSCHWERN wrote: Show quoted text
> The attached program demonstrates utf8 and YAML::XS don't mix.
Confirmed on YAML::XS 0.38.
On Mon Feb 27 13:40:41 2012, ETHER wrote: Show quoted text
> On Tue Feb 16 20:34:03 2010, MSCHWERN wrote:
> > The attached program demonstrates utf8 and YAML::XS don't mix.
> > Confirmed on YAML::XS 0.38.
More on this: - All testing done on perl5.14.2 - This worked fine, on a data file created with vim with the configs ":encoding=utf-8" and "tenc=latin-1": use strict; use warnings; use v5.14; use YAML::XS 'LoadFile'; use Data::Dumper; my $data = LoadFile("utf8.txt"); print Dumper($data); foreach my $key (keys %$data) { print "### key $key\n"; }
On 2010-02-16 23:34:03, MSCHWERN wrote: Show quoted text
> The attached program demonstrates utf8 and YAML::XS don't mix.
And I think this is fine. YAML content should be treated as a binary blob. Maybe YAML::XS should just warn or even die if it finds a wide character (just like Digest::MD5 is doing, for instance). Regards, Slaven
On 2013-04-15 14:48:08, SREZIC wrote: Show quoted text
> On 2010-02-16 23:34:03, MSCHWERN wrote:
> > The attached program demonstrates utf8 and YAML::XS don't mix.
> > And I think this is fine. YAML content should be treated as a binary > blob. Maybe YAML::XS should just warn or even die if it finds a wide > character (just like Digest::MD5 is doing, for instance).
YAML::XS is already documenting this behavior, see "USING YAML::XS WITH UNICODE" in the Pod. It explicitely mentions "utf8 octets". I suggest to reject this ticket, or maybe add more examples to the documentation. Regards, Slaven
Here is the content of "USING YAML::XS WITH UNICODE".

       Handling unicode properly in Perl can be a pain. YAML::XS only deals
       with streams of utf8 octets. Just remember this:

           $perl = Load($utf8_octets);
           $utf8_octets = Dump($perl);

       There are many, many places where things can go wrong with unicode.  If
       you are having problems, use Devel::Peek on all the possible data
       points.

Doesn't offer a solution to my problem.


Show quoted text
> YAML content should be treated as a binary blob.

I'm pretty sure this is wrong.  The YAML 1.2 spec talks extensively about Unicode.  Section 5.2 specifically states "on input, a YAML processor must support the UTF-8 and UTF-16 character encodings".

5.2. Character Encodings

All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above #xFFFF are written as four bytes, using a surrogate pair.

The character encoding is a presentation detail and must not be used to convey content information.

On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported.