Skip Menu |

This queue is for tickets about the YAML-Tiny CPAN distribution.

Report information
The Basics
Id: 76180
Status: resolved
Priority: 0/
Queue: YAML-Tiny

People
Owner: Nobody in particular
Requestors: antox [...] ml.lv
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: (no value)



Subject: patch proposal: utf8::encode in write_string
Date: Sat, 31 Mar 2012 12:19:42 +0300
To: bug-YAML-Tiny [...] rt.cpan.org
From: Antons Suspans <antox [...] ml.lv>
YAML-Tiny-1.51 attempts utf8::decode in read_string, but does not utf8::encode in write_string. Is there a reason for such asymmetric behaviour and serializing not to byte stream? Otherwise I propose the following patch at least for consistency: --- old/Tiny.pm 2012-03-11 04:35:28.000000000 +0200 +++ new/Tiny.pm 2012-03-31 12:07:16.157134064 +0300 @@ -447 +447,6 @@ - join '', map { "$_\n" } @lines; + my $string = join '', map { "$_\n" } @lines; + + # Try to encode as utf8 + utf8::encode($string) if HAVE_UTF8; + + $string; -- Antons Suspans
On Sat Mar 31 05:20:21 2012, antox@ml.lv wrote: Show quoted text
> YAML-Tiny-1.51 attempts utf8::decode in read_string, but does not > utf8::encode in write_string. Is there a reason for such asymmetric > behaviour and serializing not to byte stream?
I believe the decoding is an attempt to "do the right thing" for people who pass a string read raw from a file who forgot to decode it. I don't believe that write_string should encode -- at least not by default -- as tools that expect characters and then do their own UTF-8 encoding would double-encode. The "right" thing would be for read_string not to decode by default, but I'm not sure changing it now is wise. Possibly adding a write_utf8_string function that gives encoded output makes sense so the choice is more obvious to people.
From: antox [...] ml.lv
On Thu Sep 26 21:45:02 2013, DAGOLDEN wrote: Show quoted text
> On Sat Mar 31 05:20:21 2012, antox@ml.lv wrote:
> > YAML-Tiny-1.51 attempts utf8::decode in read_string, but does not > > utf8::encode in write_string. Is there a reason for such asymmetric > > behaviour and serializing not to byte stream?
> > I believe the decoding is an attempt to "do the right thing" for > people who pass a string read raw from a file who forgot to decode it. > > I don't believe that write_string should encode -- at least not by > default -- as tools that expect characters and then do their own UTF-8 > encoding would double-encode. > > The "right" thing would be for read_string not to decode by default, > but I'm not sure changing it now is wise.
As far as I can understand the spec [1], the stream is supposed to be encoded using UTF-8/16. So information appears either as octets or as a completely unfolded data structure (no intermediate representations). This seems like what YAML::XS does. Apparently, when dealing with streams presented as strings, YAML and YAML::Syck implicitly delegate encoding issues to the outside. If I remember correctly, some JSON module(s) also had deviations in that matter. Show quoted text
> Possibly adding a write_utf8_string function that gives encoded output > makes sense so the choice is more obvious to people.
I suspect I was mostly concerned about behavior of Load/Dump (interface common to all YAML* modules). Dump just yields whatever write_string returns. But also the write method (called by DumpFile) prints output of write_string to file as is - this produces "wide char" warning (unless user has set I/O layer in outer scope or tricked :layer into the passed file path). (For comparison, YAML::(Load|Dump)File force :utf8 layer, but YAML::Syck::(Load|Dump)File accept $fh for path). I think this was why I leaned towards altering write_string and moving closer to YAML::XS. If my suggestion appears likely to cause more problems than to solve, then IMHO putting a few explanatory lines into POD would be more beneficious than adding write_utf8_string. [1] http://www.yaml.org/spec/1.2/spec.html#id2771184
On Fri Sep 27 14:31:35 2013, antox@ml.lv wrote: Show quoted text
> I suspect I was mostly concerned about behavior of Load/Dump (interface > common to all YAML* modules). Dump just yields whatever write_string > returns. But also the write method (called by DumpFile) prints output > of write_string to file as is - this produces "wide char" warning > (unless user has set I/O layer in outer scope or tricked :layer into the > passed file path). (For comparison, YAML::(Load|Dump)File force :utf8 > layer, but YAML::Syck::(Load|Dump)File accept $fh for path). I think > this was why I leaned towards altering write_string and moving closer to > YAML::XS. If my suggestion appears likely to cause more problems than to > solve, then IMHO putting a few explanatory lines into POD would be more > beneficious than adding write_utf8_string.
I've looked at all the modules and emailed all the authors to ask about it. I think I'm coming around to your view that matching YAML::XS probably makes more sense. But it breaks backwards compatibility. Still, a major version bump might be justified to get consistency with other modules.
Show quoted text
> I've looked at all the modules and emailed all the authors to ask > about it. I think I'm coming around to your view that matching > YAML::XS probably makes more sense. But it breaks backwards > compatibility. Still, a major version bump might be justified to get > consistency with other modules.
After some additional thought, I think adding methods/functions is going to be better, because code that wants that behavior can depend on a version of YAML::Tiny with support for it. Without addressing whether "write_string" and "Dump" should behave one way or another, I have a pull-request that improves UTF-8 support generally and ensures that files are written with UTF-8 encoding. It adds a "write_utf8_string" method (for use by "write") as I didn't want to presume an outcome of discussions about "write_string" and "Dump", but obviously that could be amended if a decision were reached. https://github.com/Perl-Toolchain-Gang/YAML-Tiny/pull/2
I have applied changes to the master branch that now make it clear that YAML::Tiny's read/write are from/to UTF-8 files. read_string and write_string will continue to be character based for historical consistency. I believe a decision needs to be reached about Dump/Load and I've asked Ingy how he wants that to work across YAML modules. I consider this issue resolved, but will leave it as "patched" for now pending a decision about Dump/Load.
Can we call this fixed now?
The ticket issue itself is resolved in the latest release. The long term future of Dump/Load is still TBD, but that's tracked here: https://github.com/Perl-Toolchain-Gang/YAML-Tiny/issues/11