Bug #76180 for YAML-Tiny: patch proposal: utf8::encode in write

Sat Mar 31 05:20:21 2012 antox [...] ml.lv - Ticket created

Subject:	patch proposal: utf8::encode in write_string
Date:	Sat, 31 Mar 2012 12:19:42 +0300
To:	bug-YAML-Tiny [...] rt.cpan.org
From:	Antons Suspans <antox [...] ml.lv>

YAML-Tiny-1.51 attempts utf8::decode in read_string, but does not utf8::encode in write_string. Is there a reason for such asymmetric behaviour and serializing not to byte stream? Otherwise I propose the following patch at least for consistency: --- old/Tiny.pm 2012-03-11 04:35:28.000000000 +0200 +++ new/Tiny.pm 2012-03-31 12:07:16.157134064 +0300 @@ -447 +447,6 @@ - join '', map { "$_\n" } @lines; + my $string = join '', map { "$_\n" } @lines; + + # Try to encode as utf8 + utf8::encode($string) if HAVE_UTF8; + + $string; -- Antons Suspans

Thu Sep 26 21:45:02 2013 dagolden [...] cpan.org - Correspondence added

On Sat Mar 31 05:20:21 2012, antox@ml.lv wrote: Show quoted text

> YAML-Tiny-1.51 attempts utf8::decode in read_string, but does not > utf8::encode in write_string. Is there a reason for such asymmetric > behaviour and serializing not to byte stream?

I believe the decoding is an attempt to "do the right thing" for people who pass a string read raw from a file who forgot to decode it. I don't believe that write_string should encode -- at least not by default -- as tools that expect characters and then do their own UTF-8 encoding would double-encode. The "right" thing would be for read_string not to decode by default, but I'm not sure changing it now is wise. Possibly adding a write_utf8_string function that gives encoded output makes sense so the choice is more obvious to people.

Thu Sep 26 21:45:03 2013 The RT System itself - Status changed from 'new' to 'open'

Fri Sep 27 14:31:35 2013 antox [...] ml.lv - Correspondence added

From:

antox [...] ml.lv

On Thu Sep 26 21:45:02 2013, DAGOLDEN wrote: Show quoted text

> On Sat Mar 31 05:20:21 2012, antox@ml.lv wrote:

> > YAML-Tiny-1.51 attempts utf8::decode in read_string, but does not > > utf8::encode in write_string. Is there a reason for such asymmetric > > behaviour and serializing not to byte stream?

> > I believe the decoding is an attempt to "do the right thing" for > people who pass a string read raw from a file who forgot to decode it. > > I don't believe that write_string should encode -- at least not by > default -- as tools that expect characters and then do their own UTF-8 > encoding would double-encode. > > The "right" thing would be for read_string not to decode by default, > but I'm not sure changing it now is wise.

As far as I can understand the spec [1], the stream is supposed to be encoded using UTF-8/16. So information appears either as octets or as a completely unfolded data structure (no intermediate representations). This seems like what YAML::XS does. Apparently, when dealing with streams presented as strings, YAML and YAML::Syck implicitly delegate encoding issues to the outside. If I remember correctly, some JSON module(s) also had deviations in that matter. Show quoted text

> Possibly adding a write_utf8_string function that gives encoded output > makes sense so the choice is more obvious to people.

I suspect I was mostly concerned about behavior of Load/Dump (interface common to all YAML* modules). Dump just yields whatever write_string returns. But also the write method (called by DumpFile) prints output of write_string to file as is - this produces "wide char" warning (unless user has set I/O layer in outer scope or tricked :layer into the passed file path). (For comparison, YAML::(Load|Dump)File force :utf8 layer, but YAML::Syck::(Load|Dump)File accept $fh for path). I think this was why I leaned towards altering write_string and moving closer to YAML::XS. If my suggestion appears likely to cause more problems than to solve, then IMHO putting a few explanatory lines into POD would be more beneficious than adding write_utf8_string. [1] http://www.yaml.org/spec/1.2/spec.html#id2771184

Fri Sep 27 23:53:51 2013 dagolden [...] cpan.org - Correspondence added

On Fri Sep 27 14:31:35 2013, antox@ml.lv wrote: Show quoted text

> I suspect I was mostly concerned about behavior of Load/Dump (interface > common to all YAML* modules). Dump just yields whatever write_string > returns. But also the write method (called by DumpFile) prints output > of write_string to file as is - this produces "wide char" warning > (unless user has set I/O layer in outer scope or tricked :layer into the > passed file path). (For comparison, YAML::(Load|Dump)File force :utf8 > layer, but YAML::Syck::(Load|Dump)File accept $fh for path). I think > this was why I leaned towards altering write_string and moving closer to > YAML::XS. If my suggestion appears likely to cause more problems than to > solve, then IMHO putting a few explanatory lines into POD would be more > beneficious than adding write_utf8_string.

I've looked at all the modules and emailed all the authors to ask about it. I think I'm coming around to your view that matching YAML::XS probably makes more sense. But it breaks backwards compatibility. Still, a major version bump might be justified to get consistency with other modules.

Sat Sep 28 23:27:29 2013 dagolden [...] cpan.org - Correspondence added

Show quoted text

> I've looked at all the modules and emailed all the authors to ask > about it. I think I'm coming around to your view that matching > YAML::XS probably makes more sense. But it breaks backwards > compatibility. Still, a major version bump might be justified to get > consistency with other modules.

After some additional thought, I think adding methods/functions is going to be better, because code that wants that behavior can depend on a version of YAML::Tiny with support for it. Without addressing whether "write_string" and "Dump" should behave one way or another, I have a pull-request that improves UTF-8 support generally and ensures that files are written with UTF-8 encoding. It adds a "write_utf8_string" method (for use by "write") as I didn't want to presume an outcome of discussions about "write_string" and "Dump", but obviously that could be amended if a decision were reached. https://github.com/Perl-Toolchain-Gang/YAML-Tiny/pull/2

Sat Sep 28 23:28:24 2013 dagolden [...] cpan.org - Reference to ticket #89017 added

Sat Sep 28 23:33:12 2013 dagolden [...] cpan.org - Reference by ticket #77875 added

Fri Oct 25 05:27:03 2013 dagolden [...] cpan.org - Correspondence added

I have applied changes to the master branch that now make it clear that YAML::Tiny's read/write are from/to UTF-8 files. read_string and write_string will continue to be character based for historical consistency. I believe a decision needs to be reached about Dump/Load and I've asked Ingy how he wants that to work across YAML modules. I consider this issue resolved, but will leave it as "patched" for now pending a decision about Dump/Load.

Fri Oct 25 05:27:04 2013 dagolden [...] cpan.org - Status changed from 'open' to 'patched'

Fri Oct 25 05:56:34 2013 dagolden [...] cpan.org - Severity Normal added

Sun Feb 02 14:54:39 2014 ether [...] cpan.org - Correspondence added

Can we call this fixed now?

Sun Feb 02 20:21:38 2014 dagolden [...] cpan.org - Correspondence added

The ticket issue itself is resolved in the latest release. The long term future of Dump/Load is still TBD, but that's tracked here: https://github.com/Perl-Toolchain-Gang/YAML-Tiny/issues/11

Sun Feb 02 20:21:39 2014 dagolden [...] cpan.org - Status changed from 'patched' to 'resolved'

Bug #76180 for YAML-Tiny: patch proposal: utf8::encode in write_string