Skip Menu |

This queue is for tickets about the JSON-Any CPAN distribution.

Report information
The Basics
Id: 29917
Status: resolved
Priority: 0/
Queue: JSON-Any

People
Owner: Nobody in particular
Requestors: bohica [...] ntlworld.com
Cc:
AdminCc: cpan [...] prather.org

Bug Information
Severity: Normal
Broken in: 1.09
Fixed in: (no value)



Subject: unicode discrepancies
A couple of points to do with unicode in a few areas - some minor, some not so minor IMHO. 1. "The one parameter that is universally supported (to the extent that is supported by the underlying JSON modules) is |utf8|. When this parameter is enabled all resulting JSON will be marked as unicode, and all unicode strings in the input data structure will be preserved as such" I think this statement is misleading as there is no such things as "marking" a string as unicode in Perl and unicode in Perl is not utf-8. 2. Following on from above "The actual output will vary" Most JSON modules that can be used by JSON::Any are buggy, especially wrt to unicode. Sometimes they do use the internal utf-8 flag as indicator, sometimes they just fail, to set it. I would be careful making claims unicode works if you set the mysterious |utf8| flag. For example, unicode does not work at all with JSON::Syck via JSON::Any unless $JSON::Syck::ImplicitUnicode = 1 is set. Here is an example: use strict; use warnings; use JSON::Any qw(Syck); use Data::Dumper; #$JSON::Syck::ImplicitUnicode = 1; binmode (STDOUT, ":utf8"); my $str="\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE"; print Dumper([$str]); my $js = JSON::Syck::Dump([$str]); open OUT, ">uni.out"; binmode(OUT, ":utf8"); print OUT "$str\n"; close OUT; open IN, "<json.out"; binmode (IN, ":utf8"); my $fd=""; $fd .= $_ while (<IN>); print "string from file: " . Dumper($fd),"\n"; my $obj = JSON::Any->decode($fd); print Dumper($obj); produces: $VAR1 = [ "\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE" ]; string from file: $VAR1 = "[\"\x{e5}\x{8f}\x{b0}\x{e6}\x{89}\x{80}\x{e3}\x{81}\x{ae}\x{e6}\x{b5}\x{81}\x{e3}\x{81}\x{97}JAPANESE\"] "; $VAR1 = [ '台所の流しJAPANESE' ]; but if you uncomment the ImplicitUnicode line it works correctly: $VAR1 = [ "\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE" ]; string from file: $VAR1 = "[\"\x{e5}\x{8f}\x{b0}\x{e6}\x{89}\x{80}\x{e3}\x{81}\x{ae}\x{e6}\x{b5}\x{81}\x{e3}\x{81}\x{97}JAPANESE\"] "; $VAR1 = [ "\x{e5}\x{8f}\x{b0}\x{e6}\x{89}\x{80}\x{e3}\x{81}\x{ae}\x{e6}\x{b5}\x{81}\x{e3}\x{81}\x{97}JAPANESE" ]; However, this may just be a result of the fact that JSON::Syck does not have a constructor ("creator" key in your %conf) and hence there is no way to set the "utf8" flag as the line "if ( my $creator = $conf{$key}->{create_object} ) " fails. BTW, you don't mention in the pod you cannot do my $f = JSON::Any->new() when using JSON::Syck. 3. I don't really understand this "utf8" flag. What has it got to do with Unicode? it is an encoding and therefore just a way of encoding unicode codepoints and should only get involved when importing or exporting data in to or out of Perl. The following code with JSON::XS works fine without any "utf8" flags because Perl understands unicode and so does JSON::XS: use strict; use warnings; use JSON::XS; use Data::Dumper; binmode (STDOUT, ":utf8"); my $str="\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE"; print Dumper([$str]); open OUT, ">uni.out"; binmode(OUT, ":utf8"); print OUT "$str\n"; close OUT; my $js = JSON::XS->new->encode([$str]); print "json encoded str is $js\n" . Dumper($js); open OUT, ">json.out"; binmode(OUT, ":utf8"); print OUT "$js\n"; close OUT; open IN, "<json.out"; binmode (IN, ":utf8"); my $fd=""; $fd .= $_ while (<IN>); print "string from file: " . Dumper($fd),"\n"; my $obj = JSON::XS->new->decode($fd); print Dumper($obj); producing: $VAR1 = [ "\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE" ]; json encoded str is ["台所の流しJAPANESE"] $VAR1 = "[\"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE\"]"; string from file: $VAR1 = "[\"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE\"] "; $VAR1 = [ "\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE" ]; JSON::XS was given a unicode string and gave me back a JSON encoded unicode string. This was only encoded in utf8 when it was written to a file. On reading the file back, Perl is told the file is utf8 encoded and hence translates the utf8 into unicode characters which we pass through JSON::XS to get our object back containing unicode characters. 4. For some reason JSON::Any appears to use JSON::XS to_json method but that converts any true unicode chrs in the input string to a binary string of utf8 octets on output. As a result, when the |utf8| flag is set JSON::Any has to call decode to get back a unicode string when in fact simply using JSON::XS->new->encode would have done the right thing from the start. I appreciate JSON::Any must be a real pain to keep consistent when all the JSON modules are so different but I'd at least change the pod to warn about unicode inconsistencies instead of suggesting it just works across the board. You may be asking why I don't just use JSON::XS directly and that is because I'm actually using POE::Filter::JSON and that uses JSON::Any. Martin -- Martin J. Evans Wetherby, UK
From: NUFFIN [...] cpan.org
The utf8 flag passed to JSON::XS was me being a doofus, i got it the opposite way round. This is now fixed in trunk and Chris should release it shortly. As for Syck - IMHO it should die for now (like it does for JSON::DWIW), and later add proper support by localizing implicitunicode in encode/decode. On Thu Oct 11 08:04:03 2007, MJEVANS wrote: Show quoted text
> I think this statement is misleading as there is no such things as > "marking" a string as unicode in Perl and unicode in Perl is not utf- > 8.
My english-fu is weak. Perhaps you can be tempted into improving the docs? I was trying to convey that with utf8 => 1 passed to JSON::Any all data is going to be in wide chars, that is utf8::is_utf8 is true (i used the word marking because of the utf8 flag telling perl to decode the utf8 octets and give back wide chars instead of 0-255). Show quoted text
> 2. Following on from above "The actual output will vary"
Commented on above Show quoted text
> Here is an example:
Just FYI (a bit off topic), Devel::StringInfo was written to ease writing such demonstration code. Anyway, do you think maybe you could translate this into a test, specifically for JSON-Syck, like the JSON-XS test in JSON::Any tests utf8? t/10-unicode.t currently skips Syck because it's pretty much broken. Show quoted text
> 3. I don't really understand this "utf8" flag. What has it got to do > with Unicode? it is an encoding and therefore just a way of encoding > unicode codepoints and should only get involved when importing or > exporting data in to or out of Perl. The following code with JSON::XS > works fine without any "utf8" flags because Perl understands unicode > and > so does JSON::XS:
Yeah, that was me getting the api wrong, utf8 => 0 means "use unicode" and utf8 => 1 means "use utf8 octets", wheras for JSON::Converter/Parser it's the opposite. This has been fixed. Show quoted text
> I appreciate JSON::Any must be a real pain to keep consistent when all > the JSON modules are so different but I'd at least change the pod to > warn about unicode inconsistencies instead of suggesting it just works > across the board.
Those are code bugs, not doc bugs =) Unicode support is possible to get consistently, and should really be supported, otherwise JSON::Any is pretty much useless for any scenarios involving unicode data. Thanks for taking the time to make such a detailed report, Regards, Yuval
On Thu Oct 11 12:09:43 2007, NUFFIN wrote: Show quoted text
> The utf8 flag passed to JSON::XS was me being a doofus, i got it the > opposite way round. > This is now fixed in trunk and Chris should release it shortly.
Can you point me at the subversion repository please then I can try it out. Show quoted text
> As for Syck - IMHO it should die for now (like it does for > JSON::DWIW), and later add proper > support by localizing implicitunicode in encode/decode. > > On Thu Oct 11 08:04:03 2007, MJEVANS wrote: >
> > I think this statement is misleading as there is no such things as > > "marking" a string as unicode in Perl and unicode in Perl is not
> utf-8. > > My english-fu is weak. Perhaps you can be tempted into improving the > docs? I was trying to convey that with utf8 => 1 passed to JSON::Any > all data is going to be in wide chars, that is utf8::is_utf8 is > true (i used the word marking because of the utf8 flag telling perl > to decode the utf8 octets and give back wide chars instead of 0-255).
I am happy to take a shot at improving the docs if you can point me at the subversion repository I can get a copy of the latest to provide patches against. Show quoted text
> > 2. Following on from above "The actual output will vary"
> > Commented on above >
> > Here is an example:
> > Just FYI (a bit off topic), Devel::StringInfo was written to ease > writing such demonstration > code.
Just tried that - nice pointer - thanks. Show quoted text
> Anyway, do you think maybe you could translate this into a test, > specifically for JSON-Syck, > like the JSON-XS test in JSON::Any tests utf8?
Probably given a pointer to subversion repository. Show quoted text
> t/10-unicode.t currently skips Syck because it's pretty much broken.
It can be made to work if ImplicitUnicode is used. I used JSON::Syck for ages (with a few problems) but just changed to JSON::XS. Show quoted text
> > 3. I don't really understand this "utf8" flag. What has it got to do > > with Unicode? it is an encoding and therefore just a way of encoding > > unicode codepoints and should only get involved when importing or > > exporting data in to or out of Perl. The following code with
> JSON::XS
> > works fine without any "utf8" flags because Perl understands unicode > > and > > so does JSON::XS:
> > Yeah, that was me getting the api wrong, utf8 => 0 means "use unicode" > and utf8 => 1 means "use utf8 octets", wheras for > JSON::Converter/Parser it's the opposite. This has been fixed.
Love to get a copy of the fixed version. Show quoted text
> > I appreciate JSON::Any must be a real pain to keep consistent when > > all the JSON modules are so different but I'd at least change the > > pod to warn about unicode inconsistencies instead of suggesting it > > just works across the board.
> > Those are code bugs, not doc bugs =) > > Unicode support is possible to get consistently, and should really be > supported, otherwise JSON::Any is pretty much useless for any > scenarios involving unicode data.
Yes, that was my problem. Show quoted text
> Thanks for taking the time to make such a detailed report,
No problem. Martin -- Martin J. Evans Wetherby, UK
On Thu Oct 11 12:45:48 2007, MJEVANS wrote: Show quoted text
> Can you point me at the subversion repository please then I can try it out.
https://json-any.googlecode.com/svn Show quoted text
> I am happy to take a shot at improving the docs if you can point me at > the subversion repository I can get a copy of the latest to provide > patches against.
Great! Show quoted text
> It can be made to work if ImplicitUnicode is used. I used JSON::Syck for > ages (with a few problems) but just changed to JSON::XS.
I meant JSON::Any's Syck support is not on par with the others due to lack of OO. There are a few other issues IIRC, and I didn't have time to redo it all to support this properly. I think some sort of proxy object should be written, e.g. JSON::Any::SyckWrapper which will localize the vars according to a config it was instantiated with before encode/decode. Maybe it can be done with just closures, too. However $j->handler should still report that it's Syck somehow. Show quoted text
> Love to get a copy of the fixed version.
perigrin: BOY!?!??! RELEASE!
On Thu Oct 11 13:34:36 2007, NUFFIN wrote: Show quoted text
Correction: http://json-any.googlecode.com/svn Talk to chris about getting commit access to use the https URI.
From: martin.evans [...] easysoft.com
On Thu Oct 11 13:36:24 2007, NUFFIN wrote: Show quoted text
> On Thu Oct 11 13:34:36 2007, NUFFIN wrote: > > > Correction: http://json-any.googlecode.com/svn > > Talk to chris about getting commit access to use the https URI.
Thanks. If I have anything I'll just supply patches - write access to the repository is not required as I wasn't planning on becoming a full JSON::Any team member - just a one-off contributor. Martin -- Martin J. Evans Wetherby, UK
ok, I'm more confused now. I've downloaded the trunk from subversion and built it. I can see in the diffs between r37 and r38 the sense of utf8 appears to have been changed: + local $conf->{utf8} = !$conf->{utf8}; # it means the opposite + If I run the script below with utf8=>1 it works as before but I thought you'd reversed the meaning of utf8 so it should work without setting utf8? use strict; use warnings; use JSON::Any qw(XS); use Devel::StringInfo; binmode (STDOUT, ":utf8"); my $str="\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE"; print "starting string: " . Devel::StringInfo->new->dump_info($str); my $js = JSON::Any->new(utf8=>1)->Dump([$str]); open OUT, ">uni.out"; binmode(OUT, ":utf8"); print OUT "$str"; close OUT; open IN, "<json.out"; binmode (IN, ":utf8"); my $fd=""; $fd .= $_ while (<IN>); print "string from file: " . Devel::StringInfo->new->dump_info($fd); my $obj = JSON::Any->new(utf8=>1)->decode($fd); print "string after decode: " . Devel::StringInfo->new->dump_info($obj->[0]); produces: perl -I /home/martin/svn/json_any/trunk/lib ex1.pl starting string: string: ÕÅ░µëÇÒü«µÁüÒüùJAPANESE is_utf8: 1 char_length: 13 octet_length: 23 downgradable: 0 raw = <<ÕÅ░µëÇÒü«µÁüÒüùJAPANESE>> string from file: string: "[\"ÕÅ░µëÇÒü«µÁüÒüùJAPANESE\"]\n" is_utf8: 1 char_length: 18 octet_length: 28 downgradable: 0 raw = <<END_OF_STRING ["ÕÅ░µëÇÒü«µÁüÒüùJAPANESE"] END_OF_STRING string after decode: string: ÕÅ░µëÇÒü«µÁüÒüùJAPANESE is_utf8: 1 char_length: 13 octet_length: 23 downgradable: 0 raw = <<ÕÅ░µëÇÒü«µÁüÒüùJAPANESE>> (forgive the dodgy chrs, I'm at home now and my terminal does not understand unicode). If you remove the utf8=>1 it fails like this: perl -I /home/martin/svn/json_any/trunk/lib ex1_noutf8.pl starting string: string: ÕÅ░µëÇÒü«µÁüÒüùJAPANESE is_utf8: 1 char_length: 13 octet_length: 23 downgradable: 0 raw = <<ÕÅ░µëÇÒü«µÁüÒüùJAPANESE>> string from file: string: "[\"ÕÅ░µëÇÒü«µÁüÒüùJAPANESE\"]\n" is_utf8: 1 char_length: 18 octet_length: 28 downgradable: 0 raw = <<END_OF_STRING ["ÕÅ░µëÇÒü«µÁüÒüùJAPANESE"] END_OF_STRING Wide character in subroutine entry at /home/martin/svn/json_any/trunk/lib/JSON/A ny.pm line 353, <IN> line 1. Martin -- Martin J. Evans Wetherby, UK
From: cpan [...] prather.org
Show quoted text
> perigrin: BOY!?!??! RELEASE!
I released just a few minutes ago under the premise that I can always release again should we need. :)
From: NUFFIN [...] cpan.org
On Thu Oct 11 14:34:25 2007, MJEVANS wrote: Show quoted text
> ok, I'm more confused now. > I've downloaded the trunk from subversion and built it. I can see in the > diffs between r37 and r38 the sense of utf8 appears to have been changed: > > + local $conf->{utf8} = !$conf->{utf8}; # it means the > opposite > + > > If I run the script below with utf8=>1 it works as before but I thought > you'd reversed the meaning of utf8 so it should work without setting utf8?
No, you need to set utf8 => 1 (maybe it should be renamed to unicode => 1). It only reverses it for JSON::XS. JSON::PC and JSON take utf8 => 1.
Closing this since it seems to be resolved. If it is still an issue please re-open against 1.13 or 1.14+