Subject: | unicode discrepancies |
A couple of points to do with unicode in a few areas - some minor, some
not so minor IMHO.
1. "The one parameter that is universally supported (to the extent that
is supported by the underlying JSON modules) is |utf8|. When this
parameter is enabled all resulting JSON will be marked as unicode, and
all unicode strings in the input data structure will be preserved as such"
I think this statement is misleading as there is no such things as
"marking" a string as unicode in Perl and unicode in Perl is not utf-8.
2. Following on from above "The actual output will vary"
Most JSON modules that can be used by JSON::Any are buggy, especially
wrt to unicode. Sometimes they do use the internal utf-8 flag as
indicator, sometimes they just fail, to set it. I would be careful
making claims unicode works if you set the mysterious |utf8| flag. For
example, unicode does not work at all with JSON::Syck via JSON::Any
unless $JSON::Syck::ImplicitUnicode = 1 is set.
Here is an example:
use strict;
use warnings;
use JSON::Any qw(Syck);
use Data::Dumper;
#$JSON::Syck::ImplicitUnicode = 1;
binmode (STDOUT, ":utf8");
my $str="\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE";
print Dumper([$str]);
my $js = JSON::Syck::Dump([$str]);
open OUT, ">uni.out";
binmode(OUT, ":utf8");
print OUT "$str\n";
close OUT;
open IN, "<json.out";
binmode (IN, ":utf8");
my $fd="";
$fd .= $_ while (<IN>);
print "string from file: " . Dumper($fd),"\n";
my $obj = JSON::Any->decode($fd);
print Dumper($obj);
produces:
$VAR1 = [
"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE"
];
string from file: $VAR1 =
"[\"\x{e5}\x{8f}\x{b0}\x{e6}\x{89}\x{80}\x{e3}\x{81}\x{ae}\x{e6}\x{b5}\x{81}\x{e3}\x{81}\x{97}JAPANESE\"]
";
$VAR1 = [
'å°æÂÂã®æµÂãÂÂJAPANESE'
];
but if you uncomment the ImplicitUnicode line it works correctly:
$VAR1 = [
"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE"
];
string from file: $VAR1 =
"[\"\x{e5}\x{8f}\x{b0}\x{e6}\x{89}\x{80}\x{e3}\x{81}\x{ae}\x{e6}\x{b5}\x{81}\x{e3}\x{81}\x{97}JAPANESE\"]
";
$VAR1 = [
"\x{e5}\x{8f}\x{b0}\x{e6}\x{89}\x{80}\x{e3}\x{81}\x{ae}\x{e6}\x{b5}\x{81}\x{e3}\x{81}\x{97}JAPANESE"
];
However, this may just be a result of the fact that JSON::Syck does not
have a constructor ("creator" key in your %conf) and hence there is no
way to set the "utf8" flag as the line "if ( my $creator =
$conf{$key}->{create_object} ) " fails. BTW, you don't mention in the
pod you cannot do my $f = JSON::Any->new() when using JSON::Syck.
3. I don't really understand this "utf8" flag. What has it got to do
with Unicode? it is an encoding and therefore just a way of encoding
unicode codepoints and should only get involved when importing or
exporting data in to or out of Perl. The following code with JSON::XS
works fine without any "utf8" flags because Perl understands unicode and
so does JSON::XS:
use strict;
use warnings;
use JSON::XS;
use Data::Dumper;
binmode (STDOUT, ":utf8");
my $str="\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE";
print Dumper([$str]);
open OUT, ">uni.out";
binmode(OUT, ":utf8");
print OUT "$str\n";
close OUT;
my $js = JSON::XS->new->encode([$str]);
print "json encoded str is $js\n" . Dumper($js);
open OUT, ">json.out";
binmode(OUT, ":utf8");
print OUT "$js\n";
close OUT;
open IN, "<json.out";
binmode (IN, ":utf8");
my $fd="";
$fd .= $_ while (<IN>);
print "string from file: " . Dumper($fd),"\n";
my $obj = JSON::XS->new->decode($fd);
print Dumper($obj);
producing:
$VAR1 = [
"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE"
];
json encoded str is ["台所の流しJAPANESE"]
$VAR1 = "[\"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE\"]";
string from file: $VAR1 =
"[\"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE\"]
";
$VAR1 = [
"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE"
];
JSON::XS was given a unicode string and gave me back a JSON encoded
unicode string. This was only encoded in utf8 when it was written to a
file. On reading the file back, Perl is told the file is utf8 encoded
and hence translates the utf8 into unicode characters which we pass
through JSON::XS to get our object back containing unicode characters.
4. For some reason JSON::Any appears to use JSON::XS to_json method but
that converts any true unicode chrs in the input string to a binary
string of utf8 octets on output. As a result, when the |utf8| flag is
set JSON::Any has to call decode to get back a unicode string when in
fact simply using JSON::XS->new->encode would have done the right thing
from the start.
I appreciate JSON::Any must be a real pain to keep consistent when all
the JSON modules are so different but I'd at least change the pod to
warn about unicode inconsistencies instead of suggesting it just works
across the board.
You may be asking why I don't just use JSON::XS directly and that is
because I'm actually using POE::Filter::JSON and that uses JSON::Any.
Martin
--
Martin J. Evans
Wetherby, UK