On Thu, Aug 03, 2006 at 09:25:00AM -0400, "miyagawa@gmail.com via RT" <bug-JSON-Syck@rt.cpan.org> wrote:
Show quoted text> > If it is 0, then JSON::Syck does not corretcly encode perl strings into
> > json objects. If it is 1, it sometimes returns json objects with "bytes"
> > >255.
>
> This looks confusing. If it is 1, Dump()ed json objects are always
> UTF-8 flagged, which could obviously be > 255 (since Unicode
> characters could be).
No, they are not. As soon as you ste the utf-8 _flag_ on the scalar, it no
longer is utf-8, it is now text consisting of unicode characters.
Encode in your example above makes utf-8 out of it.
Thats the "wrong mental model" I wrote about in my original mail.
You wrognly assume that setting the _internal_ UTF-8 bit makes a scalar
utf-8. This is logically wrong. Clearing the bit on a scalar that is encoded
in utf-8 internally makes it valid UTF-8.
Show quoted text> > Perl (UTF-8 bytes) => JSON (Unicode flagged)
> >
> > Perl has no such thing as a "unicode flag". Perl has a utf-8 flag, but
> > that doesn't flag a scalar as unicode,
>
> by "Unicode flagged" I mean UTF8 flag in Perl 5. It's true that the
> official term is UTF-8 flag but to me it's totally equivalent to say.
Which is the problem. Perl doesn't work that way.
Let me explain it differently:
Perl can handle binary octet strings and unicode character strings.
The difference is that an octet string contains no character values >
255, while unicode character strings can not.
The difference is in the way you treat the scalar - perl does not make
a difference.
You can have an octet string encoded as utf-8 internally, or as
latin1/bytes. Regardless of this _internal_ encoding, a byte string
will always be a byte string.
Likewise, you can have a unicode string encoded as utf-8 _internally_,
but also as latin1, _iff_ the string contains only characters < 256.
The model you assume is that the utf-8 flag that you can set/clear on scalars
somehow makes a string unicode or not. This is a broken assumption.
See for example the utf8 manpage and the utf8::encode and utf8::decode
function. You will see that utf8::encode _clears_ the utf-8 bit. Clearing the
utf-8 bit makes a scalar utf-8 (when it actually contains utf-8).
utf8::encod takes a character string and converts it into utf-8.
Likewise, utf8::decode might or might not set the utf-8 _flag_ on the scalar.
It nevertheless comverts an utf-8 octet string into a unicode character
string. It will be unicode regardless of wether the resulting string has the
utf-8 bit set or not, the utf-8 bit has nothing to do with unicode-ness.
If you deviate from this model into the brokennotion that utf-8-bit ==
unicode flag then you are bound to run into problems.
For example, when I feed JSON::Syck a valid json object/string, encoded in
utf-8 octets, I get a datastructure with utf-8 in them, not perl strings.
JSON, however, describes unicode strings, which perl can handle.
Worse, the outcome depends on wether the octet string is internally encoded
in UTF-8 or not: JSON::Syck will give different results in this case,
although the input string is identical.
The only corretc solution is to treat the utf-8 bit in perl as just a way of
representing integer character indices in strings: If it is cleared, the
string only stores <256 character indices, if it is set, it might store >255
indices. It has *nothing* whatsoever to do with unicode.
JSON, on the other hand, clearly defines that a json object/string is
*encoded* in unicode, i.e. an octet string (all unicode encodings deliver an
octet string), and that the structure it represents is unicode strings.
JSON::Syck breaks this by creating unencoded json objects (which is not
defined by rfc4627 when serialising, or by not corretcly decoding strings
stored in a json object to perl strings.
As an explizit example of what goes wrong, look at this:
my $hash = JSON::Syck::Load "some-octet-string-containing-utf-8-encoded-json-object";
The $hash will now contain utf-8 encoded octet strings (all indices <
256), NOT the strings that are actually stored in the json object/string.
Likewise:
print STREAM JSON::Syck::Dump $hash;
Will elicit a warning when STREAM is in binmode, because it might contain
indices > 255 which are not valid. JSON objects/strings, on the other hand,
are always encoded in some unicode encoding, and thus never can have indices
Show quoted text>255.
Obviously, the above examples depend on sepcific settings of
ImplicitUnicode. However, no setting of ImplciitUnicode work
correctly. Either you get broken JSON objects (indices >255), or your
decoded strings are broken (utf-8, not perl text strings).
This is not helped by the fact that the documentation does not specify at
all what JSON::Syck does, as it talsk about a unicode flag that perl does
not have. If the unicode flag is the utf-8 flag, it is simply broken.
I hope this was clearer then my initial mail. If something is unclear
still, do not hesitate to ask for clarification. Encoding issues are not
easy, and I _really_ want the JSON::Syck module to work correctly, as soon
as possible, before too amny people have to work around its encoding bugs
(as I have to do).
Show quoted text> You're right, but utf-8 flagged strings are treated as "Unicode
> string"
Not by perl, which is the problem. JSON::Syck indeed treats it incorrectly as
"unicode string", but as this clashes with the pelr programming language,
this results in bugs.
Show quoted text> > The correct handling is to always encode the resulting json object
> > correctly (preferably in UTF-8), and always create perl text strings (in
> > either UTF-8 or latin1 encoding), as json strings are defined to be text.
>
> If you really think JSON::Syck is doing somthing wrong (which we don't
> hope), please file a failing test case.
echo '{"a":"ü"}' | perl -MJSON::Syck -e 'binmode STDIN; $hash = JSON::Syck::Load <>'
(All in UTF-8). This results not in a "ü" character, but in two
characters, \xc3\xbc. RFC4627, however, states (section 3) that the
above json-object is encoded in utf-8 (because there are no 0 bytes in
the initial 4 bytes). JSON::Syck, however, incorrectly interprets it as
latin1, which is not even mentioned as a valid encoding in rfc4627, and is
incapable of transfering characters >255.
Similarly, when dumping the perl hash '{ a=> "ü" }', we do not get a
correctly encoded json string, but instead a perl string with characters
Show quoted text>255, i.e. unencoded, which again clashes with section 3 of rfc4627.
Changing ImplciitUnicode changes the test cases and the outcomes, but doesn't
fix the problem, as no steting of ImplicitUnicode correctly decoded json
objects encoded in utf-8 and correctly encoded json objects in utf-8. Nor any
other unicode encoding.
--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ pcg@goof.com
--==---/ / _ \/ // /\ \/ /
http://schmorp.de/
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE