Bug #28434 for JSON: UTF-8 handling severly broken

Mon Jul 23 11:17:52 2007 dst [...] heise.de - Ticket created

Subject:

UTF-8 handling severly broken

When $JSON::UTF8 is enabled, the handling of strings fails, if they contain a non-ascii character that can be encoded in Latin-1 and another non-ascii character that can not be encoded in Latin-1. See the example below. \x{f6} is a German umlaut "o" and \x{20ac} is the Euro currency sign. The third string below contains a Latin-1-encoded umlaut and a UTF-8-encoded Euro sign after JSON's treatment. As such the output is absolutely unusable. dst@host:~$ perl -MJSON -e 'print $JSON::VERSION."\n"' 1.14 dst@host:~$ perl -v | grep built This is perl, v5.8.4 built for i386-linux-thread-multi dst@host:~$ perl -MJSON -MData::Dumper -e '$JSON::UTF8=1; $h = [ "\x{f6}", "\x{20ac}", "\x{f6}\x{20ac}" ]; $i = jsonToObj(objToJson($h)); print Dumper($i)' $VAR1 = [ 'ö', "\x{20ac}", 'öâ¬' ]; The following patch seems to fix it, but I'm not 100% sure, whether there are side effects. Encoding the character string $f into a UTF-8-encoded byte string and attaching it to a Latin-1 string is definitely the wrong thing to do here. --- /usr/share/perl5/JSON/Parser.pm 2007-05-06 06:51:55.000000000 +0200 +++ /home/dst/src/JSON-1.14/lib/JSON/Parser.pm 2007-07-23 16:50:20.000000000 +0200 @@ -122,7 +123,7 @@ $u .= $ch; } my $f = chr(hex($u)); - utf8::encode( $f ) if($USE_UTF8 || $USE_UnicodeString); $s .= $f; } else{

Wed Oct 10 17:09:21 2007 JFARRELL [...] cpan.org - Correspondence added

From:

james.farrell [...] ticketmaster.com

We have encountered this problem as well, and have a suggested patch. The following lines produce invalid UTF-8 for unicode code points between U+0080 and U+00FF (128 and 255): $arg = join('', map { chr($_) =~ /[\x00-\x07\x0b\x0e-\x1f]/ ? sprintf('\u%04x', $_) : $_ <= 255 ? chr($_) : $_ <= 65535 ? sprintf('\u%04x', $_) : sprintf('\u%04x', $_) } unpack('U*', $arg) ); As an example, consider the character ö (LATIN SMALL LETTER O WITH DIAERESIS). This character is code point U+00F6, and has UTF-8 encoding C3B6. Since 0xF6 < 255, the character is encoded as the byte F6, which is not the correct UTF-8 encoding. The suggested patch is to change '$_ <= 255' to '$_ <= 127': $arg = join('', map { chr($_) =~ /[\x00-\x07\x0b\x0e-\x1f]/ ? sprintf('\u%04x', $_) : $_ <= 127 ? chr($_) : $_ <= 65535 ? sprintf('\u%04x', $_) : sprintf('\u%04x', $_) } unpack('U*', $arg) );

Wed Oct 10 17:09:23 2007 The RT System itself - Status changed from 'new' to 'open'

Wed Nov 14 03:05:11 2007 makamaka [...] cpan.org - Correspondence added

Fixxed by JSON 1.15. But please try JSON::PP which can handle utf8 properly. Thanks a lot!

Wed Nov 14 03:05:13 2007 makamaka [...] cpan.org - Status changed from 'open' to 'resolved'