Skip Menu |

This queue is for tickets about the JSON CPAN distribution.

Report information
The Basics
Id: 86244
Status: open
Priority: 0/
Queue: JSON

People
Owner: Nobody in particular
Requestors: adolf.szabo [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: utf8 flag wrong
Date: Tue, 18 Jun 2013 22:24:57 +0200
To: bug-JSON [...] rt.cpan.org
From: Adolf Szabo <adolf.szabo [...] gmail.com>
Hi, My problem is that JSON->new()->decode($str) always sets utf8 flag to ON for each string in the hash, no matter what I specify (ascii, latin1, utf8(0) or utf8(1). This is not only an annoyance, but I think a bug too. Let me give you an example: Here is a sample json file, with $h->{TITL} containing őa as string. We will focus on the second character, the ascii 'a' for now: aszabo@mepc:/tmp$ hexdump -C test.txt 00000000 7b 22 54 49 54 4c 22 3a 22 c5 91 61 22 7d 0a |{"TITL":"..a"}.| 0000000f aszabo@mepc:/tmp$ cat a.pl use strict; use warnings; use Encode; use JSON; local $/=undef; my $str=<STDIN>; my $h=JSON->new()->utf8(1)->decode($str); #my $h=JSON->new()->utf8(0)->decode($str); my $c=substr($h->{TITL},1,1); printf("%s [%d], utf8 flag is %s\n",$c,ord($c),Encode::is_utf8($c)?'ON':'OFF'); exit; aszabo@mepc:/tmp$ cat test.txt | perl a.pl a [97], utf8 flag is ON This is as expected so far. Now I enable utf8(0) line, and repeat: aszabo@mepc:/tmp$ cat test.txt | perl a.pl � [145], utf8 flag is ON This is wrong: utf8 flag is set to ON, however $h->{TITL} is not in perl's internal encoding format as second character should return 'a', not second byte of first character. This utf8 flag is a problem later on when I use regexp on the strings of the hash etc. Please let me know what you think. Thx, Adolf
This is not a bug. First, because you set utf8(0), your input data is regarded as bytes. "\X{c5}\x{91}\x{61}" => dump data is \x{91} The result is expected. Second, you shouldn't look UTF8 flag. JSON(JSON::XS/PP)'s UNICODE handling depends on Perl itself. The second result is latin-1 characters even if UTF8 flag is on. Please see to http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-flag%22%3f On 2013-6月-18 火 16:25:12, adolf.szabo@gmail.com wrote: Show quoted text
> Hi, > > My problem is that JSON->new()->decode($str) always sets utf8 flag to ON > for each string in the hash, no matter what I specify (ascii, latin1, > utf8(0) or utf8(1). This is not only an annoyance, but I think a bug too. > Let me give you an example: > > Here is a sample json file, with $h->{TITL} containing őa as string. We > will focus on the second character, the ascii 'a' for now: > > aszabo@mepc:/tmp$ hexdump -C test.txt > 00000000 7b 22 54 49 54 4c 22 3a 22 c5 91 61 22 7d 0a > |{"TITL":"..a"}.| > 0000000f > aszabo@mepc:/tmp$ cat a.pl > use strict; > use warnings; > use Encode; > use JSON; > > local $/=undef; > my $str=<STDIN>; > > my $h=JSON->new()->utf8(1)->decode($str); > #my $h=JSON->new()->utf8(0)->decode($str); > my $c=substr($h->{TITL},1,1); > printf("%s [%d], utf8 flag is > %s\n",$c,ord($c),Encode::is_utf8($c)?'ON':'OFF'); > > exit; > aszabo@mepc:/tmp$ cat test.txt | perl a.pl > a [97], utf8 flag is ON > > This is as expected so far. Now I enable utf8(0) line, and repeat: > > aszabo@mepc:/tmp$ cat test.txt | perl a.pl > � [145], utf8 flag is ON > > This is wrong: utf8 flag is set to ON, however $h->{TITL} is not in perl's > internal encoding format as second character should return 'a', not second > byte of first character. This utf8 flag is a problem later on when I use > regexp on the strings of the hash etc. > > Please let me know what you think. > > Thx, Adolf
Subject: Re: [rt.cpan.org #86244] utf8 flag wrong
Date: Thu, 20 Jun 2013 05:56:18 +0200
To: bug-JSON [...] rt.cpan.org
From: Adolf Szabo <adolf.szabo [...] gmail.com>
Usually I do not mess with perl's internals. Unless I face a problem. Here is my specific problem: The character in question is the Polish ą (\xC4 \x85). When this is the last character of a string and I execute $str=~s/\s+\z//; nothing is removed (as expected). But after using JSON lib the second byte of the char is removed resulting in a broken utf8 char: my $h=JSON->new()->decode($s); $h->{TITL}=~s/\s+\z//; aszabo@mepc:/tmp$ hexdump -C test.txt 00000000 7b 22 54 49 54 4c 22 3a 22 c4 85 22 7d 0a |{"TITL":".."}.| Please explain what did I do wrong then. Thx On Thu, Jun 20, 2013 at 5:02 AM, Makamaka Hannyaharamitu via RT < bug-JSON@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=86244 > > > This is not a bug. > > First, because you set utf8(0), > your input data is regarded as bytes. > "\X{c5}\x{91}\x{61}" => dump data is \x{91} > The result is expected. > > Second, you shouldn't look UTF8 flag. > JSON(JSON::XS/PP)'s UNICODE handling depends on Perl itself. > The second result is latin-1 characters even if UTF8 flag is on. > > Please see to > http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-flag%22%3f > > > > > On 2013-6月-18 火 16:25:12, adolf.szabo@gmail.com wrote:
> > Hi, > > > > My problem is that JSON->new()->decode($str) always sets utf8 flag to ON > > for each string in the hash, no matter what I specify (ascii, latin1, > > utf8(0) or utf8(1). This is not only an annoyance, but I think a bug too. > > Let me give you an example: > > > > Here is a sample json file, with $h->{TITL} containing őa as string. We > > will focus on the second character, the ascii 'a' for now: > > > > aszabo@mepc:/tmp$ hexdump -C test.txt > > 00000000 7b 22 54 49 54 4c 22 3a 22 c5 91 61 22 7d 0a > > |{"TITL":"..a"}.| > > 0000000f > > aszabo@mepc:/tmp$ cat a.pl > > use strict; > > use warnings; > > use Encode; > > use JSON; > > > > local $/=undef; > > my $str=<STDIN>; > > > > my $h=JSON->new()->utf8(1)->decode($str); > > #my $h=JSON->new()->utf8(0)->decode($str); > > my $c=substr($h->{TITL},1,1); > > printf("%s [%d], utf8 flag is > > %s\n",$c,ord($c),Encode::is_utf8($c)?'ON':'OFF'); > > > > exit; > > aszabo@mepc:/tmp$ cat test.txt | perl a.pl > > a [97], utf8 flag is ON > > > > This is as expected so far. Now I enable utf8(0) line, and repeat: > > > > aszabo@mepc:/tmp$ cat test.txt | perl a.pl > > � [145], utf8 flag is ON > > > > This is wrong: utf8 flag is set to ON, however $h->{TITL} is not in
> perl's
> > internal encoding format as second character should return 'a', not
> second
> > byte of first character. This utf8 flag is a problem later on when I use > > regexp on the strings of the hash etc. > > > > Please let me know what you think. > > > > Thx, Adolf
> > > >
I got your point (U+0085 matches \s). I said that utf8(0) causes expecting bytes. But it is mistaken. As document says, utf8(0) expects UNICODE. http://search.cpan.org/~makamaka/JSON-2.59/lib/JSON.pm#utf8 So, the resolution is setting utf8(1). Does it answer your question? On 2013-6月-19 水 23:56:37, adolf.szabo@gmail.com wrote: Show quoted text
> Usually I do not mess with perl's internals. Unless I face a problem. > Here > is my specific problem: > > The character in question is the Polish ą (\xC4 \x85). When this is > the > last character of a string and I execute > > $str=~s/\s+\z//; > > nothing is removed (as expected). But after using JSON lib the second > byte > of the char is removed resulting in a broken utf8 char: > > my $h=JSON->new()->decode($s); > $h->{TITL}=~s/\s+\z//; > > aszabo@mepc:/tmp$ hexdump -C test.txt > 00000000 7b 22 54 49 54 4c 22 3a 22 c4 85 22 7d 0a > |{"TITL":".."}.| > > > Please explain what did I do wrong then. > > Thx > > > > On Thu, Jun 20, 2013 at 5:02 AM, Makamaka Hannyaharamitu via RT < > bug-JSON@rt.cpan.org> wrote: >
> > <URL: https://rt.cpan.org/Ticket/Display.html?id=86244 > > > > > This is not a bug. > > > > First, because you set utf8(0), > > your input data is regarded as bytes. > > "\X{c5}\x{91}\x{61}" => dump data is \x{91} > > The result is expected. > > > > Second, you shouldn't look UTF8 flag. > > JSON(JSON::XS/PP)'s UNICODE handling depends on Perl itself. > > The second result is latin-1 characters even if UTF8 flag is on. > > > > Please see to > > http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-
> flag%22%3f
> > > > > > > > > > On 2013-6月-18 火 16:25:12, adolf.szabo@gmail.com wrote:
> > > Hi, > > > > > > My problem is that JSON->new()->decode($str) always sets utf8 flag
> to ON
> > > for each string in the hash, no matter what I specify (ascii,
> latin1,
> > > utf8(0) or utf8(1). This is not only an annoyance, but I think a
> bug too.
> > > Let me give you an example: > > > > > > Here is a sample json file, with $h->{TITL} containing őa as
> string. We
> > > will focus on the second character, the ascii 'a' for now: > > > > > > aszabo@mepc:/tmp$ hexdump -C test.txt > > > 00000000 7b 22 54 49 54 4c 22 3a 22 c5 91 61 22 7d 0a > > > |{"TITL":"..a"}.| > > > 0000000f > > > aszabo@mepc:/tmp$ cat a.pl > > > use strict; > > > use warnings; > > > use Encode; > > > use JSON; > > > > > > local $/=undef; > > > my $str=<STDIN>; > > > > > > my $h=JSON->new()->utf8(1)->decode($str); > > > #my $h=JSON->new()->utf8(0)->decode($str); > > > my $c=substr($h->{TITL},1,1); > > > printf("%s [%d], utf8 flag is > > > %s\n",$c,ord($c),Encode::is_utf8($c)?'ON':'OFF'); > > > > > > exit; > > > aszabo@mepc:/tmp$ cat test.txt | perl a.pl > > > a [97], utf8 flag is ON > > > > > > This is as expected so far. Now I enable utf8(0) line, and repeat: > > > > > > aszabo@mepc:/tmp$ cat test.txt | perl a.pl > > > � [145], utf8 flag is ON > > > > > > This is wrong: utf8 flag is set to ON, however $h->{TITL} is not
> in
> > perl's
> > > internal encoding format as second character should return 'a',
> not
> > second
> > > byte of first character. This utf8 flag is a problem later on when
> I use
> > > regexp on the strings of the hash etc. > > > > > > Please let me know what you think. > > > > > > Thx, Adolf
> > > > > > > >
Subject: Re: [rt.cpan.org #86244] utf8 flag wrong
Date: Thu, 20 Jun 2013 09:58:27 +0200
To: bug-JSON [...] rt.cpan.org
From: Adolf Szabo <adolf.szabo [...] gmail.com>
Yes, U+0085 is indeed looks to be a space char. From perl 5.14 I can use /a modifier to make it work: $h->{TITL}=~s/\s+\z//a; However right now I'm stuck with 5.8.8 and a bunch of legacy code, which was designed before utf8 became widespread. And I also tried using utf8(1) as you suggest, but then for each string in the hash I need to call $h->{TITL}=Encode::encode_utf8($h->{TITL}) to let rest of the code work, or I get 'Wide character in ...' warnings everywhere. So my question is why JSON lib does not provide a way to get strings back in the plain old way, something like $h=JSON->new()->latin1(1)->decode($str); that would return strings in $h as one-byte==one-char Thx, Adolf On Thu, Jun 20, 2013 at 9:13 AM, Makamaka Hannyaharamitu via RT < bug-JSON@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=86244 > > > I got your point (U+0085 matches \s). > > I said that utf8(0) causes expecting bytes. > But it is mistaken. As document says, > utf8(0) expects UNICODE. > > http://search.cpan.org/~makamaka/JSON-2.59/lib/JSON.pm#utf8 > > So, the resolution is setting utf8(1). > Does it answer your question? > > > > On 2013-6月-19 水 23:56:37, adolf.szabo@gmail.com wrote:
> > Usually I do not mess with perl's internals. Unless I face a problem. > > Here > > is my specific problem: > > > > The character in question is the Polish ą (\xC4 \x85). When this is > > the > > last character of a string and I execute > > > > $str=~s/\s+\z//; > > > > nothing is removed (as expected). But after using JSON lib the second > > byte > > of the char is removed resulting in a broken utf8 char: > > > > my $h=JSON->new()->decode($s); > > $h->{TITL}=~s/\s+\z//; > > > > aszabo@mepc:/tmp$ hexdump -C test.txt > > 00000000 7b 22 54 49 54 4c 22 3a 22 c4 85 22 7d 0a > > |{"TITL":".."}.| > > > > > > Please explain what did I do wrong then. > > > > Thx > > > > > > > > On Thu, Jun 20, 2013 at 5:02 AM, Makamaka Hannyaharamitu via RT < > > bug-JSON@rt.cpan.org> wrote: > >
> > > <URL: https://rt.cpan.org/Ticket/Display.html?id=86244 > > > > > > > This is not a bug. > > > > > > First, because you set utf8(0), > > > your input data is regarded as bytes. > > > "\X{c5}\x{91}\x{61}" => dump data is \x{91} > > > The result is expected. > > > > > > Second, you shouldn't look UTF8 flag. > > > JSON(JSON::XS/PP)'s UNICODE handling depends on Perl itself. > > > The second result is latin-1 characters even if UTF8 flag is on. > > > > > > Please see to > > > http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-
> > flag%22%3f
> > > > > > > > > > > > > > > On 2013-6月-18 火 16:25:12, adolf.szabo@gmail.com wrote:
> > > > Hi, > > > > > > > > My problem is that JSON->new()->decode($str) always sets utf8 flag
> > to ON
> > > > for each string in the hash, no matter what I specify (ascii,
> > latin1,
> > > > utf8(0) or utf8(1). This is not only an annoyance, but I think a
> > bug too.
> > > > Let me give you an example: > > > > > > > > Here is a sample json file, with $h->{TITL} containing őa as
> > string. We
> > > > will focus on the second character, the ascii 'a' for now: > > > > > > > > aszabo@mepc:/tmp$ hexdump -C test.txt > > > > 00000000 7b 22 54 49 54 4c 22 3a 22 c5 91 61 22 7d 0a > > > > |{"TITL":"..a"}.| > > > > 0000000f > > > > aszabo@mepc:/tmp$ cat a.pl > > > > use strict; > > > > use warnings; > > > > use Encode; > > > > use JSON; > > > > > > > > local $/=undef; > > > > my $str=<STDIN>; > > > > > > > > my $h=JSON->new()->utf8(1)->decode($str); > > > > #my $h=JSON->new()->utf8(0)->decode($str); > > > > my $c=substr($h->{TITL},1,1); > > > > printf("%s [%d], utf8 flag is > > > > %s\n",$c,ord($c),Encode::is_utf8($c)?'ON':'OFF'); > > > > > > > > exit; > > > > aszabo@mepc:/tmp$ cat test.txt | perl a.pl > > > > a [97], utf8 flag is ON > > > > > > > > This is as expected so far. Now I enable utf8(0) line, and repeat: > > > > > > > > aszabo@mepc:/tmp$ cat test.txt | perl a.pl > > > > � [145], utf8 flag is ON > > > > > > > > This is wrong: utf8 flag is set to ON, however $h->{TITL} is not
> > in
> > > perl's
> > > > internal encoding format as second character should return 'a',
> > not
> > > second
> > > > byte of first character. This utf8 flag is a problem later on when
> > I use
> > > > regexp on the strings of the hash etc. > > > > > > > > Please let me know what you think. > > > > > > > > Thx, Adolf
> > > > > > > > > > > >
> > > >