Bug #94957 for Catalyst-Runtime: unicode double encoding caused by Plugin:Unicode::Encoding

Wed Apr 23 06:38:17 2014 TEEJAY [...] cpan.org - Ticket created

Subject:

unicode double encoding caused by Plugin:Unicode::Encoding

Plugin::Unicode::Encoding doesn't check for already encoded data before encoding it again : from line 70: # Encode expects plain scalars (IV, NV or PV) and segfaults on ref's $c->response->body( $c->encoding->encode( $body, $CHECK ) ) if ref(\$body) eq 'SCALAR'; Naively only checks it's a scalar, not that it isn't already encoded, which will be the case if the body is JSON with unicode Fix is : # Encode expects plain scalars (IV, NV or PV) and segfaults on ref's if (ref(\$body) eq 'SCALAR' && utf8::is_utf8($body)){ $c->response->body( $c->encoding->encode( $body, $CHECK ) ); } In our case we have code in a controller which does : $c->res->body(to_json($json, { utf8 => 1, pretty => defined $c->stash->{pretty_json}?$c->stash->{pretty_json}:1, allow_blessed => 1, convert_blessed => 1, })); This will always double encoded utf8 output without the patch shown

Wed Apr 23 06:56:12 2014 TEEJAY [...] cpan.org - Correspondence added

Show quoted text

> Fix is : > # Encode expects plain scalars (IV, NV or PV) and segfaults on ref's > if (ref(\$body) eq > 'SCALAR' && utf8::is_utf8($body)){ > $c->response->body( $c-

> >encoding->encode( $body, $CHECK ) );

> }

That fix is courtesy of gbjk (Gareth Kirwan) who spotted and patched it at our work, I'm just upstreaming it while he's busy

Thu Jul 17 05:21:17 2014 victor [...] vsespb.ru - Correspondence added

That fix is wrong. One should not use utf8::is_utf8 to detect if data is encoded or no, it would be a bug ( http://perldoc.perl.org/perlunifaq.html#How-can-I-determine-if-a-string-is-a-text-string-or-a-binary-string? ). Actually it's impossible to detect if data is encoded or no. On Wed Apr 23 14:38:17 2014, TEEJAY wrote: Show quoted text

> Plugin::Unicode::Encoding doesn't check for already encoded data > before encoding it again : > > from line 70: > # Encode expects plain scalars (IV, NV or PV) and segfaults on > ref's > $c->response->body( $c->encoding->encode( $body, $CHECK ) ) > if ref(\$body) eq 'SCALAR'; > > Naively only checks it's a scalar, not that it isn't already encoded, > which will be the case if the body is JSON with unicode > > Fix is : > # Encode expects plain scalars (IV, NV or PV) and segfaults on ref's > if (ref(\$body) eq > 'SCALAR' && utf8::is_utf8($body)){ > $c->response->body( $c-

> >encoding->encode( $body, $CHECK ) );

> } > > In our case we have code in a controller which does : > > $c->res->body(to_json($json, { > utf8 => 1, > pretty => defined $c->stash->{pretty_json}?$c->stash-

> >{pretty_json}:1,

> allow_blessed => 1, > convert_blessed => 1, > })); > > This will always double encoded utf8 output without the patch shown

Thu Jul 17 05:21:19 2014 The RT System itself - Status changed from 'new' to 'open'

Thu Jul 17 05:36:36 2014 victor [...] vsespb.ru - Correspondence added

I see fix applied already in Catalyst-Plugin-Unicode I propose reverting it, as it introduce a bug. Latin1 data in utf8::downgraded form will be detected as encoded data. On Thu Jul 17 13:21:17 2014, vsespb wrote: Show quoted text

> That fix is wrong. One should not use utf8::is_utf8 to detect if data > is encoded or no, it would be a bug ( > http://perldoc.perl.org/perlunifaq.html#How-can-I-determine-if-a- > string-is-a-text-string-or-a-binary-string? ). Actually it's > impossible to detect if data is encoded or no. > > On Wed Apr 23 14:38:17 2014, TEEJAY wrote:

> > Plugin::Unicode::Encoding doesn't check for already encoded data > > before encoding it again : > > > > from line 70: > > # Encode expects plain scalars (IV, NV or PV) and segfaults on > > ref's > > $c->response->body( $c->encoding->encode( $body, $CHECK ) ) > > if ref(\$body) eq 'SCALAR'; > > > > Naively only checks it's a scalar, not that it isn't already encoded, > > which will be the case if the body is JSON with unicode > > > > Fix is : > > # Encode expects plain scalars (IV, NV or PV) and segfaults on > > ref's > > if (ref(\$body) eq > > 'SCALAR' && utf8::is_utf8($body)){ > > $c->response->body( $c-

> > > encoding->encode( $body, $CHECK ) );

> > } > > > > In our case we have code in a controller which does : > > > > $c->res->body(to_json($json, { > > utf8 => 1, > > pretty => defined $c->stash->{pretty_json}?$c->stash-

> > > {pretty_json}:1,

> > allow_blessed => 1, > > convert_blessed => 1, > > })); > > > > This will always double encoded utf8 output without the patch shown

Thu Jul 17 11:40:54 2014 ether [...] cpan.org - Correspondence added

On 2014-07-17 02:36:36, vsespb wrote: Show quoted text

> I see fix applied already in Catalyst-Plugin-Unicode > I propose reverting it, as it introduce a bug. > Latin1 data in utf8::downgraded form will be detected as encoded data.

It looks like there are several places, all over the Catalyst-Runtime codebase, that perform is_utf8 checks :( As Victor says, this does NOT do what the author intended, and must be fixed.

Fri Jan 02 15:16:19 2015 jjnapiork [...] cpan.org - Correspondence added

On Thu Jul 17 11:40:54 2014, ETHER wrote: Show quoted text

> On 2014-07-17 02:36:36, vsespb wrote:

> > I see fix applied already in Catalyst-Plugin-Unicode > > I propose reverting it, as it introduce a bug. > > Latin1 data in utf8::downgraded form will be detected as encoded > > data.

> > It looks like there are several places, all over the Catalyst-Runtime > codebase, that perform is_utf8 checks :( As Victor says, this does > NOT do what the author intended, and must be fixed.

The current 5.90079__8 dev releases removes all that stuff and I think tries to clarify and improve how encoding is done. Also we are going to make utf8 encoding default going forward so I hope that shakes out any issues. I;m going to close this, but if you find issue with the latest set of dev releases please shout out.

Fri Jan 02 15:17:56 2015 jjnapiork [...] cpan.org - Status changed from 'open' to 'resolved'