Bug #103063 for Catalyst-Runtime: Query parsing broken for non-UTF-8 sites since 5.90083

Tue Mar 24 06:56:19 2015 victor [...] vsespb.ru - Ticket created

Subject:

Query parsing broken for non-UTF-8 sites since 5.90083

Hello. Here is the diff: https://metacpan.org/diff/file?target=JJNAPIORK/Catalyst-Runtime-5.90083/&source=JJNAPIORK/Catalyst-Runtime-5.90082/ There is the line: === map { defined $_ ? decode_utf8($self->unescape_uri($_)) : $_ } === It decodes data to Unicode string and assumes that it's in UTF-8, it ignores "encoding" option (even encoding => undef). Before 5.90083 logic was the same: === map { decode_utf8($self->unescape_uri($_)) } === However, there was === - if(my $query_obj = $env->{'plack.request.query'}) { - $c->request->query_parameters( - $c->request->_use_hash_multivalue ? - $query_obj->clone : - $query_obj->as_hashref_mixed); - return; - } - === before decoding. So, our site didn't reach this decoding, since $env->{'plack.request.query'} was true. Use case: Site runs under encoding => undef (previously without encoding at all). Web page encoding is WINDOWS-1251, so all incoming data, including query string is WINDOWS-1251 as well. Example of URL: http://example.com/test?domains=%E4%EE%EC%E5%ED %E4%EE%EC%E5%ED is Russian word in WINDOWS-1251 encoding. So in 5.90082 octets are passed as-is to the application. After 5.90083 - it's decoded to Unicode string consists of not-a-characters.

Tue Mar 24 09:53:29 2015 jjnapiork [...] cpan.org - Correspondence added

Any change you can share with me an entire request response cycle, including HTTP headers? I need to see what I can do to properly detect this. thanks! On Tue Mar 24 06:56:19 2015, vsespb wrote: Show quoted text

> Hello. > > Here is the diff: > > https://metacpan.org/diff/file?target=JJNAPIORK/Catalyst-Runtime- > 5.90083/&source=JJNAPIORK/Catalyst-Runtime-5.90082/ > > There is the line: > > === > map { defined $_ ? decode_utf8($self->unescape_uri($_)) : $_ } > === > > It decodes data to Unicode string and assumes that it's in UTF-8, it > ignores "encoding" option (even encoding => undef). > > Before 5.90083 logic was the same: > > === > map { decode_utf8($self->unescape_uri($_)) } > === > > However, there was > === > - if(my $query_obj = $env->{'plack.request.query'}) { > - $c->request->query_parameters( > - $c->request->_use_hash_multivalue ? > - $query_obj->clone : > - $query_obj->as_hashref_mixed); > - return; > - } > - > === > before decoding. > > So, our site didn't reach this decoding, since $env-

> >{'plack.request.query'} was true.

> > Use case: > > Site runs under encoding => undef (previously without encoding at > all). > Web page encoding is WINDOWS-1251, so all incoming data, including > query string is WINDOWS-1251 as well. > > Example of URL: http://example.com/test?domains=%E4%EE%EC%E5%ED > > %E4%EE%EC%E5%ED is Russian word in WINDOWS-1251 encoding. > > So in 5.90082 octets are passed as-is to the application. After > 5.90083 - it's decoded to Unicode string consists of not-a-characters.

Tue Mar 24 09:53:31 2015 The RT System itself - Status changed from 'new' to 'open'

Tue Mar 24 09:59:39 2015 jjnapiork [...] cpan.org - Correspondence added

Hey one thing I wonder exactly what you think this should do? I was taking the assumption that people would what Catalyst to convert the encoded chacters to local unicode wide characters but maybe that is not an ideal assumption? talk to me about the use case and what you'd ideally see here On Tue Mar 24 06:56:19 2015, vsespb wrote: Show quoted text

> Hello. > > Here is the diff: > > https://metacpan.org/diff/file?target=JJNAPIORK/Catalyst-Runtime- > 5.90083/&source=JJNAPIORK/Catalyst-Runtime-5.90082/ > > There is the line: > > === > map { defined $_ ? decode_utf8($self->unescape_uri($_)) : $_ } > === > > It decodes data to Unicode string and assumes that it's in UTF-8, it > ignores "encoding" option (even encoding => undef). > > Before 5.90083 logic was the same: > > === > map { decode_utf8($self->unescape_uri($_)) } > === > > However, there was > === > - if(my $query_obj = $env->{'plack.request.query'}) { > - $c->request->query_parameters( > - $c->request->_use_hash_multivalue ? > - $query_obj->clone : > - $query_obj->as_hashref_mixed); > - return; > - } > - > === > before decoding. > > So, our site didn't reach this decoding, since $env-

> >{'plack.request.query'} was true.

> > Use case: > > Site runs under encoding => undef (previously without encoding at > all). > Web page encoding is WINDOWS-1251, so all incoming data, including > query string is WINDOWS-1251 as well. > > Example of URL: http://example.com/test?domains=%E4%EE%EC%E5%ED > > %E4%EE%EC%E5%ED is Russian word in WINDOWS-1251 encoding. > > So in 5.90082 octets are passed as-is to the application. After > 5.90083 - it's decoded to Unicode string consists of not-a-characters.

Tue Mar 24 12:28:37 2015 victor [...] vsespb.ru - Correspondence added

Request/Response: ======== GET /misc/mytest?domains=%E4%EE%EC%E5%ED HTTP/1.1 Host: www1.reg.ru User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Cookie: [CUT] Connection: keep-alive Cache-Control: max-age=0 HTTP/1.1 200 OK Server: nginx Date: Tue, 24 Mar 2015 16:19:45 GMT Content-Type: text/html; charset=WINDOWS-1251 Transfer-Encoding: chunked Connection: keep-alive Content-Language: ru Set-Cookie: [CUT] X-Catalyst: 5.90083 x-ua-compatible: IE=edge,chrome=IE8 Content-Encoding: gzip ======== Action code: ======== sub mytest : Local Args(0) { my ($self, $c, $r, $p) = getcontvars_noses @_; $c->res->content_type( "text/plain" ); $c->res->body( $p->{domains} ); } ======== Console: ======== [error] Caught exception in engine "Wide character in syswrite at /usr/local/share/perl/5.14.2/Starman/Server.pm line 547." ======== Show quoted text

> one thing I wonder exactly what you think this should do?

Well, with encoding=>undef it should do nothing with charsets. i.e. return octets as is. i.e probably $self->unescape_uri($_) instead of decode_utf8($self->unescape_uri($_)) Show quoted text

> I was taking the assumption that people would what Catalyst to convert the encoded chacters to local unicode wide characters but maybe that is not an ideal assumption?

Yes, right. With encoding NOT undef, Catalyst should convert binary data to perl strings (unicode wide characters). But when encoding IS undef, it should pass binary data as-is. It's exactly what it does with output data. So with encoding undef input processing should be consistent with output processing. We work with textual data in WINDOWS-1251 currently. That's pre-unicode approach. We're migrating to unicode, but we're just not there yet. On Tue Mar 24 16:59:39 2015, JJNAPIORK wrote: Show quoted text

> Hey > > one thing I wonder exactly what you think this should do? I was > taking the assumption that people would what Catalyst to convert the > encoded chacters to local unicode wide characters but maybe that is > not an ideal assumption? > > talk to me about the use case and what you'd ideally see here > > On Tue Mar 24 06:56:19 2015, vsespb wrote:

> > Hello. > > > > Here is the diff: > > > > https://metacpan.org/diff/file?target=JJNAPIORK/Catalyst-Runtime- > > 5.90083/&source=JJNAPIORK/Catalyst-Runtime-5.90082/ > > > > There is the line: > > > > === > > map { defined $_ ? decode_utf8($self->unescape_uri($_)) : $_ } > > === > > > > It decodes data to Unicode string and assumes that it's in UTF-8, it > > ignores "encoding" option (even encoding => undef). > > > > Before 5.90083 logic was the same: > > > > === > > map { decode_utf8($self->unescape_uri($_)) } > > === > > > > However, there was > > === > > - if(my $query_obj = $env->{'plack.request.query'}) { > > - $c->request->query_parameters( > > - $c->request->_use_hash_multivalue ? > > - $query_obj->clone : > > - $query_obj->as_hashref_mixed); > > - return; > > - } > > - > > === > > before decoding. > > > > So, our site didn't reach this decoding, since $env-

> > > {'plack.request.query'} was true.

> > > > Use case: > > > > Site runs under encoding => undef (previously without encoding at > > all). > > Web page encoding is WINDOWS-1251, so all incoming data, including > > query string is WINDOWS-1251 as well. > > > > Example of URL: http://example.com/test?domains=%E4%EE%EC%E5%ED > > > > %E4%EE%EC%E5%ED is Russian word in WINDOWS-1251 encoding. > > > > So in 5.90082 octets are passed as-is to the application. After > > 5.90083 - it's decoded to Unicode string consists of not-a- > > characters.

Tue Mar 24 12:38:22 2015 jjnapiork [...] cpan.org - Correspondence added

Sorry one more question. How are the links generated? Do you have web pages that are windows charset encoded? Did you use $c->uri_for or did you have links 'hard coded', or manually created? Last are you doing the windows encoding for web pages (I assume that is what you are doing) via setting $c->encoding or similar or did you have to hack catalyst to make it work the way you needed ? On Tue Mar 24 12:28:37 2015, vsespb wrote: Show quoted text

> Request/Response: > ======== > GET /misc/mytest?domains=%E4%EE%EC%E5%ED HTTP/1.1 > Host: www1.reg.ru > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) > Gecko/20100101 Firefox/36.0 > Accept: > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > Accept-Language: en-US,en;q=0.5 > Accept-Encoding: gzip, deflate > Cookie: [CUT] > Connection: keep-alive > Cache-Control: max-age=0 > > HTTP/1.1 200 OK > Server: nginx > Date: Tue, 24 Mar 2015 16:19:45 GMT > Content-Type: text/html; charset=WINDOWS-1251 > Transfer-Encoding: chunked > Connection: keep-alive > Content-Language: ru > Set-Cookie: [CUT] > X-Catalyst: 5.90083 > x-ua-compatible: IE=edge,chrome=IE8 > Content-Encoding: gzip > ======== > > Action code: > ======== > sub mytest : Local Args(0) { > my ($self, $c, $r, $p) = getcontvars_noses @_; > $c->res->content_type( "text/plain" ); > $c->res->body( $p->{domains} ); > } > ======== > > Console: > ======== > [error] Caught exception in engine "Wide character in syswrite at > /usr/local/share/perl/5.14.2/Starman/Server.pm line 547." > ======== > >

> > one thing I wonder exactly what you think this should do?

> > Well, with encoding=>undef it should do nothing with charsets. i.e. > return octets as is. i.e > probably $self->unescape_uri($_) instead of decode_utf8($self-

> >unescape_uri($_))

> >

> > I was taking the assumption that people would what Catalyst to > > convert the encoded chacters to local unicode wide characters but > > maybe that is not an ideal assumption?

> > Yes, right. With encoding NOT undef, Catalyst should convert binary > data to perl strings (unicode wide characters). > But when encoding IS undef, it should pass binary data as-is. It's > exactly what it does with output data. > So with encoding undef input processing should be consistent with > output processing. > > We work with textual data in WINDOWS-1251 currently. That's pre- > unicode approach. We're migrating to unicode, > but we're just not there yet. > > > > On Tue Mar 24 16:59:39 2015, JJNAPIORK wrote:

> > Hey > > > > one thing I wonder exactly what you think this should do? I was > > taking the assumption that people would what Catalyst to convert the > > encoded chacters to local unicode wide characters but maybe that is > > not an ideal assumption? > > > > talk to me about the use case and what you'd ideally see here > > > > On Tue Mar 24 06:56:19 2015, vsespb wrote:

> > > Hello. > > > > > > Here is the diff: > > > > > > https://metacpan.org/diff/file?target=JJNAPIORK/Catalyst-Runtime- > > > 5.90083/&source=JJNAPIORK/Catalyst-Runtime-5.90082/ > > > > > > There is the line: > > > > > > === > > > map { defined $_ ? decode_utf8($self->unescape_uri($_)) : $_ } > > > === > > > > > > It decodes data to Unicode string and assumes that it's in UTF-8, > > > it > > > ignores "encoding" option (even encoding => undef). > > > > > > Before 5.90083 logic was the same: > > > > > > === > > > map { decode_utf8($self->unescape_uri($_)) } > > > === > > > > > > However, there was > > > === > > > - if(my $query_obj = $env->{'plack.request.query'}) { > > > - $c->request->query_parameters( > > > - $c->request->_use_hash_multivalue ? > > > - $query_obj->clone : > > > - $query_obj->as_hashref_mixed); > > > - return; > > > - } > > > - > > > === > > > before decoding. > > > > > > So, our site didn't reach this decoding, since $env-

> > > > {'plack.request.query'} was true.

> > > > > > Use case: > > > > > > Site runs under encoding => undef (previously without encoding at > > > all). > > > Web page encoding is WINDOWS-1251, so all incoming data, including > > > query string is WINDOWS-1251 as well. > > > > > > Example of URL: http://example.com/test?domains=%E4%EE%EC%E5%ED > > > > > > %E4%EE%EC%E5%ED is Russian word in WINDOWS-1251 encoding. > > > > > > So in 5.90082 octets are passed as-is to the application. After > > > 5.90083 - it's decoded to Unicode string consists of not-a- > > > characters.

Tue Mar 24 12:47:01 2015 victor [...] vsespb.ru - Correspondence added

Yes, all webpages are in windows-1251 charset. All our template toolkit templates are in windows-1251 charset on disk. All perl source code is in windows-1251 (thus string constants are windows-1251) Links with query strings are created usually using URI module. docs from URI module: === The escaping (percent encoding) of chars in the 128 .. 255 range passed to the URI constructor or when setting URI parts using the accessor methods depend on the state of the internal UTF8 flag (see utf8::is_utf8) of the string passed. If the UTF8 flag is set the UTF-8 encoded version of the character is percent encoded. If the UTF8 flag isn't set the Latin-1 version (byte) of the character is percent encoded. This basically exposes the internal encoding of Perl strings. === i.e. it will escape strings without perl UTF-8 flag as byte percent encoding. this works for us. No, we dont use $c->encoding. We just set encoding to undef. On Tue Mar 24 19:38:22 2015, JJNAPIORK wrote: Show quoted text

> Sorry one more question. How are the links generated? Do you have > web pages that are windows charset encoded? Did you use $c->uri_for > or did you have links 'hard coded', or manually created? Last are you > doing the windows encoding for web pages (I assume that is what you > are doing) via setting $c->encoding or similar or did you have to hack > catalyst to make it work the way you needed ? > > > On Tue Mar 24 12:28:37 2015, vsespb wrote:

> > Request/Response: > > ======== > > GET /misc/mytest?domains=%E4%EE%EC%E5%ED HTTP/1.1 > > Host: www1.reg.ru > > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) > > Gecko/20100101 Firefox/36.0 > > Accept: > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > Accept-Language: en-US,en;q=0.5 > > Accept-Encoding: gzip, deflate > > Cookie: [CUT] > > Connection: keep-alive > > Cache-Control: max-age=0 > > > > HTTP/1.1 200 OK > > Server: nginx > > Date: Tue, 24 Mar 2015 16:19:45 GMT > > Content-Type: text/html; charset=WINDOWS-1251 > > Transfer-Encoding: chunked > > Connection: keep-alive > > Content-Language: ru > > Set-Cookie: [CUT] > > X-Catalyst: 5.90083 > > x-ua-compatible: IE=edge,chrome=IE8 > > Content-Encoding: gzip > > ======== > > > > Action code: > > ======== > > sub mytest : Local Args(0) { > > my ($self, $c, $r, $p) = getcontvars_noses @_; > > $c->res->content_type( "text/plain" ); > > $c->res->body( $p->{domains} ); > > } > > ======== > > > > Console: > > ======== > > [error] Caught exception in engine "Wide character in syswrite at > > /usr/local/share/perl/5.14.2/Starman/Server.pm line 547." > > ======== > > > >

> > > one thing I wonder exactly what you think this should do?

> > > > Well, with encoding=>undef it should do nothing with charsets. i.e. > > return octets as is. i.e > > probably $self->unescape_uri($_) instead of decode_utf8($self-

> > > unescape_uri($_))

> > > >

> > > I was taking the assumption that people would what Catalyst to > > > convert the encoded chacters to local unicode wide characters but > > > maybe that is not an ideal assumption?

> > > > Yes, right. With encoding NOT undef, Catalyst should convert binary > > data to perl strings (unicode wide characters). > > But when encoding IS undef, it should pass binary data as-is. It's > > exactly what it does with output data. > > So with encoding undef input processing should be consistent with > > output processing. > > > > We work with textual data in WINDOWS-1251 currently. That's pre- > > unicode approach. We're migrating to unicode, > > but we're just not there yet. > > > > > > > > On Tue Mar 24 16:59:39 2015, JJNAPIORK wrote:

> > > Hey > > > > > > one thing I wonder exactly what you think this should do? I was > > > taking the assumption that people would what Catalyst to convert > > > the > > > encoded chacters to local unicode wide characters but maybe that is > > > not an ideal assumption? > > > > > > talk to me about the use case and what you'd ideally see here > > > > > > On Tue Mar 24 06:56:19 2015, vsespb wrote:

> > > > Hello. > > > > > > > > Here is the diff: > > > > > > > > https://metacpan.org/diff/file?target=JJNAPIORK/Catalyst-Runtime- > > > > 5.90083/&source=JJNAPIORK/Catalyst-Runtime-5.90082/ > > > > > > > > There is the line: > > > > > > > > === > > > > map { defined $_ ? decode_utf8($self->unescape_uri($_)) : $_ } > > > > === > > > > > > > > It decodes data to Unicode string and assumes that it's in UTF-8, > > > > it > > > > ignores "encoding" option (even encoding => undef). > > > > > > > > Before 5.90083 logic was the same: > > > > > > > > === > > > > map { decode_utf8($self->unescape_uri($_)) } > > > > === > > > > > > > > However, there was > > > > === > > > > - if(my $query_obj = $env->{'plack.request.query'}) { > > > > - $c->request->query_parameters( > > > > - $c->request->_use_hash_multivalue ? > > > > - $query_obj->clone : > > > > - $query_obj->as_hashref_mixed); > > > > - return; > > > > - } > > > > - > > > > === > > > > before decoding. > > > > > > > > So, our site didn't reach this decoding, since $env-

> > > > > {'plack.request.query'} was true.

> > > > > > > > Use case: > > > > > > > > Site runs under encoding => undef (previously without encoding at > > > > all). > > > > Web page encoding is WINDOWS-1251, so all incoming data, > > > > including > > > > query string is WINDOWS-1251 as well. > > > > > > > > Example of URL: http://example.com/test?domains=%E4%EE%EC%E5%ED > > > > > > > > %E4%EE%EC%E5%ED is Russian word in WINDOWS-1251 encoding. > > > > > > > > So in 5.90082 octets are passed as-is to the application. After > > > > 5.90083 - it's decoded to Unicode string consists of not-a- > > > > characters.

Tue Mar 24 14:04:05 2015 jjnapiork [...] cpan.org - Correspondence added

Ok so I think we can add a configurattion setting like 'url-decode-from-encoding' which would instead of defaulting to utf-8 it would use whatever you set to encoding. that way you can disable until you have time to change your code to match. how does that sound? anywhere else you'd need this? john On Tue Mar 24 12:47:01 2015, vsespb wrote: Show quoted text

> Yes, all webpages are in windows-1251 charset. > All our template toolkit templates are in windows-1251 charset on > disk. > All perl source code is in windows-1251 (thus string constants are > windows-1251) > > Links with query strings are created usually using URI module. > > docs from URI module: > === > The escaping (percent encoding) of chars in the 128 .. 255 range > passed to the URI constructor or when setting URI parts using the > accessor methods depend on the state of the internal UTF8 flag (see > utf8::is_utf8) of the string passed. If the UTF8 flag is set the UTF-8 > encoded version of the character is percent encoded. If the UTF8 flag > isn't set the Latin-1 version (byte) of the character is percent > encoded. This basically exposes the internal encoding of Perl strings. > === > > i.e. it will escape strings without perl UTF-8 flag as byte percent > encoding. this works for us. > > No, we dont use $c->encoding. We just set encoding to undef. > > On Tue Mar 24 19:38:22 2015, JJNAPIORK wrote:

> > Sorry one more question. How are the links generated? Do you have > > web pages that are windows charset encoded? Did you use $c->uri_for > > or did you have links 'hard coded', or manually created? Last are > > you > > doing the windows encoding for web pages (I assume that is what you > > are doing) via setting $c->encoding or similar or did you have to > > hack > > catalyst to make it work the way you needed ? > > > > > > On Tue Mar 24 12:28:37 2015, vsespb wrote:

> > > Request/Response: > > > ======== > > > GET /misc/mytest?domains=%E4%EE%EC%E5%ED HTTP/1.1 > > > Host: www1.reg.ru > > > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) > > > Gecko/20100101 Firefox/36.0 > > > Accept: > > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > > Accept-Language: en-US,en;q=0.5 > > > Accept-Encoding: gzip, deflate > > > Cookie: [CUT] > > > Connection: keep-alive > > > Cache-Control: max-age=0 > > > > > > HTTP/1.1 200 OK > > > Server: nginx > > > Date: Tue, 24 Mar 2015 16:19:45 GMT > > > Content-Type: text/html; charset=WINDOWS-1251 > > > Transfer-Encoding: chunked > > > Connection: keep-alive > > > Content-Language: ru > > > Set-Cookie: [CUT] > > > X-Catalyst: 5.90083 > > > x-ua-compatible: IE=edge,chrome=IE8 > > > Content-Encoding: gzip > > > ======== > > > > > > Action code: > > > ======== > > > sub mytest : Local Args(0) { > > > my ($self, $c, $r, $p) = getcontvars_noses @_; > > > $c->res->content_type( "text/plain" ); > > > $c->res->body( $p->{domains} ); > > > } > > > ======== > > > > > > Console: > > > ======== > > > [error] Caught exception in engine "Wide character in syswrite at > > > /usr/local/share/perl/5.14.2/Starman/Server.pm line 547." > > > ======== > > > > > >

> > > > one thing I wonder exactly what you think this should do?

> > > > > > Well, with encoding=>undef it should do nothing with charsets. i.e. > > > return octets as is. i.e > > > probably $self->unescape_uri($_) instead of decode_utf8($self-

> > > > unescape_uri($_))

> > > > > >

> > > > I was taking the assumption that people would what Catalyst to > > > > convert the encoded chacters to local unicode wide characters but > > > > maybe that is not an ideal assumption?

> > > > > > Yes, right. With encoding NOT undef, Catalyst should convert binary > > > data to perl strings (unicode wide characters). > > > But when encoding IS undef, it should pass binary data as-is. It's > > > exactly what it does with output data. > > > So with encoding undef input processing should be consistent with > > > output processing. > > > > > > We work with textual data in WINDOWS-1251 currently. That's pre- > > > unicode approach. We're migrating to unicode, > > > but we're just not there yet. > > > > > > > > > > > > On Tue Mar 24 16:59:39 2015, JJNAPIORK wrote:

> > > > Hey > > > > > > > > one thing I wonder exactly what you think this should do? I was > > > > taking the assumption that people would what Catalyst to convert > > > > the > > > > encoded chacters to local unicode wide characters but maybe that > > > > is > > > > not an ideal assumption? > > > > > > > > talk to me about the use case and what you'd ideally see here > > > > > > > > On Tue Mar 24 06:56:19 2015, vsespb wrote:

> > > > > Hello. > > > > > > > > > > Here is the diff: > > > > > > > > > > https://metacpan.org/diff/file?target=JJNAPIORK/Catalyst- > > > > > Runtime- > > > > > 5.90083/&source=JJNAPIORK/Catalyst-Runtime-5.90082/ > > > > > > > > > > There is the line: > > > > > > > > > > === > > > > > map { defined $_ ? decode_utf8($self->unescape_uri($_)) : $_ } > > > > > === > > > > > > > > > > It decodes data to Unicode string and assumes that it's in UTF- > > > > > 8, > > > > > it > > > > > ignores "encoding" option (even encoding => undef). > > > > > > > > > > Before 5.90083 logic was the same: > > > > > > > > > > === > > > > > map { decode_utf8($self->unescape_uri($_)) } > > > > > === > > > > > > > > > > However, there was > > > > > === > > > > > - if(my $query_obj = $env->{'plack.request.query'}) { > > > > > - $c->request->query_parameters( > > > > > - $c->request->_use_hash_multivalue ? > > > > > - $query_obj->clone : > > > > > - $query_obj->as_hashref_mixed); > > > > > - return; > > > > > - } > > > > > - > > > > > === > > > > > before decoding. > > > > > > > > > > So, our site didn't reach this decoding, since $env-

> > > > > > {'plack.request.query'} was true.

> > > > > > > > > > Use case: > > > > > > > > > > Site runs under encoding => undef (previously without encoding > > > > > at > > > > > all). > > > > > Web page encoding is WINDOWS-1251, so all incoming data, > > > > > including > > > > > query string is WINDOWS-1251 as well. > > > > > > > > > > Example of URL: http://example.com/test?domains=%E4%EE%EC%E5%ED > > > > > > > > > > %E4%EE%EC%E5%ED is Russian word in WINDOWS-1251 encoding. > > > > > > > > > > So in 5.90082 octets are passed as-is to the application. After > > > > > 5.90083 - it's decoded to Unicode string consists of not-a- > > > > > characters.

Tue Mar 24 14:16:06 2015 victor [...] vsespb.ru - Correspondence added

IF there will be a setting url-decode-from-encoding=undef, which _disables_ any decoding, this would work. I.e. url-decode-from-encoding=windows-1251 will decode windows-1251 to Perl Character Strings (unicode). This would not work. Need option url-decode-from-encoding=undef to skip any decoding. However I am not sure why there is a need in another option. There is already "encoding" option. For our case it's encoding=>undef. Is there a case when one would use encoding=>undef, but url-decode-from-encoding not undef (or url-decode-from-encoding=utf8, like now)? I don't think so! With encoding=>undef one outputs binary data, this makes sense when one works not in modern Perl Unicode model, but works with binary strings in single byte encodings (like windows-1251; any single-byte encoding, except latin-1, as latin-1 is compatible with perl Unicode). And, if we work with such data, we won't need input in Perl Character Strings (unicode)! On Tue Mar 24 21:04:05 2015, JJNAPIORK wrote: Show quoted text

> Ok so I think we can add a configurattion setting like 'url-decode- > from-encoding' which would instead of defaulting to utf-8 it would use > whatever you set to encoding. that way you can disable until you have > time to change your code to match. how does that sound? > > anywhere else you'd need this? > > john > On Tue Mar 24 12:47:01 2015, vsespb wrote:

> > Yes, all webpages are in windows-1251 charset. > > All our template toolkit templates are in windows-1251 charset on > > disk. > > All perl source code is in windows-1251 (thus string constants are > > windows-1251) > > > > Links with query strings are created usually using URI module. > > > > docs from URI module: > > === > > The escaping (percent encoding) of chars in the 128 .. 255 range > > passed to the URI constructor or when setting URI parts using the > > accessor methods depend on the state of the internal UTF8 flag (see > > utf8::is_utf8) of the string passed. If the UTF8 flag is set the UTF- > > 8 > > encoded version of the character is percent encoded. If the UTF8 flag > > isn't set the Latin-1 version (byte) of the character is percent > > encoded. This basically exposes the internal encoding of Perl > > strings. > > === > > > > i.e. it will escape strings without perl UTF-8 flag as byte percent > > encoding. this works for us. > > > > No, we dont use $c->encoding. We just set encoding to undef. > > > > On Tue Mar 24 19:38:22 2015, JJNAPIORK wrote:

> > > Sorry one more question. How are the links generated? Do you have > > > web pages that are windows charset encoded? Did you use $c-

> > > >uri_for

> > > or did you have links 'hard coded', or manually created? Last are > > > you > > > doing the windows encoding for web pages (I assume that is what you > > > are doing) via setting $c->encoding or similar or did you have to > > > hack > > > catalyst to make it work the way you needed ? > > > > > > > > > On Tue Mar 24 12:28:37 2015, vsespb wrote:

> > > > Request/Response: > > > > ======== > > > > GET /misc/mytest?domains=%E4%EE%EC%E5%ED HTTP/1.1 > > > > Host: www1.reg.ru > > > > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) > > > > Gecko/20100101 Firefox/36.0 > > > > Accept: > > > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > > > Accept-Language: en-US,en;q=0.5 > > > > Accept-Encoding: gzip, deflate > > > > Cookie: [CUT] > > > > Connection: keep-alive > > > > Cache-Control: max-age=0 > > > > > > > > HTTP/1.1 200 OK > > > > Server: nginx > > > > Date: Tue, 24 Mar 2015 16:19:45 GMT > > > > Content-Type: text/html; charset=WINDOWS-1251 > > > > Transfer-Encoding: chunked > > > > Connection: keep-alive > > > > Content-Language: ru > > > > Set-Cookie: [CUT] > > > > X-Catalyst: 5.90083 > > > > x-ua-compatible: IE=edge,chrome=IE8 > > > > Content-Encoding: gzip > > > > ======== > > > > > > > > Action code: > > > > ======== > > > > sub mytest : Local Args(0) { > > > > my ($self, $c, $r, $p) = getcontvars_noses @_; > > > > $c->res->content_type( "text/plain" ); > > > > $c->res->body( $p->{domains} ); > > > > } > > > > ======== > > > > > > > > Console: > > > > ======== > > > > [error] Caught exception in engine "Wide character in syswrite at > > > > /usr/local/share/perl/5.14.2/Starman/Server.pm line 547." > > > > ======== > > > > > > > >

> > > > > one thing I wonder exactly what you think this should do?

> > > > > > > > Well, with encoding=>undef it should do nothing with charsets. > > > > i.e. > > > > return octets as is. i.e > > > > probably $self->unescape_uri($_) instead of decode_utf8($self-

> > > > > unescape_uri($_))

> > > > > > > >

> > > > > I was taking the assumption that people would what Catalyst to > > > > > convert the encoded chacters to local unicode wide characters > > > > > but > > > > > maybe that is not an ideal assumption?

> > > > > > > > Yes, right. With encoding NOT undef, Catalyst should convert > > > > binary > > > > data to perl strings (unicode wide characters). > > > > But when encoding IS undef, it should pass binary data as-is. > > > > It's > > > > exactly what it does with output data. > > > > So with encoding undef input processing should be consistent with > > > > output processing. > > > > > > > > We work with textual data in WINDOWS-1251 currently. That's pre- > > > > unicode approach. We're migrating to unicode, > > > > but we're just not there yet. > > > > > > > > > > > > > > > > On Tue Mar 24 16:59:39 2015, JJNAPIORK wrote:

> > > > > Hey > > > > > > > > > > one thing I wonder exactly what you think this should do? I > > > > > was > > > > > taking the assumption that people would what Catalyst to > > > > > convert > > > > > the > > > > > encoded chacters to local unicode wide characters but maybe > > > > > that > > > > > is > > > > > not an ideal assumption? > > > > > > > > > > talk to me about the use case and what you'd ideally see here > > > > > > > > > > On Tue Mar 24 06:56:19 2015, vsespb wrote:

> > > > > > Hello. > > > > > > > > > > > > Here is the diff: > > > > > > > > > > > > https://metacpan.org/diff/file?target=JJNAPIORK/Catalyst- > > > > > > Runtime- > > > > > > 5.90083/&source=JJNAPIORK/Catalyst-Runtime-5.90082/ > > > > > > > > > > > > There is the line: > > > > > > > > > > > > === > > > > > > map { defined $_ ? decode_utf8($self->unescape_uri($_)) : $_ > > > > > > } > > > > > > === > > > > > > > > > > > > It decodes data to Unicode string and assumes that it's in > > > > > > UTF- > > > > > > 8, > > > > > > it > > > > > > ignores "encoding" option (even encoding => undef). > > > > > > > > > > > > Before 5.90083 logic was the same: > > > > > > > > > > > > === > > > > > > map { decode_utf8($self->unescape_uri($_)) } > > > > > > === > > > > > > > > > > > > However, there was > > > > > > === > > > > > > - if(my $query_obj = $env->{'plack.request.query'}) { > > > > > > - $c->request->query_parameters( > > > > > > - $c->request->_use_hash_multivalue ? > > > > > > - $query_obj->clone : > > > > > > - $query_obj->as_hashref_mixed); > > > > > > - return; > > > > > > - } > > > > > > - > > > > > > === > > > > > > before decoding. > > > > > > > > > > > > So, our site didn't reach this decoding, since $env-

> > > > > > > {'plack.request.query'} was true.

> > > > > > > > > > > > Use case: > > > > > > > > > > > > Site runs under encoding => undef (previously without > > > > > > encoding > > > > > > at > > > > > > all). > > > > > > Web page encoding is WINDOWS-1251, so all incoming data, > > > > > > including > > > > > > query string is WINDOWS-1251 as well. > > > > > > > > > > > > Example of URL: > > > > > > http://example.com/test?domains=%E4%EE%EC%E5%ED > > > > > > > > > > > > %E4%EE%EC%E5%ED is Russian word in WINDOWS-1251 encoding. > > > > > > > > > > > > So in 5.90082 octets are passed as-is to the application. > > > > > > After > > > > > > 5.90083 - it's decoded to Unicode string consists of not-a- > > > > > > characters.

Tue Mar 24 14:31:01 2015 jjnapiork [...] cpan.org - Correspondence added

Just to be clear, does it sound like adding a new $app configuration like "decode_URI_query_from_encoding_value' (or similar) which would decode the URL query based on whatever you set $app->encoding to (or nothing if that is undef) would fix this for you? Would this be a good solution until someday (if ever) you are able to change your code as to be UTF-8 top to bottom? jnap On Tue Mar 24 14:04:05 2015, JJNAPIORK wrote: Show quoted text

> Ok so I think we can add a configurattion setting like 'url-decode- > from-encoding' which would instead of defaulting to utf-8 it would use > whatever you set to encoding. that way you can disable until you have > time to change your code to match. how does that sound? > > anywhere else you'd need this? > > john > On Tue Mar 24 12:47:01 2015, vsespb wrote:

> > Yes, all webpages are in windows-1251 charset. > > All our template toolkit templates are in windows-1251 charset on > > disk. > > All perl source code is in windows-1251 (thus string constants are > > windows-1251) > > > > Links with query strings are created usually using URI module. > > > > docs from URI module: > > === > > The escaping (percent encoding) of chars in the 128 .. 255 range > > passed to the URI constructor or when setting URI parts using the > > accessor methods depend on the state of the internal UTF8 flag (see > > utf8::is_utf8) of the string passed. If the UTF8 flag is set the UTF- > > 8 > > encoded version of the character is percent encoded. If the UTF8 flag > > isn't set the Latin-1 version (byte) of the character is percent > > encoded. This basically exposes the internal encoding of Perl > > strings. > > === > > > > i.e. it will escape strings without perl UTF-8 flag as byte percent > > encoding. this works for us. > > > > No, we dont use $c->encoding. We just set encoding to undef. > > > > On Tue Mar 24 19:38:22 2015, JJNAPIORK wrote:

> > > Sorry one more question. How are the links generated? Do you have > > > web pages that are windows charset encoded? Did you use $c-

> > > >uri_for

> > > or did you have links 'hard coded', or manually created? Last are > > > you > > > doing the windows encoding for web pages (I assume that is what you > > > are doing) via setting $c->encoding or similar or did you have to > > > hack > > > catalyst to make it work the way you needed ? > > > > > > > > > On Tue Mar 24 12:28:37 2015, vsespb wrote:

> > > > Request/Response: > > > > ======== > > > > GET /misc/mytest?domains=%E4%EE%EC%E5%ED HTTP/1.1 > > > > Host: www1.reg.ru > > > > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) > > > > Gecko/20100101 Firefox/36.0 > > > > Accept: > > > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > > > Accept-Language: en-US,en;q=0.5 > > > > Accept-Encoding: gzip, deflate > > > > Cookie: [CUT] > > > > Connection: keep-alive > > > > Cache-Control: max-age=0 > > > > > > > > HTTP/1.1 200 OK > > > > Server: nginx > > > > Date: Tue, 24 Mar 2015 16:19:45 GMT > > > > Content-Type: text/html; charset=WINDOWS-1251 > > > > Transfer-Encoding: chunked > > > > Connection: keep-alive > > > > Content-Language: ru > > > > Set-Cookie: [CUT] > > > > X-Catalyst: 5.90083 > > > > x-ua-compatible: IE=edge,chrome=IE8 > > > > Content-Encoding: gzip > > > > ======== > > > > > > > > Action code: > > > > ======== > > > > sub mytest : Local Args(0) { > > > > my ($self, $c, $r, $p) = getcontvars_noses @_; > > > > $c->res->content_type( "text/plain" ); > > > > $c->res->body( $p->{domains} ); > > > > } > > > > ======== > > > > > > > > Console: > > > > ======== > > > > [error] Caught exception in engine "Wide character in syswrite at > > > > /usr/local/share/perl/5.14.2/Starman/Server.pm line 547." > > > > ======== > > > > > > > >

> > > > > one thing I wonder exactly what you think this should do?

> > > > > > > > Well, with encoding=>undef it should do nothing with charsets. > > > > i.e. > > > > return octets as is. i.e > > > > probably $self->unescape_uri($_) instead of decode_utf8($self-

> > > > > unescape_uri($_))

> > > > > > > >

> > > > > I was taking the assumption that people would what Catalyst to > > > > > convert the encoded chacters to local unicode wide characters > > > > > but > > > > > maybe that is not an ideal assumption?

> > > > > > > > Yes, right. With encoding NOT undef, Catalyst should convert > > > > binary > > > > data to perl strings (unicode wide characters). > > > > But when encoding IS undef, it should pass binary data as-is. > > > > It's > > > > exactly what it does with output data. > > > > So with encoding undef input processing should be consistent with > > > > output processing. > > > > > > > > We work with textual data in WINDOWS-1251 currently. That's pre- > > > > unicode approach. We're migrating to unicode, > > > > but we're just not there yet. > > > > > > > > > > > > > > > > On Tue Mar 24 16:59:39 2015, JJNAPIORK wrote:

> > > > > Hey > > > > > > > > > > one thing I wonder exactly what you think this should do? I > > > > > was > > > > > taking the assumption that people would what Catalyst to > > > > > convert > > > > > the > > > > > encoded chacters to local unicode wide characters but maybe > > > > > that > > > > > is > > > > > not an ideal assumption? > > > > > > > > > > talk to me about the use case and what you'd ideally see here > > > > > > > > > > On Tue Mar 24 06:56:19 2015, vsespb wrote:

> > > > > > Hello. > > > > > > > > > > > > Here is the diff: > > > > > > > > > > > > https://metacpan.org/diff/file?target=JJNAPIORK/Catalyst- > > > > > > Runtime- > > > > > > 5.90083/&source=JJNAPIORK/Catalyst-Runtime-5.90082/ > > > > > > > > > > > > There is the line: > > > > > > > > > > > > === > > > > > > map { defined $_ ? decode_utf8($self->unescape_uri($_)) : $_ > > > > > > } > > > > > > === > > > > > > > > > > > > It decodes data to Unicode string and assumes that it's in > > > > > > UTF- > > > > > > 8, > > > > > > it > > > > > > ignores "encoding" option (even encoding => undef). > > > > > > > > > > > > Before 5.90083 logic was the same: > > > > > > > > > > > > === > > > > > > map { decode_utf8($self->unescape_uri($_)) } > > > > > > === > > > > > > > > > > > > However, there was > > > > > > === > > > > > > - if(my $query_obj = $env->{'plack.request.query'}) { > > > > > > - $c->request->query_parameters( > > > > > > - $c->request->_use_hash_multivalue ? > > > > > > - $query_obj->clone : > > > > > > - $query_obj->as_hashref_mixed); > > > > > > - return; > > > > > > - } > > > > > > - > > > > > > === > > > > > > before decoding. > > > > > > > > > > > > So, our site didn't reach this decoding, since $env-

> > > > > > > {'plack.request.query'} was true.

> > > > > > > > > > > > Use case: > > > > > > > > > > > > Site runs under encoding => undef (previously without > > > > > > encoding > > > > > > at > > > > > > all). > > > > > > Web page encoding is WINDOWS-1251, so all incoming data, > > > > > > including > > > > > > query string is WINDOWS-1251 as well. > > > > > > > > > > > > Example of URL: > > > > > > http://example.com/test?domains=%E4%EE%EC%E5%ED > > > > > > > > > > > > %E4%EE%EC%E5%ED is Russian word in WINDOWS-1251 encoding. > > > > > > > > > > > > So in 5.90082 octets are passed as-is to the application. > > > > > > After > > > > > > 5.90083 - it's decoded to Unicode string consists of not-a- > > > > > > characters.

Tue Mar 24 14:37:58 2015 victor [...] vsespb.ru - Correspondence added

Show quoted text

> does it sound like adding a new $app configuration like "decode_URI_query_from_encoding_value' (or similar) which would decode the URL query based on whatever you set $app->encoding to (or nothing if that is undef)

yes! (and when encoding==undef, no decoding happens) Show quoted text

> Would this be a good solution until someday (if ever) you are able to change your code as to be UTF-8 top to bottom?

yes! On Tue Mar 24 21:31:01 2015, JJNAPIORK wrote: Show quoted text

> Just to be clear, does it sound like adding a new $app configuration > like "decode_URI_query_from_encoding_value' (or similar) which would > decode the URL query based on whatever you set $app->encoding to (or > nothing if that is undef) would fix this for you? Would this be a > good solution until someday (if ever) you are able to change your code > as to be UTF-8 top to bottom? > > jnap > > On Tue Mar 24 14:04:05 2015, JJNAPIORK wrote:

> > Ok so I think we can add a configurattion setting like 'url-decode- > > from-encoding' which would instead of defaulting to utf-8 it would > > use > > whatever you set to encoding. that way you can disable until you > > have > > time to change your code to match. how does that sound? > > > > anywhere else you'd need this? > > > > john > > On Tue Mar 24 12:47:01 2015, vsespb wrote:

> > > Yes, all webpages are in windows-1251 charset. > > > All our template toolkit templates are in windows-1251 charset on > > > disk. > > > All perl source code is in windows-1251 (thus string constants are > > > windows-1251) > > > > > > Links with query strings are created usually using URI module. > > > > > > docs from URI module: > > > === > > > The escaping (percent encoding) of chars in the 128 .. 255 range > > > passed to the URI constructor or when setting URI parts using the > > > accessor methods depend on the state of the internal UTF8 flag (see > > > utf8::is_utf8) of the string passed. If the UTF8 flag is set the > > > UTF- > > > 8 > > > encoded version of the character is percent encoded. If the UTF8 > > > flag > > > isn't set the Latin-1 version (byte) of the character is percent > > > encoded. This basically exposes the internal encoding of Perl > > > strings. > > > === > > > > > > i.e. it will escape strings without perl UTF-8 flag as byte percent > > > encoding. this works for us. > > > > > > No, we dont use $c->encoding. We just set encoding to undef. > > > > > > On Tue Mar 24 19:38:22 2015, JJNAPIORK wrote:

> > > > Sorry one more question. How are the links generated? Do you > > > > have > > > > web pages that are windows charset encoded? Did you use $c-

> > > > > uri_for

> > > > or did you have links 'hard coded', or manually created? Last > > > > are > > > > you > > > > doing the windows encoding for web pages (I assume that is what > > > > you > > > > are doing) via setting $c->encoding or similar or did you have to > > > > hack > > > > catalyst to make it work the way you needed ? > > > > > > > > > > > > On Tue Mar 24 12:28:37 2015, vsespb wrote:

> > > > > Request/Response: > > > > > ======== > > > > > GET /misc/mytest?domains=%E4%EE%EC%E5%ED HTTP/1.1 > > > > > Host: www1.reg.ru > > > > > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) > > > > > Gecko/20100101 Firefox/36.0 > > > > > Accept: > > > > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > > > > Accept-Language: en-US,en;q=0.5 > > > > > Accept-Encoding: gzip, deflate > > > > > Cookie: [CUT] > > > > > Connection: keep-alive > > > > > Cache-Control: max-age=0 > > > > > > > > > > HTTP/1.1 200 OK > > > > > Server: nginx > > > > > Date: Tue, 24 Mar 2015 16:19:45 GMT > > > > > Content-Type: text/html; charset=WINDOWS-1251 > > > > > Transfer-Encoding: chunked > > > > > Connection: keep-alive > > > > > Content-Language: ru > > > > > Set-Cookie: [CUT] > > > > > X-Catalyst: 5.90083 > > > > > x-ua-compatible: IE=edge,chrome=IE8 > > > > > Content-Encoding: gzip > > > > > ======== > > > > > > > > > > Action code: > > > > > ======== > > > > > sub mytest : Local Args(0) { > > > > > my ($self, $c, $r, $p) = getcontvars_noses @_; > > > > > $c->res->content_type( "text/plain" ); > > > > > $c->res->body( $p->{domains} ); > > > > > } > > > > > ======== > > > > > > > > > > Console: > > > > > ======== > > > > > [error] Caught exception in engine "Wide character in syswrite > > > > > at > > > > > /usr/local/share/perl/5.14.2/Starman/Server.pm line 547." > > > > > ======== > > > > > > > > > >

> > > > > > one thing I wonder exactly what you think this should do?

> > > > > > > > > > Well, with encoding=>undef it should do nothing with charsets. > > > > > i.e. > > > > > return octets as is. i.e > > > > > probably $self->unescape_uri($_) instead of decode_utf8($self-

> > > > > > unescape_uri($_))

> > > > > > > > > >

> > > > > > I was taking the assumption that people would what Catalyst > > > > > > to > > > > > > convert the encoded chacters to local unicode wide characters > > > > > > but > > > > > > maybe that is not an ideal assumption?

> > > > > > > > > > Yes, right. With encoding NOT undef, Catalyst should convert > > > > > binary > > > > > data to perl strings (unicode wide characters). > > > > > But when encoding IS undef, it should pass binary data as-is. > > > > > It's > > > > > exactly what it does with output data. > > > > > So with encoding undef input processing should be consistent > > > > > with > > > > > output processing. > > > > > > > > > > We work with textual data in WINDOWS-1251 currently. That's > > > > > pre- > > > > > unicode approach. We're migrating to unicode, > > > > > but we're just not there yet. > > > > > > > > > > > > > > > > > > > > On Tue Mar 24 16:59:39 2015, JJNAPIORK wrote:

> > > > > > Hey > > > > > > > > > > > > one thing I wonder exactly what you think this should do? I > > > > > > was > > > > > > taking the assumption that people would what Catalyst to > > > > > > convert > > > > > > the > > > > > > encoded chacters to local unicode wide characters but maybe > > > > > > that > > > > > > is > > > > > > not an ideal assumption? > > > > > > > > > > > > talk to me about the use case and what you'd ideally see here > > > > > > > > > > > > On Tue Mar 24 06:56:19 2015, vsespb wrote:

> > > > > > > Hello. > > > > > > > > > > > > > > Here is the diff: > > > > > > > > > > > > > > https://metacpan.org/diff/file?target=JJNAPIORK/Catalyst- > > > > > > > Runtime- > > > > > > > 5.90083/&source=JJNAPIORK/Catalyst-Runtime-5.90082/ > > > > > > > > > > > > > > There is the line: > > > > > > > > > > > > > > === > > > > > > > map { defined $_ ? decode_utf8($self->unescape_uri($_)) : > > > > > > > $_ > > > > > > > } > > > > > > > === > > > > > > > > > > > > > > It decodes data to Unicode string and assumes that it's in > > > > > > > UTF- > > > > > > > 8, > > > > > > > it > > > > > > > ignores "encoding" option (even encoding => undef). > > > > > > > > > > > > > > Before 5.90083 logic was the same: > > > > > > > > > > > > > > === > > > > > > > map { decode_utf8($self->unescape_uri($_)) } > > > > > > > === > > > > > > > > > > > > > > However, there was > > > > > > > === > > > > > > > - if(my $query_obj = $env->{'plack.request.query'}) { > > > > > > > - $c->request->query_parameters( > > > > > > > - $c->request->_use_hash_multivalue ? > > > > > > > - $query_obj->clone : > > > > > > > - $query_obj->as_hashref_mixed); > > > > > > > - return; > > > > > > > - } > > > > > > > - > > > > > > > === > > > > > > > before decoding. > > > > > > > > > > > > > > So, our site didn't reach this decoding, since $env-

> > > > > > > > {'plack.request.query'} was true.

> > > > > > > > > > > > > > Use case: > > > > > > > > > > > > > > Site runs under encoding => undef (previously without > > > > > > > encoding > > > > > > > at > > > > > > > all). > > > > > > > Web page encoding is WINDOWS-1251, so all incoming data, > > > > > > > including > > > > > > > query string is WINDOWS-1251 as well. > > > > > > > > > > > > > > Example of URL: > > > > > > > http://example.com/test?domains=%E4%EE%EC%E5%ED > > > > > > > > > > > > > > %E4%EE%EC%E5%ED is Russian word in WINDOWS-1251 encoding. > > > > > > > > > > > > > > So in 5.90082 octets are passed as-is to the application. > > > > > > > After > > > > > > > 5.90083 - it's decoded to Unicode string consists of not-a- > > > > > > > characters.

Wed Mar 25 15:00:30 2015 jjnapiork [...] cpan.org - Correspondence added

There is a new Catalyst on CPAN https://metacpan.org/release/JJNAPIORK/Catalyst-Runtime-5.90085 Please check that out, and see if any of the three backcompat options described in the upgrading pod and elsewhere do what you need. If so please mark this ticket resolved and let me know that as well. -jnap On Tue Mar 24 14:37:58 2015, vsespb wrote: Show quoted text

> > does it sound like adding a new $app configuration like > > "decode_URI_query_from_encoding_value' (or similar) which would > > decode the URL query based on whatever you set $app->encoding to (or > > nothing if that is undef)

> > yes! (and when encoding==undef, no decoding happens) >

> > Would this be a good solution until someday (if ever) you are able to > > change your code as to be UTF-8 top to bottom?

> > yes! > > > > On Tue Mar 24 21:31:01 2015, JJNAPIORK wrote:

> > Just to be clear, does it sound like adding a new $app configuration > > like "decode_URI_query_from_encoding_value' (or similar) which would > > decode the URL query based on whatever you set $app->encoding to (or > > nothing if that is undef) would fix this for you? Would this be a > > good solution until someday (if ever) you are able to change your > > code > > as to be UTF-8 top to bottom? > > > > jnap > > > > On Tue Mar 24 14:04:05 2015, JJNAPIORK wrote:

> > > Ok so I think we can add a configurattion setting like 'url-decode- > > > from-encoding' which would instead of defaulting to utf-8 it would > > > use > > > whatever you set to encoding. that way you can disable until you > > > have > > > time to change your code to match. how does that sound? > > > > > > anywhere else you'd need this? > > > > > > john > > > On Tue Mar 24 12:47:01 2015, vsespb wrote:

> > > > Yes, all webpages are in windows-1251 charset. > > > > All our template toolkit templates are in windows-1251 charset on > > > > disk. > > > > All perl source code is in windows-1251 (thus string constants > > > > are > > > > windows-1251) > > > > > > > > Links with query strings are created usually using URI module. > > > > > > > > docs from URI module: > > > > === > > > > The escaping (percent encoding) of chars in the 128 .. 255 range > > > > passed to the URI constructor or when setting URI parts using the > > > > accessor methods depend on the state of the internal UTF8 flag > > > > (see > > > > utf8::is_utf8) of the string passed. If the UTF8 flag is set the > > > > UTF- > > > > 8 > > > > encoded version of the character is percent encoded. If the UTF8 > > > > flag > > > > isn't set the Latin-1 version (byte) of the character is percent > > > > encoded. This basically exposes the internal encoding of Perl > > > > strings. > > > > === > > > > > > > > i.e. it will escape strings without perl UTF-8 flag as byte > > > > percent > > > > encoding. this works for us. > > > > > > > > No, we dont use $c->encoding. We just set encoding to undef. > > > > > > > > On Tue Mar 24 19:38:22 2015, JJNAPIORK wrote:

> > > > > Sorry one more question. How are the links generated? Do you > > > > > have > > > > > web pages that are windows charset encoded? Did you use $c-

> > > > > > uri_for

> > > > > or did you have links 'hard coded', or manually created? Last > > > > > are > > > > > you > > > > > doing the windows encoding for web pages (I assume that is what > > > > > you > > > > > are doing) via setting $c->encoding or similar or did you have > > > > > to > > > > > hack > > > > > catalyst to make it work the way you needed ? > > > > > > > > > > > > > > > On Tue Mar 24 12:28:37 2015, vsespb wrote:

> > > > > > Request/Response: > > > > > > ======== > > > > > > GET /misc/mytest?domains=%E4%EE%EC%E5%ED HTTP/1.1 > > > > > > Host: www1.reg.ru > > > > > > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) > > > > > > Gecko/20100101 Firefox/36.0 > > > > > > Accept: > > > > > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > > > > > Accept-Language: en-US,en;q=0.5 > > > > > > Accept-Encoding: gzip, deflate > > > > > > Cookie: [CUT] > > > > > > Connection: keep-alive > > > > > > Cache-Control: max-age=0 > > > > > > > > > > > > HTTP/1.1 200 OK > > > > > > Server: nginx > > > > > > Date: Tue, 24 Mar 2015 16:19:45 GMT > > > > > > Content-Type: text/html; charset=WINDOWS-1251 > > > > > > Transfer-Encoding: chunked > > > > > > Connection: keep-alive > > > > > > Content-Language: ru > > > > > > Set-Cookie: [CUT] > > > > > > X-Catalyst: 5.90083 > > > > > > x-ua-compatible: IE=edge,chrome=IE8 > > > > > > Content-Encoding: gzip > > > > > > ======== > > > > > > > > > > > > Action code: > > > > > > ======== > > > > > > sub mytest : Local Args(0) { > > > > > > my ($self, $c, $r, $p) = getcontvars_noses @_; > > > > > > $c->res->content_type( "text/plain" ); > > > > > > $c->res->body( $p->{domains} ); > > > > > > } > > > > > > ======== > > > > > > > > > > > > Console: > > > > > > ======== > > > > > > [error] Caught exception in engine "Wide character in > > > > > > syswrite > > > > > > at > > > > > > /usr/local/share/perl/5.14.2/Starman/Server.pm line 547." > > > > > > ======== > > > > > > > > > > > >

> > > > > > > one thing I wonder exactly what you think this should do?

> > > > > > > > > > > > Well, with encoding=>undef it should do nothing with > > > > > > charsets. > > > > > > i.e. > > > > > > return octets as is. i.e > > > > > > probably $self->unescape_uri($_) instead of > > > > > > decode_utf8($self-

> > > > > > > unescape_uri($_))

> > > > > > > > > > > >

> > > > > > > I was taking the assumption that people would what Catalyst > > > > > > > to > > > > > > > convert the encoded chacters to local unicode wide > > > > > > > characters > > > > > > > but > > > > > > > maybe that is not an ideal assumption?

> > > > > > > > > > > > Yes, right. With encoding NOT undef, Catalyst should convert > > > > > > binary > > > > > > data to perl strings (unicode wide characters). > > > > > > But when encoding IS undef, it should pass binary data as-is. > > > > > > It's > > > > > > exactly what it does with output data. > > > > > > So with encoding undef input processing should be consistent > > > > > > with > > > > > > output processing. > > > > > > > > > > > > We work with textual data in WINDOWS-1251 currently. That's > > > > > > pre- > > > > > > unicode approach. We're migrating to unicode, > > > > > > but we're just not there yet. > > > > > > > > > > > > > > > > > > > > > > > > On Tue Mar 24 16:59:39 2015, JJNAPIORK wrote:

> > > > > > > Hey > > > > > > > > > > > > > > one thing I wonder exactly what you think this should do? > > > > > > > I > > > > > > > was > > > > > > > taking the assumption that people would what Catalyst to > > > > > > > convert > > > > > > > the > > > > > > > encoded chacters to local unicode wide characters but maybe > > > > > > > that > > > > > > > is > > > > > > > not an ideal assumption? > > > > > > > > > > > > > > talk to me about the use case and what you'd ideally see > > > > > > > here > > > > > > > > > > > > > > On Tue Mar 24 06:56:19 2015, vsespb wrote:

> > > > > > > > Hello. > > > > > > > > > > > > > > > > Here is the diff: > > > > > > > > > > > > > > > > https://metacpan.org/diff/file?target=JJNAPIORK/Catalyst- > > > > > > > > Runtime- > > > > > > > > 5.90083/&source=JJNAPIORK/Catalyst-Runtime-5.90082/ > > > > > > > > > > > > > > > > There is the line: > > > > > > > > > > > > > > > > === > > > > > > > > map { defined $_ ? decode_utf8($self->unescape_uri($_)) : > > > > > > > > $_ > > > > > > > > } > > > > > > > > === > > > > > > > > > > > > > > > > It decodes data to Unicode string and assumes that it's > > > > > > > > in > > > > > > > > UTF- > > > > > > > > 8, > > > > > > > > it > > > > > > > > ignores "encoding" option (even encoding => undef). > > > > > > > > > > > > > > > > Before 5.90083 logic was the same: > > > > > > > > > > > > > > > > === > > > > > > > > map { decode_utf8($self->unescape_uri($_)) } > > > > > > > > === > > > > > > > > > > > > > > > > However, there was > > > > > > > > === > > > > > > > > - if(my $query_obj = $env->{'plack.request.query'}) { > > > > > > > > - $c->request->query_parameters( > > > > > > > > - $c->request->_use_hash_multivalue ? > > > > > > > > - $query_obj->clone : > > > > > > > > - $query_obj->as_hashref_mixed); > > > > > > > > - return; > > > > > > > > - } > > > > > > > > - > > > > > > > > === > > > > > > > > before decoding. > > > > > > > > > > > > > > > > So, our site didn't reach this decoding, since $env-

> > > > > > > > > {'plack.request.query'} was true.

> > > > > > > > > > > > > > > > Use case: > > > > > > > > > > > > > > > > Site runs under encoding => undef (previously without > > > > > > > > encoding > > > > > > > > at > > > > > > > > all). > > > > > > > > Web page encoding is WINDOWS-1251, so all incoming data, > > > > > > > > including > > > > > > > > query string is WINDOWS-1251 as well. > > > > > > > > > > > > > > > > Example of URL: > > > > > > > > http://example.com/test?domains=%E4%EE%EC%E5%ED > > > > > > > > > > > > > > > > %E4%EE%EC%E5%ED is Russian word in WINDOWS-1251 encoding. > > > > > > > > > > > > > > > > So in 5.90082 octets are passed as-is to the application. > > > > > > > > After > > > > > > > > 5.90083 - it's decoded to Unicode string consists of not- > > > > > > > > a- > > > > > > > > characters.

Thu Mar 26 04:26:38 2015 victor [...] vsespb.ru - Correspondence added

Thank you! do_not_decode_query resolved the issue.

Thu Mar 26 04:26:47 2015 victor [...] vsespb.ru - Status changed from 'open' to 'resolved'