Bug #32952 for SOAP-Lite: UTF8 Strings Not Marked as UTF8 If Base64 encoded

Tue Feb 05 15:20:36 2008 gwittel [...] proofpoint.com - Ticket created

Subject:	UTF8 Strings Not Marked as UTF8 If Base64 encoded
Date:	Tue, 05 Feb 2008 12:20:02 -0800
To:	bug-SOAP-Lite [...] rt.cpan.org
From:	Greg Wittel <gwittel [...] proofpoint.com>

Tried on SOAP::Lite 0.70_4. If a UTF8 string is subjected to base64 encoding (See RT Bug# 30271 ; http://rt.cpan.org/Public/Bug/Display.html?id=30271), the deserialized data does not have its is_utf8 bits set. This means the client gets octets back rather than a string as expected. Based on Bug# 30721 there are 2 ways to fix this: 1) Fix data type detection so that UTF8 data is not detected as binary and sent to base64 encoding: In SOAP::Serializer change: _typelookup => { 'base64binary' => [10, sub { $_[0] =~ ...}, ... ] To (adding the appropriate 'use' statements): _typelookup => { 'base64binary' => [10, sub { ( ! Encode::is_utf8($_[0]) ) && $_[0] =~ .... }, ... ] This assumes that transport charset is UTF8. Not sure what happens if its not. 2) Create a data type 'utf8base64' and properly encode/decode it. The expected behavior should be equivalent to: Serialize: encode_base64( Encode::encode(...) ) De-Serialized: Encode::decode(decode_base64() ... ) This method would be less sensitive to transport charset, but I'm guessing that this would cause interop problems. -Greg

Tue Feb 05 15:41:01 2008 kutterma [...] users.sourceforge.net - Correspondence added

From:

kutterma [...] users.sourceforge.net

I'd suggest a resolution similar to RT Bug# 30271: perl 5.8 and above should not detect utf-8 as binary, and there's no use fixing it for perls below (to which unicode strings are just octets). On Tue Feb 05 15:20:36 2008, gwittel@proofpoint.com wrote: Show quoted text

> Tried on SOAP::Lite 0.70_4. > > If a UTF8 string is subjected to base64 encoding (See RT Bug# 30271 ; > http://rt.cpan.org/Public/Bug/Display.html?id=30271), the deserialized

data Show quoted text

> does not have its is_utf8 bits set. This means the client gets octets

back Show quoted text

> rather than a string as expected. > > Based on Bug# 30721 there are 2 ways to fix this: > 1) Fix data type detection so that UTF8 data is not detected as

binary and Show quoted text

> sent to base64 encoding: > In SOAP::Serializer change: > _typelookup => { > 'base64binary' => [10, sub { $_[0] =~ ...}, ... ] > > To (adding the appropriate 'use' statements): > _typelookup => { > 'base64binary' => [10, sub { ( ! > Encode::is_utf8($_[0]) ) && $_[0] =~ .... }, ... ] > > This assumes that transport charset is UTF8. Not sure what

happens if Show quoted text

> its not. > > 2) Create a data type 'utf8base64' and properly encode/decode it. > The expected behavior should be equivalent to: > Serialize: encode_base64( Encode::encode(...) ) > De-Serialized: Encode::decode(decode_base64() ... ) > This method would be less sensitive to transport charset, but I'm > guessing that this would cause interop problems. > > -Greg

Tue Feb 05 15:41:45 2008 The RT System itself - Status changed from 'new' to 'open'

Tue Feb 05 16:31:52 2008 gwittel [...] proofpoint.com - Correspondence added

Subject:	Re: [rt.cpan.org #32952] UTF8 Strings Not Marked as UTF8 If Base64 encoded
Date:	Tue, 05 Feb 2008 13:31:09 -0800
To:	bug-SOAP-Lite [...] rt.cpan.org
From:	Greg Wittel <gwittel [...] proofpoint.com>

Martin Kutter via RT wrote: Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=32952 > > > I'd suggest a resolution similar to RT Bug# 30271: perl 5.8 and above > should not detect utf-8 as binary, and there's no use fixing it for > perls below (to which unicode strings are just octets). >

Thanks for the quick response. To do something similar to Bug# 30271, how would we handle marking data as UTF8 on deserialization? Its just a bunch of base64 encoded octets so there's no way to know if it should be marked as such or not. If you mean implementing a new data type in a way similar to #30271, that should work. The deserialization problem is why I suggested fixing the base64binary type lookup since it incorrectly detects some UTF8 strings (such as Japanese characters) as binary. -Greg

Wed Feb 06 02:43:17 2008 kutterma [...] users.sourceforge.net - Correspondence added

From:

kutterma [...] users.sourceforge.net

Sorry for misleading you: A similar fix would mean that SOAP::Lite should not encode unicode strings as base64binary in perl 5.8 and above. The SOAP 1.2 standard demands the use of utf-8 or utf-16 (at least for HTTP), so there should be no problems (SOAP1.1 does not demand a specific encoding). Introducing a "utf8base64" type only helps perls before 5.8 - and that's pretty useless, as these don't have a unicode handling and there's no way to reliably detect whether a sequence of octets is a utf8 string or not. The problem is that this may affect existing SOAP clients and servers, since many of them rely on SOAP::Lites autotyping, so I'd like to discuss it on the SOAP::Lite mailing list first.

Wed Feb 06 15:52:06 2008 gwittel [...] proofpoint.com - Correspondence added

Subject:	Re: [rt.cpan.org #32952] UTF8 Strings Not Marked as UTF8 If Base64 encoded
Date:	Wed, 06 Feb 2008 12:50:59 -0800
To:	bug-SOAP-Lite [...] rt.cpan.org
From:	Greg Wittel <gwittel [...] proofpoint.com>

Thanks for the clarification. That makes sense. I look forward to seeing what comes of it as I can finally have all client/server code be fully UTF8 transparent. -Greg Martin Kutter via RT wrote: Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=32952 > > > Sorry for misleading you: A similar fix would mean that SOAP::Lite > should not encode unicode strings as base64binary in perl 5.8 and above. > The SOAP 1.2 standard demands the use of utf-8 or utf-16 (at least for > HTTP), so there should be no problems (SOAP1.1 does not demand a > specific encoding). > > Introducing a "utf8base64" type only helps perls before 5.8 - and that's > pretty useless, as these don't have a unicode handling and there's no > way to reliably detect whether a sequence of octets is a utf8 string or not. > > The problem is that this may affect existing SOAP clients and servers, > since many of them rely on SOAP::Lites autotyping, so I'd like to > discuss it on the SOAP::Lite mailing list first.

Sat Feb 16 05:25:14 2008 kutterma [...] users.sourceforge.net - Correspondence added

Hi, after discussion on the mailing list, I'm going to resolve this as following: - UTF-8 strings will not be base64 encoded in the future. To avoid breaking things, this behaviour will not be included in the next stable release (which should be out in a few days), but be included in the next devel release after the next stable. Thanks for reporting, Martin

Tue Feb 19 11:24:50 2008 gwittel [...] proofpoint.com - Correspondence added

Subject:	Re: [rt.cpan.org #32952] UTF8 Strings Not Marked as UTF8 If Base64 encoded
Date:	Tue, 19 Feb 2008 08:24:16 -0800
To:	bug-SOAP-Lite [...] rt.cpan.org
From:	Greg Wittel <gwittel [...] proofpoint.com>

Hi Martin, Thanks for the update. I look forward to seeing the patch as I have to backport it to 0.60 for our internal uses. Regards, -Greg Martin Kutter via RT wrote: Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=32952 > > > Hi, > > after discussion on the mailing list, I'm going to resolve this as > following: > > - UTF-8 strings will not be base64 encoded in the future. > > To avoid breaking things, this behaviour will not be included in the > next stable release (which should be out in a few days), but be included > in the next devel release after the next stable. > > Thanks for reporting, > > Martin

Mon Aug 15 17:21:03 2011 kutterma [...] users.sourceforge.net - Status changed from 'open' to 'resolved'