Skip Menu |

This queue is for tickets about the SOAP-Lite CPAN distribution.

Report information
The Basics
Id: 32952
Status: resolved
Priority: 0/
Queue: SOAP-Lite

People
Owner: Nobody in particular
Requestors: gwittel [...] proofpoint.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: UTF8 Strings Not Marked as UTF8 If Base64 encoded
Date: Tue, 05 Feb 2008 12:20:02 -0800
To: bug-SOAP-Lite [...] rt.cpan.org
From: Greg Wittel <gwittel [...] proofpoint.com>
Tried on SOAP::Lite 0.70_4. If a UTF8 string is subjected to base64 encoding (See RT Bug# 30271 ; http://rt.cpan.org/Public/Bug/Display.html?id=30271), the deserialized data does not have its is_utf8 bits set. This means the client gets octets back rather than a string as expected. Based on Bug# 30721 there are 2 ways to fix this: 1) Fix data type detection so that UTF8 data is not detected as binary and sent to base64 encoding: In SOAP::Serializer change: _typelookup => { 'base64binary' => [10, sub { $_[0] =~ ...}, ... ] To (adding the appropriate 'use' statements): _typelookup => { 'base64binary' => [10, sub { ( ! Encode::is_utf8($_[0]) ) && $_[0] =~ .... }, ... ] This assumes that transport charset is UTF8. Not sure what happens if its not. 2) Create a data type 'utf8base64' and properly encode/decode it. The expected behavior should be equivalent to: Serialize: encode_base64( Encode::encode(...) ) De-Serialized: Encode::decode(decode_base64() ... ) This method would be less sensitive to transport charset, but I'm guessing that this would cause interop problems. -Greg
From: kutterma [...] users.sourceforge.net
I'd suggest a resolution similar to RT Bug# 30271: perl 5.8 and above should not detect utf-8 as binary, and there's no use fixing it for perls below (to which unicode strings are just octets). On Tue Feb 05 15:20:36 2008, gwittel@proofpoint.com wrote: Show quoted text
> Tried on SOAP::Lite 0.70_4. > > If a UTF8 string is subjected to base64 encoding (See RT Bug# 30271 ; > http://rt.cpan.org/Public/Bug/Display.html?id=30271), the deserialized
data Show quoted text
> does not have its is_utf8 bits set. This means the client gets octets
back Show quoted text
> rather than a string as expected. > > Based on Bug# 30721 there are 2 ways to fix this: > 1) Fix data type detection so that UTF8 data is not detected as
binary and Show quoted text
> sent to base64 encoding: > In SOAP::Serializer change: > _typelookup => { > 'base64binary' => [10, sub { $_[0] =~ ...}, ... ] > > To (adding the appropriate 'use' statements): > _typelookup => { > 'base64binary' => [10, sub { ( ! > Encode::is_utf8($_[0]) ) && $_[0] =~ .... }, ... ] > > This assumes that transport charset is UTF8. Not sure what
happens if Show quoted text
> its not. > > 2) Create a data type 'utf8base64' and properly encode/decode it. > The expected behavior should be equivalent to: > Serialize: encode_base64( Encode::encode(...) ) > De-Serialized: Encode::decode(decode_base64() ... ) > This method would be less sensitive to transport charset, but I'm > guessing that this would cause interop problems. > > -Greg
Subject: Re: [rt.cpan.org #32952] UTF8 Strings Not Marked as UTF8 If Base64 encoded
Date: Tue, 05 Feb 2008 13:31:09 -0800
To: bug-SOAP-Lite [...] rt.cpan.org
From: Greg Wittel <gwittel [...] proofpoint.com>
Martin Kutter via RT wrote: Show quoted text
> <URL: http://rt.cpan.org/Ticket/Display.html?id=32952 > > > I'd suggest a resolution similar to RT Bug# 30271: perl 5.8 and above > should not detect utf-8 as binary, and there's no use fixing it for > perls below (to which unicode strings are just octets). >
Thanks for the quick response. To do something similar to Bug# 30271, how would we handle marking data as UTF8 on deserialization? Its just a bunch of base64 encoded octets so there's no way to know if it should be marked as such or not. If you mean implementing a new data type in a way similar to #30271, that should work. The deserialization problem is why I suggested fixing the base64binary type lookup since it incorrectly detects some UTF8 strings (such as Japanese characters) as binary. -Greg
From: kutterma [...] users.sourceforge.net
Sorry for misleading you: A similar fix would mean that SOAP::Lite should not encode unicode strings as base64binary in perl 5.8 and above. The SOAP 1.2 standard demands the use of utf-8 or utf-16 (at least for HTTP), so there should be no problems (SOAP1.1 does not demand a specific encoding). Introducing a "utf8base64" type only helps perls before 5.8 - and that's pretty useless, as these don't have a unicode handling and there's no way to reliably detect whether a sequence of octets is a utf8 string or not. The problem is that this may affect existing SOAP clients and servers, since many of them rely on SOAP::Lites autotyping, so I'd like to discuss it on the SOAP::Lite mailing list first.
Subject: Re: [rt.cpan.org #32952] UTF8 Strings Not Marked as UTF8 If Base64 encoded
Date: Wed, 06 Feb 2008 12:50:59 -0800
To: bug-SOAP-Lite [...] rt.cpan.org
From: Greg Wittel <gwittel [...] proofpoint.com>
Thanks for the clarification. That makes sense. I look forward to seeing what comes of it as I can finally have all client/server code be fully UTF8 transparent. -Greg Martin Kutter via RT wrote: Show quoted text
> <URL: http://rt.cpan.org/Ticket/Display.html?id=32952 > > > Sorry for misleading you: A similar fix would mean that SOAP::Lite > should not encode unicode strings as base64binary in perl 5.8 and above. > The SOAP 1.2 standard demands the use of utf-8 or utf-16 (at least for > HTTP), so there should be no problems (SOAP1.1 does not demand a > specific encoding). > > Introducing a "utf8base64" type only helps perls before 5.8 - and that's > pretty useless, as these don't have a unicode handling and there's no > way to reliably detect whether a sequence of octets is a utf8 string or not. > > The problem is that this may affect existing SOAP clients and servers, > since many of them rely on SOAP::Lites autotyping, so I'd like to > discuss it on the SOAP::Lite mailing list first.
Hi, after discussion on the mailing list, I'm going to resolve this as following: - UTF-8 strings will not be base64 encoded in the future. To avoid breaking things, this behaviour will not be included in the next stable release (which should be out in a few days), but be included in the next devel release after the next stable. Thanks for reporting, Martin
Subject: Re: [rt.cpan.org #32952] UTF8 Strings Not Marked as UTF8 If Base64 encoded
Date: Tue, 19 Feb 2008 08:24:16 -0800
To: bug-SOAP-Lite [...] rt.cpan.org
From: Greg Wittel <gwittel [...] proofpoint.com>
Hi Martin, Thanks for the update. I look forward to seeing the patch as I have to backport it to 0.60 for our internal uses. Regards, -Greg Martin Kutter via RT wrote: Show quoted text
> <URL: http://rt.cpan.org/Ticket/Display.html?id=32952 > > > Hi, > > after discussion on the mailing list, I'm going to resolve this as > following: > > - UTF-8 strings will not be base64 encoded in the future. > > To avoid breaking things, this behaviour will not be included in the > next stable release (which should be out in a few days), but be included > in the next devel release after the next stable. > > Thanks for reporting, > > Martin