Bug #78588 for SOAP-Lite: unicode data not correct encoded

Thu Jul 26 11:17:34 2012 steve.bitcard [...] yewtc.demon.co.uk - Ticket created

Subject:

unicode data not correct encoded

The attached file is based on the example in: http://stackoverflow.com/questions/9365402/how-to-convince-soaplite-to-return-utf-8-data-in-responses-as-utf-8 I had the same problem as the poster. I added an en-dash to the example data as that was the character caused me grief. Basically strings with is_utf8 set on a being picked up by the base64 match ( they contain a character with value > 0x7F) whereas if they are unicode strings (perversely is_utf8($val) =1) they should be treated as strings. The problem is that the de-serializer doesn't know if the string should be unicode or not - and leaves it alone. Uncommenting the $ser->typelookup... effectively fixes the problem, though it may have unforseen consequences.

Subject:

SoapLitEncode.pm

use strictures; use Test::More; use SOAP::Lite; use utf8; use Data::Dumper; my $data = "mÃ¼\x{2013}"; my $ser = SOAP::Serializer->new; $ser->typelookup->{trick_into_ignoring} = [9, \&utf8::is_utf8 ,'as_utf8_string']; my $xml = $ser->envelope( freeform => $data ); my ( $cycled ) = values %{ SOAP::Deserializer->deserialize( $xml )->body }; is( length( $data ), length( $cycled ), "UTF-8 string is the same after serializing" ); done_testing; sub check_utf8 { my ($val) = @_; return utf8::is_utf8($val); } package SOAP::Serializer; sub as_utf8_string { my $self = shift; my($value, $name, $type, $attr) = @_; return $self->as_string($value, $name, $type, $attr); } 1;

Thu Jul 26 11:21:50 2012 steve.bitcard [...] yewtc.demon.co.uk - Correspondence added

From:

steve.bitcard [...] yewtc.demon.co.uk

sorry forgot to say v0.714/perl 5.14/ fedora 17 and v0.714/perl 5.8.8/RHEL .57

Thu Jul 26 12:49:51 2012 martin.kutter [...] fen-net.de - Correspondence added

Subject:	Re: [rt.cpan.org #78588] unicode data not correct encoded
Date:	Thu, 26 Jul 2012 18:49:29 +0200
To:	bug-SOAP-Lite [...] rt.cpan.org
From:	Martin Kutter <martin.kutter [...] fen-net.de>

Hi, unfortunately, the stackoverflow answers are both wrong and unproductive - though the description of XML::Compile is quite accurate. Even the question is not-so smart: The referenced test makes sure the length is equal after a encoding-decoding cycle - wouldn't you care about the content, too? The simplest approach to get unicode working as desired is to disable SOAP::Lite's autotyping, thus use the following test (which works fine): use Test::More; use SOAP::Lite; use utf8; my $data = "mü\x{2013}"; my $serializer = SOAP::Serializer->new(autotype => 0); my $xml = $serializer->envelope( freeform => $data ); my ( $cycled ) = values %{ SOAP::Deserializer->deserialize( $xml )->body }; is( length( $data ), length( $cycled ), "UTF-8 string is the same after serializing" ); The second best choice is to change the base64Binary detection in SOAP::Lite as documented in the SOAP::Serializer pod (and leveraged - unfortunately not for the better - in the attached module): my $data = "mü\x{2013}"; my $serializer = SOAP::Serializer->new(); $serializer->typelookup()->{ base64Binary } = [ 10, sub { 0 }, undef]; my $xml = $serializer->envelope( freeform => $data ); my ( $cycled ) = values %{ SOAP::Deserializer->deserialize( $xml )->body }; is( $data, $cycled, "UTF-8 string is the same after serializing" ); So the issue effectively boils down to the following documentation issues: - document that autotype should be disabled if you don't need it - change "base64" in the listing of types in the SOAP::Serializer pod to "base64Binary" (I had to look into the source for the example above). Best regards, Martin

Thu Jul 26 12:49:52 2012 The RT System itself - Status changed from 'new' to 'open'

Thu Jul 26 15:38:58 2012 steve.bitcard [...] yewtc.demon.co.uk - Correspondence added

From:

steve.bitcard [...] yewtc.demon.co.uk

On Thu Jul 26 12:49:51 2012, martin.kutter@fen-net.de wrote: Show quoted text

> The simplest approach to get unicode working as desired is to disable > SOAP::Lite's autotyping, thus use the following test (which works fine): >

... I did something similar (and responded to the stackoverflow post).I added: my $ser = SOAP::Serializer->new; $ser->typelookup->{trick_into_ignoring} = [9, \&utf8::is_utf8 ,'as_utf8_string']; and package SOAP::Serializer; sub as_utf8_string { my $self = shift; my($value, $name, $type, $attr) = @_; return $self->as_string($value, $name, $type, $attr); } The 9 means that it gets processed first and basically says that utf8 data is left alone and allows non-utf8 data to be base64 encoded as now. My concern with this is that I could quite easily have a string '1234' with is_utf8 set and this would no longer be treated as an integer. It seems to me that the base64 check should be at the opposite end of the priority list so that it only does the base64 if nothing else matches, it contains non-ascii characters and is_utf8 is false. Show quoted text

> > - document that autotype should be disabled if you don't need it >

I need to re-read the section on autotype. I think I need it - or at least would like it. Steve

Wed Mar 04 13:05:04 2015 F.Dreyer [...] telekom.de - Correspondence added

Subject:	[rt.cpan.org #78588]
Date:	Wed, 4 Mar 2015 19:04:49 +0100
To:	<bug-SOAP-Lite [...] rt.cpan.org>
From:	<F.Dreyer [...] telekom.de>

IMHO this is the best solution: User code: $soap_lite->serializer->typelookup->{base64Binary} = [10, sub {!utf8::is_utf8($_[0]) && $_[0] =~ /[^\x09\x0a\x0d\x20-\x7f]/ }, 'as_base64Binary']; Change SOAP::Lite module -> see attached patch This will still correctly encode scalars with non-ascii chars but without utf-8 flag (=true binary data or latin1 strings) using as_base64Binary - but perl unicode strings with utf-8 flag will pass through to the other typelookups and will eventually be encoded using either as_anyURI (95) as_string (100).

Message body is not shown because sender requested not to inline it.

Mon Dec 18 23:23:34 2017 ether [...] cpan.org - Correspondence added

On 2015-03-04 10:05:04, F.Dreyer@telekom.de wrote: Show quoted text

> IMHO this is the best solution: > > User code: > $soap_lite->serializer->typelookup->{base64Binary} = [10, sub > {!utf8::is_utf8($_[0]) && $_[0] =~ /[^\x09\x0a\x0d\x20-\x7f]/ }, > 'as_base64Binary']; > > Change SOAP::Lite module -> see attached patch > > > This will still correctly encode scalars with non-ascii chars but > without utf-8 flag (=true binary data or latin1 strings) using > as_base64Binary - but perl unicode strings with utf-8 flag will pass > through to the other typelookups and will eventually be encoded using > either as_anyURI (95) as_string (100).

It looks like you implemented this in https://metacpan.org/diff/file?target=PHRED/SOAP-Lite-1.23/&source=PHRED%2FSOAP-Lite-1.22#lib/SOAP/Lite.pm. Unfortunately, that is not a correct solution. is_utf8() does not do what you think it does. As a quick example, consider the string "Â". what does is_utf8() return? Should it be considered encoded for the purposes of SOAP::Lite? It's certainly not ascii, so it should be encoded. But it encodes to "\x{c3}\x{82}". Is *that* utf8? it's not ascii either! How do we know this is "Â" in utf8 encoding, vs. two characters, "Â" and an unprintable one? The sad truth is that *there is no way* of distinguishing characters from encoded text by merely inspecting its bytes. The setting of the *internal only* utf8 flag does not help. This ticket is long, but it covers the concept in depth very well: https://rt.cpan.org/Ticket/Display.html?id=104433