Bug #86649 for SOAP-Lite: UTF-8 not handled properly

Tue Jul 02 14:22:53 2013 brianjamespugh [...] gmail.com - Ticket created

Subject:

UTF-8 not handled properly

When sending a HTTP message that as constructed with UTF-8 characters, the server is sending truncated messages. I found this by testing a server with soapUI, the message payload that is a complex XML structure using UTF-8 characters and is stored as a string (with utf8 flag) to be passed to SOAP::Lite for transmission. The resulting message is truncated by a few bytes. I traced through the code and found that the content-length header is being calculated properly but the message content is being re-encoded in Latin 1 which is causing the UTF-8 characters to multiply in size, thus causing the client to stop reading prematurely and truncating the message. The issue is that SOAP::Lite is using the wrong method to retrieve the data from HTTP::Message. It should be using decoded_content() instead of just plain content(). I created a patch to fix this. This patch also impacts a test case: SOAP/Transport/HTTP/CGI.t The test payload in this test case uses a UTF-8 string cut and pasted from somewhere, but doesn't use the 'utf8' pragma or uses Enocde::_utf8_on to tell Perl that it is a UTF-8 string or that the test script has a UTF-8 character in its source. I updated the test to replace the UTF-8 character with the code '\x{DC}' which causes the string to be built properly. The test server also needed to flag its STDOUT and STDIN as well. The attached patch is for .715 but has been tested with .716 as well.

Subject:

SOAP-Lite-0.715-utf8_correction.patch

diff -uNr SOAP-Lite-0.715.orig/lib/SOAP/Transport/HTTP.pm SOAP-Lite-0.715/lib/SOAP/Transport/HTTP.pm --- SOAP-Lite-0.715.orig/lib/SOAP/Transport/HTTP.pm 2012-07-15 05:18:44.000000000 -0400 +++ SOAP-Lite-0.715/lib/SOAP/Transport/HTTP.pm 2013-07-02 13:07:38.930105900 -0400 @@ -615,7 +615,7 @@ print STDOUT "$status $code ", HTTP::Status::status_message($code), "\015\012", $self->response->headers_as_string("\015\012"), "\015\012", - $self->response->content; + $self->response->decoded_content; } # ====================================================================== diff -uNr SOAP-Lite-0.715.orig/t/SOAP/Transport/HTTP/CGI/test_server.pl SOAP-Lite-0.715/t/SOAP/Transport/HTTP/CGI/test_server.pl --- SOAP-Lite-0.715.orig/t/SOAP/Transport/HTTP/CGI/test_server.pl 2010-06-03 11:33:24.000000000 -0400 +++ SOAP-Lite-0.715/t/SOAP/Transport/HTTP/CGI/test_server.pl 2013-07-02 13:15:54.915699500 -0400 @@ -7,11 +7,14 @@ dispatch_to => 'main' ); +binmode STDIN, ":utf8"; +binmode STDOUT, ":utf8"; + $soap->handle(); sub test { my ($self, $envelope) = @_; - return SOAP::Data->name('testResult')->value('Ãberall')->type('string'); + return SOAP::Data->name('testResult')->value("\x{dc}berall")->type('string'); } diff -uNr SOAP-Lite-0.715.orig/t/SOAP/Transport/HTTP/CGI.t SOAP-Lite-0.715/t/SOAP/Transport/HTTP/CGI.t --- SOAP-Lite-0.715.orig/t/SOAP/Transport/HTTP/CGI.t 2010-06-03 11:33:24.000000000 -0400 +++ SOAP-Lite-0.715/t/SOAP/Transport/HTTP/CGI.t 2013-07-02 13:15:45.540762100 -0400 @@ -56,7 +56,7 @@ if ($] >= 5.008) { ok utf8::is_utf8($result), 'return utf8 string'; { - is $result, 'Ãberall', 'utf8 content: ' . $result; + is $result, "\x{dc}berall", 'utf8 content: ' . $result; } } else {

Thu Nov 27 13:21:38 2014 ether [...] cpan.org - Severity Important added

Tue Feb 10 10:51:59 2015 F.Dreyer [...] telekom.de - Correspondence added

Subject:	[rt.cpan.org #86649]
Date:	Tue, 10 Feb 2015 16:51:42 +0100
To:	<bug-SOAP-Lite [...] rt.cpan.org>
From:	<F.Dreyer [...] telekom.de>

I also encountered a problem where a message generated by a SOAP CGI server gets truncated. Since the original report did not provide example code and is reported against a different version of SOAP::Lite (I'm using 1.13), I can't tell if this is caused by the exact same bug. But I assume it is since the symptom is the same. I reproduced the bug with the following code (files saved in latin1 charset - see explanation below): server: -------- #!/usr/bin/perl use SOAP::Lite; use SOAP::Transport::HTTP; use SOAP::Constants; SOAP::Transport::HTTP::CGI->dispatch_to('WebserviceTest')->handle(); package WebserviceTest; use Encode qw(decode); sub webservice_test { die SOAP::Fault->faultcode($SOAP::Constants::FAULT_SERVER)->faultstring('tüdelü'); }; ------- client: ------- #!/usr/bin/perl use SOAP::Lite; my $soap = SOAP::Lite ->uri('http://just-a-test/WebserviceTest') ->proxy('https://example.org/soap_rt86649.cgi'); #replace with url to server cgi script $soap->transport->add_handler("request_send", sub { print STDERR "SOAP Request:\n" . shift->dump(maxlength=>2000); return; }); $soap->transport->add_handler("response_done", sub { print STDERR "SOAP Response:\n" . shift->dump(maxlength=>2000); return; }); $soap->call('webservice_test'); ------- The debug output shows that the reply from the server has Content-Length: 459 and the following content: ------- <?xml version="1.0" encoding="UTF-8"?><soap:Envelope soap:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><soap:Body><soap:Fault><faultcode>soap:Server</faultcode><faultstring>t\xC3\xBCdel\xC3\xBC</faultstring></soap:Fault></soap:Body></soap:Envelop ------- As you can see, the latin1 scalar was automatically upgrade to a perl unicode string which was then correctly converted to utf-8 byte output (\xC3\xBC). The total length of the correct output is 461 byte, but only 459 bytes are transmitted so the "e>" at the end is truncated. I looked through the code and found the cause of the bug in SOAP::Transport::HTTP::Server->make_response, beginning at line 491. Now the first thing that is wrong here (although this has no impact on my testcase because it doesn't use compression) is that the Compress::Zlib::compress call is done while $response still contains a unicode string. Next, the value of the Content-Length header is calculated using SOAP::Utils::bytelength. This function is inherently broken because it just calculates the byte length of the internal data structure that represents the unicode string. The CPAN documentation of the 'bytes' pragma even mentions that you should never use it except for debugging: "[...] use of this module for anything other than debugging purposes is strongly discouraged. If you feel that the functions here within might be useful for your application, this possibly indicates a mismatch between your mental model of Perl Unicode and the current reality. In that case, you may wish to read some of the perl Unicode documentation: perluniintro, perlunitut, perlunifaq and perlunicode". Now because in my testcase the input is a latin1 string which has not been implicitly upgraded to unicode yet at this point, SOAP::Utils::bytelength actually returns the same value as the character length which is 459 for the entire response. Finally, the call to Encode::encode does the actual encoding from unicode string (implicitly upgraded from the latin1 string $response) to utf-8 byte output. The size of this byte output is 461 bytes (two ü chars encoded as \xC3\xBC in utf-8). The attached patch fixes the bug by rearranging the code so that 1) the value of the Content-Length Header is calculated by first calling Encode::encode to convert $response to utf-8 bytes and then calling length() on the actual byte data that is passed to HTTP::Response->new. 2) the call to Compress::Zlib::compress is done after Encode::encode so that it compresses the utf-8bytes (instead of converting the binary zlib compressed data to utf-8) PS1: the change to HTTP.pm suggested in the attached patchfile of the original report (replace response->content with response->decoded_content) is definitely wrong, because just above that print statement is a binmode(STDOUT); call which means the output is expected to be binary (bytes). The documentation of HTTP::Message says that content contains the raw binary data while decoded_content converts that data to perl unicode strings. PS2: I can only reproduce this bug with the "die SOAP::Fault" statement inside the webservice_test function. If I instead return a SOAP::Data object of type string (with latin1 or unicode value), for some reason this works without truncating the message. I assume that in this case the latin1 string is already upgraded to a unicode string at an earlier point so that SOAP::Utils::bytelength happens to return the correct length. But note that this only "accidentally" works - if you for example change that charset on the server side to utf-16: ------- my $soap_transport = SOAP::Transport::HTTP::CGI->dispatch_to('WebserviceTest'); $soap_transport->serializer->encoding('UTF-16'); $soap_transport->handle(); ------- then the current code _always_ breaks and truncates the second half of the message but with my patch it always works correctly. PS3: Both with and without my patch, compression didn't even work if I add a compress_threshold=>50 to my above testcase. First, it currently cannot even theoretically work because SOAP::Transport::HTTP::Client sends $COMPRESS = 'gzip' (HTTP.pm line 147) as Content-Encoding header, but SOAP::Transport::HTTP::Server rejects the request with HTTP 415 Unsupported Media Type if the header isn't $COMPRESS = 'deflate' (HTTP.pm line 342). But even if I set both variables to the same value, the debug output shows that the compression doesn't work and there are perl warnings on both client and server side. So it seems getting compression to work is an even more complex issue and requires further investigation which would be beyond the scope of this ticket.

Message body is not shown because sender requested not to inline it.

Tue Feb 10 10:52:00 2015 The RT System itself - Status changed from 'new' to 'open'