Bug #98372 for IO-Socket-SSL: Should utf8::downgrade data in syswrite

Wed Aug 27 18:10:41 2014 victor [...] vsespb.ru - Ticket created

Subject:

Should utf8::downgrade data in syswrite

There is issue in Net::SSLeay https://rt.cpan.org/Ticket/Display.html?id=98368 Unfortunately it cannot be fixed. IO::Socket::SSL::syswrite/write is affected too. i.e. it behaves different way depending on scalar's UTF-8 flag. I propose utf8::downgrade input string (in _generic_write probably).

Thu Aug 28 12:49:22 2014 ether [...] cpan.org - Cc ETHER added

Thu Aug 28 16:03:16 2014 Steffen_Ullrich [...] genua.de - Correspondence added

Am Mi 27. Aug 2014, 18:10:41, vsespb schrieb: Show quoted text

> There is issue in Net::SSLeay > https://rt.cpan.org/Ticket/Display.html?id=98368 > Unfortunately it cannot be fixed. > > IO::Socket::SSL::syswrite/write is affected too. i.e. it behaves > different way depending on scalar's UTF-8 flag. > I propose utf8::downgrade input string (in _generic_write probably).

Thanks for reporting the problem, but I think a real fix is more complex. We cannot simply assume latin1 for output, but there might be a binmode of :utf8 on the file handle. In this case one would need to do an utf8 upgrade and not a downgrade and one would need to take care of sysread too. And, because a single utf8 character ca

Thu Aug 28 16:03:16 2014 The RT System itself - Status changed from 'new' to 'open'

Thu Aug 28 16:04:18 2014 Steffen_Ullrich [...] genua.de - Correspondence added

Am Mi 27. Aug 2014, 18:10:41, vsespb schrieb: Show quoted text

> There is issue in Net::SSLeay > https://rt.cpan.org/Ticket/Display.html?id=98368 > Unfortunately it cannot be fixed. > > IO::Socket::SSL::syswrite/write is affected too. i.e. it behaves > different way depending on scalar's UTF-8 flag. > I propose utf8::downgrade input string (in _generic_write probably).

Thanks for reporting the problem, but I think a real fix is more complex. We cannot simply assume latin1 for output, but there might be a binmode of :utf8 on the file handle. In this case one would need to do an utf8 upgrade and not a downgrade and one would need to take care of sysread too. And, because a single utf8 character can spread over multiple octets one would need to handle the cases where read or write stop inside an utf8 character (e.g. needs buffering).

Fri Jun 19 18:37:22 2015 leonerd-cpan [...] leonerd.org.uk - Correspondence added

A regular ->syswrite call on a native (non-SSL) socket handle in Perl always takes bytes, not Unicode characters. A ->syswrite call on an IO::Socket::SSL seems to be interpreting its input as Unicode characters if the SvUTF8 flag is set on the PV. This is wrong. I have observed a bug in my client (Net::Async::IRC) that causes double-encoding of UTF-8 over the wire, but only when SSL is enabled. I eventually tracked it down to this problem. It can easily be worked around at the Perl level by applying utf8::downgrade to the byte sequence I pass in to ->syswrite just before I call it. That *ought* never be necessary, but since it causes a demonstrable change in behaviour here, it indicates that IO::Socket::SSL (or maybe a lower dependency still) is doing the wrong thing. -- Paul Evans

Fri Jun 19 18:49:26 2015 leonerd-cpan [...] leonerd.org.uk - Correspondence added

On Fri Jun 19 18:37:22 2015, PEVANS wrote: Show quoted text

> It can easily be worked around at the Perl level by applying > utf8::downgrade to the byte sequence I pass in to ->syswrite just > before I call it.

The attached patch contains a test case demonstrating the failure, and the trivial oneline fix in IO::Async::SSL's case. -- Paul Evans

Subject:

RT98372.patch

=== modified file 'lib/IO/Async/SSL.pm' --- lib/IO/Async/SSL.pm 2015-05-29 18:53:19 +0000 +++ lib/IO/Async/SSL.pm 2015-06-19 22:48:30 +0000 @@ -117,6 +117,10 @@ my $stream = shift; my ( $fh, undef, $len ) = @_; + # Placate RT98372 + utf8::downgrade( $_[1] ) or + carp "Wide character in sslwrite"; + my $ret = $stream->_syswrite( $fh, $_[1], $len ); my $write_wants_read = !defined $ret && === added file 't/05utf8.t' --- t/05utf8.t 1970-01-01 00:00:00 +0000 +++ t/05utf8.t 2015-06-19 22:48:30 +0000 @@ -0,0 +1,83 @@ +#!/usr/bin/perl + +use strict; +use warnings; +use utf8; + +use Test::More; +use Test::Identity; + +use IO::Async::Test; + +use IO::Async::OS; +use IO::Async::Loop; +use IO::Async::SSL; +use IO::Async::Stream; + +use Encode qw( encode_utf8 decode_utf8 ); + +my $loop = IO::Async::Loop->new; + +# A message containing non-8859-1 characters as this tests Perl more interestingly +my $message = "Äu vi Äi tio vidas?"; + +sub chomped { chomp( my $tmp = $_[0] ); return $tmp } + +testing_loop( $loop ); + +{ + my ( $server_sock, $client_sock ) = IO::Async::OS->socketpair or + die "Cannot socketpair - $!"; + + $server_sock->blocking( 0 ); + $client_sock->blocking( 0 ); + + my $server_stream = IO::Async::Stream->new( + handle => $server_sock, + on_read => sub { 0 }, + ); + $loop->add( $server_stream ); + + my $client_stream = IO::Async::Stream->new( + handle => $client_sock, + on_read => sub { 0 }, + ); + $loop->add( $client_stream ); + + my $server_f = $loop->SSL_upgrade( + handle => $server_stream, + SSL_server => 1, + SSL_key_file => "t/privkey.pem", + SSL_cert_file => "t/server.pem", + ); + + my $client_f = $loop->SSL_upgrade( + handle => $client_stream, + SSL_verify_mode => 0, + ); + + wait_for { $server_f->is_ready and $client_f->is_ready }; + + # Check that we can pass UTF-8 bytes unmolested + my $bytes = encode_utf8( $message ); + + $client_stream->write( "$bytes\n" ); + + my $read_f = $server_stream->read_until( "\n" ); + wait_for { $read_f->is_ready }; + is( decode_utf8( chomped $read_f->get ), $message, + 'UTF-8 string unmolested' ); + + # Check further that the bytes remain umolested even if they somehow end + # up with the SvUTF8 flag set + utf8::upgrade( $bytes ); + + $client_stream->write( "$bytes\n" ); + + $read_f = $server_stream->read_until( "\n" ); + wait_for { $read_f->is_ready }; + is( decode_utf8( chomped $read_f->get ), $message, + 'UTF-8 string unmolested even with SvUTF8' ); +} + +done_testing;

Fri Dec 18 14:04:33 2015 Steffen_Ullrich [...] genua.de - Correspondence added

On Fri Jun 19 18:49:26 2015, PEVANS wrote: Show quoted text

> On Fri Jun 19 18:37:22 2015, PEVANS wrote:

> > It can easily be worked around at the Perl level by applying > > utf8::downgrade to the byte sequence I pass in to ->syswrite just > > before I call it.

> > The attached patch contains a test case demonstrating the failure, and > the trivial oneline fix in IO::Async::SSL's case.

Hi, Sorry that it took me such a long time to follow up on your message. My argument is, that writing on a socket should be done only with bytes and never with strings, because the underlying layer for sockets works only at the granularity of bytes. Any attempt to properly supporting strings will thus conflict with this layer and might lead to problems. Take the following example: my $cl = IO::Socket::INET->new('127.0.0.1:1234') or die $!; $cl->blocking(0); my $buf = "\N{U+20AC}"; # EUR - 3 Byte in UTF-8 binmode($cl,':utf8'); while (1) { my $n = syswrite($cl,$buf) or last; print STDERR "<$n>" } In this example syswrite will report each time that 1 character is written, but internally 3 bytes are send into the socket buffer. If the server stops receiving data the socket buffer will fill up and if we have a bad luck the full 3 bytes from the EUR character will not fit into the buffer. Since we are non-blocking syswrite will not block but return. It will return with undef even though maybe 1 of the 2 bytes from the character are written to the socket buffer. Additionally it will complain because "Malformed UTF-8 character (unexpected end of string) in syswrite ...". The information how many bytes are written is not known to the application which only sees that writing the character failed. So if it retries the write later it would start again with writing the full 3 bytes of the character - which completely messes up the data send to the server because there is now an incomplete utf-8 character in the data stream. What you request is that in all cases it should be assumed that the output is latin1, i.e. simply utf8::downgrade should be called on each write and consequently utf8::upgrade must be called on each read. This would be similar to what IO::Socket::INET does, only that it does not automatically upgrade the output to treat it as latin1 and that it does the automatic downgrade only if there is no explicit binmode set. Like I said, I think that sockets and strings do not match but that sockets must be used with bytes. Apart from this it would be costly to call utf8::downgrade all the time for input data which is typically bytes and not strings. But you might ask the maintainers of Net::SSLeay if they would add this functionality to SSL_write since checking inside the XS code for SvUTF8 has probably only negligible impact on the performance, compared to doing the same from inside Perl. And since IO::Socket::SSL just uses Net::SSLeay for all I/O you would get what you want this way. Regards, Steffen

Fri Dec 18 14:04:35 2015 Steffen_Ullrich [...] genua.de - Status changed from 'open' to 'rejected'

Mon Dec 21 11:52:40 2015 victor [...] vsespb.ru - Correspondence added

On Fri Dec 18 22:04:33 2015, SULLR wrote: Show quoted text

> My argument is, that writing on a socket should be done only with > bytes and never with > strings

Exactly! I agree. However I think there is misunderstanding of perl string concept. In short: There is "character string" and "byte string". _both_ can have UTF-8 flag on. Show quoted text

>because the underlying layer for sockets works only at the > granularity of > bytes. Any attempt to properly supporting strings will thus conflict > with this layer and > might lead to problems. Take the following example: > > my $cl = IO::Socket::INET->new('127.0.0.1:1234') or die $!; > $cl->blocking(0); > my $buf = "\N{U+20AC}"; # EUR - 3 Byte in UTF-8 > binmode($cl,':utf8'); > while (1) { > my $n = syswrite($cl,$buf) or last; > print STDERR "<$n>" > } > > In this example syswrite will report each time that 1 character is > written, but internally > 3 bytes are send into the socket buffer. If the server stops > receiving data the socket > buffer will fill up and if we have a bad luck the full 3 bytes from > the EUR character will > not fit into the buffer. Since we are non-blocking syswrite will not > block but return. >

This example is irrelevant here. It's about writing character string to socket with encoding specified. And we talk about writing byte string. Forget characters! Show quoted text

> It will return with undef even though maybe 1 of the 2 bytes from the > character are written > to the socket buffer. Additionally it will complain because > "Malformed UTF-8 character > (unexpected end of string) in syswrite ...". > The information how many bytes are written is not known to the > application which only sees > that writing the character failed. So if it retries the write later it > would start again > with writing the full 3 bytes of the character - which completely > messes up the data send > to the server because there is now an incomplete utf-8 character in > the data stream. > > > What you request is that in all cases it should be assumed that the > output is latin1,

No, no latin1. Forget characters. We talk about byte strings only. Show quoted text

> i.e. simply utf8::downgrade should be called on each write

Yes. Show quoted text

> and > consequently utf8::upgrade > must be called on each read.

No. Show quoted text

> This would be similar to what > IO::Socket::INET does, only > that it does not automatically upgrade the output to treat it as

Yes, it does not utf8::upgrade, because it would be wrong. Show quoted text

> latin1 and that it does > the automatic downgrade only if there is no explicit binmode set. > > > Like I said, I think that sockets and strings do not match but that > sockets must be used > with bytes.

exactly. but utf8::downgrade should be called. Show quoted text

> Apart from this it would be costly to call utf8::downgrade > all the time > for input data which is typically bytes and not strings.

not really probably. for byte string without utf8 flag it will return immediately, with O(1), I belived. for byte string with utf8 flag it will do what it does. Show quoted text

> But you might ask the maintainers of Net::SSLeay if they would add > this functionality > to SSL_write since checking inside the XS code for SvUTF8 has probably > only negligible > impact on the performance, compared to doing the same from inside > Perl. And since > IO::Socket::SSL just uses Net::SSLeay for all I/O you would get what > you want this way. > > Regards, > Steffen

I am original bug reporter, but I agree with leonerd-cpan@leonerd.org.uk, there should be utf8::downgrade for consistency with perl string model. And it what Perl does in similar cases.