On Fri Dec 18 22:04:33 2015, SULLR wrote:
Show quoted text> My argument is, that writing on a socket should be done only with
> bytes and never with
> strings
Exactly! I agree.
However I think there is misunderstanding of perl string concept.
In short:
There is "character string" and "byte string". _both_ can have UTF-8 flag on.
Show quoted text>because the underlying layer for sockets works only at the
> granularity of
> bytes. Any attempt to properly supporting strings will thus conflict
> with this layer and
> might lead to problems. Take the following example:
>
> my $cl = IO::Socket::INET->new('127.0.0.1:1234') or die $!;
> $cl->blocking(0);
> my $buf = "\N{U+20AC}"; # EUR - 3 Byte in UTF-8
> binmode($cl,':utf8');
> while (1) {
> my $n = syswrite($cl,$buf) or last;
> print STDERR "<$n>"
> }
>
> In this example syswrite will report each time that 1 character is
> written, but internally
> 3 bytes are send into the socket buffer. If the server stops
> receiving data the socket
> buffer will fill up and if we have a bad luck the full 3 bytes from
> the EUR character will
> not fit into the buffer. Since we are non-blocking syswrite will not
> block but return.
>
This example is irrelevant here. It's about writing character string to socket with encoding specified. And we talk about writing byte string. Forget characters!
Show quoted text> It will return with undef even though maybe 1 of the 2 bytes from the
> character are written
> to the socket buffer. Additionally it will complain because
> "Malformed UTF-8 character
> (unexpected end of string) in syswrite ...".
> The information how many bytes are written is not known to the
> application which only sees
> that writing the character failed. So if it retries the write later it
> would start again
> with writing the full 3 bytes of the character - which completely
> messes up the data send
> to the server because there is now an incomplete utf-8 character in
> the data stream.
>
>
> What you request is that in all cases it should be assumed that the
> output is latin1,
No, no latin1. Forget characters. We talk about byte strings only.
Show quoted text> i.e. simply utf8::downgrade should be called on each write
Yes.
Show quoted text> and
> consequently utf8::upgrade
> must be called on each read.
No.
Show quoted text> This would be similar to what
> IO::Socket::INET does, only
> that it does not automatically upgrade the output to treat it as
Yes, it does not utf8::upgrade, because it would be wrong.
Show quoted text> latin1 and that it does
> the automatic downgrade only if there is no explicit binmode set.
>
>
> Like I said, I think that sockets and strings do not match but that
> sockets must be used
> with bytes.
exactly. but utf8::downgrade should be called.
Show quoted text> Apart from this it would be costly to call utf8::downgrade
> all the time
> for input data which is typically bytes and not strings.
not really probably. for byte string without utf8 flag it will return immediately, with O(1), I belived.
for byte string with utf8 flag it will do what it does.
Show quoted text> But you might ask the maintainers of Net::SSLeay if they would add
> this functionality
> to SSL_write since checking inside the XS code for SvUTF8 has probably
> only negligible
> impact on the performance, compared to doing the same from inside
> Perl. And since
> IO::Socket::SSL just uses Net::SSLeay for all I/O you would get what
> you want this way.
>
> Regards,
> Steffen
I am original bug reporter, but I agree with leonerd-cpan@leonerd.org.uk, there should be utf8::downgrade for consistency with perl string model.
And it what Perl does in similar cases.