Bug #130192 for Net-Stomp: Encoding/UTF-8 issues

Fri Jul 26 08:10:24 2019 mstock [...] cpan.org - Ticket created

Subject:

Encoding/UTF-8 issues

Hi, we recently noticed an issue when we were sending messages with JSON data in the body that contained UTF-8 encoded characters (over a SSL connection and using ActiveMQ, we did not try it without SSL). These messages usually either did not arrive at the other end at all or they contained double-encoded UTF-8 characters (or something that looked similar) or invalid JSON. At first, I assumed that the issue was with the body, but we were actually passing a byte string (as opposed to a character string with set UTF-8 flag) to Net::Stomp, so that part was fine, but as it turns out, the destination header was built using a value that came from a database that had the UTF-8 flag set. So when this got concatenated in Net::Stomp::Frame (in as_string()), the UTF-8 flag got 'propagated' to the $frame, and when appending the $body, the bytes get appended to this UTF-8 character string, which likely explains the observed double-encoding. Initially, we did not send the content-length header, but once I added it, I got errors in the ActiveMQ log complaining that it did not receive the trailing null byte after reading content-length bytes. I assume that there are two reasons for this: 1. If the $frame has the UTF-8 flag set, the code in send_frame() works with character counts, while the lower level code (OpenSSL?) that gets eventually called in syswrite() might be returning byte counts, which makes the substr() remove too many characters when syswrite has written a UTF-8 character (which can be more than one byte long), causing 'data loss' in the body of the frame. I'm not sure if we actually triggered this though and if it really is an issue since our messages are rather short, I just noticed this potential problem when reading the send_frame() code. 2. Because of the double-encoding mentioned above, the body becomes longer than it was when the length was calculated in the calling code, so the content-length doesn't actually match the length of the body that gets sent. I'm not sure if I fully understand the problem yet, nor do I know how to best solve this, maybe it's sufficient to document that both headers and body must be byte strings (with UTF-8 flag off) to give others who run into this a hint what they might have to look for. In our case, I currently 'solved' it by wrapping the send method with some code that encodes parameter keys and values as UTF-8 byte string (using the Encode module) if the UTF-8 flag is set, but one has to be careful since, if the caller provided a content-length header and the body was a character string with UTF-8 flag set, the actual body length might be different from the one provided by the caller. The attached .t file tries to demonstrate a part of the issue by showing that there is a difference if a header value has or hasn't the UTF-8 flag set. Kind regards Manfred

Subject:

encoding.t

#!perl use lib 't/lib'; use TestHelp; use Net::Stomp::Frame; use Encode; use File::Spec; use Devel::Peek; my $expected_frame_data = encode('UTF-8', "SEND\ndestination:/foo/bar\n\n\N{WHITE SMILING FACE}\0"); my $body = encode('UTF-8', "\N{WHITE SMILING FACE}"); use Data::Dumper; local $Data::Dumper::Useqq = 1; local $Data::Dumper::Terse = 1; local $Data::Dumper::Indent = 0; subtest 'Non UTF-8 destination' => sub { my $f = Net::Stomp::Frame->new({ command => 'SEND', headers => { destination => '/foo/bar', }, body => $body, }); my $frame_data = $f->as_string(); is($frame_data, $expected_frame_data, 'frame content looks as expected'); is(utf8::is_utf8($frame_data), '', 'content not marked as UTF-8'); diag 'Frame data: ' . Dumper($frame_data); Dump $frame_data; }; subtest 'UTF-8 destination' => sub { my $f = Net::Stomp::Frame->new({ command => 'SEND', headers => { destination => decode('UTF-8', '/foo/bar'), }, body => $body, }); my $frame_data = $f->as_string(); is($frame_data, $expected_frame_data, 'frame content looks as expected'); is(utf8::is_utf8($frame_data), '', 'content not marked as UTF-8'); diag 'Frame data: ' . Dumper($frame_data); Dump $frame_data; # The following 'simulates' code (that might be called from syswrite) which # detects the UTF-8 flag and performs encoding to bytes, which results in # double-encoded UTF-8 data is(encode('UTF-8', $frame_data), $expected_frame_data, 'frame content looks as expected'); utf8::encode($frame_data); is($frame_data, $expected_frame_data, 'frame content looks as expected'); }; done_testing;

Mon Aug 12 08:37:06 2019 dakkar [...] thenautilus.net - Correspondence added

Subject:	Re: [rt.cpan.org #130192] Encoding/UTF-8 issues
Date:	Mon, 12 Aug 2019 13:30:13 +0100
To:	"Manfred Stock via RT" <bug-Net-Stomp [...] rt.cpan.org>
From:	Gianni Ceccarelli <dakkar [...] thenautilus.net>

On Fri, 26 Jul 2019 08:10:25 -0400 "Manfred Stock via RT" <bug-Net-Stomp@rt.cpan.org> wrote: Show quoted text

> we recently noticed an issue when we were sending messages with JSON > data in the body that contained UTF-8 encoded characters (over a SSL > connection and using ActiveMQ, we did not try it without SSL).

Thank you for the bug report! To the best of my knowledge, there is no reliable way to detect whether a string is going to be interpreted as characters or bytes, especially when going through XS. I fear that the best I can do is to document more explicitly that all strings passed to Net::Stomp must be byte strings. -- Dakkar - <Mobilis in mobile> GPG public key fingerprint = A071 E618 DD2C 5901 9574 6FE2 40EA 9883 7519 3F88 key id = 0x75193F88 No committee could ever come up with anything as revolutionary as a camel -- anything as practical and as perfectly designed to perform effectively under such difficult conditions. -- Laurence J. Peter

Mon Aug 12 08:37:07 2019 The RT System itself - Status changed from 'new' to 'open'

Tue Aug 13 03:02:45 2019 mstock [...] cpan.org - Correspondence added

Am Mo 12. Aug 2019, 08:37:06, dakkar@thenautilus.net schrieb: Show quoted text

> To the best of my knowledge, there is no reliable way to detect > whether a string is going to be interpreted as characters or bytes, > especially when going through XS. I fear that the best I can do is > to document more explicitly that all strings passed to Net::Stomp must > be byte strings.

I think that would be fine, especially if you explicitly mention that this does not only apply to the body, but also to all other parameters like eg. header fields and destination. Doing something 'magic' behind the scenes is often a bit risky, and introducing it now might also break existing code. One thing one might consider is emitting/logging a warning on debug or trace level if some string has the UTF-8 flag set (or only if the string that gets written to a socket has it set, as this would require fewer changes), but warnings tend to get ignored if things work anyhow, and it's probably not even a problem if no multi-byte characters are involved, which might be a rare use-case anyway. Kind regards Manfred