Subject: | Encoding/UTF-8 issues |
Hi,
we recently noticed an issue when we were sending messages with JSON data in the body that contained UTF-8 encoded characters (over a SSL connection and using ActiveMQ, we did not try it without SSL). These messages usually either did not arrive at the other end at all or they contained double-encoded UTF-8 characters (or something that looked similar) or invalid JSON. At first, I assumed that the issue was with the body, but we were actually passing a byte string (as opposed to a character string with set UTF-8 flag) to Net::Stomp, so that part was fine, but as it turns out, the destination header was built using a value that came from a database that had the UTF-8 flag set. So when this got concatenated in Net::Stomp::Frame (in as_string()), the UTF-8 flag got 'propagated' to the $frame, and when appending the $body, the bytes get appended to this UTF-8 character string, which likely explains the observed double-encoding. Initially, we did not send the content-length header, but once I added it, I got errors in the ActiveMQ log complaining that it did not receive the trailing null byte after reading content-length bytes. I assume that there are two reasons for this:
1. If the $frame has the UTF-8 flag set, the code in send_frame() works with character counts, while the lower level code (OpenSSL?) that gets eventually called in syswrite() might be returning byte counts, which makes the substr() remove too many characters when syswrite has written a UTF-8 character (which can be more than one byte long), causing 'data loss' in the body of the frame. I'm not sure if we actually triggered this though and if it really is an issue since our messages are rather short, I just noticed this potential problem when reading the send_frame() code.
2. Because of the double-encoding mentioned above, the body becomes longer than it was when the length was calculated in the calling code, so the content-length doesn't actually match the length of the body that gets sent.
I'm not sure if I fully understand the problem yet, nor do I know how to best solve this, maybe it's sufficient to document that both headers and body must be byte strings (with UTF-8 flag off) to give others who run into this a hint what they might have to look for. In our case, I currently 'solved' it by wrapping the send method with some code that encodes parameter keys and values as UTF-8 byte string (using the Encode module) if the UTF-8 flag is set, but one has to be careful since, if the caller provided a content-length header and the body was a character string with UTF-8 flag set, the actual body length might be different from the one provided by the caller.
The attached .t file tries to demonstrate a part of the issue by showing that there is a difference if a header value has or hasn't the UTF-8 flag set.
Kind regards
Manfred
Subject: | encoding.t |
#!perl
use lib 't/lib';
use TestHelp;
use Net::Stomp::Frame;
use Encode;
use File::Spec;
use Devel::Peek;
my $expected_frame_data = encode('UTF-8',
"SEND\ndestination:/foo/bar\n\n\N{WHITE SMILING FACE}\0");
my $body = encode('UTF-8', "\N{WHITE SMILING FACE}");
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
local $Data::Dumper::Terse = 1;
local $Data::Dumper::Indent = 0;
subtest 'Non UTF-8 destination' => sub {
my $f = Net::Stomp::Frame->new({
command => 'SEND',
headers => {
destination => '/foo/bar',
},
body => $body,
});
my $frame_data = $f->as_string();
is($frame_data, $expected_frame_data, 'frame content looks as expected');
is(utf8::is_utf8($frame_data), '', 'content not marked as UTF-8');
diag 'Frame data: ' . Dumper($frame_data);
Dump $frame_data;
};
subtest 'UTF-8 destination' => sub {
my $f = Net::Stomp::Frame->new({
command => 'SEND',
headers => {
destination => decode('UTF-8', '/foo/bar'),
},
body => $body,
});
my $frame_data = $f->as_string();
is($frame_data, $expected_frame_data, 'frame content looks as expected');
is(utf8::is_utf8($frame_data), '', 'content not marked as UTF-8');
diag 'Frame data: ' . Dumper($frame_data);
Dump $frame_data;
# The following 'simulates' code (that might be called from syswrite) which
# detects the UTF-8 flag and performs encoding to bytes, which results in
# double-encoded UTF-8 data
is(encode('UTF-8', $frame_data), $expected_frame_data, 'frame content looks as expected');
utf8::encode($frame_data);
is($frame_data, $expected_frame_data, 'frame content looks as expected');
};
done_testing;