Skip Menu |

This queue is for tickets about the libnet CPAN distribution.

Report information
The Basics
Id: 104433
Status: resolved
Priority: 0/
Queue: libnet

People
Owner: Nobody in particular
Requestors: rjbs [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: 3.07



Subject: datasend corrupts input with abuse of is_utf8
A snippet: sub datasend { my $cmd = shift; my $arr = @_ == 1 && ref($_[0]) ? $_[0] : \@_; my $line = join("", @$arr); # encode to individual utf8 bytes if # $line is a string (in internal UTF-8) utf8::encode($line) if is_utf8($line); This is crazy. First of all, is_utf8 is defined in one of four ways, several having distinct meanings! Assuming we get the ideal one, it tests whether the UTF8 flag is turned on. Any string stored in an SvUTF8 scalar is then encoded. It's perfectly reasonable, though, for an SvUTF8 to contain octets forming a UTF-8 encoded string. This mistake is disturbingly common. In reality, you can determine approximately *nothing* about the semantics of a string based on the UTF8 flag, and we should *just stop trying.* There's a problem, of course. Somebody might be relying on this behavior. It seems hard to fathom, but I guess it's possible. :-) The fix, here, is to assume all input is bytes and encode nothing. If a string contains a wide character, a warning should be issued. Effect: * on strings of [\x00-\x7F]: no change * on strings of [\x00-\xFF], SvUTF8 on: * if these were UTF-8 octets: bugfix * if these were Latin-1 octets: bugfix * if these were Unicode characters: breaking change * on strings with [\x100-Inf]: * will issue "wide string" warning, otherwise no change Then, the documentation needs to be reviewed to state that the input to datasend is octets for the wire. -- rjbs
On 2015-05-14 11:07:36, RJBS wrote: Show quoted text
> This is crazy. First of all, is_utf8 is defined in one of four ways, > several having distinct meanings! Assuming we get the ideal one, it > tests whether the UTF8 flag is turned on. Any string stored in an > SvUTF8 scalar is then encoded. It's perfectly reasonable, though, for > an SvUTF8 to contain octets forming a UTF-8 encoded string.
I realize that I did not point out the big problem specifically enough. Because of this bug, if I send UTF-8 data to datasend, and if the scalar containing that UTF-8 data has the SvUTF8 flag on, it will be encoded again, meaning I will send double-encoded data across the wire. -- rjbs
In the meantime, I have made a trial release of Email::Sender::Transport::SMTP, which works around this problem. So far, I see nothing but bugfixes in my codebase. 😊 -- rjbs
On Thu May 14 11:09:23 2015, RJBS wrote: Show quoted text
> On 2015-05-14 11:07:36, RJBS wrote:
> > This is crazy. First of all, is_utf8 is defined in one of four ways, > > several having distinct meanings! Assuming we get the ideal one, it > > tests whether the UTF8 flag is turned on. Any string stored in an > > SvUTF8 scalar is then encoded. It's perfectly reasonable, though, for > > an SvUTF8 to contain octets forming a UTF-8 encoded string.
> > I realize that I did not point out the big problem specifically enough. > > Because of this bug, if I send UTF-8 data to datasend, and if the scalar > containing that UTF-8 data has the SvUTF8 flag on, it will be encoded > again, meaning I will send double-encoded data across the wire.
Apologies for the slow response. I'm back from vacation now, and starting to catch up on things :-) I agree that having four definitions of is_utf8 is crazy. Looks like it was there for supporting older perls not having utf8/Encode, but since I've changed libnet to only support >= 5.8.1 I think we can get rid of all but one of those definitions anyway -- it should just use the Encode version, I think. (I don't have much interest in supporting a perl built without Encode! I can just make that a prerequisite to be safe.) However, I'm afraid I don't understand what the problem is otherwise. As I understand it, perl's internal format happens to be UTF-8 anyway so when a string contains "wide characters" they're actually stored as the sequence of bytes that constitute the UTF-8 encoding of those characters anyway; the UTF-8 flag on the string is turned on to tell Perl that it should treat the string as the "wide characters" represented by that UTF-8 encoding, otherwise it interprets each byte as a character in the ISO-8859-1 single-byte encoding. So if you have a Perl scalar, $line, containing some UTF-8 data with the UTF-8 flag on then all that utf8::encode($line) actually does is turn the UTF-8 flag off. Internally, $line contains the same bytes before and after this encode() call, so I don't see where any double-encoding would come from. [At the Perl level, the effect of turning the flag off is that the string now appears to contain many more characters than before, e.g. two characters with ords 0xc4 and 0x80 where there was previously one character with ord 0x100, but internally the data stored is the same -- in this example, 0xc4 and 0x80 before and after.] (Double encoding would occur if encode() was called on a $line that contained UTF-8 data *without* the UTF-8 flag on, but the code is careful not to do that by checking the flag first with is_utf8(). I do not see how that is an abuse of is_utf8(); surely that's exactly what the function is for?) The purpose of the encode() call is to stop treating the wide characters as such, and start processing the (UTF-8) bytes which they are composed of directly, since it is the raw bytes that we ultimately want to send down the wire. That seems perfectly correct and reasonable to me, e.g. when datasend() subsequently calls "my $len = length($line)" it is surely the number of bytes, not the number of characters, that we are interested in.
A string in Perl 5 is a sequence of non-negative integers, and their meaning as UTF-8 octets or Unicode characters is entirely determined by how they are found and used. You can't tell a single thing about their meaning from the UTF8 scalar flag. &datasend is pretending that it can know that the string's contents are characters just because the UTF8 flag is set, which is TOTAL NONSENSE. Consider this program: --- use strict; use Encode 'encode_utf8'; use Devel::Peek; my $text = "\x{100}"; # That's "A" with a line over top of it my $utf8 = encode_utf8($text); Dump($text); Dump(substr($text,0,0)); Dump($utf8); my $mystery = substr($text,0,0) . $utf8; Dump($mystery); print unpack 'H*', encode_utf8($mystery); --- $text is a string of only one element. Its value is above 0xFF, so the string must be stored "wide." $utf8 is its UTF-8 encoded value. The UTF8 flag, thought it *happens* to match the semantics of encoded/decoded here, has *no required relationship* to these facts. For example, consider that if we take a ZERO-LENGTH substring of $text, it still has the UTF8 flag set. Then we append it to $utf8. The UTF8 flag is still on. The two strings have identical codepoints. They have identical meaning as long as we have correctly accounted for the semantics. If we were to UTF-8 encode $mystery just because the UTF8 flag was turned on, *then we would produce mojibake*. That is *precisely* what Net::Cmd is doing. The use of substr($utf8,0,0) here is only one of a myriad of ways that you can end up with a UTF8-flag-on string that actually contains octets of UTF-8. Conflating "UTF8 flag on for perl's internal memory representation" with "string is actually a text string that needs to be encoded before use" is one of the most common and insidious problems with Perl 5's strings.
On Fri Jul 03 22:09:57 2015, RJBS wrote: Show quoted text
> A string in Perl 5 is a sequence of non-negative integers, and their > meaning as UTF-8 octets or Unicode characters is entirely determined > by how they are found and used. You can't tell a single thing about > their meaning from the UTF8 scalar flag. > > &datasend is pretending that it can know that the string's contents > are characters just because the UTF8 flag is set, which is TOTAL > NONSENSE. > > Consider this program: > > --- > > use strict; > use Encode 'encode_utf8'; > use Devel::Peek; > > my $text = "\x{100}"; # That's "A" with a line over top of it > my $utf8 = encode_utf8($text); > Dump($text); > Dump(substr($text,0,0)); > Dump($utf8); > > my $mystery = substr($text,0,0) . $utf8; > > Dump($mystery); > > print unpack 'H*', encode_utf8($mystery); > > --- > > $text is a string of only one element. Its value is above 0xFF, so > the string must be stored "wide." $utf8 is its UTF-8 encoded value. > > The UTF8 flag, thought it *happens* to match the semantics of > encoded/decoded here, has *no required relationship* to these facts. > For example, consider that if we take a ZERO-LENGTH substring of > $text, it still has the UTF8 flag set. Then we append it to $utf8. > The UTF8 flag is still on. > > The two strings have identical codepoints. They have identical > meaning as long as we have correctly accounted for the semantics. If > we were to UTF-8 encode $mystery just because the UTF8 flag was turned > on, *then we would produce mojibake*. > > That is *precisely* what Net::Cmd is doing. > > The use of substr($utf8,0,0) here is only one of a myriad of ways that > you can end up with a UTF8-flag-on string that actually contains > octets of UTF-8. > > Conflating "UTF8 flag on for perl's internal memory representation" > with "string is actually a text string that needs to be encoded before > use" is one of the most common and insidious problems with Perl 5's > strings.
I disagree that the two strings (I assume you mean $text and $mystery) have identical codepoints or identical meaning -- to me they are quite different, and the difference has arisen precisely because you have not correctly accounted for the different semantics of the two strings from which $mystery was concatenated. The problem being that, as you say, substr($text,0,0), has the UTF-8 flag on but $utf8 does not since it has been explicitly encoded, i.e. in the language of various perl manpages substr($text,0,0) is a "text string" (consisting of (zero) *characters*) but $utf8 is a "binary string" (consisting of bytes). The result is, as documented in the "Byte and Character Semantics" section of perlunicode and in the "How Do I Know Whether My String Is In Unicode?" section of perluniintro, that the "binary string" gets upgraded to UTF-8 prior to concatenation. The bytes that it is composed of (0xc4 and 0x80) are interpreted as characters in the ISO-8859-1 single-byte encoding, and each get encoded as UTF-8: 0xc4 becomes 0xc3 0x84, and 0x80 becomes 0xc2 0x80. Those two characters (stored internally in their UTF-8 encoding) are then concatenated with the zero-length string to produce $mystery. So now we have $text containing the single character U+0100 while $mystery contains the two characters U+00C4 and U+0080. Not identical at all, and probably not what you wanted either. But the mistake is yours in performing that concatenation in the first place. Your program has lost track of what is "text" ($text, and therefore substr($text,0,0) too) and what is "binary" ($utf8) and has foolishly concatenated the two. As the "How can I determine if a string is a text string or a binary string?" section of perlunifaq says, this is something that you, the programmer (in this case, the user of libnet), has to keep track of, and you can't use the UTF-8 flag for doing so because the flag can be off for "text" (when a single-byte encoding is being used to store the string). (*That* would be an abuse of the is_utf8() flag, but it's not what libnet is using that function for.) That doesn't mean that libnet shouldn't use the is_utf8() function to determine whether to call encode() or not, though. As the "How Do I Know Whether My String Is In Unicode?" section of perluniintro says, it is quite correct to call utf8::is_utf8() to determine if a string is "in Unicode" -- although that doesn't mean that any of the characters in the string are necessarily UTF-8 encoded, or that any of the characters have code points greater than 0xFF (255) or even 0x80 (128), or that the string has any characters at all. (You've constructed a zero-length string with the UTF-8 flag on yourself, and it's equally easy to store a simple ASCII character with the flag forced on too, e.g. $A = pack('U0W*', 65).) All utf8::is_utf8() means is simply that the UTF-8 flag is on, and in that case I still think it is quite legitimate for a module like libnet to call utf8::encode() on it because all that does (for a correctly flagged string) is turn the flag off to cause byte semantics to be used in what follows, which is just what libnet wants at this late stage in its output routine. Note that in your example, calling utf8::encode() (or Encode::encode_utf8()) on $mystery would not do anything more sinister than exactly that: turn the UTF-8 flag off. Dump($mystery) outputs PV = 0x280a874 "\303\204\302\200"\0 [UTF8 "\x{c4}\x{80}"] while Dump(encode_utf8($mystery)) outputs PV = 0x280aac0 "\303\204\302\200"\0 It hasn't double-encoded anything, or turned anything into mojibake as you allege would happen; all it's done is simply turned the UTF-8 flag off. The fact that there is garbage in $mystery after the encode() call is entirely due to it having garbage in it before the encode() call, due to its formation from the unwise concatenation of a text string and a byte string. There's really nothing that libnet, or any other module, can do about such a case of "garbage in, garbage out". I don't agree that conflating "UTF8 flag on for perl's internal memory representation" with "string is actually a text string that needs to be encoded before use" is any kind of an evil thing. The UTF-8 flag is exactly what tells *perl* how to interpret the contents of a given string, so surely it tells us the same thing, and I think programmers ignore it at their peril. In particular, concatenating one string which is flagged to be interpreted as text (characters) with another that is flagged to be interpreted as binary (bytes) is bound to produce garbage, and the only way to resolve it is to keep careful track of what is text and what is binary, and encode/decode as appropriate. That's all that libnet is doing (i.e. calling encode() on a string that is flagged as text), and that's what the caller should do too, to avoid passing garbage into libnet and then complaining that garbage is coming out. It may be helpful to enable "use encoding::warnings" in your software to find where the mistaken concatenation of text strings and byte strings is occurring. For example, when added to your example program it would output the warning Bytes implicitly upgraded into wide characters as iso-8859-1 on the line that constructs $mystery.
Show quoted text
> I disagree that the two strings (I assume you mean $text and $mystery)
No, I mean $utf8 and $mystery. -- rjbs
Tack this onto the program: print "U - $utf8\n"; print "M - $mystery\n"; utf8::encode($mystery); print "X - $mystery\n"; Then: ~$ perl utf8-flag | gcat -A c384c280$ U - M-DM-^@$ M - M-DM-^@$ X - M-CM-^DM-BM-^@$
On 2015-07-04 14:54:23, SHAY wrote: Show quoted text
> That doesn't mean that libnet shouldn't use the is_utf8() function to > determine whether to call encode() or not, though. As the "How Do I > Know Whether My String Is In Unicode?" section of perluniintro says, > it is quite correct to call utf8::is_utf8() to determine if a string > is "in Unicode"
This document is really not all that great at actually explaining what it's talking about, at least in that excerpt. It's talking about whether perl will apply Unicode semantics to your string. This was a big deal back before the unicode_strings feature, when sometimes $x=~/.../ would match according to Unicode semantics, and sometimes not. If the UTF8 flag was on, though, you'd be okay. This is no longer relevant at all, with recent perl, but it was *never* relevant to determining whether the string contained, in the Perl programmer's world, UTF-8 octets or Unicode characters. -- rjbs
On Sat Jul 04 14:54:23 2015, SHAY wrote: Show quoted text
> I disagree that the two strings (I assume you mean $text and $mystery) > have identical codepoints or identical meaning
You are disagreeing with perl itself. use 5.012; use warnings; sub yn { $_[0] ? 'yes' : 'no' } my $str1 = my $str2 = "Mot\x{f6}rhead"; utf8::downgrade($str1); utf8::upgrade($str2); say 'same content? ', yn $str1 eq $str2; say 'same length? ', yn length $str1 == length $str2; say 'identical elements? ', map { yn substr($str1,$_,1) eq substr($str2,$_,1) } 0 .. length $str1; say for $str1, $str2; Output: same content? yes same length? yes identical elements? yesyesyesyesyesyesyesyesyesyes Mot�rhead Mot�rhead Perl considers these strings are identical in every respect: they compare equal, they have the same number of elements, they have the same elements, they are indistinguishable during I/O. It says these strings mean the exact same thing. But your code thinks otherwise, and does one thing when given one of these strings and another when given the other. Your code is wrong. That’s the bug.
On Sat Jul 04 15:14:14 2015, RJBS wrote: Show quoted text
> Tack this onto the program: > > print "U - $utf8\n"; > print "M - $mystery\n"; > > utf8::encode($mystery); > print "X - $mystery\n"; > > Then: > > ~$ perl utf8-flag | gcat -A > c384c280$ > U - M-DM-^@$ > M - M-DM-^@$ > X - M-CM-^DM-BM-^@$
Ok, what you wrote makes more sense now, knowing that you meant $utf8 and $mystery, not $text and $mystery, but now I'm confused. I get this output running it through "od -c", which makes more sense to me than "cat -A": 0000000 U - 304 200 \r \n M - 304 200 \r \n 0000020 X - 303 204 302 200 \r \n 0000032 $utf8 is the UTF-8 encoding of U+0100, being the two bytes 0xc4 and 0xc80 (octal 304 and 200) and that's what has been printed. $mystery is the two characters U+00C4 and U+0080 [i.e. the two bytes from above, but now treated as a character each due to the binary->text conversion that happened when the byte string was concatenated with a text string] stored as their four-byte UTF-8 encoding, and I'm surprised that it has printed those two characters as a single byte each (i.e. the same 0xc4 0x80 / 0304 0200 as above). Dump() shows the four bytes that it is stored as internally and shows that the CUR length of the internal storage is 4, i.e. it is definitely using the UTF-8 encoding internally even though these characters could be stored as single bytes, so I would have expected the four bytes of the UTF-8 encoding to be printed. The encode()d $mystery is the UTF-8 encoding of U+00C4 and U+0080, i.e. the same four bytes as in $mystery but now with the UTF-8 flag off to indicate byte semantics rather than character semantics, and those four bytes are what has been printed. So it's actually the printing of $mystery rather than encode($mystery) that confuses me! The "Unicode I/O" section of perluniintro says that, "writing out Unicode data produces raw bytes that Perl happens to use to internally encode the Unicode string," so why hasn't this happened here? However, the case where encode() is used does what I would expect, outputting the bytes of the binary string, just like what happens in the $utf8 case, so I'm still not seeing the problem in libnet :-(
On Sat Jul 04 15:36:14 2015, RJBS wrote: Show quoted text
> On 2015-07-04 14:54:23, SHAY wrote: >
> > That doesn't mean that libnet shouldn't use the is_utf8() function to > > determine whether to call encode() or not, though. As the "How Do I > > Know Whether My String Is In Unicode?" section of perluniintro says, > > it is quite correct to call utf8::is_utf8() to determine if a string > > is "in Unicode"
> > This document is really not all that great at actually explaining what > it's talking about, at least in that excerpt. > > It's talking about whether perl will apply Unicode semantics to your > string. This was a big deal back before the unicode_strings feature, > when sometimes $x=~/.../ would match according to Unicode semantics, > and sometimes not. If the UTF8 flag was on, though, you'd be okay. > > This is no longer relevant at all, with recent perl, but it was > *never* relevant to determining whether the string contained, in the > Perl programmer's world, UTF-8 octets or Unicode characters.
The flag is used internally by perl in deciding whether to use byte or character semantics, and it seems useful for the same purpose in Perl code to me, e.g. Dump()s of $utf8, $mystery and encode($mystery) produce: PV = 0x27845b4 "\304\200"\0 PV = 0x27c5820 "\303\204\302\200"\0 [UTF8 "\x{c4}\x{80}"] PV = 0x287b7dc "\303\204\302\200"\0 respectively. The first and last are binary strings which the programmer should be keeping track of the meaning of (e.g. they might be the UTF-8 encoding of some text, or they might be PNG image file data etc); the second one is identified by the UTF-8 flag as being a text string, so the four bytes which it is composed of specifically represent the UTF-8 encoding of two characters and it is valid to call encode() on it to encode it to a byte string for passing to external programs etc. As I understand it, libnet is calling encode() to explicitly get UTF-8 byte strings from scalars that are identified as Unicode text strings, rather than relying on knowledge of what perl's internal format is and letting perl output the raw bytes of that internal format itself. This is in keeping with the "What is "the UTF8 flag"?" section of perlunifaq, which says, "It's better to pretend that the internal format is some unknown encoding, and that you always have to encode and decode explicitly." (I don't know if it's necessarily correct for datasend() to be encoding to UTF-8, mind you, but it surely needs to encode to something to avoid "Wide character in print" warnings if it is ever asked to output text strings with codepoints > 0xFF. Or maybe that never happens anyway with the protocols in question? I haven't investigated that.) (I also concede, however, that the same section of perlunifaq says, "don't think about the UTF8 flag at all. That means that you very probably shouldn't use is_utf8, _utf8_on or _utf8_off at all," which is all very confusing. How is the programmer supposed to know when to do this essential encoding/decoding if the flag indicating the internal UTF-8 state should never be examined?)
On Sat Jul 04 16:53:49 2015, ARISTOTLE wrote: Show quoted text
> On Sat Jul 04 14:54:23 2015, SHAY wrote:
> > I disagree that the two strings (I assume you mean $text and > > $mystery) > > have identical codepoints or identical meaning
> > You are disagreeing with perl itself.
(I was disagreeing that $text and $mystery are identical, but Ricardo has since pointed out that I misinterpreted him. He was actually saying that $utf8 and $mystery are identical.) Show quoted text
> > use 5.012; > use warnings; > > sub yn { $_[0] ? 'yes' : 'no' } > > my $str1 = my $str2 = "Mot\x{f6}rhead"; > > utf8::downgrade($str1); > utf8::upgrade($str2); > > say 'same content? ', yn $str1 eq $str2; > say 'same length? ', yn length $str1 == length $str2; > say 'identical elements? ', map { yn substr($str1,$_,1) eq > substr($str2,$_,1) } 0 .. length $str1; > say for $str1, $str2; > > Output: > > same content? yes > same length? yes > identical elements? yesyesyesyesyesyesyesyesyesyes > Mot�rhead > Mot�rhead > > Perl considers these strings are identical in every respect: they > compare equal, they have the same number of elements, they have the > same elements, they are indistinguishable during I/O. It says these > strings mean the exact same thing. But your code thinks otherwise, and > does one thing when given one of these strings and another when given > the other. Your code is wrong. > > That’s the bug.
[Minor disclaimer: This is not my code. I've just picked up maintenance of it, mainly for the purpose of making new CPAN releases containing simple bug-fixes, rather than leaving them sat on a pile in RT for ever. I'm trying to keep on top of new bug reports, but much of the libnet code is well beyond my understanding of the various protocols involved...] Anyway, thank you for the example program. I see what the problem is now, and Ricardo's example with $utf8 and $mystery is making more sense too now :-) However, I'm still confused why writing out the UTF-8 flagged string doesn't cause the raw bytes to be printed. The "Unicode I/O" section of perluniintro seems to promise that the bytes of the internal format (i.e. the UTF-8 encoding) would be written out, just like they are after a call to encode(). Is it just because the character with the double-byte UTF-8 encoding (U+00F6) can be represented in the single-byte encoding (and there are no other characters in the string that don't have a single-byte representation)? I haven't seen anything about this in any man page yet, but I note that if I append "\x{100}" (which has no single-byte representation) to $str2 then the single byte 0xf6 that was previously printed now appears as two bytes. The other thing that worries me here is that if I do indeed drop the encode() call then we will presumably now get "Wide character in print" warnings from any $line given to datasend() that contains codepoints > 0xFF (if that can ever happen -- which is where my knowledge of libnet's workings starts to seriously fall down). Also on the back of my mind: Why was the offending line added in the first place, and what bug will be reintroduced by removing it? The encode() call was added by https://github.com/steve-m-hay/perl-libnet/commit/a0cf376daae1ea8e56fc5d2572e346e0074d465b with the comment "Fix slow regexp in data when scalar passed has utf8 flag set"; the is_utf8() check was added by https://github.com/steve-m-hay/perl-libnet/commit/a6dad2861af99ff15840cd3fb276e941dcab07ff with the comment "Fix bug causing utf8 encoding of 8bit strings" (and tweaked further by 5c2de6eebac9b218dde22cbfa5f39d3c83c7cba4 and 35d28d72ef1f1493ff1dbe949f7a6daeff8fab44). So it seems likely that removing the encode() could potentially have a performance impact in some cases, although getting the correct output is obviously more important. I wonder if the encode() should stay for strings that contain codepoints > 0xFF though?
On Sat Jul 04 20:55:47 2015, SHAY wrote: Show quoted text
> [Minor disclaimer: This is not my code. …]
I know. It was a rhetorical flourish; pardon my indulgence. (“Your code” in the sense that it’s yours to maintain, not in the sense that its wrongness is necessarily your doing.) Show quoted text
> However, I'm still confused why writing out the UTF-8 flagged string > doesn't cause the raw bytes to be printed. The "Unicode I/O" section > of perluniintro seems to promise that the bytes of the internal format > (i.e. the UTF-8 encoding) would be written out, just like they are > after a call to encode(). > > Is it just because the character with the double-byte UTF-8 encoding > (U+00F6) can be represented in the single-byte encoding (and there are > no other characters in the string that don't have a single-byte > representation)?
Basically. You put a U+00F6 in the string; so there is a 0xF6 in the output. There’s nothing complicated going on if you think of it that way. It only gets confusing if you start thinking that the internal representation matters. Show quoted text
> I haven't seen anything about this in any man page yet, but I note > that if I append "\x{100}" (which has no single-byte representation) > to $str2 then the single byte 0xf6 that was previously printed now > appears as two bytes.
Yes. Obviously a U+0100 cannot be output as a byte, but you just asked perl to do that, so what now? Does it throw an exception? Out of the question in Perl. Does it encode just the one character? That would be crazy. For better or for worse, it encodes the entire string and outputs that… while whining at you for asking it to do something that makes no sense. Unfortunate, in that it makes the string model confusing in a way that it ougtn’t be, but alas, there is really no better option. Because the fundamental fact of the matter is that you asked it to do something silly. If you want to output text data, you have to encode it yourself, because perl cannot know what encoding is the correct one under your circumstances. If forced to make a choice anyway, it falls back to UTF8 as a last resort, which seems about as reasonable as it can pick under fundamentally unreasonable circumstances. But is that choice the right one? Well who knows. And neither can `datasend` know, under the same circumstances. Show quoted text
> The other thing that worries me here is that if I do indeed drop the > encode() call then we will presumably now get "Wide character in > print" warnings from any $line given to datasend() that contains > codepoints > 0xFF
Indeed. Which you *want*! Because the caller gave you data that needs to be encoded, and you can’t know what the right encoding is. There is no correct thing for you to do at that point. That one’s on your caller. Show quoted text
> Also on the back of my mind: Why was the offending line added in the > first place, and what bug will be reintroduced by removing it? The > encode() call was added by > > https://github.com/steve-m-hay/perl-libnet/commit/a0cf376daae1ea8e56fc5d2572e346e0074d465b > > with the comment "Fix slow regexp in data when scalar passed has utf8 > flag set"; the is_utf8() check was added by > > https://github.com/steve-m-hay/perl-libnet/commit/a6dad2861af99ff15840cd3fb276e941dcab07ff > > with the comment "Fix bug causing utf8 encoding of 8bit strings" (and > tweaked further by 5c2de6eebac9b218dde22cbfa5f39d3c83c7cba4 and > 35d28d72ef1f1493ff1dbe949f7a6daeff8fab44). > > So it seems likely that removing the encode() could potentially have a > performance impact in some cases, although getting the correct output > is obviously more important.
Ah. Well the `encode` added in a0cf376daae1ea was the wrong fix because it *changes the meaning of the string*. All the patches that followed were attempts to duct-tape the fallout of that fundamental misstep. What Graham really wanted in a0cf376daae1ea was `utf8::downgrade`. That one converts a string to UTF8=off format in-place, if possible. It does that, in terms of the internal representation, by not only turning the flag off if it was on, but also decoding any multibyte characters in the string buffer to single byte. Because it does both at the same time, the meaning of the string ends up not changing. Of course that only works for multibyte characters in the U+0080 … U+00FF range. If there are any above that range in the string, then the downgrade fails. Which implies that the caller asked you to do something silly, so at that point you carp “Wide character” at them. Note that there has been significant effort to make the regexp engine faster on UTF8=on strings. It’s possible that downgrading the string doesn’t help at all nowadays and you can just drop the whole thing. Without benchmarks I won’t venture an opinion about this. Show quoted text
> I wonder if the encode() should stay for strings that contain > codepoints > 0xFF though?
Possibly. Encoding changes the meaning of the string, so the question is, what regexp matches were supposed to be sped up by it? If they involve patterns that contain characters outside the ASCII range, then encoding the string is a bug under any circumstances. If they do not, then you could encode the string after whining at the caller for passing you nonsense. But note that this will miss UTF8=on strings that contain no such codepoints; if there is a speed-up that matters here then those are more important, and for them you want to downgrade, not encode. And that will automatically tell you whether the string *could* be downgraded, whereupon you can encode-after-whining, or just leave be. Note that an encode-after-failed-downgrade strategy means two passes over the string buffer, so it may hurt performance rather than help… although it will also only happen to nonsense data passed by a caller that needs fixing. (Again, no benchmarks, no idea.) (Just to be clear: personally I would at most do a FAIL_OK downgrade attempt and if that fails, shrug.)
On Sun Jul 05 10:15:00 2015, ARISTOTLE wrote: [...] Show quoted text
> (Just to be clear: personally I would at most do a FAIL_OK downgrade > attempt and if that fails, shrug.)
Thank you for a most informative and helpful reply. I investigate what performance problems there were/are and thus decide on the best course of action, likely involving the use of downgrade as you suggest. I will try to get this done this week.
On Thu May 14 11:07:36 2015, RJBS wrote: Show quoted text
> The fix, here, is to assume all input is bytes and encode nothing. If > a string contains a wide character, a warning should be issued. > > Effect: > > * on strings of [\x00-\x7F]: no change > * on strings of [\x00-\xFF], SvUTF8 on: > * if these were UTF-8 octets: bugfix > * if these were Latin-1 octets: bugfix > * if these were Unicode characters: breaking change > * on strings with [\x100-Inf]: > * will issue "wide string" warning, otherwise no change > > Then, the documentation needs to be reviewed to state that the input > to datasend is octets for the wire.
Ricardo: Returning to your original report, I still have two problems in my mind: 1. I don't think the input to datasend() should be assumed or documented to be bytes / octets for the wire. My understanding of text/binary strings and the need to encode/decode is that one decodes input from the OS (e.g. when reading from a filehandle) and encodes output to the OS (e.g. when writing to a filehandle), so the encode (or actually downgrade, we've agreed on in this case) should happen at the point of output -- i.e. inside datasend(). I don't think it is the caller's responsibility to encode/downgrade their text strings into byte streams before passing them to libnet. If the caller has "PV = 0x26244b8 "\303\204"\0 [UTF8 "\x{c4}"]" in a string then they should be able to pass that into datasend() and datasend() will downgrade it to "PV = 0x26246c8 "\304"\0" prior to output; the caller shouldn't be having to do that themselves. 2. I'm still confused by your description of the effect of the change. The [\x00-\x7F] and [\x100-Inf] cases make sense, but why have you subdivided the "[\x00-\xFF], SvUTF8 on" case into three sub-cases ("UTF-8 octets", "Latin-1 octets" and "Unicode characters")? If the SvUTF8 flag is on then doesn't that mean that the string consists of (Unicode) characters stored internally in UTF-8? I don't see how there's three separate sub-cases here, two of which you say are bugfixes and the other a breaking change.
Subject: Re: [rt.cpan.org #104433] datasend corrupts input with abuse of is_utf8
Date: Wed, 8 Jul 2015 16:43:00 -0400
To: Steve Hay via RT <bug-libnet [...] rt.cpan.org>
From: Ricardo Signes <rjbs [...] cpan.org>
* Steve Hay via RT <bug-libnet@rt.cpan.org> [2015-07-06T09:27:45] Show quoted text
> Ricardo: Returning to your original report, I still have two problems in my > mind: > > 1. I don't think the input to datasend() should be assumed or documented to > be bytes / octets for the wire.
I disagree. I -do- agree that the encoding should be done as close to the border as possible, but I submit that in this case, the "as close as possible" is not inside datasend(). The way in which data must be encoded in network protocols is protocol specific, and libnet can't know what to do. That is, you can't just say "assume it's text and UTF-8 encode it." If it's SMTP, for example, you can't send more than 998 octets on a line without a CRLF. Just UTF-8 encoding will change where the line breaks are needed. Similarly, a user could be using 8BITMIME to send a message with Latin-1 characters in its body. Worse, it coudl contain two parts, one that's Latin-1 and one that is UTF-8. The data being sent over the wire has to be binary, and libnet cannot correctly make it binary. It is too potentially complex. Show quoted text
> I don't > think it is the caller's responsibility to encode/downgrade their text > strings into byte streams before passing them to libnet.
It has to be, because of what I said above. Beyond that, though, you *must not* confuse the kind of thing that encode and downgrade do. They are *RADICALLY* different. Given a string with no codepoints above 0xFF, you can upgrade/downgrade it all the live long day and the Perl progammer *should never ever know* unless they break out something naugty that inspects the scalar's flags. Multiply upgrading or downgrading is entirely idempotent after the first time. *** If upgrade or downgrade changes how your code behaves, there is a bug *** somewhere. All upgrade/downgrade do is change that way that perl runtime engine stores the string contents, **not** how the perl programming language handles the string. (All instances where this is not true are bugs, 99.99% sorted out by v5.14 and the unicode_strings feature, which was only added because a simple bugfix would have broken code where people came to rely on the bugs. This is the entity known as the always-capitalized "The Unicode Bug.") encode, on the other hand, changes the elements in the string. If you encode something more than once, you get mojibake. If you decode it more than once… well, if that works, it was probably mojibake to begin with. :) Show quoted text
> If the caller has "PV = 0x26244b8 "\303\204"\0 [UTF8 "\x{c4}"]" in a string > then they should be able to pass that into datasend() and datasend() will > downgrade it to "PV = 0x26246c8 "\304"\0" prior to output; the caller > shouldn't be having to do that themselves.
The only reason to downgrade is because maybe perl can do some operation faster on a downgraded string. Personally, I am dubious, especially when we count in the overhead of downgrading. We've done a lot of improving the speed of operations on wide-format-storage strings. So, the caller should never have to upgrade or downgrade. Having to call upgrade or downgrade *anywhere* is the admission that there is a bug and you don't want to deal with it. It makes *just a little* sense to use this before calling XS code that will only work if the string if the SVUTF8 flag is set and is too stupid to deal with the problem on its own. Show quoted text
> 2. I'm still confused by your description of the effect of the change. The > [\x00-\x7F] and [\x100-Inf] cases make sense, but why have you subdivided the > "[\x00-\xFF], SvUTF8 on" case into three sub-cases ("UTF-8 octets", "Latin-1 > octets" and "Unicode characters")? If the SvUTF8 flag is on then doesn't that > mean that the string consists of (Unicode) characters stored internally in > UTF-8? I don't see how there's three separate sub-cases here, two of which > you say are bugfixes and the other a breaking change.
So, what I was saying was that right now a user is sending in one of these cases of input: 1 strings of [\x00-\x7F] 2 strings of [\x00-\xFF], SvUTF8 on 2a if they are UTF-8 octets 2a if they are Latin-1 octets 2a if they are Unicode characters 3 strings with [\x100-Inf] You agree with me on cases (1) and (3) and ask asking why I have subdivided (2). Let me quote you again: Show quoted text
> If the SvUTF8 flag is on then doesn't that mean that the string consists of > (Unicode) characters stored internally in UTF-8?
No. A million times no. This is the absolute core of the problem. A string, even one with SvUTF8, can only be known to be a sequence of integers exposed in Perl space as the length-1 substrings of the string. They might be Unicode codepoints, but they also might be octets that just happen to be stored in the wide internal representation. Example code: open my $fh, '<', "file-of-latin-3.txt"; my $line = <$fh>; utf8::upgrade($line); Calling utf8::upgrade did not turn that text into Unicode characters. It just meant that it took up more memory than was strictly needed. :) This is not the only way that you'd end up with "bytes in an upgraded string". There are other ways that are not contrived. But it is possible. It is even reasonable. It's also something that absolutely happens in pratice. Then we go back to my three-star line above: *** If upgrade or downgrade changes how your code behaves, there is a bug *** somewhere. By calling encode() only when the string has been upgraded (i.e., is SVUTF8) then that is a bug. Calling downgrade should not introduce bugs, but I advise against it anyway, unless there's still really good evidence that it improves performance. Frankly, I'm dubious, but that's just my gut. I haven't done any measuring on this front. Nobody has any relevant measurements, at this point. Finally, as to why I gave two different outcomes for the cases 2a, 2b, 2c: 2a -- If a user is passing in an upgraded string containing UTF-8 octets (that is, the integers in the string form a sequence of 8-bit values that are valid UTF-8 and meant to be interpreted as such) then dropping the "encode if upgraded" line fixes a bug. The bug is that right now, libnet would encode those octets, resulting in mojibake. 2b -- Same thing, exactly, goes for Latin-1, except that you get different-looking mojibake. 2c -- If users are passing in Unicode strings *and expecting to get UTF-8 sent out* then this will break stuff. How do 2b and 2c differ? Well, that's established by the context. Let's pretend it's an SMTP transmission and I roughly do one of these: CASE 2B: $q = "QueensrĂżche"; # contains \xFF, is upgraded $smtp->datasend("Content-type: text/plain; charset=Latin-1\n\n"); # <-- $smtp->datasend($q); # contains \xFF, upgraded string CASE 2C: $q = "QueensrĂżche"; # contains \xFF, is upgraded $smtp->datasend("Content-type: text/plain; charset=UTF-8\n\n"); # <-- $smtp->datasend($q); The code snippets differ only on the marked-with-arrow lines. The current behavior will encode that first send, turning the eleven octets of Latin-1 into 12 octets of UTF-8. The receiving user will then try to display the UTF-8 as Latin-1 (as instructed by the MIME header) and will see mojibake. This is the bugfix for 2b. I have hit this bug in production. (This is also the same bug as in 2a, but you get different mojibake.) The current behavior will *also* encode that second send as it goes through, so the receiving SMTP server will see 12 octets of UTF-8 come in, even though the string contains 11 integers, one of them being 0xFF. That code would work, although I think it's quite unlikely, given that the rest of the context of the SMTP transaction would need to be at least a little lucky for this to work. (You'd need to put Unicode strings in your body, but mark it was UTF-8, but not in your headers, which can't be marked that way, and, and, and...) This is the breaking change in 2c. I think it's likely to be vanishingly rare. -- rjbs
Download signature.asc
application/pgp-signature 473b

Message body not shown because it is not plain text.

Message body is not shown because it is too large.

Subject: Re: [rt.cpan.org #104433] datasend corrupts input with abuse of is_utf8
Date: Thu, 9 Jul 2015 21:37:13 -0400
To: Steve Hay via RT <bug-libnet [...] rt.cpan.org>
From: Ricardo Signes <rjbs [...] cpan.org>
* Steve Hay via RT <bug-libnet@rt.cpan.org> [2015-07-09T20:57:41] Show quoted text
> All the pennies have dropped now
I am glad we got here! Perl doesn't make this stuff easy. :( Show quoted text
> > The data being sent over the wire has to be binary, and libnet cannot > > correctly make it binary. It is too potentially complex.
> > Ok, agreed. I will update documentation accordingly.
Cool. Show quoted text
> I still mean to investigate what performance issues there might have been (by > using an old libnet prior to the encode() call having been added and old > perls from around the time that was done) and whether they still exist today. > > In theory, libnet works with 5.8.1+ so it may be desirable to switch encode() > to downgrade() for older perls at least, but it will depend on whether I can > uncover what performance problem there was and if/when it went away.
Cool. I've gotta say, I'll be surprised to hear it's worth it, but I look forward to hearing what you find! Show quoted text
> There is a subtle point of confusion here, which I think is what kept > tripping me up: If the SvUTF8 flag is on then *internally* the string surely > has UTF-8 bytes, otherwise it's in a very messed-up state.
Right. Except when we decode them and have them in Perl space, we throw away their Unicode meanings and consider them like numbers. It's weird, but every time we pretend it's not the case, we end up sorry later. I have some code that uses >0xFF string elements for non-Unicode purposes, representing codepoints in a non-Unicode character repertoire. Getting that code working was a helpful exercise. Show quoted text
> "Bytes in an upgraded string" is a nice way of phrasing this subtle issue.
*Anything* to avoid saying SVUTF8. I wish that had been named something else! Show quoted text
> This all puts me somewhat in favour of your request for "explicit string > semantics" > (http://www.nntp.perl.org/group/perl.perl5.porters/2015/06/msg228670.html)! > :-)
Exactly! :) Good luck in your research, thanks for your time, and happy hacking! -- rjbs
Download signature.asc
application/pgp-signature 473b

Message body not shown because it is not plain text.

On Thu Jul 09 21:37:30 2015, RJBS wrote: Show quoted text
> * Steve Hay via RT <bug-libnet@rt.cpan.org> [2015-07-09T20:57:41]
> > I still mean to investigate what performance issues there might have > > been (by > > using an old libnet prior to the encode() call having been added and > > old > > perls from around the time that was done) and whether they still > > exist today. > > > > In theory, libnet works with 5.8.1+ so it may be desirable to switch > > encode() > > to downgrade() for older perls at least, but it will depend on > > whether I can > > uncover what performance problem there was and if/when it went away.
> > Cool. I've gotta say, I'll be surprised to hear it's worth it, but I > look > forward to hearing what you find! >
Well, I've reproduced a possible example of what the problem might have been, but I don't see any way of knowing for sure what the exact problem actually was, of course. Consider the following (very contrived) program: use strict; use warnings; use Devel::Peek qw(Dump); use Time::HiRes qw(time); my $str = join('', ('a' .. 'z')) . "\n"; utf8::upgrade($str) if $] >= 5.008; Dump($str); $str .= $str for (1 .. 15); my $time = time; $str =~ s/\015?\012(\.?)/\015\012$1$1/sg; my $diff = time - $time; print "Time: $diff\n"; The substitution there is one taken from datasend(), shortly after where the encode() is done. On my machine, the time output at the end (i.e. the time taken for the substitution to be done on the long UTF8-flagged string) is around 29 seconds for every perl-5.8 from 5.8.1 to 5.8.8 and for every perl-5.9 from 5.9.0 to 5.9.4. Weirdly the time then jumps up to a crazy 90 seconds for 5.9.5, but the problem is then almost fixed in the next release -- 5.10.0 -- with a time of 1.5 second. After that the problem goes away: 5.8.9, 5.10.1 and 5.11.0 onwards all have the time down to around 0.1 second or less. Chronologically by release date the picture is: 5.8.8 (2006 Jan 31) 29s 5.9.4 (2006 Aug 15) 29s 5.9.5 (2007 Jul 07) 90s 5.10.0 (2007 Dec 18) 1.5s 5.8.9 (2008 Dec 14) 0.1s 5.10.1 (2009 Aug 22) 0.1s 5.11.0 (2009 Oct 02) 0.1s I can't imagine what happened in 5.9.5, and I haven't pin-pointed which change(s) after that eventually fixed the problem; maybe something to do with UTF-8 caching? Anyway, all versions of perl from 5.8.1 onwards have times of 0.1 second or less if the utf8::upgrade() is removed from the program, so I think I will drop the is_utf8() test and change the encode() to a downgrade() as discussed before -- but only for $] < 5.010001 (and maybe making an exception for 5.8.9, which works fine without the downgrade), and obviously dropping the whole thing for $] >= 5.010001. (Btw, 5.6.2, which doesn't have the UTF8 flag, reports times of more like 0.01s... It's amazing just how fast some old versions are if they're able to do what you want to do!)
On Tue Jul 14 04:07:02 2015, SHAY wrote: Show quoted text
> I think I will drop the is_utf8() test and change the encode() to a > downgrade() as discussed before -- but only for $] < 5.010001 (and > maybe making an exception for 5.8.9, which works fine without the > downgrade), and obviously dropping the whole thing for $] >= 5.010001. >
Work-in-progress patch attached (as a diff against my current Github repo). Please shout sooner rather than later if you think I'm heading in the wrong direction with this! :-)
Subject: encode.patch
diff --git a/lib/Net/Cmd.pm b/lib/Net/Cmd.pm index cec44bf..3bf5ec6 100644 --- a/lib/Net/Cmd.pm +++ b/lib/Net/Cmd.pm @@ -2,7 +2,7 @@ # # Versions up to 2.29_1 Copyright (c) 1995-2006 Graham Barr <gbarr@pobox.com>. # All rights reserved. -# Changes in Version 2.29_2 onwards Copyright (C) 2013-2014 Steve Hay. All +# Changes in Version 2.29_2 onwards Copyright (C) 2013-2015 Steve Hay. All # rights reserved. # This module is free software; you can redistribute it and/or modify it under # the same terms as Perl itself, i.e. under the terms of either the GNU General @@ -27,21 +27,6 @@ BEGIN { } } -BEGIN { - if (!eval { require utf8 }) { - *is_utf8 = sub { 0 }; - } - elsif (eval { utf8::is_utf8(undef); 1 }) { - *is_utf8 = \&utf8::is_utf8; - } - elsif (eval { require Encode; Encode::is_utf8(undef); 1 }) { - *is_utf8 = \&Encode::is_utf8; - } - else { - *is_utf8 = sub { $_[0] =~ /[^\x00-\xff]/ }; - } -} - our $VERSION = "3.07"; our @ISA = qw(Exporter); our @EXPORT = qw(CMD_INFO CMD_OK CMD_MORE CMD_REJECT CMD_ERROR CMD_PENDING); @@ -429,9 +414,17 @@ sub datasend { my $arr = @_ == 1 && ref($_[0]) ? $_[0] : \@_; my $line = join("", @$arr); - # encode to individual utf8 bytes if - # $line is a string (in internal UTF-8) - utf8::encode($line) if is_utf8($line); + # Perls < 5.10.1 (with the exception of 5.8.9) have a performance problem with + # the substitutions below when dealing with strings stored internally in + # UTF-8, so downgrade them (if possible). + # Data passed to datasend() should be encoded to octets upstream already so + # shouldn't even have the UTF-8 flag on to start with, but if it so happens + # that the octets are stored in an upgraded string (as can sometimes occur) + # then they would still downgrade without fail anyway. + # Only Unicode codepoints > 0xFF stored in an upgraded string will fail to + # downgrade. We fail silently in that case, and a "Wide character in print" + # warning will be emitted later by syswrite(). + utf8::downgrade($line, 1) if $] < 5.010001 && $] != 5.008009; return 0 if $cmd->_is_closed; @@ -722,6 +715,8 @@ is pending then C<CMD_PENDING> is returned. Send data to the remote server, converting LF to CRLF. Any line starting with a '.' will be prefixed with another '.'. C<DATA> may be an array or a reference to an array. +The C<DATA> passed in must be encoded by the caller to octets of whatever +encoding is required, e.g. by using the Encode module's C<encode()> function. =item dataend () @@ -794,6 +789,9 @@ Unget a line of text from the server. Send data to the remote server without performing any conversions. C<DATA> is a scalar. +As with C<datasend()>, the C<DATA> passed in must be encoded by the caller +to octets of whatever encoding is required, e.g. by using the Encode module's +C<encode()> function. =item read_until_dot () diff --git a/t/datasend.t b/t/datasend.t index 3a97c4b..3c11cf5 100644 --- a/t/datasend.t +++ b/t/datasend.t @@ -158,3 +158,10 @@ check( "a\015\012..\015\012.\015\012", ); +# Test that datasend() plays nicely with bytes in an upgraded string, +# even though the input should really be encode()d already. +check( + substr("\x{100}", 0, 0) . "\x{e9}", + + "\x{e9}\015\012.\015\012" +); diff --git a/t/pod_coverage.t b/t/pod_coverage.t index 9cb64c2..3d674d4 100644 --- a/t/pod_coverage.t +++ b/t/pod_coverage.t @@ -7,7 +7,7 @@ # Test script to check POD coverage. # # COPYRIGHT -# Copyright (C) 2014 Steve Hay. All rights reserved. +# Copyright (C) 2014, 2015 Steve Hay. All rights reserved. # # LICENCE # This script is free software; you can redistribute it and/or modify it under @@ -48,7 +48,7 @@ MAIN: { my $params = { coverage_class => qw(Pod::Coverage::CountParents) }; pod_coverage_ok('Net::Cmd', { %$params, - also_private => [qw(is_utf8 toascii toebcdic set_status)] + also_private => [qw(toascii toebcdic set_status)] }); pod_coverage_ok('Net::Config', { %$params,
Subject: Re: [rt.cpan.org #104433] datasend corrupts input with abuse of is_utf8
Date: Tue, 14 Jul 2015 09:25:28 -0400
To: Steve Hay via RT <bug-libnet [...] rt.cpan.org>
From: Ricardo Signes <rjbs [...] cpan.org>
* Steve Hay via RT <bug-libnet@rt.cpan.org> [2015-07-14T09:16:56] Show quoted text
> Work-in-progress patch attached (as a diff against my current Github repo). > Please shout sooner rather than later if you think I'm heading in the wrong > direction with this! :-)
Looks great! -- rjbs
Download signature.asc
application/pgp-signature 473b

Message body not shown because it is not plain text.

On Tue Jul 14 09:25:49 2015, RJBS wrote: Show quoted text
> * Steve Hay via RT <bug-libnet@rt.cpan.org> [2015-07-14T09:16:56]
> > Work-in-progress patch attached (as a diff against my current Github > > repo). > > Please shout sooner rather than later if you think I'm heading in the > > wrong > > direction with this! :-)
> > Looks great!
Thanks. Now committed: https://github.com/steve-m-hay/perl-libnet/commit/20056b26e77c3a0874195d8286538e83ff950004 I will roll a new release very soon -- just going to look at a couple of pending pull requests first...
Fixed in 3.07, now on CPAN.
Am So 05. Jul 2015, 10:15:00, ARISTOTLE schrieb: Show quoted text
> What Graham really wanted in a0cf376daae1ea was `utf8::downgrade`. > That one converts a string to UTF8=off format in-place, if possible. > It does that, in terms of the internal representation, by not only > turning the flag off if it was on, but also decoding any multibyte > characters in the string buffer to single byte. Because it does both > at the same time, the meaning of the string ends up not changing. Of > course that only works for multibyte characters in the U+0080 … U+00FF > range. If there are any above that range in the string, then the > downgrade fails. Which implies that the caller asked you to do > something silly, so at that point you carp “Wide character” at them.
`utf8:downgrade` is NOT what one wants, using `utf8::encode` instead IS correct, just read the docs: Show quoted text
> (Since Perl v5.8.0) Converts in-place the internal representation of the string from UTF-8 to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
vs. Show quoted text
> (Since Perl v5.8.0) Converts in-place the character sequence to the corresponding octet sequence in UTF-8.
Both do nearly the same thing, you can not say `downgrade` is correct and use that as the argument for `encode` being incorrect, that doesn't make any sense. The only difference between both is the target charset used to create bytes and `downgrade` IS unreliable, because it uses some weird "native encoding" which no one knows about in the current environment. `encode` instead is guaranteed to produce UTF-8 BYTES and can therefore deal with arbitrary strings. And that's why you have the restriction of `downgrade` only supporting characters between U+0080 and U+00FF, because the target encoding doesn't support more.
On Tue Jun 12 12:55:54 2018, tschoening@am-soft.de wrote: Show quoted text
> Am So 05. Jul 2015, 10:15:00, ARISTOTLE schrieb:
> > What Graham really wanted in a0cf376daae1ea was `utf8::downgrade`. > > That one converts a string to UTF8=off format in-place, if possible. > > It does that, in terms of the internal representation, by not only > > turning the flag off if it was on, but also decoding any multibyte > > characters in the string buffer to single byte. Because it does both > > at the same time, the meaning of the string ends up not changing. Of > > course that only works for multibyte characters in the U+0080 … > > U+00FF > > range. If there are any above that range in the string, then the > > downgrade fails. Which implies that the caller asked you to do > > something silly, so at that point you carp “Wide character” at them.
> > `utf8:downgrade` is NOT what one wants, using `utf8::encode` instead > IS correct, just read the docs: >
> > (Since Perl v5.8.0) Converts in-place the internal representation of > > the string from UTF-8 to the equivalent octet sequence in the native > > encoding (Latin-1 or EBCDIC).
> > vs. >
> > (Since Perl v5.8.0) Converts in-place the character sequence to the > > corresponding octet sequence in UTF-8.
> > Both do nearly the same thing
Nearly. $a = $b = chr 0x10FFF; say 'before: length $a == ', length $a; say 'before: length $b == ', length $b; utf8::encode($a); utf8::downgrade($b, 1); say 'after: length $a == ', length $a; say 'after: length $b == ', length $b; Output: before: length $a == 1 before: length $b == 1 after: length $a == 4 after: length $b == 1 One of these outputs is correct. One of them is not. *Which* one is correct depends on what you semantics you need. But it’s always the case that one of them is correct and the other is incorrect. In this case, utf8::encode is the wrong one. Show quoted text
> you can not say `downgrade` is correct > and use that as the argument for `encode` being incorrect, that > doesn't make any sense.
Of course it makes sense. You yourself say the functions aren’t exactly the same, only nearly. Und knapp daneben ist auch vorbei. The functions maybe be *nearly* the same, but the correctness of which one you choose hinges precisely on that one bit of difference between them. Show quoted text
> The only difference between both is the target > charset used to create bytes and `downgrade` IS unreliable, because it > uses some weird "native encoding" which no one knows about in the > current environment.
It simply keeps the encoding unchanged. Reliably. Perl code sees no difference before and after downgrading a string (unless it actively tries – which it generally shouldn’t). Show quoted text
> `encode` instead is guaranteed to produce UTF-8 > BYTES and can therefore deal with arbitrary strings.
And for that reason it’s guaranteed to double-encode already-encoded strings. So you cannot use it if your API expects already-encoded strings. You can only use downgrade… as a workaround, if you need to call a badly designed API yourself. (If you don’t call badly designed APIs, you won’t need to downgrade the string, because downgrading doesn’t change its meaning on the Perl side.) If your API expects decoded strings and you need to write bytes, then you *must* use utf8::encode (or equivalent Encode.pm functions) (assuming your wire/file format expects UTF-8). If your API tries to say “you can give me either encoded or decoded strings and I’ll do the right thing”, but your API doesn’t also require the caller to say which kind the string is, then you lose: you are not asking for enough information from the caller, so you don’t know which output from my code example above would be the correct one. You cannot find that out just by looking at the string; the caller must tell you. If you do not say which kind of string you expect, and you do not make the caller tell you, then your code will always do the wrong thing in some circumstance, and every attempt to fix the bug will only create a different bug. Show quoted text
> And that's why > you have the restriction of `downgrade` only supporting characters > between U+0080 and U+00FF, because the target encoding doesn't support > more.
Downgrading doesn’t have a target encoding. It keeps the string in the encoding it was already in. Again: Perl code sees no difference before and after downgrading a string (unless it actively tries – which it generally shouldn’t).
Am Di 12. Jun 2018, 19:34:37, ARISTOTLE schrieb: Show quoted text
> utf8::downgrade($b, 1);
Using `1` here hides that what you are doing is simply wrong: Show quoted text
> Fails if the original UTF-8 sequence cannot be represented in the native 8 bit encoding. On failure dies or, if the value of $fail_ok is true, returns false.
https://perldoc.perl.org/utf8.html Without `1` the following warning is printed and the call dies, which makes sense, because your Unicode character can not be represented in `native` encoding, which is LATIN-1 as documented. Show quoted text
> Wide character in subroutine entry[...]
With `1` `downgrade` simply does nothing, so keeps your character string including its present UTF-8 flag as is, simply check that in your case using `is_utf8`. While that might work sometimes, it is wrong, because the result of `downgrade` should be a byte array instead of a character string as documented. Garbage in, garbage out and `die`ing is the default for some good reason. That all makes sense if you think of it. Show quoted text
> One of these outputs is correct. One of them is not. *Which* one is > correct depends on what you semantics you need.
No, using `downgrade` on arbitrary Unicode characters is always wrong and the fact that you need to disable error checks to make it output random garbage shows exactly that. You are violating its documented contract that way. Show quoted text
> In this case, utf8::encode is the wrong one.
You are wrong of course, `encode` is the correct one, because it is able to encode arbitrary Unicode characters into an UTF-8 encoded byte array without loosing any data and, again, the fact that `encode` works while `dowgrade` doesn't by default, proves that. Show quoted text
> Of course it makes sense. You yourself say the functions aren’t > exactly the same, only nearly.
The difference is that `encode` properly works with arbitrary Unicode characters and `downgrade` doesn't and that is what you have proven yourself. Show quoted text
> It simply keeps the encoding unchanged. Reliably.
Because you are disabling error checks by purpose which is the wrong thing to do. With default behaviour your call to `downgrade` would `die` to tell you that are doing things wrong. Additionally, as stated before, read the docs of `downgrade` about it's contract, the result should be a byte array with UTF-8 flag off, which is not the case in your example because you wrongly accept arbitrary errors. Show quoted text
> Can be used to make sure that the UTF-8 flag is off, e.g. when you want to make sure that the substr() or length() function works with the usually faster byte algorithm.
https://perldoc.perl.org/utf8.html Show quoted text
> Perl code sees no > difference before and after downgrading a string (unless it actively > tries – which it generally shouldn’t).
That is completely wrong again of course because of the formerly quoted sentence. Just read the docs, `downgrade` creates an array of bytes and might even result in loss of data like if it's used wrongly like you did. Just remove your error check flag and rerun your tests and test again with some ASCII character and let you print the output of `is_utf8` and you clearly can see that what you claim is wrong. Show quoted text
> And for that reason it’s guaranteed to double-encode already-encoded > strings.
Wrong, `encode` properly called on character strings properly results in UTF-8 encoded byte arrays, like your own test above proves. `encode` called on byte arrays results in arbitrary garbage and is a user error. Show quoted text
> So you cannot use it if your API expects already-encoded strings.
You mean byte arrays of arbitrary encodings. Show quoted text
> You can only use downgrade…
Wrong again of course, just read the docs, `encode` and `downgrade` both work on the same character strings as input instead of byte arrays, only the result is different. Show quoted text
> (Since Perl v5.8.0) Converts in-place the internal representation of the string from UTF-8 to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
vs. Show quoted text
> (Since Perl v5.8.0) Converts in-place the character sequence to the corresponding octet sequence in UTF-8.
The input is always the same, a character string instead of bytes. The output contract is always the same regarding data type, a byte array instead a character string, only the contents differ, UTF-8 vs. LATIN-1. Show quoted text
> If your API expects decoded strings and you need to write bytes, then > you *must* use utf8::encode (or equivalent Encode.pm functions) > (assuming your wire/file format expects UTF-8).
Which is exactly what I told in the beginning, `encode` instead of `downgrade`, because `encode` doesn't loose data and is a reliable encoding of UTF-8. Show quoted text
> If your API tries to say “you can give me either encoded or decoded > strings and I’ll do the right thing”, but your API doesn’t also > require the caller to say which kind the string is, then you lose:
And that's exactly where `is_utf8` comes into play and for that reason it is used internally in Perl as well, to distinguish between byte arrays and character strings. There are/have been a few exceptions in which the flag was off for ASCII only texts etc., but simply for historical reasons and because there's no actual difference in treating those as byte array or character string. That is nothing to rely on, though. Show quoted text
> you > are not asking for enough information from the caller, so you don’t > know which output from my code example above would be the correct one.
Of course I know and using default behaviour Perl would have told you as well, `downgrade` in your example is wrong, like it is almost always these days. Show quoted text
> You cannot find that out just by looking at the string; the caller > must tell you.
Even in your example `is_utf8` is able to tell the difference between byte arrays and character strings, just try it. Show quoted text
> Downgrading doesn’t have a target encoding.
Of course it has, just read the docs and don't rely on your broken own test, what you are doing is wrong and non-default behaviour. Show quoted text
> (Since Perl v5.8.0) Converts in-place the internal representation of the string from UTF-8 to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
The docs clearly say that the target encoding is some weird "native" one, most likely LATIN-1. Show quoted text
> Again: Perl code sees no difference before > and after downgrading a string (unless it actively tries – which it > generally shouldn’t).
Your own usage of `length` proves you wrong.
I have not replied to this for over two years because every time I have looked at it I have gotten stuck at where to even start arguing with such a flabbergastingly backwards understanding of Perl’s string model. I cannot imagine how it is possible to read the documentation and come to the exact opposite understanding of what it says (and quite plainly at that). On Wed, 13 Jun 2018 08:47:54 GMT, tschoening@am-soft.de wrote: Show quoted text
> Additionally, as stated before, read the docs of `downgrade` about it's contract, the result should be a byte array with UTF-8 flag off, which is not the case in your example because you wrongly accept arbitrary errors.
Incorrect. The contract is in this little sentence, found in the documentation of both utf8::upgrade and utf8::downgrade, which says the opposite of what you said: “The logical character sequence itself is unchanged. If $string is already stored <in the respective target representation>, then this is a no-op.” The second of these sentences makes no sense unless the function is expected to accept strings of both representations. And the first sentence says the string means exactly the same before and after the operation, whether it is being upgraded or downgraded. That is the contract of these functions: they take any string, and they leave the meaning of the string unchanged. The lack of change in the meaning of the string is shown by the fact that length() always returns the same value for the string before and after up- or downgrading it. These two functions simply change the internal representation for the same sequence of characters. Of course since there are two representations for strings but only one of them can represent the full range of characters, a function that converts between the two representations cannot possibly always do so successfully. That is why the $fail_ok argument exists: to ask the function not to fail loudly in that case. Why is this fine? Because the meaning of the string does not change. That is the contract of these functions. Passing a true value for $fail_ok is part of the documented interface of the function and does not somehow “violate its contract”. The contract of both functions is that the meaning of the string does not change and failing the conversion has no effect on that. You are of course forced to argue that normal usage of the function as documented is a violation of the contract of the function, because you are arguing that the contract of the function is the opposite of its actual contract. To argue that, you cannot permit that passing a true value for $fail_ok is correct usage, even though it is documented to be normal usage of the function. Of course your argument makes no sense anyway, because it would mean the interface of the function is designed and documented in a way that violates its own equally documented contract. The simple fact of the matter is that you are wrong about the contract of the upgrade and downgrade functions. Performing an operation on a string that is documented as not change the meaning of the string cannot possibly be wrong. The fact that this operation is even to Perl programs is due to the fact that some code wrongly assigns meaning where Perl itself does not, and working around this sometimes forces to care about the internal representation even when you should not have to.