Am Di 12. Jun 2018, 19:34:37, ARISTOTLE schrieb:
Show quoted text> utf8::downgrade($b, 1);
Using `1` here hides that what you are doing is simply wrong:
Show quoted text> Fails if the original UTF-8 sequence cannot be represented in the native 8 bit encoding. On failure dies or, if the value of $fail_ok is true, returns false.
https://perldoc.perl.org/utf8.html
Without `1` the following warning is printed and the call dies, which makes sense, because your Unicode character can not be represented in `native` encoding, which is LATIN-1 as documented.
Show quoted text> Wide character in subroutine entry[...]
With `1` `downgrade` simply does nothing, so keeps your character string including its present UTF-8 flag as is, simply check that in your case using `is_utf8`. While that might work sometimes, it is wrong, because the result of `downgrade` should be a byte array instead of a character string as documented. Garbage in, garbage out and `die`ing is the default for some good reason. That all makes sense if you think of it.
Show quoted text> One of these outputs is correct. One of them is not. *Which* one is
> correct depends on what you semantics you need.
No, using `downgrade` on arbitrary Unicode characters is always wrong and the fact that you need to disable error checks to make it output random garbage shows exactly that. You are violating its documented contract that way.
Show quoted text> In this case, utf8::encode is the wrong one.
You are wrong of course, `encode` is the correct one, because it is able to encode arbitrary Unicode characters into an UTF-8 encoded byte array without loosing any data and, again, the fact that `encode` works while `dowgrade` doesn't by default, proves that.
Show quoted text> Of course it makes sense. You yourself say the functions arenât
> exactly the same, only nearly.
The difference is that `encode` properly works with arbitrary Unicode characters and `downgrade` doesn't and that is what you have proven yourself.
Show quoted text> It simply keeps the encoding unchanged. Reliably.
Because you are disabling error checks by purpose which is the wrong thing to do. With default behaviour your call to `downgrade` would `die` to tell you that are doing things wrong. Additionally, as stated before, read the docs of `downgrade` about it's contract, the result should be a byte array with UTF-8 flag off, which is not the case in your example because you wrongly accept arbitrary errors.
Show quoted text> Can be used to make sure that the UTF-8 flag is off, e.g. when you want to make sure that the substr() or length() function works with the usually faster byte algorithm.
https://perldoc.perl.org/utf8.html
Show quoted text> Perl code sees no
> difference before and after downgrading a string (unless it actively
> tries â which it generally shouldnât).
That is completely wrong again of course because of the formerly quoted sentence. Just read the docs, `downgrade` creates an array of bytes and might even result in loss of data like if it's used wrongly like you did. Just remove your error check flag and rerun your tests and test again with some ASCII character and let you print the output of `is_utf8` and you clearly can see that what you claim is wrong.
Show quoted text> And for that reason itâs guaranteed to double-encode already-encoded
> strings.
Wrong, `encode` properly called on character strings properly results in UTF-8 encoded byte arrays, like your own test above proves. `encode` called on byte arrays results in arbitrary garbage and is a user error.
Show quoted text> So you cannot use it if your API expects already-encoded strings.
You mean byte arrays of arbitrary encodings.
Show quoted text> You can only use downgradeâŚ
Wrong again of course, just read the docs, `encode` and `downgrade` both work on the same character strings as input instead of byte arrays, only the result is different.
Show quoted text> (Since Perl v5.8.0) Converts in-place the internal representation of the string from UTF-8 to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
vs.
Show quoted text> (Since Perl v5.8.0) Converts in-place the character sequence to the corresponding octet sequence in UTF-8.
The input is always the same, a character string instead of bytes. The output contract is always the same regarding data type, a byte array instead a character string, only the contents differ, UTF-8 vs. LATIN-1.
Show quoted text> If your API expects decoded strings and you need to write bytes, then
> you *must* use utf8::encode (or equivalent Encode.pm functions)
> (assuming your wire/file format expects UTF-8).
Which is exactly what I told in the beginning, `encode` instead of `downgrade`, because `encode` doesn't loose data and is a reliable encoding of UTF-8.
Show quoted text> If your API tries to say âyou can give me either encoded or decoded
> strings and Iâll do the right thingâ, but your API doesnât also
> require the caller to say which kind the string is, then you lose:
And that's exactly where `is_utf8` comes into play and for that reason it is used internally in Perl as well, to distinguish between byte arrays and character strings. There are/have been a few exceptions in which the flag was off for ASCII only texts etc., but simply for historical reasons and because there's no actual difference in treating those as byte array or character string. That is nothing to rely on, though.
Show quoted text> you
> are not asking for enough information from the caller, so you donât
> know which output from my code example above would be the correct one.
Of course I know and using default behaviour Perl would have told you as well, `downgrade` in your example is wrong, like it is almost always these days.
Show quoted text> You cannot find that out just by looking at the string; the caller
> must tell you.
Even in your example `is_utf8` is able to tell the difference between byte arrays and character strings, just try it.
Show quoted text> Downgrading doesnât have a target encoding.
Of course it has, just read the docs and don't rely on your broken own test, what you are doing is wrong and non-default behaviour.
Show quoted text> (Since Perl v5.8.0) Converts in-place the internal representation of the string from UTF-8 to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
The docs clearly say that the target encoding is some weird "native" one, most likely LATIN-1.
Show quoted text> Again: Perl code sees no difference before
> and after downgrading a string (unless it actively tries â which it
> generally shouldnât).
Your own usage of `length` proves you wrong.