Subject: | decode_utf8 idempotence |
Date: | Sun, 26 Sep 2010 13:22:39 -0700 |
To: | bug-Encode [...] rt.cpan.org |
From: | Father Chrysostomos <sprout [...] cpan.org> |
The fix for #14559 ‘fix for #8872 introduces new “bug”’ (<https://rt.cpan.org/Public/Bug/Display.html?id=14559>) itself introduces a bug.
If I have a string containing "\xc3\xa9" that just happens to have the UTF8 flag on (e.g., substr "\x{100}\xc3\xa9", 1), decode_utf8 won’t decode it.
The UTF8 flag is something internal to perl, which should not be used in deciding what to do with a given string.
I think that bug #14559 is not a bug at all:
On Mon Sep 12 16:48:04 2005, RUZ wrote:
Show quoted text
> Fix for http://rt.cpan.org/NoAuth/Bug.html?id=8872 doesn't allow to
> use strings with UTF-8 flag as decode_utf8 argument:
>
> $ perl -MEncode -we 'decode_utf8("\x{100}")'
> Cannot decode string with wide characters at
> /usr/lib/perl5/5.8.7/x86_64-linux/Encode.pm line 166.
You can’t decode something other than bytes. There every decode routine must only accept characters in the range 0..255. How they are encoded internally by perl should be irrelevant.
Show quoted text> This behaviour is not documented and also is not consistent with
> encode_utf8 that doesn't die when string has no UTF-8 flag.
Again, whether the UTF8 flag is on or not should be irrelevant. encode_utf8 doesn’t die because perl string cannot contain anything it cannot handle.