Subject: | error decoding UTF-16 "noncharacters" |
Date: | Fri, 14 Jan 2011 16:45:36 -0800 |
To: | bug-Encode [...] rt.cpan.org |
From: | Andrew Pimlott <andrew [...] pimlott.net> |
Below is a forward of perl bug 81454
(http://rt.perl.org/rt3/Public/Bug/Display.html?id=81454), which I was
asked to report here. Since it was originally reported as a perl bug, I
have ported the test case to Encode directly. It's about decoding
Unicode "noncharacters" (which according to the spec are valid Unicode,
but for "internal use only"):
use Encode ();
$utf8 = "\xef\xb7\x93";
# returns "\x{FDD3}"
$x = Encode::decode('UTF-8', $utf8, Encode::FB_CROAK);
$utf16le = "\xd3\xfd";
# dies with 'UTF-16LE:Unicode character fdd3 is illegal'
$x = Encode::decode('UTF-16LE', $utf16le, Encode::FB_CROAK);
I'm not so concerned with this behaviour of Encode, per se, because when
you're using Encode, you have lots of options for handling "malformed"
data (even though this is not really malformed). I'm more concerned
with perl IO layers, as in the original report. I think that when
called for an IO layer, Encode should behave consistently with core
perl:
- "illegal" characters cause a warning, not an error (even though
malformed UTF-16 still throws an error)
- the warning is disabled by no warnings 'utf8' (I don't know if this
can be detected from Encode; if not, core perl would have to pass in
this flag)
- the set of "illegal" characters is exactly what it is in core perl
(maybe it is already)
- the warning message is formatted exactly as in core perl (remove the
"UTF-16LE:" prefix and put "0x" in front of the code point)
Basically, users think of IO encodings as core perl, so Encode should
make them act that way.
Original bug:
Create UTF-8 and UTF-16LE files containing the character U+FDD0. (For
UTF-8, this is the bytes ef b7 93; for UTF-16LE, it is d3 fd.) With the
UTF-8 file as STDIN, run
binmode(STDIN, ':encoding(UTF-8)');
while (<STDIN>) { }
The program runs without complaint. With the UTF-16LE file as STDIN, run
binmode(STDIN, ':encoding(UTF-16LE)');
while (<STDIN>) { }
The program dies with
UTF-16LE:Unicode character fdd3 is illegal at ./bin/grep_high line 2.
This is a fatal error and I find no way to turn it off except perhaps to
call Encode::decode by hand. I have run across files like this in the real
world, and it would be nice to read them with the standard filehandle
mechanism. Also, the difference between UTF-8 and UTF-16 behavior seems
unjustified.
I suggest that this diagnostic be a warning, just like the "is illegal for
interchange" messages emitted in other contexts, and be disabled by "no
warnings 'utf8'". Also, this form of the diagnostic is not documented in
perldiag, even though it practically comes from the perl core.
Andrew