Bug #64788 for Encode: error decoding UTF-16 "noncharacters"

Fri Jan 14 20:03:12 2011 andrew [...] pimlott.net - Ticket created

Subject:	error decoding UTF-16 "noncharacters"
Date:	Fri, 14 Jan 2011 16:45:36 -0800
To:	bug-Encode [...] rt.cpan.org
From:	Andrew Pimlott <andrew [...] pimlott.net>

Below is a forward of perl bug 81454 (http://rt.perl.org/rt3/Public/Bug/Display.html?id=81454), which I was asked to report here. Since it was originally reported as a perl bug, I have ported the test case to Encode directly. It's about decoding Unicode "noncharacters" (which according to the spec are valid Unicode, but for "internal use only"): use Encode (); $utf8 = "\xef\xb7\x93"; # returns "\x{FDD3}" $x = Encode::decode('UTF-8', $utf8, Encode::FB_CROAK); $utf16le = "\xd3\xfd"; # dies with 'UTF-16LE:Unicode character fdd3 is illegal' $x = Encode::decode('UTF-16LE', $utf16le, Encode::FB_CROAK); I'm not so concerned with this behaviour of Encode, per se, because when you're using Encode, you have lots of options for handling "malformed" data (even though this is not really malformed). I'm more concerned with perl IO layers, as in the original report. I think that when called for an IO layer, Encode should behave consistently with core perl: - "illegal" characters cause a warning, not an error (even though malformed UTF-16 still throws an error) - the warning is disabled by no warnings 'utf8' (I don't know if this can be detected from Encode; if not, core perl would have to pass in this flag) - the set of "illegal" characters is exactly what it is in core perl (maybe it is already) - the warning message is formatted exactly as in core perl (remove the "UTF-16LE:" prefix and put "0x" in front of the code point) Basically, users think of IO encodings as core perl, so Encode should make them act that way. Original bug: Create UTF-8 and UTF-16LE files containing the character U+FDD0. (For UTF-8, this is the bytes ef b7 93; for UTF-16LE, it is d3 fd.) With the UTF-8 file as STDIN, run binmode(STDIN, ':encoding(UTF-8)'); while (<STDIN>) { } The program runs without complaint. With the UTF-16LE file as STDIN, run binmode(STDIN, ':encoding(UTF-16LE)'); while (<STDIN>) { } The program dies with UTF-16LE:Unicode character fdd3 is illegal at ./bin/grep_high line 2. This is a fatal error and I find no way to turn it off except perhaps to call Encode::decode by hand. I have run across files like this in the real world, and it would be nice to read them with the standard filehandle mechanism. Also, the difference between UTF-8 and UTF-16 behavior seems unjustified. I suggest that this diagnostic be a warning, just like the "is illegal for interchange" messages emitted in other contexts, and be disabled by "no warnings 'utf8'". Also, this form of the diagnostic is not documented in perldiag, even though it practically comes from the perl core. Andrew

Sat May 21 18:23:35 2011 DANKOGAI [...] cpan.org - Correspondence added

That's simply because you give the third argument, Encode::FB_CROAK. just get rid of it like $x = Encode::decode('UTF-16LE', $utf16le) and it will work as expected. See perldoc Encode to find what ncode::FB_CROAK means. Dan the Maintainer Thereof On Fri Jan 14 20:03:12 2011, andrew@pimlott.net wrote: Show quoted text

> Below is a forward of perl bug 81454 > (http://rt.perl.org/rt3/Public/Bug/Display.html?id=81454), which I was > asked to report here. Since it was originally reported as a perl bug, I > have ported the test case to Encode directly. It's about decoding > Unicode "noncharacters" (which according to the spec are valid Unicode, > but for "internal use only"): > > use Encode (); > $utf8 = "\xef\xb7\x93"; > # returns "\x{FDD3}" > $x = Encode::decode('UTF-8', $utf8, Encode::FB_CROAK); > $utf16le = "\xd3\xfd"; > # dies with 'UTF-16LE:Unicode character fdd3 is illegal' > $x = Encode::decode('UTF-16LE', $utf16le, Encode::FB_CROAK); > > I'm not so concerned with this behaviour of Encode, per se, because when > you're using Encode, you have lots of options for handling "malformed" > data (even though this is not really malformed). I'm more concerned > with perl IO layers, as in the original report. I think that when > called for an IO layer, Encode should behave consistently with core > perl: > > - "illegal" characters cause a warning, not an error (even though > malformed UTF-16 still throws an error) > - the warning is disabled by no warnings 'utf8' (I don't know if this > can be detected from Encode; if not, core perl would have to pass in > this flag) > - the set of "illegal" characters is exactly what it is in core perl > (maybe it is already) > - the warning message is formatted exactly as in core perl (remove the > "UTF-16LE:" prefix and put "0x" in front of the code point) > > Basically, users think of IO encodings as core perl, so Encode should > make them act that way. > > Original bug: > > Create UTF-8 and UTF-16LE files containing the character U+FDD0. (For > UTF-8, this is the bytes ef b7 93; for UTF-16LE, it is d3 fd.) With the > UTF-8 file as STDIN, run > > binmode(STDIN, ':encoding(UTF-8)'); > while (<STDIN>) { } > > The program runs without complaint. With the UTF-16LE file as STDIN, run > > binmode(STDIN, ':encoding(UTF-16LE)'); > while (<STDIN>) { } > > The program dies with > > UTF-16LE:Unicode character fdd3 is illegal at ./bin/grep_high line 2. > > This is a fatal error and I find no way to turn it off except perhaps to > call Encode::decode by hand. I have run across files like this in the real > world, and it would be nice to read them with the standard filehandle > mechanism. Also, the difference between UTF-8 and UTF-16 behavior seems > unjustified. > > I suggest that this diagnostic be a warning, just like the "is illegal for > interchange" messages emitted in other contexts, and be disabled by "no > warnings 'utf8'". Also, this form of the diagnostic is not documented in > perldiag, even though it practically comes from the perl core. > > Andrew

Sat May 21 18:23:36 2011 The RT System itself - Status changed from 'new' to 'open'

Sat May 21 18:23:36 2011 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Mon May 23 16:32:11 2011 andrew [...] pimlott.net - Correspondence added

Subject:	Re: [rt.cpan.org #64788] error decoding UTF-16 "noncharacters"
Date:	Mon, 23 May 2011 13:31:01 -0700
To:	bug-Encode <bug-encode [...] rt.cpan.org>
From:	Andrew Pimlott <andrew [...] pimlott.net>

Thanks for the reply. I understand the meaning of Encode::FB_CROAK. I used it because I do want decode to croak on "real" UTF-16 problems. I was taking issue with error because the input ("\xd3\xfd") is the correct UTF-16 encoding of a valid Unicode character (U+FDD3). There is a lot more context in the original bug I referred to, http://rt.perl.org/rt3/Public/Bug/Display.html?id=81454. I am content to leave things as they are while a model for "string" and "lax" handling is worked out. Andrew Excerpts from Dan Kogai via RT's message of Sat May 21 15:23:36 -0700 2011: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=64788 > > > That's simply because you give the third argument, Encode::FB_CROAK. just get rid of it like > > $x = Encode::decode('UTF-16LE', $utf16le) > > and it will work as expected. See perldoc Encode to find what ncode::FB_CROAK means. > > Dan the Maintainer Thereof > > On Fri Jan 14 20:03:12 2011, andrew@pimlott.net wrote:

> > Below is a forward of perl bug 81454 > > (http://rt.perl.org/rt3/Public/Bug/Display.html?id=81454), which I was > > asked to report here. Since it was originally reported as a perl bug, I > > have ported the test case to Encode directly. It's about decoding > > Unicode "noncharacters" (which according to the spec are valid Unicode, > > but for "internal use only"): > > > > use Encode (); > > $utf8 = "\xef\xb7\x93"; > > # returns "\x{FDD3}" > > $x = Encode::decode('UTF-8', $utf8, Encode::FB_CROAK); > > $utf16le = "\xd3\xfd"; > > # dies with 'UTF-16LE:Unicode character fdd3 is illegal' > > $x = Encode::decode('UTF-16LE', $utf16le, Encode::FB_CROAK); > > > > I'm not so concerned with this behaviour of Encode, per se, because when > > you're using Encode, you have lots of options for handling "malformed" > > data (even though this is not really malformed). I'm more concerned > > with perl IO layers, as in the original report. I think that when > > called for an IO layer, Encode should behave consistently with core > > perl: > > > > - "illegal" characters cause a warning, not an error (even though > > malformed UTF-16 still throws an error) > > - the warning is disabled by no warnings 'utf8' (I don't know if this > > can be detected from Encode; if not, core perl would have to pass in > > this flag) > > - the set of "illegal" characters is exactly what it is in core perl > > (maybe it is already) > > - the warning message is formatted exactly as in core perl (remove the > > "UTF-16LE:" prefix and put "0x" in front of the code point) > > > > Basically, users think of IO encodings as core perl, so Encode should > > make them act that way. > > > > Original bug: > > > > Create UTF-8 and UTF-16LE files containing the character U+FDD0. (For > > UTF-8, this is the bytes ef b7 93; for UTF-16LE, it is d3 fd.) With the > > UTF-8 file as STDIN, run > > > > binmode(STDIN, ':encoding(UTF-8)'); > > while (<STDIN>) { } > > > > The program runs without complaint. With the UTF-16LE file as STDIN, run > > > > binmode(STDIN, ':encoding(UTF-16LE)'); > > while (<STDIN>) { } > > > > The program dies with > > > > UTF-16LE:Unicode character fdd3 is illegal at ./bin/grep_high line 2. > > > > This is a fatal error and I find no way to turn it off except perhaps to > > call Encode::decode by hand. I have run across files like this in the real > > world, and it would be nice to read them with the standard filehandle > > mechanism. Also, the difference between UTF-8 and UTF-16 behavior seems > > unjustified. > > > > I suggest that this diagnostic be a warning, just like the "is illegal for > > interchange" messages emitted in other contexts, and be disabled by "no > > warnings 'utf8'". Also, this form of the diagnostic is not documented in > > perldiag, even though it practically comes from the perl core. > > > > Andrew

Mon May 23 16:32:12 2011 The RT System itself - Status changed from 'resolved' to 'open'

Sat Nov 12 18:51:53 2011 chansen [...] cpan.org - Correspondence added

Noncharacters has a valid representation within all Unicode encoding forms, they are assigned code points but not assigned characters. It would make sense to change the current behavior from croak()'ing to warn'ing using utf8 warning category (and replace ordinal with U+FFFD, as is). I volunteer to fix this if Dan agree. -- chansen