Bug #97358 for Encode: Encode Should Accept Noncharacters as Unicode

Sun Jul 20 01:04:20 2014 dwheeler [...] cpan.org - Ticket created

Subject:	Encode Should Accept Noncharacters as Unicode
Date:	Sat, 19 Jul 2014 22:03:59 -0700
To:	bug-Encode [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

I think this is a bug: perl -MEncode -E 'say Encode::decode("UTF-8", "\xEF\xBF\xBF", Encode::FB_CROAK)' utf8 "\xFFFF" does not map to Unicode at /usr/local/lib/perl5/site_perl/5.20.0/darwin-thread-multi-2level/Encode.pm line 175. \xFFFF is, in fact, a part of UTF-8. It is one of a family of “Noncharacters”, and, according to [Corrigendum 9](http://www.unicode.org/versions/corrigendum9.html), reserved noncharacters now are allowed to appear in UTF-8 strings. Related discussions: http://grokbase.com/t/perl/perl5-porters/147gfvrd2n/encode-vs-json https://rt.perl.org/Public/Bug/Display.html?id=121937. Thanks, David

Download signature.asc
application/pgp-signature 842b

Message body not shown because it is not plain text.

Sun Jul 20 06:58:08 2014 DANKOGAI [...] cpan.org - Correspondence added

If it were are a bug, it belongs to perl core because the strictness of UTF8 is #defined in the value of UTF8_DISALLOW_ILLEGAL_INTERCHANGE which is defined in perl core: http://perldoc.perl.org/perlapi.html#Unicode-Support Show quoted text

> * utf8n_to_uvchr > Certain code points are considered problematic. > These are Unicode surrogates, Unicode non-characters, > and code points above the Unicode maximum of 0x10FFFF. > By default these are considered regular code points, > but certain situations warrant special handling for them. > If flags contains UTF8_DISALLOW_ILLEGAL_INTERCHANGE, > all three classes are treated as malformations and handled as such. > The flags UTF8_DISALLOW_SURROGATE, UTF8_DISALLOW_NONCHAR, > and UTF8_DISALLOW_SUPER (meaning above the legal Unicode maximum) > can be set to disallow these categories individually.

In other words, Encode faithfully believes perl core with that respect. And I want to leave Encode that way. If it is to be fixed, it should be fixed by redefining UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR in perl core. Dan the Encode Maintainer On Sun Jul 20 01:04:20 2014, DWHEELER wrote: Show quoted text

> I think this is a bug: > > perl -MEncode -E 'say Encode::decode("UTF-8", "\xEF\xBF\xBF", > Encode::FB_CROAK)' > utf8 "\xFFFF" does not map to Unicode at > /usr/local/lib/perl5/site_perl/5.20.0/darwin-thread-multi- > 2level/Encode.pm line 175. > > \xFFFF is, in fact, a part of UTF-8. It is one of a family of > “Noncharacters”, and, according to [Corrigendum > 9](http://www.unicode.org/versions/corrigendum9.html), reserved > noncharacters now are allowed to appear in UTF-8 strings. > > Related discussions: > > http://grokbase.com/t/perl/perl5-porters/147gfvrd2n/encode-vs-json > https://rt.perl.org/Public/Bug/Display.html?id=121937. > > Thanks, > > David

Sun Jul 20 06:58:09 2014 The RT System itself - Status changed from 'new' to 'open'

Sun Jul 20 06:58:22 2014 DANKOGAI [...] cpan.org - Status changed from 'open' to 'rejected'

Mon Jul 21 14:35:03 2014 dwheeler [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #97358] Encode Should Accept Noncharacters as Unicode
Date:	Mon, 21 Jul 2014 11:34:37 -0700
To:	bug-Encode [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Jul 20, 2014, at 3:58 AM, Dan Kogai via RT <bug-Encode@rt.cpan.org> wrote: Show quoted text

> In other words, Encode faithfully believes perl core with that respect. And I want to leave Encode that way. If it is to be fixed, it should be fixed by redefining UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR in perl core.

Thanks. I have taken this up with Perl 5 Porters. David

Download signature.asc
application/pgp-signature 842b

Message body not shown because it is not plain text.

Tue Oct 07 12:30:37 2014 dwheeler [...] cpan.org - Correspondence added

Followup from the p5p thread: http://grokbase.com/t/perl/perl5-porters/147gfvrd2n/encode-vs-json I think we need an interface to tell Encode to exclude UTF8_DISALLOW_NONCHAR from UTF8_DISALLOW_ILLEGAL_INTERCHANGE.

Tue Oct 07 12:30:38 2014 dwheeler [...] cpan.org - Status changed from 'rejected' to 'open'

Thu Feb 05 12:24:44 2015 dwheeler [...] cpan.org - Correspondence added

On 2014-10-07 12:30:37, DWHEELER wrote: Show quoted text

> Followup from the p5p thread: > > http://grokbase.com/t/perl/perl5-porters/147gfvrd2n/encode-vs-json > > I think we need an interface to tell Encode to exclude > UTF8_DISALLOW_NONCHAR from UTF8_DISALLOW_ILLEGAL_INTERCHANGE.

Hey Dan, any chance we could see this functionality integrated soon? Thanks, David

Wed Jul 13 17:49:28 2016 pali [...] cpan.org - Cc PALI added

Wed Jul 13 18:02:06 2016 pali [...] cpan.org - Correspondence added

On Ned Júl 20 01:04:20 2014, DWHEELER wrote: Show quoted text

> I think this is a bug: > > perl -MEncode -E 'say Encode::decode("UTF-8", "\xEF\xBF\xBF", > Encode::FB_CROAK)' > utf8 "\xFFFF" does not map to Unicode at > /usr/local/lib/perl5/site_perl/5.20.0/darwin-thread-multi- > 2level/Encode.pm line 175.

I would say this is correct and expected behaviour. Show quoted text

> \xFFFF is, in fact, a part of UTF-8. It is one of a family of > “Noncharacters”, and, according to [Corrigendum > 9](http://www.unicode.org/versions/corrigendum9.html), reserved > noncharacters now are allowed to appear in UTF-8 strings.

Above function convert UTF-8 octet string to perl Unicode string. And "\xFFFF" is really not valid Unicode character. If you want to replace invalid Unicode characters by Unicode replacement character "\xFFFD", then call decode method with FB_DEFAULT flag (also equivalent of calling with none flags). FB_CROAK is there to croak when input strict cannot be correctly converted. Anyway, "UTF-8" encoding implements strict UTF-8 encoding and non-characters and invalid sequences must not be accepted. I would suggest you to read about utf8 vs UTF-8 in perl: https://metacpan.org/pod/Encode#UTF-8-vs.-utf8-vs.-UTF8

Thu Jul 14 07:13:09 2016 dwheeler [...] cpan.org - Correspondence added

CC:	pali [...] cpan.org
Subject:	Re: [rt.cpan.org #97358] Encode Should Accept Noncharacters as Unicode
Date:	Thu, 14 Jul 2016 12:12:55 +0100
To:	bug-Encode [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Jul 13, 2016, at 11:02 PM, Pali via RT <bug-Encode@rt.cpan.org> wrote: Show quoted text

> Above function convert UTF-8 octet string to perl Unicode string. And "\xFFFF" is really not valid Unicode character.

Hey, thanks for your comments, pali. You’re correct that it’s not a Unicode character. In fact, it’s specifically identified as a “noncharacter”. There was a fair bit of discussion of this issue on p5p, which you can see in the link upthread in this ticket. It’s clear that there’s some disagreement over the interpretation of corrigendum 9, which says that noncharacters *should* be allowed in UTF-8 strings. The upshot is that sometimes one needs an option to tell Encode to exclude UTF8_DISALLOW_NONCHAR from UTF8_DISALLOW_ILLEGAL_INTERCHANGE. This will help when dealing with services that allow noncharacters in their output and they need to be preserved by the Perl code processing that string. Best, David

Thu Jul 14 12:08:50 2016 pali [...] cpan.org - Correspondence added

On Štv Júl 14 07:13:09 2016, DWHEELER wrote: Show quoted text

> On Jul 13, 2016, at 11:02 PM, Pali via RT <bug-Encode@rt.cpan.org> > wrote: >

> > Above function convert UTF-8 octet string to perl Unicode string. And > > "\xFFFF" is really not valid Unicode character.

> > Hey, thanks for your comments, pali. You’re correct that it’s not a > Unicode character. In fact, it’s specifically identified as a > “noncharacter”. > > There was a fair bit of discussion of this issue on p5p, which you can > see in the link upthread in this ticket. It’s clear that there’s some > disagreement over the interpretation of corrigendum 9, which says that > noncharacters *should* be allowed in UTF-8 strings.

I do not if standard/specification allow or not them in UTF-8, but personally I'm avoiding to send or use such UTF-8 sequence. But it it not important if it is allowed or not in UTF-8. Here you we are dealing with conversion from UTF-8 to Unicode (not what is allowed in UTF-8). And that problematic sequence is not Unicode character and therefore throwing error is expected behaviour... That is similar problem if you want to convert e.g. UTF-8 to Latin1. Some characters can be converted and some not... Show quoted text

> The upshot is that sometimes one needs an option to tell Encode to > exclude UTF8_DISALLOW_NONCHAR from UTF8_DISALLOW_ILLEGAL_INTERCHANGE. > This will help when dealing with services that allow noncharacters in > their output and they need to be preserved by the Perl code processing > that string.

If you in some case really want to have those character in output perl (unicoded) string, then strict "UTF-8" encoder is not for you. There is "utf8" one, as mentioned in previous link about UTF-8 vs utf8.

Tue Feb 07 09:33:01 2017 pali [...] cpan.org - Correspondence added

Ok, Unicode Standard, Version 9.0 at http://www.unicode.org/versions/Unicode9.0.0/ in chapter 3.9 Unicode Encoding Forms says: * Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is ill-formed. * A conformant encoding form conversion will treat any ill-formed code unit sequence as an error condition. Table 3-7 contains Well-Formed UTF-8 Byte Sequences *without* noncharacters. So no, noncharacters cannot be allowed in UTF-8. Because in Encode module supports also "utf8" encoding (which allows noncharacters) and "UTF-8" cannot allow noncharacters, this bug can be closed as invalid.

Tue Feb 07 10:50:01 2017 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'