Subject: | Typo in CGI::Util routine utf8_chr : six byte encoding should use 0xFC |
From the various documents and sources I can find, it would appear that the constant used to begin a six-byte UTF-8 sequence should be 0xFC rather than the 0xFE found in the source. I originally noticed this in CGI::Simple which borrowed the routine.
Copy of my email to jfreeman follows:
===============================================================
Relaxing by reading code and noticed what looks like a non-sequitur.
Tried to verify against other Perl modules and gave up. Went to
unicode.org and other places and they _seem_ to confirm that a
constant is wrong.
Util.pm line 201 has
0xfe | ($c >> 30),
I believe this should be 0xfc
ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c
has line
static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC };
ftp://ftp.rfc-editor.org/in-notes/rfc2279.txt
also says
UCS-4 range (hex.) UTF-8 octet sequence (binary)
0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx
So I think 0xfc is the correct value.
However.... unicode.org seems pretty strident in their disclaimer
that Unicode is character-oriented, that they've defined a space
for 'only' a million characters, and that the valid space greater
than 16 bits is only from 000000 -> 10FFFF. Thus they 'define'
only the one byte to four byte space for UTF-8 (even though their
code example will handle up to six bytes).
Me, I'd just fix the constant to make it able to handle the full
32 bit space. But there's justification for throwing out anything
more than 21 bits.
http://www.unicode.org/reports/tr19/tr19-9.html#10646
has
UTF-32 is restricted in values to the range 0..10FFFF16,
which precisely matches the range of characters defined in
the Unicode Standard (and other standards such as XML),
and those representable by UTF-8 and UTF-16.
also
Resolution M38.6 (Restriction of encoding space) [adopted unanimously]
"WG2 accepts the proposal in document N2175 towards removing
the provision for Private Use Groups and Planes beyond Plane 16
in ISO/IEC 10646, to ensure internal consistency in the standard
between UCS-4, UTF-8 and UTF-16 encoding formats, and instructs
its project editor [to] prepare suitable text for processing as
a future Technical Corrigendum or an Amendment to 10646-1:2000."