Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the CGI CPAN distribution.

Report information
The Basics
Id: 3207
Status: resolved
Priority: 0/
Queue: CGI

People
Owner: LDS [...] cpan.org
Requestors: tshinnic [...] io.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: (no value)



Subject: Typo in CGI::Util routine utf8_chr : six byte encoding should use 0xFC
From the various documents and sources I can find, it would appear that the constant used to begin a six-byte UTF-8 sequence should be 0xFC rather than the 0xFE found in the source. I originally noticed this in CGI::Simple which borrowed the routine. Copy of my email to jfreeman follows: =============================================================== Relaxing by reading code and noticed what looks like a non-sequitur. Tried to verify against other Perl modules and gave up. Went to unicode.org and other places and they _seem_ to confirm that a constant is wrong. Util.pm line 201 has 0xfe | ($c >> 30), I believe this should be 0xfc ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c has line static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC }; ftp://ftp.rfc-editor.org/in-notes/rfc2279.txt also says UCS-4 range (hex.) UTF-8 octet sequence (binary) 0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx So I think 0xfc is the correct value. However.... unicode.org seems pretty strident in their disclaimer that Unicode is character-oriented, that they've defined a space for 'only' a million characters, and that the valid space greater than 16 bits is only from 000000 -> 10FFFF. Thus they 'define' only the one byte to four byte space for UTF-8 (even though their code example will handle up to six bytes). Me, I'd just fix the constant to make it able to handle the full 32 bit space. But there's justification for throwing out anything more than 21 bits. http://www.unicode.org/reports/tr19/tr19-9.html#10646 has UTF-32 is restricted in values to the range 0..10FFFF16, which precisely matches the range of characters defined in the Unicode Standard (and other standards such as XML), and those representable by UTF-8 and UTF-16. also Resolution M38.6 (Restriction of encoding space) [adopted unanimously] "WG2 accepts the proposal in document N2175 towards removing the provision for Private Use Groups and Planes beyond Plane 16 in ISO/IEC 10646, to ensure internal consistency in the standard between UCS-4, UTF-8 and UTF-16 encoding formats, and instructs its project editor [to] prepare suitable text for processing as a future Technical Corrigendum or an Amendment to 10646-1:2000."
From: Lincoln Stein <lstein [...] cshl.edu>
To: bug-CGI.pm [...] rt.cpan.org, "AdminCc of cpan Ticket #3207": ;;, [...] fontina.cshl.org
Subject: Re: [cpan #3207] Typo in CGI::Util routine utf8_chr : six byte encoding should use 0xFC
Date: Mon, 18 Aug 2003 13:47:00 -0400
RT-Send-Cc:
I've made the change. Hope you're right. Lincoln On Monday 11 August 2003 03:49 am, Guest via RT wrote: Show quoted text
> This message about CGI.pm was sent to you by guest <> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=3207 > > > From the various documents and sources I can find, it would appear that the > constant used to begin a six-byte UTF-8 sequence should be 0xFC rather than > the 0xFE found in the source. I originally noticed this in CGI::Simple > which borrowed the routine. > > Copy of my email to jfreeman follows: > =============================================================== > Relaxing by reading code and noticed what looks like a non-sequitur. > Tried to verify against other Perl modules and gave up. Went to > unicode.org and other places and they _seem_ to confirm that a > constant is wrong. > > Util.pm line 201 has > 0xfe | ($c >> 30), > I believe this should be 0xfc > > ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c > has line > static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0xE0, 0xF0, > 0xF8, 0xFC }; ftp://ftp.rfc-editor.org/in-notes/rfc2279.txt > also says > UCS-4 range (hex.) UTF-8 octet sequence (binary) > 0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx > > So I think 0xfc is the correct value. > > However.... unicode.org seems pretty strident in their disclaimer > that Unicode is character-oriented, that they've defined a space > for 'only' a million characters, and that the valid space greater > than 16 bits is only from 000000 -> 10FFFF. Thus they 'define' > only the one byte to four byte space for UTF-8 (even though their > code example will handle up to six bytes). > > Me, I'd just fix the constant to make it able to handle the full > 32 bit space. But there's justification for throwing out anything > more than 21 bits. > > > http://www.unicode.org/reports/tr19/tr19-9.html#10646 > has > UTF-32 is restricted in values to the range 0..10FFFF16, > which precisely matches the range of characters defined in > the Unicode Standard (and other standards such as XML), > and those representable by UTF-8 and UTF-16. > also > Resolution M38.6 (Restriction of encoding space) [adopted unanimously] > "WG2 accepts the proposal in document N2175 towards removing > the provision for Private Use Groups and Planes beyond Plane 16 > in ISO/IEC 10646, to ensure internal consistency in the standard > between UCS-4, UTF-8 and UTF-16 encoding formats, and instructs > its project editor [to] prepare suitable text for processing as > a future Technical Corrigendum or an Amendment to 10646-1:2000."
-- ======================================================================== Lincoln D. Stein Cold Spring Harbor Laboratory lstein@cshl.org Cold Spring Harbor, NY ========================================================================
I have accepted the proposed constant change. I trust it is correct, as I don't use (or care) about UTF-8. [guest - Mon Aug 11 03:49:23 2003]: Show quoted text
> From the various documents and sources I can find, it would appear > that the constant used to begin a six-byte UTF-8 sequence should
be Show quoted text
> 0xFC rather than the 0xFE found in the source. I originally > noticed this in CGI::Simple which borrowed the routine. > > Copy of my email to jfreeman follows: > =============================================================== > Relaxing by reading code and noticed what looks like a non-sequitur. > Tried to verify against other Perl modules and gave up. Went to > unicode.org and other places and they _seem_ to confirm that a > constant is wrong. > > Util.pm line 201 has > 0xfe | ($c >> 30), > I believe this should be 0xfc > > ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c > has line > static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0xE0,
0xF0, Show quoted text
> 0xF8, 0xFC }; > ftp://ftp.rfc-editor.org/in-notes/rfc2279.txt > also says > UCS-4 range (hex.) UTF-8 octet sequence (binary) > 0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx > > So I think 0xfc is the correct value. > > However.... unicode.org seems pretty strident in their disclaimer > that Unicode is character-oriented, that they've defined a space > for 'only' a million characters, and that the valid space greater > than 16 bits is only from 000000 -> 10FFFF. Thus they 'define' > only the one byte to four byte space for UTF-8 (even though their > code example will handle up to six bytes). > > Me, I'd just fix the constant to make it able to handle the full > 32 bit space. But there's justification for throwing out anything > more than 21 bits. > > > http://www.unicode.org/reports/tr19/tr19-9.html#10646 > has > UTF-32 is restricted in values to the range 0..10FFFF16, > which precisely matches the range of characters defined in > the Unicode Standard (and other standards such as XML), > and those representable by UTF-8 and UTF-16. > also > Resolution M38.6 (Restriction of encoding space) [adopted > unanimously] > "WG2 accepts the proposal in document N2175 towards removing > the provision for Private Use Groups and Planes beyond Plane 16 > in ISO/IEC 10646, to ensure internal consistency in the standard > between UCS-4, UTF-8 and UTF-16 encoding formats, and instructs > its project editor [to] prepare suitable text for processing as > a future Technical Corrigendum or an Amendment to 10646-1:2000."