Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 18105
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: jgmyers [...] proofpoint.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 2.14
Fixed in: (no value)



Subject: UTF-8 decodes illegal (non)character U+FFFE
No input should cause the UTF-8 decoder to produce illegal characters, any such should be replaced with U+FFFD. The attached script generates the output and warning: fffe Unicode character 0xfffe is illegal at utf8-nonchar.pl line 11. It should instead produce: fffd and no warning.
Subject: utf8-nonchar.pl
use Encode; use strict; use warnings; my $text = "aaa\xef\xbf\xbebbb"; my $utf = Encode::decode('UTF-8', $text, 0); printf "%x\n", ord(substr($utf, 3, 1)); $utf =~ /\b(?:https?|ftp)/o;
Same thing goes for U+1FFFE, F0 9F BF BE. Presumably all of the xFFFE up to U+10FFFE are affected, but I haven't tested that.
From: jgmyers [...] proofpoint.com
Also affects U+1FFFF, U+2FFFF, on up to U+10FFFF. Also affects U+FDD0 through U+FDEF. Filed perl #38722 on the underlying bug in Perl_utf8n_to_uvuni().
From: jgmyers [...] proofpoint.com
Proposed fix.
Only in Encode-2.12-1utf8nonchar/: blib Only in Encode-2.12-1utf8nonchar/Byte: Byte.bs Only in Encode-2.12-1utf8nonchar/Byte: Byte.c Only in Encode-2.12-1utf8nonchar/Byte: Byte.o Only in Encode-2.12-1utf8nonchar/Byte: byte_t.c Only in Encode-2.12-1utf8nonchar/Byte: byte_t.exh Only in Encode-2.12-1utf8nonchar/Byte: byte_t.fnm Only in Encode-2.12-1utf8nonchar/Byte: byte_t.h Only in Encode-2.12-1utf8nonchar/Byte: byte_t.o Only in Encode-2.12-1utf8nonchar/Byte: Byte.xs Only in Encode-2.12-1utf8nonchar/Byte: Makefile Only in Encode-2.12-1utf8nonchar/Byte: pm_to_blib Only in Encode-2.12-1utf8nonchar/CN: CN.bs Only in Encode-2.12-1utf8nonchar/CN: CN.c Only in Encode-2.12-1utf8nonchar/CN: CN.o Only in Encode-2.12-1utf8nonchar/CN: CN.xs Only in Encode-2.12-1utf8nonchar/CN: cp_00_t.c Only in Encode-2.12-1utf8nonchar/CN: cp_00_t.exh Only in Encode-2.12-1utf8nonchar/CN: cp_00_t.fnm Only in Encode-2.12-1utf8nonchar/CN: cp_00_t.h Only in Encode-2.12-1utf8nonchar/CN: cp_00_t.o Only in Encode-2.12-1utf8nonchar/CN: eu_01_t.c Only in Encode-2.12-1utf8nonchar/CN: eu_01_t.exh Only in Encode-2.12-1utf8nonchar/CN: eu_01_t.fnm Only in Encode-2.12-1utf8nonchar/CN: eu_01_t.h Only in Encode-2.12-1utf8nonchar/CN: eu_01_t.o Only in Encode-2.12-1utf8nonchar/CN: gb_02_t.c Only in Encode-2.12-1utf8nonchar/CN: gb_02_t.exh Only in Encode-2.12-1utf8nonchar/CN: gb_02_t.fnm Only in Encode-2.12-1utf8nonchar/CN: gb_02_t.h Only in Encode-2.12-1utf8nonchar/CN: gb_02_t.o Only in Encode-2.12-1utf8nonchar/CN: gb_03_t.c Only in Encode-2.12-1utf8nonchar/CN: gb_03_t.exh Only in Encode-2.12-1utf8nonchar/CN: gb_03_t.fnm Only in Encode-2.12-1utf8nonchar/CN: gb_03_t.h Only in Encode-2.12-1utf8nonchar/CN: gb_03_t.o Only in Encode-2.12-1utf8nonchar/CN: ir_04_t.c Only in Encode-2.12-1utf8nonchar/CN: ir_04_t.exh Only in Encode-2.12-1utf8nonchar/CN: ir_04_t.fnm Only in Encode-2.12-1utf8nonchar/CN: ir_04_t.h Only in Encode-2.12-1utf8nonchar/CN: ir_04_t.o Only in Encode-2.12-1utf8nonchar/CN: ma_05_t.c Only in Encode-2.12-1utf8nonchar/CN: ma_05_t.exh Only in Encode-2.12-1utf8nonchar/CN: ma_05_t.fnm Only in Encode-2.12-1utf8nonchar/CN: ma_05_t.h Only in Encode-2.12-1utf8nonchar/CN: ma_05_t.o Only in Encode-2.12-1utf8nonchar/CN: Makefile Only in Encode-2.12-1utf8nonchar/CN: pm_to_blib Only in Encode-2.12-1utf8nonchar/: def_t.c Only in Encode-2.12-1utf8nonchar/: def_t.exh Only in Encode-2.12-1utf8nonchar/: def_t.fnm Only in Encode-2.12-1utf8nonchar/: def_t.h Only in Encode-2.12-1utf8nonchar/: def_t.o Only in Encode-2.12-1utf8nonchar/EBCDIC: EBCDIC.bs Only in Encode-2.12-1utf8nonchar/EBCDIC: EBCDIC.c Only in Encode-2.12-1utf8nonchar/EBCDIC: EBCDIC.o Only in Encode-2.12-1utf8nonchar/EBCDIC: ebcdic_t.c Only in Encode-2.12-1utf8nonchar/EBCDIC: ebcdic_t.exh Only in Encode-2.12-1utf8nonchar/EBCDIC: ebcdic_t.fnm Only in Encode-2.12-1utf8nonchar/EBCDIC: ebcdic_t.h Only in Encode-2.12-1utf8nonchar/EBCDIC: ebcdic_t.o Only in Encode-2.12-1utf8nonchar/EBCDIC: EBCDIC.xs Only in Encode-2.12-1utf8nonchar/EBCDIC: Makefile Only in Encode-2.12-1utf8nonchar/EBCDIC: pm_to_blib Only in Encode-2.12-1utf8nonchar/: encengine.o Only in Encode-2.12-1utf8nonchar/: Encode.bs Only in Encode-2.12-1utf8nonchar/: Encode.c Only in Encode-2.12-1utf8nonchar/: Encode.o diff -ru Encode-2.12-0orig/Encode.xs Encode-2.12-1utf8nonchar/Encode.xs --- Encode-2.12-0orig/Encode.xs 2006-03-13 10:09:45.000000000 -0800 +++ Encode-2.12-1utf8nonchar/Encode.xs 2006-03-13 11:19:59.000000000 -0800 @@ -335,6 +335,10 @@ if (strict && uv > PERL_UNICODE_MAX) ulen = -1; #endif + /* Work around perl #38722 */ + if (strict && ((uv & 0xFFFE) == 0xFFFE || + (uv >= 0xFDD0 && uv <= 0xFDEF))) + ulen = -1; if (ulen == -1) { if (strict) { uv = utf8n_to_uvuni(s, e - s, &ulen, Only in Encode-2.12-1utf8nonchar/: Encode.xs~ Only in Encode-2.12-1utf8nonchar/JP: cp_00_t.c Only in Encode-2.12-1utf8nonchar/JP: cp_00_t.exh Only in Encode-2.12-1utf8nonchar/JP: cp_00_t.fnm Only in Encode-2.12-1utf8nonchar/JP: cp_00_t.h Only in Encode-2.12-1utf8nonchar/JP: cp_00_t.o Only in Encode-2.12-1utf8nonchar/JP: eu_01_t.c Only in Encode-2.12-1utf8nonchar/JP: eu_01_t.exh Only in Encode-2.12-1utf8nonchar/JP: eu_01_t.fnm Only in Encode-2.12-1utf8nonchar/JP: eu_01_t.h Only in Encode-2.12-1utf8nonchar/JP: eu_01_t.o Only in Encode-2.12-1utf8nonchar/JP: ji_02_t.c Only in Encode-2.12-1utf8nonchar/JP: ji_02_t.exh Only in Encode-2.12-1utf8nonchar/JP: ji_02_t.fnm Only in Encode-2.12-1utf8nonchar/JP: ji_02_t.h Only in Encode-2.12-1utf8nonchar/JP: ji_02_t.o Only in Encode-2.12-1utf8nonchar/JP: ji_03_t.c Only in Encode-2.12-1utf8nonchar/JP: ji_03_t.exh Only in Encode-2.12-1utf8nonchar/JP: ji_03_t.fnm Only in Encode-2.12-1utf8nonchar/JP: ji_03_t.h Only in Encode-2.12-1utf8nonchar/JP: ji_03_t.o Only in Encode-2.12-1utf8nonchar/JP: ji_04_t.c Only in Encode-2.12-1utf8nonchar/JP: ji_04_t.exh Only in Encode-2.12-1utf8nonchar/JP: ji_04_t.fnm Only in Encode-2.12-1utf8nonchar/JP: ji_04_t.h Only in Encode-2.12-1utf8nonchar/JP: ji_04_t.o Only in Encode-2.12-1utf8nonchar/JP: JP.bs Only in Encode-2.12-1utf8nonchar/JP: JP.c Only in Encode-2.12-1utf8nonchar/JP: JP.o Only in Encode-2.12-1utf8nonchar/JP: JP.xs Only in Encode-2.12-1utf8nonchar/JP: ma_05_t.c Only in Encode-2.12-1utf8nonchar/JP: ma_05_t.exh Only in Encode-2.12-1utf8nonchar/JP: ma_05_t.fnm Only in Encode-2.12-1utf8nonchar/JP: ma_05_t.h Only in Encode-2.12-1utf8nonchar/JP: ma_05_t.o Only in Encode-2.12-1utf8nonchar/JP: Makefile Only in Encode-2.12-1utf8nonchar/JP: pm_to_blib Only in Encode-2.12-1utf8nonchar/JP: sh_06_t.c Only in Encode-2.12-1utf8nonchar/JP: sh_06_t.exh Only in Encode-2.12-1utf8nonchar/JP: sh_06_t.fnm Only in Encode-2.12-1utf8nonchar/JP: sh_06_t.h Only in Encode-2.12-1utf8nonchar/JP: sh_06_t.o Only in Encode-2.12-1utf8nonchar/KR: cp_00_t.c Only in Encode-2.12-1utf8nonchar/KR: cp_00_t.exh Only in Encode-2.12-1utf8nonchar/KR: cp_00_t.fnm Only in Encode-2.12-1utf8nonchar/KR: cp_00_t.h Only in Encode-2.12-1utf8nonchar/KR: cp_00_t.o Only in Encode-2.12-1utf8nonchar/KR: eu_01_t.c Only in Encode-2.12-1utf8nonchar/KR: eu_01_t.exh Only in Encode-2.12-1utf8nonchar/KR: eu_01_t.fnm Only in Encode-2.12-1utf8nonchar/KR: eu_01_t.h Only in Encode-2.12-1utf8nonchar/KR: eu_01_t.o Only in Encode-2.12-1utf8nonchar/KR: jo_02_t.c Only in Encode-2.12-1utf8nonchar/KR: jo_02_t.exh Only in Encode-2.12-1utf8nonchar/KR: jo_02_t.fnm Only in Encode-2.12-1utf8nonchar/KR: jo_02_t.h Only in Encode-2.12-1utf8nonchar/KR: jo_02_t.o Only in Encode-2.12-1utf8nonchar/KR: KR.bs Only in Encode-2.12-1utf8nonchar/KR: KR.c Only in Encode-2.12-1utf8nonchar/KR: KR.o Only in Encode-2.12-1utf8nonchar/KR: KR.xs Only in Encode-2.12-1utf8nonchar/KR: ks_03_t.c Only in Encode-2.12-1utf8nonchar/KR: ks_03_t.exh Only in Encode-2.12-1utf8nonchar/KR: ks_03_t.fnm Only in Encode-2.12-1utf8nonchar/KR: ks_03_t.h Only in Encode-2.12-1utf8nonchar/KR: ks_03_t.o Only in Encode-2.12-1utf8nonchar/KR: ma_04_t.c Only in Encode-2.12-1utf8nonchar/KR: ma_04_t.exh Only in Encode-2.12-1utf8nonchar/KR: ma_04_t.fnm Only in Encode-2.12-1utf8nonchar/KR: ma_04_t.h Only in Encode-2.12-1utf8nonchar/KR: ma_04_t.o Only in Encode-2.12-1utf8nonchar/KR: Makefile Only in Encode-2.12-1utf8nonchar/KR: pm_to_blib Only in Encode-2.12-1utf8nonchar/: Makefile Only in Encode-2.12-1utf8nonchar/: pm_to_blib Only in Encode-2.12-1utf8nonchar/Symbol: Makefile Only in Encode-2.12-1utf8nonchar/Symbol: pm_to_blib Only in Encode-2.12-1utf8nonchar/Symbol: Symbol.bs Only in Encode-2.12-1utf8nonchar/Symbol: Symbol.c Only in Encode-2.12-1utf8nonchar/Symbol: Symbol.o Only in Encode-2.12-1utf8nonchar/Symbol: symbol_t.c Only in Encode-2.12-1utf8nonchar/Symbol: symbol_t.exh Only in Encode-2.12-1utf8nonchar/Symbol: symbol_t.fnm Only in Encode-2.12-1utf8nonchar/Symbol: symbol_t.h Only in Encode-2.12-1utf8nonchar/Symbol: symbol_t.o Only in Encode-2.12-1utf8nonchar/Symbol: Symbol.xs Only in Encode-2.12-1utf8nonchar/TW: bi_00_t.c Only in Encode-2.12-1utf8nonchar/TW: bi_00_t.exh Only in Encode-2.12-1utf8nonchar/TW: bi_00_t.fnm Only in Encode-2.12-1utf8nonchar/TW: bi_00_t.h Only in Encode-2.12-1utf8nonchar/TW: bi_00_t.o Only in Encode-2.12-1utf8nonchar/TW: bi_01_t.c Only in Encode-2.12-1utf8nonchar/TW: bi_01_t.exh Only in Encode-2.12-1utf8nonchar/TW: bi_01_t.fnm Only in Encode-2.12-1utf8nonchar/TW: bi_01_t.h Only in Encode-2.12-1utf8nonchar/TW: bi_01_t.o Only in Encode-2.12-1utf8nonchar/TW: cp_02_t.c Only in Encode-2.12-1utf8nonchar/TW: cp_02_t.exh Only in Encode-2.12-1utf8nonchar/TW: cp_02_t.fnm Only in Encode-2.12-1utf8nonchar/TW: cp_02_t.h Only in Encode-2.12-1utf8nonchar/TW: cp_02_t.o Only in Encode-2.12-1utf8nonchar/TW: ma_03_t.c Only in Encode-2.12-1utf8nonchar/TW: ma_03_t.exh Only in Encode-2.12-1utf8nonchar/TW: ma_03_t.fnm Only in Encode-2.12-1utf8nonchar/TW: ma_03_t.h Only in Encode-2.12-1utf8nonchar/TW: ma_03_t.o Only in Encode-2.12-1utf8nonchar/TW: Makefile Only in Encode-2.12-1utf8nonchar/TW: pm_to_blib Only in Encode-2.12-1utf8nonchar/TW: TW.bs Only in Encode-2.12-1utf8nonchar/TW: TW.c Only in Encode-2.12-1utf8nonchar/TW: TW.o Only in Encode-2.12-1utf8nonchar/TW: TW.xs Only in Encode-2.12-1utf8nonchar/Unicode: Makefile Only in Encode-2.12-1utf8nonchar/Unicode: pm_to_blib Only in Encode-2.12-1utf8nonchar/Unicode: Unicode.bs Only in Encode-2.12-1utf8nonchar/Unicode: Unicode.c Only in Encode-2.12-1utf8nonchar/Unicode: Unicode.o Only in Encode-2.12-1utf8nonchar/Unicode: Unicode.xs~
From: jgmyers [...] proofpoint.com
Updated proposed fix. Needed to adjust a test case to avoid a problematic character.
diff -ru Encode-2.12-0orig/Encode.xs Encode-2.12-1utf8nonchar/Encode.xs --- Encode-2.12-0orig/Encode.xs 2006-03-13 10:09:45.000000000 -0800 +++ Encode-2.12-1utf8nonchar/Encode.xs 2006-03-13 11:19:59.000000000 -0800 @@ -335,6 +335,10 @@ if (strict && uv > PERL_UNICODE_MAX) ulen = -1; #endif + /* Work around perl #38722 */ + if (strict && ((uv & 0xFFFE) == 0xFFFE || + (uv >= 0xFDD0 && uv <= 0xFDEF))) + ulen = -1; if (ulen == -1) { if (strict) { uv = utf8n_to_uvuni(s, e - s, &ulen, diff -ru Encode-2.12-0orig/t/utf8strict.t Encode-2.12-1utf8nonchar/t/utf8strict.t --- Encode-2.12-0orig/t/utf8strict.t 2006-03-13 10:09:43.000000000 -0800 +++ Encode-2.12-1utf8nonchar/t/utf8strict.t 2006-03-13 13:46:54.000000000 -0800 @@ -43,7 +43,7 @@ %SEQ = ( qq/ed 9f bf/ => 0, # 2.3.1 qq/ee 80 80/ => 0, # 2.3.2 - qq/f4 8f bf bf/ => 0, # 2.3.3 + qq/f4 8f bf bd/ => 0, # 2.3.3 qq/f4 90 80 80/ => 1, # 2.3.4 -- out of range so NG # "3 Malformed sequences" are checked by perl. # "4 Overlong sequences" are checked by perl.
On Mon Mar 13 16:54:08 2006, guest wrote: Show quoted text
> Updated proposed fix. Needed to adjust a test case to avoid a > problematic character. >
The test in your attachment passes on Encode 2.17 so I consider this one fixed. Dan the Encode Maintainer