Bug #6279 for Encode: Encode::Guess has some misguesses

Mon May 10 17:48:17 2004 Guest - Ticket created

Subject:

Encode::Guess has some misguesses

Given the following script (warning: a couple of files names beginning with "utf" are created): use Encode::Guess; my $bom = chr(0xfeff); for my $width (16, 32, 8) { for my $endianness ('be', 'le', '') { next if ($width == 8) xor ($endianness eq ''); for my $bomness (0, 1) { my $bomsuffix = $bomness ? 'b' : ''; my $fn = sprintf "utf$width$endianness%s", $bomsuffix; open(FH, ">:encoding(utf$width$endianness)", $fn) or die "$fn: $!"; print FH $bomness ? $bom : '', chr(0x1234), "foobar\n"; close FH; open(FH, "<$fn"); my $enc = guess_encoding(my $data = <FH>); close(FH); ref($enc) or die "No idea: $enc\n"; my $ename = $enc->name; my $utf8 = $enc->decode($data); printf "%-8s %-8s %d %04x - %s %s\n", $fn, $ename, $utf8 =~ /^$bom/ ? 1 : 0, ord($utf8), $ename =~ /^UTF-?$width($endianness($bomsuffix)?)?/i ? "ok" : "NO", ord($utf8) == 0x1234 ? "ok" : "NO"; } } } I get: utf16be UTF-16BE 0 1234 - ok ok utf16beb UTF-16 0 1234 - ok ok utf16le UTF-16LE 0 1234 - ok ok utf16leb UTF-16 0 1234 - ok ok utf32be UTF-32BE 0 1234 - ok ok utf32beb UTF-32 0 1234 - ok ok utf32le UTF-32LE 0 1234 - ok ok utf32leb UTF-16 0 0000 - NO NO utf8 utf8 0 1234 - ok ok utf8b utf8 1 feff - ok NO So UTF-32-LE with BOM and UTF-8 with BOM are misguessed. The second is admittedly a "freak" but the first one is a bit worrying.

Sun May 16 16:57:18 2004 DANKOGAI [...] cpan.org - Status changed from 'new' to 'resolved'

Fri Oct 01 16:34:09 2004 DANKOGAI [...] cpan.org - Taken