Subject: | Encode::Guess has some misguesses |
Given the following script (warning: a couple of files names beginning with "utf" are created):
use Encode::Guess;
my $bom = chr(0xfeff);
for my $width (16, 32, 8) {
for my $endianness ('be', 'le', '') {
next if ($width == 8) xor ($endianness eq '');
for my $bomness (0, 1) {
my $bomsuffix = $bomness ? 'b' : '';
my $fn = sprintf "utf$width$endianness%s", $bomsuffix;
open(FH, ">:encoding(utf$width$endianness)", $fn) or die "$fn: $!";
print FH $bomness ? $bom : '', chr(0x1234), "foobar\n";
close FH;
open(FH, "<$fn");
my $enc = guess_encoding(my $data = <FH>);
close(FH);
ref($enc) or die "No idea: $enc\n";
my $ename = $enc->name;
my $utf8 = $enc->decode($data);
printf "%-8s %-8s %d %04x - %s %s\n",
$fn, $ename, $utf8 =~ /^$bom/ ? 1 : 0, ord($utf8),
$ename =~ /^UTF-?$width($endianness($bomsuffix)?)?/i ? "ok" : "NO",
ord($utf8) == 0x1234 ? "ok" : "NO";
}
}
}
I get:
utf16be UTF-16BE 0 1234 - ok ok
utf16beb UTF-16 0 1234 - ok ok
utf16le UTF-16LE 0 1234 - ok ok
utf16leb UTF-16 0 1234 - ok ok
utf32be UTF-32BE 0 1234 - ok ok
utf32beb UTF-32 0 1234 - ok ok
utf32le UTF-32LE 0 1234 - ok ok
utf32leb UTF-16 0 0000 - NO NO
utf8 utf8 0 1234 - ok ok
utf8b utf8 1 feff - ok NO
So UTF-32-LE with BOM and UTF-8 with BOM are misguessed. The second is admittedly a "freak" but the first one is a bit worrying.