Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 6279
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: DANKOGAI [...] cpan.org
Requestors: jhi [...] iki.fi
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: (no value)



Subject: Encode::Guess has some misguesses
Given the following script (warning: a couple of files names beginning with "utf" are created): use Encode::Guess; my $bom = chr(0xfeff); for my $width (16, 32, 8) { for my $endianness ('be', 'le', '') { next if ($width == 8) xor ($endianness eq ''); for my $bomness (0, 1) { my $bomsuffix = $bomness ? 'b' : ''; my $fn = sprintf "utf$width$endianness%s", $bomsuffix; open(FH, ">:encoding(utf$width$endianness)", $fn) or die "$fn: $!"; print FH $bomness ? $bom : '', chr(0x1234), "foobar\n"; close FH; open(FH, "<$fn"); my $enc = guess_encoding(my $data = <FH>); close(FH); ref($enc) or die "No idea: $enc\n"; my $ename = $enc->name; my $utf8 = $enc->decode($data); printf "%-8s %-8s %d %04x - %s %s\n", $fn, $ename, $utf8 =~ /^$bom/ ? 1 : 0, ord($utf8), $ename =~ /^UTF-?$width($endianness($bomsuffix)?)?/i ? "ok" : "NO", ord($utf8) == 0x1234 ? "ok" : "NO"; } } } I get: utf16be UTF-16BE 0 1234 - ok ok utf16beb UTF-16 0 1234 - ok ok utf16le UTF-16LE 0 1234 - ok ok utf16leb UTF-16 0 1234 - ok ok utf32be UTF-32BE 0 1234 - ok ok utf32beb UTF-32 0 1234 - ok ok utf32le UTF-32LE 0 1234 - ok ok utf32leb UTF-16 0 0000 - NO NO utf8 utf8 0 1234 - ok ok utf8b utf8 1 feff - ok NO So UTF-32-LE with BOM and UTF-8 with BOM are misguessed. The second is admittedly a "freak" but the first one is a bit worrying.