Skip Menu |

This queue is for tickets about the Encode-Detect CPAN distribution.

Report information
The Basics
Id: 15399
Status: rejected
Priority: 0/
Queue: Encode-Detect

People
Owner: Nobody in particular
Requestors: vskytta [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 0.01
Fixed in: (no value)



Subject: Some UTF-8 detected as EUC-JP/EUC-KR
With 0.01 and Perl 5.8.6 on Fedora Core 4, some UTF-8 data appears to be detected as EUC-*. For example, "München" in UTF-8 gets detected as EUC-JP, and "Skyttä" in UTF-8 as EUC-KR.
Subject: Some UTF-8 detected as EUC-JP/EUC-KR/gb18030
The attached utf-8 file is incorrectly detected as gb18030 $ perl -MEncode::Detect::Detector -E 'say Encode::Detect::Detector::detect(`cat ~/prueba.html`);' gb18030 $
Subject: prueba.html

Lotería Canción

On Thu Nov 11 14:59:40 2010, DMUEY wrote: Show quoted text
> The attached utf-8 file is incorrectly detected as gb18030 > > $ perl -MEncode::Detect::Detector -E 'say > Encode::Detect::Detector::detect(`cat ~/prueba.html`);' > gb18030 > $
There are exactly two bytes in the file which are not ASCII so it's a bit ridiculous to expect a correct guess here. Unless you have evidence that bytes c3 b3 are not valid GB18030 it's not a mistake.
Misdetections on short files will happen.