Bug #15399 for Encode-Detect: Some UTF-8 detected as EUC-JP/EUC-KR

RT for rt.cpan.org

This queue is for tickets about the Encode-Detect CPAN distribution.

Report information

The Basics

Id:	15399
Status:	rejected
Priority:	0/
Queue:	Encode-Detect

People

Owner:	Nobody in particular
Requestors:	vskytta [...] gmail.com
Cc:
AdminCc:

Bug Information

Severity:	Normal
Broken in:	0.01
Fixed in:	(no value)

History Show all quoted text

Sun Oct 30 02:44:59 2005 scop [...] cpan.org - Ticket created

Subject:

Some UTF-8 detected as EUC-JP/EUC-KR

With 0.01 and Perl 5.8.6 on Fedora Core 4, some UTF-8 data appears to be detected as EUC-*. For example, "München" in UTF-8 gets detected as EUC-JP, and "Skyttä" in UTF-8 as EUC-KR.

Thu Nov 11 14:59:40 2010 dmuey [...] cpan.org - Correspondence added

Subject:

Some UTF-8 detected as EUC-JP/EUC-KR/gb18030

The attached utf-8 file is incorrectly detected as gb18030 $ perl -MEncode::Detect::Detector -E 'say Encode::Detect::Detector::detect(`cat ~/prueba.html`);' gb18030 $

Subject:

prueba.html

Lotería Canción

Thu Nov 11 14:59:41 2010 The RT System itself - Status changed from 'new' to 'open'

Wed Jun 13 11:33:10 2012 bkb [...] cpan.org - Correspondence added

On Thu Nov 11 14:59:40 2010, DMUEY wrote: Show quoted text

> The attached utf-8 file is incorrectly detected as gb18030 > > $ perl -MEncode::Detect::Detector -E 'say > Encode::Detect::Detector::detect(`cat ~/prueba.html`);' > gb18030 > $

There are exactly two bytes in the file which are not ASCII so it's a bit ridiculous to expect a correct guess here. Unless you have evidence that bytes c3 b3 are not valid GB18030 it's not a mistake.

Wed Jun 13 12:38:19 2012 JGMYERS [...] cpan.org - Correspondence added

Misdetections on short files will happen.

Wed Jun 13 12:38:20 2012 JGMYERS [...] cpan.org - Status changed from 'open' to 'rejected'