Bug #44523 for Encode: files containing NULL byte reported as UTF-LE by Encode::Guess

Tue Mar 24 12:00:26 2009 jquelin [...] cpan.org - Ticket created

Subject:

files containing NULL byte reported as UTF-LE by Encode::Guess

attached file contains "foo<null>bar" where <null> is the null byte (ctrl+v ctrl+0 in vim, or ctrl+q in emacs) this file is detected as UTF-16LE by Encode::Guess, as demonstrated by snippet: $ perl -MEncode::Guess -E '$a=qx{cat null}; say guess_encoding($a,"ascii")->name;' UTF-16LE and of course, using this detected encoding to decode the file yields very strange results: $ perl -MEncode -E '$a=qx{cat null}; $b=decode("UTF-16LE",$a); say $b' Wide character in print at -e line 1. 潦o慢ੲ happens with Encode 2.32, providing Encode::Guess 2.03

Subject:

null

Download null
application/octet-stream 8b

Message body not shown because it is not plain text.

Tue Mar 24 12:07:39 2009 DANKOGAI [...] cpan.org - Correspondence added

On Tue Mar 24 12:00:26 2009, JQUELIN wrote: Show quoted text

> attached file contains "foo<null>bar" where <null> is the null byte > (ctrl+v ctrl+0 in vim, or ctrl+q in emacs) > > this file is detected as UTF-16LE by Encode::Guess, as demonstrated by > snippet: > > $ perl -MEncode::Guess -E '$a=qx{cat null}; say > guess_encoding($a,"ascii")->name;' > UTF-16LE > > and of course, using this detected encoding to decode the file yields > very strange results: > $ perl -MEncode -E '$a=qx{cat null}; $b=decode("UTF-16LE",$a); say $b' > Wide character in print at -e line 1. > 潦o慢ੲ > > happens with Encode 2.32, providing Encode::Guess 2.03

No, that's not a bug. That's what UTF-(16|32)(LE|BE) is all about. i.e \x20\x00 is VALID and it means \x{0020}. Dan the Maintainer Thereof

Tue Mar 24 12:07:39 2009 The RT System itself - Status changed from 'new' to 'open'

Tue Mar 24 12:07:40 2009 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Tue Mar 24 13:28:34 2009 jquelin [...] cpan.org - Correspondence added

i understand that the sequence is valid utf-16. what i'm objecting is that it's not the best guess in this case... what should i do to have a correct guess?

Tue Mar 24 13:28:34 2009 The RT System itself - Status changed from 'resolved' to 'open'

Sun Jul 12 21:47:58 2009 DANKOGAI [...] cpan.org - Correspondence added

On Tue Mar 24 13:28:34 2009, JQUELIN wrote: Show quoted text

> i understand that the sequence is valid utf-16. what i'm objecting is > that it's not the best guess in this case... > > what should i do to have a correct guess?

Of course it is not the best. After all it is guessing and so long as it appears vaild, it returns the only valid guess. Read perldoc Encode::Guess one more time. Dan the Encode Maintainer

Sun Jul 12 21:48:00 2009 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'