Bug #7892 for Encode: Encode::Unicode croaks with malformed data

Wed Oct 06 01:36:08 2004 Guest - Ticket created

Subject:

Encode::Unicode croaks with malformed data

I use Encode-2.01 bundled with perl 5.8.5. Encode::Unicode croaks even if CHECK arg is not Encode::FB_CROAK when conversion fails due to malformed unicode data (ex. invalid surrogate, missing a BOM). I hope invalid characters should be replaced with `substitution character' in Encode::Unicode. % perl -MEncode -e '$a = "\xfe\xff\xd8\xd9\xda\xdb\0\n"; Encode::from_to($a, "utf16", "shift_jis", 0); print ("$a");' UTF-16:Malformed LO surrogate d8d9 at /usr/lib/perl5/5.8.5/cygwin-thread-multi-64int/Encode.pm line 184. % perl -MEncode -e '$a = "BOM missing"; Encode::from_to($a, "utf16", "shift_jis", 0); print ("$a");' UTF-16:Unrecognised BOM 424f at /usr/lib/perl5/5.8.5/cygwin-thread-multi-64int/Encode.pm line 184. % perl -v This is perl, v5.8.5 built for cygwin-thread-multi-64int Copyright 1987-2004, Larry Wall Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5 source kit. Complete documentation for Perl, including FAQ lists, should be found on this system using `man perl' or `perldoc perl'. If you have access to the Internet, point your browser at http://www.perl.com/, the Perl Home Page. % uname -a CYGWIN_NT-5.1 blue-water 1.5.11(0.116/4/2) 2004-09-04 23:17 i686 unknown unknown Cygwin

Tue Oct 19 17:05:43 2004 DANKOGAI [...] cpan.org - Taken

Tue Oct 19 17:36:03 2004 DANKOGAI [...] cpan.org - Correspondence added

[guest - Wed Oct 6 01:36:08 2004]: Show quoted text

> I use Encode-2.01 bundled with perl 5.8.5. > Encode::Unicode croaks even if CHECK arg is not Encode::FB_CROAK when > conversion fails due to malformed unicode data (ex. invalid > surrogate, missing a BOM). I hope invalid characters should be > replaced with `substitution character' in Encode::Unicode.

Well, in this particular case I believe Encode does the right thing. Unlike other encodings where mappings are not one-to-one, UTFs are guaranteed to map one another. So they should be treated more strict. Consider that "division by zero" of Encode :) As for checking the integrity of the source string, you can use Encode::Guess. Dan the Encode Maintainer

Tue Oct 19 17:36:04 2004 DANKOGAI [...] cpan.org - Status changed from 'new' to 'open'

Fri Oct 22 02:18:47 2004 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'