Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 7892
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: DANKOGAI [...] cpan.org
Requestors: qbin [...] users.sourceforge.net
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 2.01
Fixed in: (no value)



Subject: Encode::Unicode croaks with malformed data
I use Encode-2.01 bundled with perl 5.8.5. Encode::Unicode croaks even if CHECK arg is not Encode::FB_CROAK when conversion fails due to malformed unicode data (ex. invalid surrogate, missing a BOM). I hope invalid characters should be replaced with `substitution character' in Encode::Unicode. % perl -MEncode -e '$a = "\xfe\xff\xd8\xd9\xda\xdb\0\n"; Encode::from_to($a, "utf16", "shift_jis", 0); print ("$a");' UTF-16:Malformed LO surrogate d8d9 at /usr/lib/perl5/5.8.5/cygwin-thread-multi-64int/Encode.pm line 184. % perl -MEncode -e '$a = "BOM missing"; Encode::from_to($a, "utf16", "shift_jis", 0); print ("$a");' UTF-16:Unrecognised BOM 424f at /usr/lib/perl5/5.8.5/cygwin-thread-multi-64int/Encode.pm line 184. % perl -v This is perl, v5.8.5 built for cygwin-thread-multi-64int Copyright 1987-2004, Larry Wall Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5 source kit. Complete documentation for Perl, including FAQ lists, should be found on this system using `man perl' or `perldoc perl'. If you have access to the Internet, point your browser at http://www.perl.com/, the Perl Home Page. % uname -a CYGWIN_NT-5.1 blue-water 1.5.11(0.116/4/2) 2004-09-04 23:17 i686 unknown unknown Cygwin
[guest - Wed Oct 6 01:36:08 2004]: Show quoted text
> I use Encode-2.01 bundled with perl 5.8.5. > Encode::Unicode croaks even if CHECK arg is not Encode::FB_CROAK when > conversion fails due to malformed unicode data (ex. invalid > surrogate, missing a BOM). I hope invalid characters should be > replaced with `substitution character' in Encode::Unicode.
Well, in this particular case I believe Encode does the right thing. Unlike other encodings where mappings are not one-to-one, UTFs are guaranteed to map one another. So they should be treated more strict. Consider that "division by zero" of Encode :) As for checking the integrity of the source string, you can use Encode::Guess. Dan the Encode Maintainer