Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 48018
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: lubo.rintel [...] gooddata.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 2.23
Fixed in: (no value)



Subject: Encode and iconv (etc.) disagree on what's valid UTF-8
This input seems to be correctly marked as invalid, both by Encode as well as iconv: [lkundrak@trurl ~]$ perl -MEncode -e '$u = "\xef\xbf"; print $u; decode ("UTF-8", $u, 1);'|iconv -f utf8 -t iso8859-1 utf8 "\xEF" does not map to Unicode at /usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm line 162. iconv: incomplete character or shift sequence at end of buffer [lkundrak@trurl ~]$ Most tools I've encountered won't accept 0xEF 0xBF 0xBD sequence either, though not being an expert on the topic I can't really say who's wrong here. See: [lkundrak@trurl ~]$ perl -MEncode -e '$u = "\xef\xbf\xbd"; print $u; decode ("UTF-8", $u, 1);'|iconv -f utf8 -t iso8859-1 iconv: illegal input sequence at position 0 Iconv complains about somthing that decode() accepts happily.
On Mon Jul 20 07:37:05 2009, lkundrak wrote: Show quoted text
> This input seems to be correctly marked as invalid, both by Encode as > well as iconv: > > [lkundrak@trurl ~]$ perl -MEncode -e '$u = "\xef\xbf"; print $u; decode > ("UTF-8", $u, 1);'|iconv -f utf8 -t iso8859-1 > utf8 "\xEF" does not map to Unicode at > /usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm line 162. > iconv: incomplete character or shift sequence at end of buffer > [lkundrak@trurl ~]$ > > Most tools I've encountered won't accept 0xEF 0xBF 0xBD sequence either, > though not being an expert on the topic I can't really say who's wrong > here. See:
That's U+FFFD (REPLACEMENT CHARACTER) encoded in UTF-8. Show quoted text
> > [lkundrak@trurl ~]$ perl -MEncode -e '$u = "\xef\xbf\xbd"; print $u; > decode ("UTF-8", $u, 1);'|iconv -f utf8 -t iso8859-1 > iconv: illegal input sequence at position 0 > > Iconv complains about somthing that decode() accepts happily.
It is a valid Unicode which does not have a map to iso8859-1. So both Encode and iconv are behaving okay. Dan the Encode Maintainer