Subject: | Encode and iconv (etc.) disagree on what's valid UTF-8 |
This input seems to be correctly marked as invalid, both by Encode as
well as iconv:
[lkundrak@trurl ~]$ perl -MEncode -e '$u = "\xef\xbf"; print $u; decode
("UTF-8", $u, 1);'|iconv -f utf8 -t iso8859-1
utf8 "\xEF" does not map to Unicode at
/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm line 162.
iconv: incomplete character or shift sequence at end of buffer
[lkundrak@trurl ~]$
Most tools I've encountered won't accept 0xEF 0xBF 0xBD sequence either,
though not being an expert on the topic I can't really say who's wrong
here. See:
[lkundrak@trurl ~]$ perl -MEncode -e '$u = "\xef\xbf\xbd"; print $u;
decode ("UTF-8", $u, 1);'|iconv -f utf8 -t iso8859-1
iconv: illegal input sequence at position 0
Iconv complains about somthing that decode() accepts happily.