Bug #77122 for Unicode-CaseFold: Output of fc kills Encode::decode

Thu May 10 23:58:19 2012 RSAVAGE [...] cpan.org - Ticket created

Subject:

Output of fc kills Encode::decode

Hi I'm processing subcountry names in Estonia, from: http://en.wikipedia.org/wiki/ISO_3166-2:EE I got to that page from the list of all countries: http://en.wikipedia.org/wiki/ISO_3166-2 Code: for my $element (@$table) { $i++; $self -> log(debug => "code: $$element{code}"); $self -> log(debug => "name: $$element{name}"); $self -> log(debug => "decode: " . decode('utf8', $$element{name})); $self -> log(debug => "decode fc: " . decode('utf8', fc $$element{name})); $sth -> execute($country_id, $$element{code}, decode('utf8', fc $$element{name}), decode('utf8', $$element{name}), $i); } Output: debug: code: EE-37. debug: name: Harjumaa. debug: decode: Harjumaa. debug: decode fc: harjumaa. debug: code: EE-39. debug: name: Hiiumaa. debug: decode: Hiiumaa. debug: decode fc: hiiumaa. debug: code: EE-44. debug: name: Ida-Virumaa. debug: decode: Ida-Virumaa. debug: decode fc: ida-virumaa. debug: code: EE-49. debug: name: JÃµgevamaa. debug: decode: Jõgevamaa. Cannot decode string with wide characters at /home/ron/perl5/perlbrew/perls/perl-5.14.2/lib/5.14.2/x86_64-linux- thread-multi/Encode.pm line 176. So, the call to fc returns something unacceptable to decode, when the name is Jõgevamaa. I rigged the code to skip Estonia, and the code works in all other countries and their subcountries. I then rigged the code to skip Jõgevamaa, and the next place it dies is: debug: code: EE-65. debug: name: PÃµlvamaa. debug: decode: Põlvamaa. Cannot decode string with wide characters at /home/ron/perl5/perlbrew/perls/perl-5.14.2/lib/5.14.2/x86_64-linux- thread-multi/Encode.pm line 176. I.e The names corresponding to the codes EE-51, EE-57 and EE-59 are all handled ok. I rigged it to skip Põlvamaa, and the next place it dies is: debug: code: EE-86. debug: name: VÃµrumaa. debug: decode: Võrumaa. Cannot decode string with wide characters at /home/ron/perl5/perlbrew/perls/perl-5.14.2/lib/5.14.2/x86_64-linux- thread-multi/Encode.pm line 176. So, each problem is 'o' with a tilde above it. When I rigged to code to skip these 3 cases, everything worked. This is Debian 6, 64 bit. Perl V 5.14.2. Encode V 2.44. Unicode::CaseFold V 0.02. Unicode::Normalize V 1.14. Installing Perl V 5.15.9... Versions of Encode, Unicode::CaseFold, Unicode::Normalize are the same. Same problem :-(. Cheers Ron

Sun May 13 11:48:50 2012 ARODLAND [...] cpan.org - Correspondence added

On Thu May 10 23:58:19 2012, RSAVAGE wrote: Show quoted text

> $self -> log(debug => "decode: " . decode('utf8', > $$element{name})); > $self -> log(debug => "decode fc: " . decode('utf8', fc > $$element{name}));

This isn't a bug in Unicode::CaseFold, except possibly the lack of a better error message (I will see what perl 5.16 does, and try to imitate it). In any case, decode('utf8', fc $bytes) is invalid. You should be writing fc decode('utf8', $bytes) instead, as fc works on character- strings, not byte-strings.

Sun May 13 11:48:50 2012 The RT System itself - Status changed from 'new' to 'open'

Sun May 13 11:48:51 2012 ARODLAND [...] cpan.org - Status changed from 'open' to 'rejected'

Sun May 13 19:54:24 2012 ron [...] savage.net.au - Correspondence added

Subject:	Re: [rt.cpan.org #77122] Output of fc kills Encode::decode
Date:	Mon, 14 May 2012 09:50:35 +1000
To:	bug-Unicode-CaseFold [...] rt.cpan.org
From:	Ron Savage <ron [...] savage.net.au>

Hi Andrew On 14/05/12 01:48, Andrew Rodland via RT wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=77122> > > On Thu May 10 23:58:19 2012, RSAVAGE wrote:

>> $self -> log(debug => "decode: " . decode('utf8', >> $$element{name})); >> $self -> log(debug => "decode fc: " . decode('utf8', fc >> $$element{name}));

> > This isn't a bug in Unicode::CaseFold, except possibly the lack of a > better error message (I will see what perl 5.16 does, and try to imitate > it). In any case, decode('utf8', fc $bytes) is invalid. You should be > writing fc decode('utf8', $bytes) instead, as fc works on character- > strings, not byte-strings.

OK. Thanx for the reply. -- Ron Savage http://savage.net.au/ Ph: 0421 920 622

Sun May 13 19:54:25 2012 The RT System itself - Status changed from 'rejected' to 'open'

Mon Jun 25 10:21:21 2012 ARODLAND [...] cpan.org - Status changed from 'open' to 'rejected'