Bug #98923 for Lingua-Han-Utils: Unihan_value() always assumes its argument to contain raw bytes

Tue Sep 16 06:12:41 2014 GUGOD [...] cpan.org - Ticket created

Hi, The subroutine Unihan_value() always assumes its argument to contain raw bytes, and guess the character encoding from there. It is a lot of redundant work, while we can also just let it takes character strings in the beginning -- particularly the case when we are doing bulk-processing with data loaded from databases that are already decoded into characters. The patch is as simple as this: ---------- --- Utils.pm.orig 2014-09-16 11:59:25.000000000 +0200 +++ Utils.pm 2014-09-16 12:07:31.000000000 +0200 @@ -20,7 +20,7 @@ sub Unihan_value { my $word = shift; - $word = cdecode($word); + $word = cdecode($word) unless Encode::is_utf8($word); my @unihan = map { uc sprintf("%x",$_) } unpack ("U*", $word); return wantarray?@unihan:(join('', @unihan)); } ----------

Tue Sep 16 06:50:42 2014 fayland [...] cpan.org - Correspondence added

hi new version shipped. thanks for the patching.

Tue Sep 16 06:50:43 2014 The RT System itself - Status changed from 'new' to 'open'

Tue Sep 16 06:50:46 2014 fayland [...] cpan.org - Status changed from 'open' to 'resolved'

Tue Sep 16 12:44:09 2014 ether [...] cpan.org - Subject changed from (no value) to 'Unihan_value() always assumes its argument to contain raw bytes'