Hi,
The subroutine Unihan_value() always assumes its argument to contain raw bytes, and guess the character encoding from there. It is a lot of redundant work, while we can also just let it takes character strings in the beginning -- particularly the case when we are doing bulk-processing with data loaded from databases that are already decoded into characters.
The patch is as simple as this:
----------
--- Utils.pm.orig 2014-09-16 11:59:25.000000000 +0200
+++ Utils.pm 2014-09-16 12:07:31.000000000 +0200
@@ -20,7 +20,7 @@
sub Unihan_value {
my $word = shift;
- $word = cdecode($word);
+ $word = cdecode($word) unless Encode::is_utf8($word);
my @unihan = map { uc sprintf("%x",$_) } unpack ("U*", $word);
return wantarray?@unihan:(join('', @unihan));
}
----------