Subject: | Dictionary loading fails with wide character error with 'use encoding "utf8"' |
Please see patch below, which simply swaps the chomp and decode lines.
If 'use encoding "utf8"' has been enabled anywhere in the script that
uses Lingua::ZH::WordSegmenter, then strings are automatically promoted
to utf8 when they are modified. Thus chomp causes $line to be marked as
utf8, and then the subsequent decode() operation fails with a 'Wide
character' error. Placing the chomp after the decode resolves the problem.
*** lib/Lingua/ZH/WordSegmenter.pm.orig 2007-03-30 16:08:39.000000000 +0930
--- lib/Lingua/ZH/WordSegmenter.pm 2009-08-07 09:58:11.000000000 +0930
***************
*** 68,75 ****
}
while(my $line = <$FH>){
- chomp $line;
$line = decode($self->{dic_encoding},$line);
my ($word,$freq) = split(/\s+/,$line);
my $len=length($word);
--- 68,75 ----
}
while(my $line = <$FH>){
$line = decode($self->{dic_encoding},$line);
+ chomp $line;
my ($word,$freq) = split(/\s+/,$line);
my $len=length($word);
***************
*** 285,291 ****
=cut
! our $VERSION = '0.01';
=head1 SYNOPSIS
--- 285,291 ----
=cut
! our $VERSION = '0.02';
=head1 SYNOPSIS