Bug #48506 for Lingua-ZH-WordSegmenter: Dictionary loading fails with wide character error with 'use encoding "utf8"'

Subject:

Dictionary loading fails with wide character error with 'use encoding "utf8"'

Please see patch below, which simply swaps the chomp and decode lines. If 'use encoding "utf8"' has been enabled anywhere in the script that uses Lingua::ZH::WordSegmenter, then strings are automatically promoted to utf8 when they are modified. Thus chomp causes $line to be marked as utf8, and then the subsequent decode() operation fails with a 'Wide character' error. Placing the chomp after the decode resolves the problem. *** lib/Lingua/ZH/WordSegmenter.pm.orig 2007-03-30 16:08:39.000000000 +0930 --- lib/Lingua/ZH/WordSegmenter.pm 2009-08-07 09:58:11.000000000 +0930 *************** *** 68,75 **** } while(my $line = <$FH>){ - chomp $line; $line = decode($self->{dic_encoding},$line); my ($word,$freq) = split(/\s+/,$line); my $len=length($word); --- 68,75 ---- } while(my $line = <$FH>){ $line = decode($self->{dic_encoding},$line); + chomp $line; my ($word,$freq) = split(/\s+/,$line); my $len=length($word); *************** *** 285,291 **** =cut ! our $VERSION = '0.01'; =head1 SYNOPSIS --- 285,291 ---- =cut ! our $VERSION = '0.02'; =head1 SYNOPSIS