Skip Menu |

This queue is for tickets about the Lingua-ZH-WordSegmenter CPAN distribution.

Report information
The Basics
Id: 48506
Status: new
Priority: 0/
Queue: Lingua-ZH-WordSegmenter

People
Owner: Nobody in particular
Requestors: JJSCHUTZ [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 0.01
Fixed in: (no value)



Subject: Dictionary loading fails with wide character error with 'use encoding "utf8"'
Please see patch below, which simply swaps the chomp and decode lines. If 'use encoding "utf8"' has been enabled anywhere in the script that uses Lingua::ZH::WordSegmenter, then strings are automatically promoted to utf8 when they are modified. Thus chomp causes $line to be marked as utf8, and then the subsequent decode() operation fails with a 'Wide character' error. Placing the chomp after the decode resolves the problem. *** lib/Lingua/ZH/WordSegmenter.pm.orig 2007-03-30 16:08:39.000000000 +0930 --- lib/Lingua/ZH/WordSegmenter.pm 2009-08-07 09:58:11.000000000 +0930 *************** *** 68,75 **** } while(my $line = <$FH>){ - chomp $line; $line = decode($self->{dic_encoding},$line); my ($word,$freq) = split(/\s+/,$line); my $len=length($word); --- 68,75 ---- } while(my $line = <$FH>){ $line = decode($self->{dic_encoding},$line); + chomp $line; my ($word,$freq) = split(/\s+/,$line); my $len=length($word); *************** *** 285,291 **** =cut ! our $VERSION = '0.01'; =head1 SYNOPSIS --- 285,291 ---- =cut ! our $VERSION = '0.02'; =head1 SYNOPSIS