Subject: | segment merging problems with utf8-strings |
This is with perl 5.8.5, Plucene v1.20:
This short program tickles the problem:
################cut here##########
#!/usr/bin/perl
use Plucene::Document;
use Plucene::Document::Field;
use Plucene::Analysis::SimpleAnalyzer;
use Plucene::Index::Writer;
use Encode;
my $idx_writer = Plucene::Index::Writer->new("my_idx",
Plucene::Analysis::SimpleAnalyzer->new(), 1);
foreach (qw/one two three four/) {
push (@lines, decode("latin1", "$_ \xe9$_")); # \xe9 = e-acute
}
foreach my $txt (@lines) {
warn(++$nr, "\t", $txt, "\n");
my $fld = Plucene::Document::Field->Text("txt", $txt);
my $doc = Plucene::Document->new;
$doc->add($fld);
$idx_writer->add_document($doc);
$idx_writer->optimize; # tickle the errormsg
}
################cut here##########
On a latin1 terminal, you can see that the utf8 flag is lost and the
multibytes are converted again to utf8:
Here is the output on a latin1 terminal:
1 one éone
2 two étwo
3 three éthree
Can't add out-of-order term ÃÂtwo lt ÃÂéone (txt lt txt) at /pkgs/perl-5.8.5/lib/site_perl/5.8.5/Plucene/Index/SegmentMerger.pm line 154
4 four éfour
Can't add out-of-order term ÃÂtwo lt ÃÂÃÂÃÂéone (txt lt txt) at /pkgs/perl-5.8.5/lib/site_perl/5.8.5/Plucene/Index/SegmentMerger.pm line 154
Can anyone reproduce this, or is this peculiar to my environment?