Bug #11332 for Plucene: segment merging problems with utf8-strings

Subject:

segment merging problems with utf8-strings

This is with perl 5.8.5, Plucene v1.20: This short program tickles the problem: ################cut here########## #!/usr/bin/perl use Plucene::Document; use Plucene::Document::Field; use Plucene::Analysis::SimpleAnalyzer; use Plucene::Index::Writer; use Encode; my $idx_writer = Plucene::Index::Writer->new("my_idx", Plucene::Analysis::SimpleAnalyzer->new(), 1); foreach (qw/one two three four/) { push (@lines, decode("latin1", "$_ \xe9$_")); # \xe9 = e-acute } foreach my $txt (@lines) { warn(++$nr, "\t", $txt, "\n"); my $fld = Plucene::Document::Field->Text("txt", $txt); my $doc = Plucene::Document->new; $doc->add($fld); $idx_writer->add_document($doc); $idx_writer->optimize; # tickle the errormsg } ################cut here########## On a latin1 terminal, you can see that the utf8 flag is lost and the multibytes are converted again to utf8: Here is the output on a latin1 terminal: 1 one Ã©one 2 two Ã©two 3 three Ã©three Can't add out-of-order term ÃÂtwo lt ÃÂÃÂ©one (txt lt txt) at /pkgs/perl-5.8.5/lib/site_perl/5.8.5/Plucene/Index/SegmentMerger.pm line 154 4 four Ã©four Can't add out-of-order term ÃÂtwo lt ÃÂÃÂÃÂÃÂ©one (txt lt txt) at /pkgs/perl-5.8.5/lib/site_perl/5.8.5/Plucene/Index/SegmentMerger.pm line 154 Can anyone reproduce this, or is this peculiar to my environment?