Skip Menu |

This queue is for tickets about the Plucene CPAN distribution.

Report information
The Basics
Id: 11332
Status: new
Priority: 0/
Queue: Plucene

People
Owner: Nobody in particular
Requestors: paul.bijnens [...] xplanation.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 1.20
Fixed in: (no value)



Subject: segment merging problems with utf8-strings
This is with perl 5.8.5, Plucene v1.20: This short program tickles the problem: ################cut here########## #!/usr/bin/perl use Plucene::Document; use Plucene::Document::Field; use Plucene::Analysis::SimpleAnalyzer; use Plucene::Index::Writer; use Encode; my $idx_writer = Plucene::Index::Writer->new("my_idx", Plucene::Analysis::SimpleAnalyzer->new(), 1); foreach (qw/one two three four/) { push (@lines, decode("latin1", "$_ \xe9$_")); # \xe9 = e-acute } foreach my $txt (@lines) { warn(++$nr, "\t", $txt, "\n"); my $fld = Plucene::Document::Field->Text("txt", $txt); my $doc = Plucene::Document->new; $doc->add($fld); $idx_writer->add_document($doc); $idx_writer->optimize; # tickle the errormsg } ################cut here########## On a latin1 terminal, you can see that the utf8 flag is lost and the multibytes are converted again to utf8: Here is the output on a latin1 terminal: 1 one éone 2 two étwo 3 three éthree Can't add out-of-order term ÃÂtwo lt ÃÂéone (txt lt txt) at /pkgs/perl-5.8.5/lib/site_perl/5.8.5/Plucene/Index/SegmentMerger.pm line 154 4 four éfour Can't add out-of-order term ÃÂtwo lt ÃÂÃÂÃÂéone (txt lt txt) at /pkgs/perl-5.8.5/lib/site_perl/5.8.5/Plucene/Index/SegmentMerger.pm line 154 Can anyone reproduce this, or is this peculiar to my environment?
From: torben-spam-plucene [...] nehmer.net
[guest - Wed Feb 2 07:09:27 2005]: Show quoted text
> Can anyone reproduce this, or is this peculiar to my environment?
I can confirm this behavoir, it is related to 11658, as the bug you have been outlining here is responsible for the chaotic output of the SegmentMerger. On each segment merging, it seems to me that the UTF-8 entities are taken byte-wise and re-encoded as UTF-8 that way. From what a good friend of mine, who is a Perl-Guru(tm), told me, the way the Input and OutputStream classes work in this respect is dangerous, as they just assume strings to be utf-8 -- which appearantly is not the case.