Bug #12226 for Plucene: Bug fix for terms that are the single character 0

Sun Apr 10 04:49:29 2005 Guest - Ticket created

Subject:

Bug fix for terms that are the single character 0

The single term 0 (the digit zero) causes problems when indexing. To reproduce, try indexing using a WhiteSpaceAnalyzer the text "a 0 is higher than . in ascii" and/or "a 0 causes problems with 0.0.0.0" The attached file has one patch each for lib/Plucene/Index/TermInfosWriter.pm lib/Plucene/Index/SegmentTermEnum.pm A unit test (t/regress-05.t) is also included that tests for this problem. For more details see the thread titled "out-of-order term" http://www.kasei.com/pipermail/plucene/2005-April/thread.html#345

=============================================================================== lib/Plucene/Index/TermInfosWriter.pm 131c131,132 < my $text = $term->text || ""; --- > my $text = $term->text; > if (not defined($text)) { $text = ''; } =============================================================================== lib/Plucene/Index/SegmentTermEnum.pm 136c136 < $self->{buffer} ||= " " x $length; --- > if (not defined($self->{buffer})) { $self->{buffer} = " " x $length; } =============================================================================== t/regress-05.t #!/usr/bin/perl -w =head1 NAME regress-05.t Check an index is created with the terms you expect. Introduced for testing bugs in Plucene v 1.21 which had problems dealing with a term that was the single character zero (0). We create an index using various chunks of text, then test that each term in the index matches what we are expecting. =cut use strict; use warnings; use Plucene::Document; use Plucene::Document::Field; use Plucene::Index::Writer; use Plucene::Analysis::WhitespaceAnalyzer; use Plucene::Search::IndexSearcher; use File::Temp qw(tempdir); require Test::More; $| = 0; my $dir = tempdir(CLEANUP => 1); my @strings = ( 'a simple test that should pass', 'something lower than 0 in ascii is . (aka a period)', 'a test with a 0 and 0.0.0.0 terms', ); Test::More->import(tests => scalar(@strings)); foreach (@strings) { &test_build($_); } sub test_build { my $string = shift; # Setup out index my $analyzer = Plucene::Analysis::WhitespaceAnalyzer->new(); my $writer = Plucene::Index::Writer->new($dir, $analyzer, 1); my $doc = Plucene::Document->new; # Index the string and close the writer/index. $doc->add(Plucene::Document::Field->Text("content", $string)); $writer->add_document($doc); $writer->optimize(); # This invalidates $writer undef $writer; # Forces $writer->DESTROY() to be called, merging segments # Read the index back in and compare each term my $searcher = Plucene::Search::IndexSearcher->new( $dir ); my $enum = $searcher->reader->terms(); my @all = sort split(/\s+/, $string); my @keys; for (my $i = 0; $i < scalar(@all); $i++) { if ( ($i > 0) and ($all[$i-1] eq $all[$i])) { next; } push(@keys, $all[$i]); } my ($pos, $success) = (0,1); while($enum->next) { if ($enum->term->text ne $keys[$pos++]) { $success = 0; last; } } if (not $success) { ok(0, "Term not matching expected result\n" . "Expecting term '" . $keys[$pos - 1] . "' but got '" . $enum->term->text . "'\nwhile testing the string '$string'"); } elsif (scalar(@keys ne $pos)) { ok(0, "Not enough terms in the index\n" . "Expecting " . scalar(@keys) . " but only found $pos\n" . "while testing the string '$string'"); } else { ok(1); } } ===============================================================================

Sun Jul 17 06:35:24 2005 TMTM [...] cpan.org - Correspondence added

[guest - Sun Apr 10 04:49:29 2005]: Show quoted text

> The single term 0 (the digit zero) causes problems when indexing. > A unit test (t/regress-05.t) is also included that tests for this > problem. > > For more details see the thread titled "out-of-order term" > http://www.kasei.com/pipermail/plucene/2005-April/thread.html#345

I'm not really liking the test here - it seems a little too low level. Can we not just have a test that indexes and searches, rather than reading the index back in? Thanks, Tony

Wed Jan 25 03:07:10 2006 Guest - Correspondence added

On Sun Jul 17 06:35:24 2005, TMTM wrote: Show quoted text

> I'm not really liking the test here - it seems a little too low level. > Can we not just have a test that indexes and searches, rather than > reading the index back in?

attached.

#!/usr/bin/perl -w =head1 NAME regress-05.t Check an index is created with the terms you expect. Introduced for testing bugs in Plucene v 1.21 which had problems dealing with a term that was the single character zero (0). Also tests for a bug present up to 1.24 that causes numeric terms to be incorrectly indexed. We create an index using various chunks of text, then test that we can search the index correctly for those terms. =cut use strict; use warnings; use Plucene::Document; use Plucene::Document::Field; use Plucene::Index::Writer; use Plucene::Analysis::WhitespaceAnalyzer; use Plucene::Search::IndexSearcher; use Plucene::QueryParser; use File::Temp qw(tempdir); require Test::More; $| = 0; my $dir = tempdir(CLEANUP => 1); my @strings = ( 'a simple test that should pass', 'something lower than 0 in ascii is . [aka a period]', 'a test with a 0 and 0.0.0.0 terms', ); Test::More->import(tests => scalar(@strings)); foreach (@strings) { &test_build($_); } sub test_build { my $string = shift; # Setup our index my $analyzer = Plucene::Analysis::WhitespaceAnalyzer->new(); my $writer = Plucene::Index::Writer->new($dir, $analyzer, 1); my $doc = Plucene::Document->new; # Index the string and close the writer/index. $doc->add(Plucene::Document::Field->Text("content", $string)); $writer->add_document($doc); $writer->optimize(); # This invalidates $writer undef $writer; # Forces $writer->DESTROY() to be called, merging segments # Prepare to search on the index my $searcher = Plucene::Search::IndexSearcher->new( $dir ); my $parser = Plucene::QueryParser->new({ analyzer => Plucene::Analysis::WhitespaceAnalyzer->new(), default => 'content' }); # Split the indexed term into words and check each exists in # the index. my $hit = 0; my @terms = split(/\s+/, $string); my @missed; foreach my $term (@terms) { #print("-$term-\n"); my $query = $parser->parse("content:$term"); my $hits = $searcher->search($query); if ($hits->length() > 0) { $hit++; } else { push(@missed, $term); } } if ($hit == scalar(@terms)) { ok(1); } else { my $msg = "The following terms (minus the quotes) were either " . "not indexed, or failed to be found when searched for:\n "; foreach my $missed (@missed) { $msg .= "'$missed',"; } chop($msg); $msg .= "\nwhile testing the string '$string'"; ok(0, $msg); } }

Wed Jan 25 03:07:11 2006 The RT System itself - Status changed from 'new' to 'open'

Thu Mar 02 17:50:11 2006 Guest - Correspondence added

I have a similar problem with the WhitespaceAnalyzer when characters other than a-z or 0-9 are involved. When using the default values for /usr/local/share/perl/5.8.4/Plucene/Analysis/WhitespaceTokenizer.pm sub token_re { qr/\S+/ } the indexing will fail with an error similar to: Docs out of order (44 < 53) at /usr/local/share/perl/5.8.4/Plucene/Index/SegmentMerger.pm line 149. But when changing the token_re function into: sub token_re { qr/[a-z\d]+/ } which will only allow a-z and 0-9 the indexing has no problems what so ever (at least I dont get the above error message). This is using plucene 1.24 downloaded through cpan using perl -MCPAN -e 'install Plucene' on a debian box running linux 2.6 kernel and perl 5.8.4.

Fri Mar 03 02:27:17 2006 tony [...] kasei.com - Correspondence added

Subject:	Re: [rt.cpan.org #12226] Bug fix for terms that are the single character 0
Date:	Fri, 3 Mar 2006 07:26:56 +0000
To:	Guest via RT <bug-plucene [...] rt.cpan.org>
From:	Tony Bowden <tony [...] kasei.com>

On Thu, Mar 02, 2006 at 05:50:12PM -0500, Guest via RT wrote: Show quoted text

> When using the default values > for /usr/local/share/perl/5.8.4/Plucene/Analysis/WhitespaceTokenizer.pm > sub token_re { qr/\S+/ } > the indexing will fail with an error similar to: > Docs out of order (44 < 53)

Any chance of a test case for this? Thanks, Tony

Fri Mar 03 03:39:40 2006 mintywalker [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #12226] Bug fix for terms that are the single character 0
Date:	Fri, 3 Mar 2006 08:39:27 +0000
To:	bug-plucene [...] rt.cpan.org
From:	Minty <mintywalker [...] gmail.com>

not me, but I'll email the guy and see if he can help :) On 3/3/06, Tony Bowden via RT <bug-plucene@rt.cpan.org> wrote: Show quoted text

> On Thu, Mar 02, 2006 at 05:50:12PM -0500, Guest via RT wrote:

> > When using the default values > > for /usr/local/share/perl/5.8.4/Plucene/Analysis/WhitespaceTokenizer.pm > > sub token_re { qr/\S+/ } > > the indexing will fail with an error similar to: > > Docs out of order (44 < 53)

> > Any chance of a test case for this? > > Thanks, > > Tony > >

Fri Mar 03 06:59:09 2006 Guest - Correspondence added

From:

Apachez

I have emailed Minty a sample of data where the error occurs along with the script I use to send data from the database (mysql) into plucene. During my more aggressive tests to collect data for the sample I received another error which might in more detail point to where the actual error can be located: " Docs out of order (44 < 49) at /usr/local/share/perl/5.8.4/Plucene/Index/SegmentMerger.pm line 149. (in cleanup) Can't call method "seek" on an undefined value at /usr/local/share/perl/5.8.4/Plucene/Index/TermInfosWriter.pm line 146 during global destruction. " Kind Regards Apachez