Bug #11052 for Plucene: QueryParser should pass correct field name when tokenizing

Tue Jan 18 11:51:09 2005 Guest - Ticket created

Subject:

QueryParser should pass correct field name when tokenizing

Distribution: Plucene-1.20 Perl version: ActiveState 5.8.6 build 811 OS: Windows XP The current version QueryParser.pm always passes the name of its default field to the analyzer (via its call to the analyzer's tokenstream function). For queries of the form "FieldName:BasicClause", QueryParser should instead pass "FieldName" to the analyzer. I have verified that Java Lucene behaves this way. I have made the following simple changes to QueryParser.pm (attached file) on my system and they fix the problem. I have not regression tested them. C:\Perl\site\lib\Plucene>diff QueryParser.pm.bak QueryParser.pm 101c101 < $item->{term} = $self->_tokenize($extracted); --- Show quoted text

> $item->{term} = $self->_tokenize($extracted, $item->{field});

104c104 < $item->{term} = $self->_tokenize($1); --- Show quoted text

> $item->{term} = $self->_tokenize($1, $item->{field});

108c108 < $item->{term} = $self->_tokenize($1); --- Show quoted text

> $item->{term} = $self->_tokenize($1, $item->{field});

125c125 < my ($self, $image) = @_; --- Show quoted text

> my ($self, $image, $field) = @_;

127c127 < field => $self->{default}, --- Show quoted text

> field => $field || $self->{default},

package Plucene::QueryParser; use strict; use warnings; use base 'Class::Accessor::Fast'; use Carp 'croak'; use IO::Scalar; use Text::Balanced qw(extract_delimited extract_bracketed); our $DefaultOperator = "OR"; __PACKAGE__->mk_accessors(qw(analyzer default)); =head1 NAME Plucene::QueryParser - Turn query strings into Plucene::Search::Query objects =head1 SYNOPSIS my $p = Plucene::QueryParser->new({ analyzer => Plucene::Analysis::Analyzer $a, default => "text" }); my Plucene::Search::Query $q = $p->parse("foo bar:baz"); =head1 DESCRIPTION This module is responsible for turning a query string into a Plucene::Query object. It needs to have an Analyzer object to help it tokenize incoming queries, and it also needs to know the default field to be used if no field is given in the query string. =head1 METHODS =head2 new my $p = Plucene::QueryParser->new({ analyzer => Plucene::Analysis::Analyzer $a, default => "text" }); Construct a new query parser =cut sub new { my $self = shift->SUPER::new(@_); croak "You need to pass an analyzer" unless UNIVERSAL::isa($self->{analyzer}, "Plucene::Analysis::Analyzer"); croak "No default field name supplied!" unless $self->{default}; return $self; } =head2 parse my Plucene::Search::Query $q = $p->parse("foo bar:baz"); Turns the string into a query object. =cut sub parse { my $self = shift; local $_ = shift; my $ast = shift; my @rv; while ($_) { s/^\s+// and next; my $item; $item->{conj} = "NONE"; s/^(AND|OR|\|\|)\s+//i; if ($1) { $item->{conj} = uc $1; $item->{conj} = "OR" if $item->{conj} eq "||"; } if (s/^\+//) { $item->{mods} = "REQ"; } elsif (s/^(-|!|NOT)\s*//i) { $item->{mods} = "NOT"; } else { $item->{mods} = "NONE"; } if (s/^([^\s(":]+)://) { $item->{field} = $1 } # Subquery if (/^$/) { my ($extracted, $remainer) = extract_bracketed($_, "("); if (!$extracted) { croak "Unbalanced subquery" } $_ = $remainer; $extracted =~ s/^\(//; $extracted =~ s/$$//; $item->{query} = "SUBQUERY"; $item->{subquery} = $self->parse($extracted, 1); } elsif (/^"/) { my ($extracted, $remainer) = extract_delimited($_, '"'); if (!$extracted) { croak "Unbalanced phrase" } $_ = $remainer; $extracted =~ s/^"//; $extracted =~ s/"$//; $item->{query} = "PHRASE"; $item->{term} = $self->_tokenize($extracted, $item->{field}); } elsif (s/^(\S+)\*//) { $item->{query} = "PREFIX"; $item->{term} = $self->_tokenize($1, $item->{field}); } else { s/([^\s\^]+)// or croak "Malformed query"; $item->{query} = "TERM"; $item->{term} = $self->_tokenize($1, $item->{field}); if ($item->{term} =~ / /) { $item->{query} = "PHRASE"; } } s/^~(\d+)// and $item->{slop} = $1; if (s/^\^(\d+(?:.\d+)?)//) { $item->{boost} = $1 } push @rv, bless $item, "Plucene::QueryParser::" . ucfirst lc $item->{query}; } my $obj = bless \@rv, "Plucene::QueryParser::TopLevel"; # If we only want the AST, don't convert to a Search::Query. if ($ast) { return $obj } return $obj->to_plucene($self->{default}); } sub _tokenize { my ($self, $image, $field) = @_; my $stream = $self->{analyzer}->tokenstream({ field => $field || $self->{default}, reader => IO::Scalar->new(\$image) }); my @words; while (my $x = $stream->next) { push @words, $x->text } join(" ", @words); } package Plucene::QueryParser::TopLevel; sub to_plucene { my ($self, $field) = @_; return $self->[0]->to_plucene($field) if @$self == 1 and $self->[0]->{mods} eq "NONE"; my @clauses; $self->add_clause(\@clauses, $_, $field) for @$self; require Plucene::Search::BooleanQuery; my $query = new Plucene::Search::BooleanQuery; $query->add_clause($_) for @clauses; $query; } sub add_clause { my ($self, $clauses, $term, $field) = @_; my $q = $term->to_plucene($field); if ($term->{conj} eq "AND" and @$clauses) { # The previous term needs to become required $clauses->[-1]->required(1) unless $clauses->[-1]->prohibited; } if ( $Plucene::QueryParser::DefaultOperator eq "AND" and $term->{conj} eq "OR") { $clauses->[-1]->required(0) unless $clauses->[-1]->prohibited; } return unless $q; # Shouldn't happen yet my $prohibited; my $required; if ($Plucene::QueryParser::DefaultOperator eq "OR") { # We set REQUIRED if we're introduced by AND or +; PROHIBITED if # introduced by NOT or -; make sure not to set both. $prohibited = ($term->{mods} eq "NOT"); $required = ($term->{mods} eq "REQ"); $required = 1 if $term->{conj} eq "AND" and !$prohibited; } else { # We set PROHIBITED if we're introduced by NOT or -; We set # REQUIRED if not PROHIBITED and not introduced by OR $prohibited = ($term->{mods} eq "NOT"); $required = (!$prohibited and $term->{conj} ne "OR"); } require Plucene::Search::BooleanClause; push @$clauses, Plucene::Search::BooleanClause->new({ prohibited => $prohibited, required => $required, query => $q }); } package Plucene::QueryParser::Term; sub to_plucene { require Plucene::Search::TermQuery; require Plucene::Index::Term; my ($self, $field) = @_; $self->set_term($field); my $q = Plucene::Search::TermQuery->new({ term => $self->{pl_term} }); $self->set_boost($q); return $q; } sub set_term { my ($self, $field) = @_; $self->{pl_term} = Plucene::Index::Term->new({ field => (exists $self->{field} ? $self->{field} : $field), text => $self->{term} }); } sub set_boost { my ($self, $q) = @_; $q->boost($self->{boost}) if exists $self->{boost}; } package Plucene::QueryParser::Phrase; our @ISA = qw(Plucene::QueryParser::Term); # This corresponds to the rules for "PHRASE" in the Plucene grammar sub to_plucene { require Plucene::Search::PhraseQuery; require Plucene::Index::Term; my ($self, $field) = @_; my @words = split /\s+/, $self->{term}; return $self->SUPER::to_plucene($field) if @words == 1; my $phrase = Plucene::Search::PhraseQuery->new; for my $word (@words) { my $term = Plucene::Index::Term->new({ field => (exists $self->{field} ? $self->{field} : $field), text => $word }); $phrase->add($term); } if (exists $self->{slop}) { $phrase->slop($self->{slop}); } $self->set_boost($phrase); return $phrase; } package Plucene::QueryParser::Subquery; sub to_plucene { my ($self, $field) = @_; $self->{subquery} ->to_plucene(exists $self->{field} ? $self->{field} : $field); } package Plucene::QueryParser::Prefix; our @ISA = qw(Plucene::QueryParser::Term); sub to_plucene { require Plucene::Search::PrefixQuery; my ($self, $field) = @_; $self->set_term($field); my $q = Plucene::Search::PrefixQuery->new({ prefix => $self->{pl_term} }); $self->set_boost($q); return $q; } 1;

Sun Jan 23 16:30:44 2005 tony [...] kasei.com - Correspondence added

Date:	Sun, 23 Jan 2005 21:19:49 +0000
From:	Tony Bowden <tony [...] kasei.com>
To:	Guest via RT <bug-plucene [...] rt.cpan.org>
Subject:	Re: [cpan #11052] QueryParser should pass correct field name when tokenizing
RT-Send-Cc:

On Tue, Jan 18, 2005 at 11:51:10AM -0500, Guest via RT wrote: Show quoted text

> The current version QueryParser.pm always passes the name of its > default field to the analyzer (via its call to the analyzer's tokenstream > function). For queries of the form "FieldName:BasicClause", QueryParser > should instead pass "FieldName" to the analyzer. I have verified that > Java Lucene behaves this way.

Do you have a simple test that exposes this bug? Show quoted text

> I have made the following simple changes to QueryParser.pm (attached > file) on my system and they fix the problem. I have not regression > tested them.

Thanks. This passes all the tests, but I'd prefer not to integrate without a regression test for this. Tony

Sun Jan 23 17:01:17 2005 Guest - Correspondence added

Attached is a script which exposes the bug, adapted (quickly) from the Plucene documentation example. It creates a simple Analyzer subclass, which prints the name of the field being analyzed. The result I get with Plucene 1.20 is: Creating index... In TestAnalyzer::tokenstream(), $field = content In TestAnalyzer::tokenstream(), $field = author Starting query... In TestAnalyzer::tokenstream(), $field = text In TestAnalyzer::tokenstream(), $field = text Results: ... The result I expect is: Creating index... In TestAnalyzer::tokenstream(), $field = content In TestAnalyzer::tokenstream(), $field = author Starting query... In TestAnalyzer::tokenstream(), $field = author In TestAnalyzer::tokenstream(), $field = text Results: ...

#!perl package TestAnalyzer; use base 'Plucene::Analysis::Analyzer'; use Plucene::Analysis::Analyzer; use Plucene::Analysis::WhitespaceTokenizer; use Data::Dumper; sub tokenstream { my $class = shift; my $field = $_[0]->{field}; my $tok; print "In TestAnalyzer::tokenstream(), \$field = $field\n"; $tok = Plucene::Analysis::WhitespaceTokenizer->new(@_); return $tok; } 1; package main; use Plucene::Document; use Plucene::Document::Field; print "Creating index...\n"; my $doc = Plucene::Document->new; $doc->add(Plucene::Document::Field->Text("content", $content)); $doc->add(Plucene::Document::Field->Text("author", "Your Name")); #Next, choose your analyser, and make an index writer. use Plucene::Index::Writer; use Plucene::Analysis::SimpleAnalyzer; my $writer = Plucene::Index::Writer->new("my_index", TestAnalyzer->new(), 1); #Now write your documents into the index. $writer->add_document($doc); undef $writer; # close #When you come to search, parse the query and create a searcher: print "Starting query...\n"; use Plucene::QueryParser; use Plucene::Analysis::SimpleAnalyzer; use Plucene::Search::IndexSearcher; my $parser = Plucene::QueryParser->new({ analyzer => TestAnalyzer->new(), default => "text" # Default field for non-specified queries }); my $query = $parser->parse('author:Your Name'); my $searcher = Plucene::Search::IndexSearcher->new("my_index"); #Decide what you're going to do with the results: use Plucene::Search::HitCollector; my @docs; my $hc = Plucene::Search::HitCollector->new(collect => sub { my ($self, $doc, $score)= @_; push @docs, $searcher->doc($doc); }); $searcher->search_hc($query, $hc); use Data::Dumper; print "Results:\n"; print Dumper @docs;

Wed Jul 20 06:30:00 2005 TMTM [...] cpan.org - Correspondence added

[guest - Sun Jan 23 17:01:17 2005]: Show quoted text

> Attached is a script which exposes the bug, adapted (quickly) from the > Plucene documentation example. It creates a simple Analyzer subclass, > which prints the name of the field being analyzed.

Sorry it's taken me so long to get around this. I can replicate this bug, and I can fix it, but I'm curious as to how it manifests itself. What problem does this bug actually cause? I'd like to add a test that's at a slightly higher level than the one you've supplied as well. Thanks, Tony

Wed Jul 20 12:02:13 2005 Guest - Correspondence added

No need to apologize. Unless you're getting paid to support this :-) I haven't looked at this in a while myself, but basically the way the bug manifests is that any time a QueryParser object parses a query of the form "fieldname:pattern", the result is likely to be incorrect. The only case which gives the correct result is when fieldname is the same as the default field for that QueryParser object. For example, if you've added the following fieldnames & data to a document: name:Eric, age:39, hair:brown name:Pamela, age:40, hair:blonde And you try to run a query with the search string "hair:blonde", you will not get any matches, since what Plucene sees internally is "name:blonde". I'll try to come up with a real code example later.

Wed Jul 20 12:17:29 2005 tony [...] kasei.com - Correspondence added

Date:	Wed, 20 Jul 2005 17:17:16 +0100
From:	Tony Bowden <tony [...] kasei.com>
To:	Guest via RT <bug-plucene [...] rt.cpan.org>
Subject:	Re: [cpan #11052] QueryParser should pass correct field name when tokenizing
RT-Send-Cc:

On Wed, Jul 20, 2005 at 12:02:13PM -0400, Guest via RT wrote: Show quoted text

> No need to apologize. Unless you're getting paid to support this :-)

Unfortunately not :) Show quoted text

> name:Eric, age:39, hair:brown > name:Pamela, age:40, hair:blonde > And you try to run a query with the search string "hair:blonde", you > will not get any matches, since what Plucene sees internally is > "name:blonde".

Eeek. That's a fairly serious bug. I was pretty sure there were tests for this sort of thing though ... Show quoted text

> I'll try to come up with a real code example later.

Thanks. I'll see if I can get some time this evening to play about with it myself. Tony