Skip Menu |

This queue is for tickets about the KinoSearch CPAN distribution.

Report information
The Basics
Id: 21359
Status: resolved
Priority: 0/
Queue: KinoSearch

People
Owner: Nobody in particular
Requestors: mcrawfor [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in:
  • 0.12
  • 0.13
  • 0.14
  • 0.15
Fixed in: 0.20_02



Subject: Default tokenizer regex breaks unicode
The default regex in KinoSearch::Analysis::Tokenizer breaks unicode. Building a custom Tokenizer with just non-whitespace like so: my $tokenizer= KinoSearch::Analysis::Tokenizer->new( token_re => qr/\S+/); fixes the issue. I'm not sure why the built-in regex breaks unicode, but it seems like it could leave it alone without too much trouble. Example that fails to match: -------------------------------- #!/usr/bin/perl use KinoSearch::InvIndexer; use KinoSearch::Analysis::PolyAnalyzer; use KinoSearch::Searcher; my $uni = "\x{3028}\x{3063}\x{3057}\x{3024}"; my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( language => 'en'); my $invindexer = KinoSearch::InvIndexer->new( invindex => 'kino.idx', create => 1, analyzer => $analyzer, ); $invindexer->spec_field( name => 'title', boost => 3, ); $invindexer->spec_field( name => 'bodytext' ); my $doc = $invindexer->new_doc; $doc->set_value( title => $uni ." hellos" ); $doc->set_value( bodytext => 'horatio' ); $invindexer->add_doc($doc); $invindexer->finish; my $searcher = KinoSearch::Searcher->new( invindex => 'kino.idx', analyzer => $analyzer, ); my $hits = $searcher->search( query => $uni ); while ( my $hit = $hits->fetch_hit_hashref ) { print "$hit->{title}\n"; } ------------------------------------ But this same example works if you just create the analyzer like: my $lc_normalizer = KinoSearch::Analysis::LCNormalizer->new(); my $tokenizer= KinoSearch::Analysis::Tokenizer->new( token_re => qr/\S+/); my $stemmer = KinoSearch::Analysis::Stemmer->new(language => 'en'); my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( analyzers => [$lc_normalizer, $tokenizer, $stemmer] ); Which is essentially the default, except for the replaced token_re.
CC: KinoSearch discussion forum <kinosearch [...] rectangular.com>
Subject: Re: [rt.cpan.org #21359] Default tokenizer regex breaks unicode
Date: Thu, 7 Sep 2006 13:03:13 -0700
To: bug-KinoSearch [...] rt.cpan.org
From: Marvin Humphrey <marvin [...] rectangular.com>
On Sep 6, 2006, at 1:57 PM, via RT wrote: Show quoted text
> The default regex in KinoSearch::Analysis::Tokenizer breaks unicode.
Thank you for the report. Thank you especially for the test case, which I will incorporate into KinoSearch's test suite. The problem exposed by your test appears to be due to the loss of the scalar's UTF8 flag as the text is absorbed into a KinoSearch::Analysis::TokenBatch object, then recreated later. By adding Encode::_utf8_on($_) at the right spot in Tokenizer::analyze, we get the desired behavior in your test with the stock English PolyAnalyzer. Unfortunately, the TokenBatch bug is not the only place where Unicode support does not work properly in KinoSearch 0.12/0.13. All these issues were addressed a few weeks back, but there has not yet been a release incorporating the changes. The fix -- KS now converts everything to Unicode for internal processing -- is not backwards compatible, and so I'm trying to put together a single 0.20 release which aggregates multiple backwards-incompatible changes. I would appreciate it if you would try a recent version from KinoSearch's subversion repository and see if it works properly for you. As of this email, the current repository revision is 1216, which I believe will work. However, there has been quite a bit of churn lately, and you may wish to try revision 1030. svn co -r 1216 http://www.rectangular.com/svn/kinosearch/trunk kinosearch Best, -- Marvin Humphrey
CC: mcrawfor [...] cpan.org
Subject: Re: [rt.cpan.org #21359] Default tokenizer regex breaks unicode
Date: Thu, 7 Sep 2006 13:45:15 -0700 (PDT)
To: "marvin [...] rectangular.com via RT" <bug-KinoSearch [...] rt.cpan.org>
From: Miles Crawford <mcrawfor [...] u.washington.edu>
Revision 1216 does the trick! My only concern now is that it emits a warning: Wide character in print at /home/mcrawfor/bin/kino.pl line 46. Of course, adding a binmode(STDOUT, ":utf8"); line fixes the warning, so I guess this just exposes my lack of understanding about perl's unicode handling. Why was it working before? Did perl just not know the unicode was there and my terminal interpreted it correctly anyway? Ah well. Thanks for the extremely responsive attention to this issue, I eagerly await version 0.20! ;) -miles On Thu, 7 Sep 2006, marvin@rectangular.com via RT wrote: Show quoted text
> > <URL: http://rt.cpan.org/Ticket/Display.html?id=21359 > > > > On Sep 6, 2006, at 1:57 PM, via RT wrote:
>> The default regex in KinoSearch::Analysis::Tokenizer breaks unicode.
> > Thank you for the report. Thank you especially for the test case, > which I will incorporate into KinoSearch's test suite. > > The problem exposed by your test appears to be due to the loss of the > scalar's UTF8 flag as the text is absorbed into a > KinoSearch::Analysis::TokenBatch object, then recreated later. By > adding Encode::_utf8_on($_) at the right spot in Tokenizer::analyze, > we get the desired behavior in your test with the stock English > PolyAnalyzer. Unfortunately, the TokenBatch bug is not the only > place where Unicode support does not work properly in KinoSearch > 0.12/0.13. > > All these issues were addressed a few weeks back, but there has not > yet been a release incorporating the changes. The fix -- KS now > converts everything to Unicode for internal processing -- is not > backwards compatible, and so I'm trying to put together a single 0.20 > release which aggregates multiple backwards-incompatible changes. > > I would appreciate it if you would try a recent version from > KinoSearch's subversion repository and see if it works properly for > you. As of this email, the current repository revision is 1216, > which I believe will work. However, there has been quite a bit of > churn lately, and you may wish to try revision 1030. > > svn co -r 1216 http://www.rectangular.com/svn/kinosearch/trunk > kinosearch > > Best, > > -- > Marvin Humphrey > > > >
This bug has been eliminated in the development branch of KinoSearch, currently available as releases 0.20_xx.