Bug #21359 for KinoSearch: Default tokenizer regex breaks unicode

Wed Sep 06 16:57:05 2006 mcrawfor [...] cpan.org - Ticket created

Subject:

Default tokenizer regex breaks unicode

The default regex in KinoSearch::Analysis::Tokenizer breaks unicode. Building a custom Tokenizer with just non-whitespace like so: my $tokenizer= KinoSearch::Analysis::Tokenizer->new( token_re => qr/\S+/); fixes the issue. I'm not sure why the built-in regex breaks unicode, but it seems like it could leave it alone without too much trouble. Example that fails to match: -------------------------------- #!/usr/bin/perl use KinoSearch::InvIndexer; use KinoSearch::Analysis::PolyAnalyzer; use KinoSearch::Searcher; my $uni = "\x{3028}\x{3063}\x{3057}\x{3024}"; my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( language => 'en'); my $invindexer = KinoSearch::InvIndexer->new( invindex => 'kino.idx', create => 1, analyzer => $analyzer, ); $invindexer->spec_field( name => 'title', boost => 3, ); $invindexer->spec_field( name => 'bodytext' ); my $doc = $invindexer->new_doc; $doc->set_value( title => $uni ." hellos" ); $doc->set_value( bodytext => 'horatio' ); $invindexer->add_doc($doc); $invindexer->finish; my $searcher = KinoSearch::Searcher->new( invindex => 'kino.idx', analyzer => $analyzer, ); my $hits = $searcher->search( query => $uni ); while ( my $hit = $hits->fetch_hit_hashref ) { print "$hit->{title}\n"; } ------------------------------------ But this same example works if you just create the analyzer like: my $lc_normalizer = KinoSearch::Analysis::LCNormalizer->new(); my $tokenizer= KinoSearch::Analysis::Tokenizer->new( token_re => qr/\S+/); my $stemmer = KinoSearch::Analysis::Stemmer->new(language => 'en'); my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( analyzers => [$lc_normalizer, $tokenizer, $stemmer] ); Which is essentially the default, except for the replaced token_re.

Thu Sep 07 16:03:36 2006 marvin [...] rectangular.com - Correspondence added

CC:	KinoSearch discussion forum <kinosearch [...] rectangular.com>
Subject:	Re: [rt.cpan.org #21359] Default tokenizer regex breaks unicode
Date:	Thu, 7 Sep 2006 13:03:13 -0700
To:	bug-KinoSearch [...] rt.cpan.org
From:	Marvin Humphrey <marvin [...] rectangular.com>

On Sep 6, 2006, at 1:57 PM, via RT wrote: Show quoted text

> The default regex in KinoSearch::Analysis::Tokenizer breaks unicode.

Thank you for the report. Thank you especially for the test case, which I will incorporate into KinoSearch's test suite. The problem exposed by your test appears to be due to the loss of the scalar's UTF8 flag as the text is absorbed into a KinoSearch::Analysis::TokenBatch object, then recreated later. By adding Encode::_utf8_on($_) at the right spot in Tokenizer::analyze, we get the desired behavior in your test with the stock English PolyAnalyzer. Unfortunately, the TokenBatch bug is not the only place where Unicode support does not work properly in KinoSearch 0.12/0.13. All these issues were addressed a few weeks back, but there has not yet been a release incorporating the changes. The fix -- KS now converts everything to Unicode for internal processing -- is not backwards compatible, and so I'm trying to put together a single 0.20 release which aggregates multiple backwards-incompatible changes. I would appreciate it if you would try a recent version from KinoSearch's subversion repository and see if it works properly for you. As of this email, the current repository revision is 1216, which I believe will work. However, there has been quite a bit of churn lately, and you may wish to try revision 1030. svn co -r 1216 http://www.rectangular.com/svn/kinosearch/trunk kinosearch Best, -- Marvin Humphrey

Thu Sep 07 16:03:37 2006 The RT System itself - Status changed from 'new' to 'open'

Thu Sep 07 16:45:57 2006 mcrawfor [...] u.washington.edu - Correspondence added

CC:	mcrawfor [...] cpan.org
Subject:	Re: [rt.cpan.org #21359] Default tokenizer regex breaks unicode
Date:	Thu, 7 Sep 2006 13:45:15 -0700 (PDT)
To:	"marvin [...] rectangular.com via RT" <bug-KinoSearch [...] rt.cpan.org>
From:	Miles Crawford <mcrawfor [...] u.washington.edu>

Revision 1216 does the trick! My only concern now is that it emits a warning: Wide character in print at /home/mcrawfor/bin/kino.pl line 46. Of course, adding a binmode(STDOUT, ":utf8"); line fixes the warning, so I guess this just exposes my lack of understanding about perl's unicode handling. Why was it working before? Did perl just not know the unicode was there and my terminal interpreted it correctly anyway? Ah well. Thanks for the extremely responsive attention to this issue, I eagerly await version 0.20! ;) -miles On Thu, 7 Sep 2006, marvin@rectangular.com via RT wrote: Show quoted text

> > <URL: http://rt.cpan.org/Ticket/Display.html?id=21359 > > > > On Sep 6, 2006, at 1:57 PM, via RT wrote:

>> The default regex in KinoSearch::Analysis::Tokenizer breaks unicode.

> > Thank you for the report. Thank you especially for the test case, > which I will incorporate into KinoSearch's test suite. > > The problem exposed by your test appears to be due to the loss of the > scalar's UTF8 flag as the text is absorbed into a > KinoSearch::Analysis::TokenBatch object, then recreated later. By > adding Encode::_utf8_on($_) at the right spot in Tokenizer::analyze, > we get the desired behavior in your test with the stock English > PolyAnalyzer. Unfortunately, the TokenBatch bug is not the only > place where Unicode support does not work properly in KinoSearch > 0.12/0.13. > > All these issues were addressed a few weeks back, but there has not > yet been a release incorporating the changes. The fix -- KS now > converts everything to Unicode for internal processing -- is not > backwards compatible, and so I'm trying to put together a single 0.20 > release which aggregates multiple backwards-incompatible changes. > > I would appreciate it if you would try a recent version from > KinoSearch's subversion repository and see if it works properly for > you. As of this email, the current repository revision is 1216, > which I believe will work. However, there has been quite a bit of > churn lately, and you may wish to try revision 1030. > > svn co -r 1216 http://www.rectangular.com/svn/kinosearch/trunk > kinosearch > > Best, > > -- > Marvin Humphrey > > > >

Tue Mar 20 20:00:25 2007 CREAMYG [...] cpan.org - Broken in 0.14 added

Tue Mar 20 20:00:25 2007 CREAMYG [...] cpan.org - Broken in 0.15 added

Tue Mar 20 20:00:25 2007 CREAMYG [...] cpan.org - Fixed in 0.20_02 added

Tue Mar 20 20:08:29 2007 CREAMYG [...] cpan.org - Correspondence added

This bug has been eliminated in the development branch of KinoSearch, currently available as releases 0.20_xx.

Tue Mar 20 20:08:31 2007 CREAMYG [...] cpan.org - Status changed from 'open' to 'resolved'