Bug #18899 for KinoSearch: KinoSearch and locale

Mon Apr 24 11:25:35 2006 Guest - Ticket created

Subject:

KinoSearch and locale

Hello. I've tried to use KinoSearch version 0.10_01 with English and Russian (KOI8-R charset) texts and had some problems. For English text it works as expected, but it doesn't works for texts in KOI8-R charset, because 'locale' Perl-pragma is not used in classes Tokenizer and LCNormalizer (they use lc() and regular expressions). I've created LocalizedLCNormalizer (with 'use locale;') and LocalizedPolyAnalyzer (it calls Tokenizer with token_re & separator_re parameters defined in scope with 'use locale;') and they works well. Same problem with KinoSearch::Highlight::Highlighter class - it also uses regular expressions in scope without 'use locale;', and doesn't work with KOI8-R charset. Of course I can create LocalizedHighligher with the only difference in locale pragma, but it isn't good solution. Why not use locale pragma in Perl-part of KinoSearch?

Subject:

LocalizedPolyAnalyzer.pm

package LocalizedPolyAnalyzer; use strict; use warnings; use locale; use KinoSearch::Analysis::PolyAnalyzer; use KinoSearch::Analysis::Tokenizer; use KinoSearch::Analysis::Stemmer; use base qw( KinoSearch::Analysis::PolyAnalyzer); our %instance_vars = __PACKAGE__->init_instance_vars(); sub init_instance { my $self = shift; my $language = $self->{language} = lc($self->{language}); croak("Must specify 'language'") unless $language; $self->{analyzers} = [ #KinoSearch::Analysis::LCNormalizer->new(language => $language), LocalizedLCNormalizer->new(language => $language), #KinoSearch::Analysis::Tokenizer->new(language => $language), KinoSearch::Analysis::Tokenizer->new( language => $language, token_re => qr/\b\w+(?:'\w+)?\b/, separator_re => qr/\W*/ ), KinoSearch::Analysis::Stemmer->new(language => $language), ]; } package LocalizedLCNormalizer; use strict; use warnings; use locale; use base qw( KinoSearch::Analysis::LCNormalizer ); our %instance_vars = __PACKAGE__->init_instance_vars(); sub analyze { my ($self, $token_batch) = @_; # lc all of the terms, one by one while ($token_batch->next) { $token_batch->set_text(lc($token_batch->get_text)); } return $token_batch; } 1;

Wed Apr 26 17:04:32 2006 marvin [...] rectangular.com - Correspondence added

CC:	kinosearch [...] rectangular.com
Subject:	Re: [rt.cpan.org #18899] KinoSearch and locale
Date:	Wed, 26 Apr 2006 14:03:50 -0700
To:	"javacocoon [...] gmail.com via RT" <bug-KinoSearch [...] rt.cpan.org>
From:	Marvin Humphrey <marvin [...] rectangular.com>

Hi, thanks for the report! Show quoted text

> I've tried to use KinoSearch version 0.10_01 with English and Russian > (KOI8-R charset) texts and had some problems. For English text it > works > as expected, but it doesn't works for texts in KOI8-R charset, because > 'locale' Perl-pragma is not used in classes Tokenizer and LCNormalizer > (they use lc() and regular expressions).

Mmm. I haven't got any test files in KinoSearch right now which deal specifically with testing the languages it claims to support. Lingua::Stem::Snowball and Lingua::StopWords both have pretty thorough test suites, but I hadn't considered the behavior of regexes or case conversion. It looks like all the Analyzer classes are going to need their own dedicated test sets. Show quoted text

> I've created LocalizedLCNormalizer (with 'use locale;') and > LocalizedPolyAnalyzer (it calls Tokenizer with token_re & separator_re > parameters defined in scope with 'use locale;') and they works well. > > Same problem with KinoSearch::Highlight::Highlighter class - it also > uses regular expressions in scope without 'use locale;', and doesn't > work with KOI8-R charset.

Ideally, the Highlighter ultimately wouldn't use regexes at all. The insertion of the pre_tag and the post_tag, currently implemented with regexes, can and should be refactored using plain old substring/ concat operations. The selection of an excerpt based on density of tokens doesn't need regexes. The only thing that requires regexes right now is finding the precise start and close boundaries for the excerpt. What I'd really like to do is abstract out an interface for that task that can work with any encoding and any language. If we can get it to work with Japanese (no spaces between tokens), and either Arabic or Hebrew (written right-to-left), I'll be cheesed. With an eye towards the future, the steps are: 1. Remove partial characters from the top if necessary. 2. Choose the excerpt start, preferring a sentence boundary and using the start of a guaranteed complete "word" otherwise. 3. If the excerpt doesn't start on a sentence boundary, indicate that by e.g. inserting an ellipsis. 4. Adjust the length of the excerpt, preferring a sentence boundary, and using the end of a guaranteed complete "word" otherwise. 5. Insert pre_tags and post_tags. 5. Measure the length of the excerpt in visible chars, and remove as many tokens from the end as necessary to satisfy the excerpt_length constraint. [not implemented] 6. If the excerpt doesn't end on a sentence boundary, indicate that by e.g. working an ellipsis onto the end (removing a token or two if necessary to stay within the excerpt_length constraint). Show quoted text

> Of course I can create LocalizedHighligher > with the only difference in locale pragma, but it isn't good solution.

Yes, I agree. If KinoSearch claims to support Russian, it should support Russian. :) Eventually, subclassing Analyzer will be possible, in which case your solution would be legal provided that it used all documented public API. That stuff isn't public yet though. Show quoted text

> Why not use locale pragma in Perl-part of KinoSearch?

There are performance implications for turning on the locale pragma. I've been spelunking the Perl source code over the last couple of days trying to figure out exactly what they are. LCNormalizer at least seems to run a bit slower under "use locale;". There are also security implications. Strings whose content may be affected by the value of a locale get marked as tainted. That wouldn't affect the analysis apparatus right now because of the way that TokenBatch and PostingsWriter work, but it might affect Highlighter, as I believe the excerpts would be tainted. Does that matter? I don't believe it does. I'm leaning towards adopting "use locale;" as a good interim solution, but I need to understand the consequences just a little better. Marvin Humphrey Rectangular Research http://www.rectangular.com/

Wed Apr 26 17:04:35 2006 The RT System itself - Status changed from 'new' to 'open'

Thu May 04 17:25:02 2006 CREAMYG [...] cpan.org - Correspondence added

RT-Send-CC:

kinosearch [...] rectangular.com

The "use locale" solution has been implemented. The impact upon indexing speed is basically negligible and having tainted excerpts shouldn't matter.

Thu May 04 17:25:05 2006 CREAMYG [...] cpan.org - Status changed from 'open' to 'resolved'

Thu May 04 17:27:48 2006 kinosearch-owner [...] rectangular.com - Correspondence added

Subject:	[rt.cpan.org #18899] KinoSearch and locale
Date:	Thu, 04 May 2006 14:28:39 -0700
To:	bug-kinosearch [...] rt.cpan.org
From:	kinosearch-owner [...] rectangular.com

You are not allowed to post to this mailing list, and your message has been automatically rejected. If you think that your messages are being rejected in error, contact the mailing list owner at kinosearch-owner@rectangular.com.

CC:	kinosearch [...] rectangular.com
Subject:	[rt.cpan.org #18899] KinoSearch and locale
Date:	Thu, 4 May 2006 17:25:04 -0400 (EDT)
From:	" via RT" <bug-KinoSearch [...] rt.cpan.org>

<URL: http://rt.cpan.org/Ticket/Display.html?id=18899 > The "use locale" solution has been implemented. The impact upon indexing speed is basically negligible and having tainted excerpts shouldn't matter.

Thu May 04 17:27:50 2006 The RT System itself - Status changed from 'resolved' to 'open'

Fri May 12 21:49:45 2006 CREAMYG [...] cpan.org - Fixed in 0.10 added

Fri May 12 21:49:57 2006 CREAMYG [...] cpan.org - Status changed from 'open' to 'resolved'