Subject: | KinoSearch and locale |
Hello.
I've tried to use KinoSearch version 0.10_01 with English and Russian
(KOI8-R charset) texts and had some problems. For English text it works
as expected, but it doesn't works for texts in KOI8-R charset, because
'locale' Perl-pragma is not used in classes Tokenizer and LCNormalizer
(they use lc() and regular expressions).
I've created LocalizedLCNormalizer (with 'use locale;') and
LocalizedPolyAnalyzer (it calls Tokenizer with token_re & separator_re
parameters defined in scope with 'use locale;') and they works well.
Same problem with KinoSearch::Highlight::Highlighter class - it also
uses regular expressions in scope without 'use locale;', and doesn't
work with KOI8-R charset. Of course I can create LocalizedHighligher
with the only difference in locale pragma, but it isn't good solution.
Why not use locale pragma in Perl-part of KinoSearch?
Subject: | LocalizedPolyAnalyzer.pm |
package LocalizedPolyAnalyzer;
use strict;
use warnings;
use locale;
use KinoSearch::Analysis::PolyAnalyzer;
use KinoSearch::Analysis::Tokenizer;
use KinoSearch::Analysis::Stemmer;
use base qw( KinoSearch::Analysis::PolyAnalyzer);
our %instance_vars = __PACKAGE__->init_instance_vars();
sub init_instance {
my $self = shift;
my $language = $self->{language} = lc($self->{language});
croak("Must specify 'language'") unless $language;
$self->{analyzers} = [
#KinoSearch::Analysis::LCNormalizer->new(language => $language),
LocalizedLCNormalizer->new(language => $language),
#KinoSearch::Analysis::Tokenizer->new(language => $language),
KinoSearch::Analysis::Tokenizer->new(
language => $language,
token_re => qr/\b\w+(?:'\w+)?\b/,
separator_re => qr/\W*/
),
KinoSearch::Analysis::Stemmer->new(language => $language),
];
}
package LocalizedLCNormalizer;
use strict;
use warnings;
use locale;
use base qw( KinoSearch::Analysis::LCNormalizer );
our %instance_vars = __PACKAGE__->init_instance_vars();
sub analyze {
my ($self, $token_batch) = @_;
# lc all of the terms, one by one
while ($token_batch->next) {
$token_batch->set_text(lc($token_batch->get_text));
}
return $token_batch;
}
1;