Subject: | Default tokenizer regex breaks unicode |
The default regex in KinoSearch::Analysis::Tokenizer breaks unicode.
Building a custom Tokenizer with just non-whitespace like so:
my $tokenizer= KinoSearch::Analysis::Tokenizer->new( token_re => qr/\S+/);
fixes the issue.
I'm not sure why the built-in regex breaks unicode, but it seems like it
could leave it alone without too much trouble.
Example that fails to match:
--------------------------------
#!/usr/bin/perl
use KinoSearch::InvIndexer;
use KinoSearch::Analysis::PolyAnalyzer;
use KinoSearch::Searcher;
my $uni = "\x{3028}\x{3063}\x{3057}\x{3024}";
my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( language => 'en');
my $invindexer = KinoSearch::InvIndexer->new(
invindex => 'kino.idx',
create => 1,
analyzer => $analyzer,
);
$invindexer->spec_field(
name => 'title',
boost => 3,
);
$invindexer->spec_field(
name => 'bodytext'
);
my $doc = $invindexer->new_doc;
$doc->set_value( title => $uni ." hellos" );
$doc->set_value( bodytext => 'horatio' );
$invindexer->add_doc($doc);
$invindexer->finish;
my $searcher = KinoSearch::Searcher->new(
invindex => 'kino.idx',
analyzer => $analyzer,
);
my $hits = $searcher->search( query => $uni );
while ( my $hit = $hits->fetch_hit_hashref ) {
print "$hit->{title}\n";
}
------------------------------------
But this same example works if you just create the analyzer like:
my $lc_normalizer = KinoSearch::Analysis::LCNormalizer->new();
my $tokenizer= KinoSearch::Analysis::Tokenizer->new( token_re => qr/\S+/);
my $stemmer = KinoSearch::Analysis::Stemmer->new(language => 'en');
my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( analyzers =>
[$lc_normalizer, $tokenizer, $stemmer] );
Which is essentially the default, except for the replaced token_re.