Skip Menu |

This queue is for tickets about the KinoSearch CPAN distribution.

Report information
The Basics
Id: 25400
Status: resolved
Priority: 0/
Queue: KinoSearch

People
Owner: CREAMYG [...] cpan.org
Requestors: mandrews [...] bit0.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: KinoSearch 0.15 crash bug + weird feature req
Date: Tue, 13 Mar 2007 00:34:52 -0400 (EDT)
To: bug-KinoSearch [...] rt.cpan.org
From: Mike Andrews <mandrews [...] bit0.com>
Platform is FreeBSD 6.2-RELEASE, both i386 and amd64 versions. Using a fairly standard KinoSearch 0.15 setup (mostly boilerplate code) entering a URL as a search term causes Perl, and thus mod_perl and its Apache parent, to SIGSEGV. I'm guessing it's trying to add a field named 'http' to the search terms, and I don't have one by that name, but it's weird because entering other nonexistent field names just makes it return 0 results -- as it should. Just before it crashes, I get this: Undefined subroutine &KinoSearch::Search::PhraseScorer::kerror called at /usr/local/lib/perl5/site_perl/5.8.8/mach/KinoSearch/Search/PhraseScorer.pm line 21. Feeding the core file to gdb says this (on amd64): #0 0x0000000801c22bcd in Kino_PhraseScorer_destroy () from /usr/local/lib/perl5/site_perl/5.8.8/mach/auto/KinoSearch/KinoSearch.so #1 0x0000000801c1eae7 in XS_KinoSearch__Search__PhraseScorer_DESTROY () from /usr/local/lib/perl5/site_perl/5.8.8/mach/auto/KinoSearch/KinoSearch.so #2 0x00000008006bd11c in Perl_pp_entersub () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #3 0x00000008006620d7 in S_call_body () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #4 0x0000000800666c1c in Perl_call_sv () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #5 0x00000008006bfab5 in Perl_sv_clear () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #6 0x00000008006c0161 in Perl_sv_free () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #7 0x00000008006df2a7 in Perl_leave_scope () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #8 0x0000000800663689 in S_my_exit_jump () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #9 0x0000000800668c1b in Perl_my_failure_exit () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #10 0x00000008006e16e1 in Perl_die_where () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #11 0x00000008006a80af in Perl_vdie () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #12 0x00000008006a81cd in Perl_die () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #13 0x00000008006bd7df in Perl_pp_entersub () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #14 0x00000008006b5dbe in Perl_runops_standard () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #15 0x00000008006675f2 in perl_run () from /usr/local/lib/perl5/5.8.8/mach/CORE/libperl.so #16 0x000000000040156f in main () I haven't tried 0.20_2 to see if it's fixed there. Now for the feature request, and this is probably a bit out there because I don't know that a lot of people other than me would use this, but, would it be possible to separate out the highlighter from the excerpter? For my application I want to highlight all the search terms in a field but not actually do any excerpting of it at all. Setting the excerpt_size to a huge value keeps the full string there but still does some punctuation mangling at the end, adds ellipsis if there's no full-stop, etc. Here's why I want a weird un-excerpted highlight: the search app I'm writing is searching just news headlines without the articles. There are some other fields that can be searched to narrow results down (a topic, a date, a partial URL) but basically each "document" is under 300 bytes, has already had HTML entities normalized, etc. So there's not much point in doing an excerpt of something that's already that short. And since news headlines tend to not have trailing punctuation, the full-stop check throws ellipses on the end by default... I can see where some people might want un-highlighted excerpts too, such as command-line searches that don't use HTML (or curses). On the other hand, I might just be weird for having a search engine for short strings :) For now, I made a custom slimmed-down version of generate_excerpt() by subclassing KinoSearch::Highlight::Highlighter and that works, and it looks like it'll work on 0.20 also (even though I know a lot of other stuff will break in the 0.15 to 0.20 conversion -- which is fine -- looking forward to trying the secondary sort feature there). So short term I have a workable solution. I also hacked around the crash bug by just throwing out search terms that start with \w+: that don't match one of my known fields.
Hello, Thank you for the report. We'll tackle the bug first. There are actually three bugs here. At least two and possibly the third as well do not affect the development branch of KS. The call to kerror() triggers an exception because the kerror() function, which is supposed to be exported by KinoSearch::Util::ToolSet, is not. In 0.20_xx, kerror() has been added to ToolSet's export list. The subsequent segfault is due to the way the hybrid Perl/C constructor is set up in version 0.15. A C object is created, then other member vars are added to it; however, if the full constructor routine does not complete, the destructor does not see the member vars that it expects to and segfaults. This is not a problem in 0.20 because the object is created with a single XS call rather than multiple calls. What is puzzling is why kerror() was invoked in the first place. kerror() gets called when there are hash-style labeled parameters which are either odd in number (indicating a missing element) or when parameter names do not verify against a known list. However, this is an internal routine, so it should certainly be getting the arguments right. I believe what is occurring is that the one of the parameters (norms_reader) is being created on the fly by a routine called in list context which fails, resulting in an uneven number of arguments. However, things should have shut down before this, or this crash would happen all the time. My suspicion is that at some point, the field 'http' got defined in your index somehow, but has no terms. The search makes it through an earlier filter based upon valid field names, then fails in an ugly way at the later stage when a norms_reader can't be created, setting off a cascade. If I am correct in my diagnosis, then that aspect of the bug is also gone in 0.20, as the norms_reader parameter has been eliminated.
With regards to the feature request... I think the proper way to handle flexibility with regards to field handling will be to add an add_field() method to Highlighter. $highlighter->add_field( name => 'title', exerpt_length => 150, formatter => $formatter, encoder => $encoder, ); The punctuation mangling should not occur at the end of a field value, regardless of whether that value ends in a full stop. I consider that a bug. It's actually possible to "not highlight" now by supplying a "formatter" that doesn't do anything. In 0.15, you could do it this way... my $formatter = KinoSearch::Highlight::SimpleHTMLFormatter->new( pre_tag => '', post_tag => '', ); my $highlighter = KinoSearch::Highlight::Highlighter->new( excerpt_field => 'title', formatter => $formatter, ); I'm personally not likely to get to add_field() or the punctuation mangling right away because I have my hands full at the moment with other issues such as sorting and range filters. If you find yourself feeling ambitious and want to work on Highlighter, please consider subscribing to the KinoSearch mailing list and please be aware that contributors must be comfortable with assigning code to the Apache Software Foundation. Otherwise, you can wait for me to add these features, as I agree that they are desirable.
With the release of 0.161, the crash bug is now fixed in both the maint and devel branches. Also, Highlighter's interface has changed in the devel branch, and add_field() has been implemented. I'm leaving this bug open for now because the ellipsis issue still exists.
The last of the issues in this report (avoiding ellipsis at end of excerpt at the close of a field or sentence) was resolved in the KS dev branch some time ago, so it's time to close the issue.