Bug #21888 for KinoSearch: All docs in an index must have the same fields

Tue Oct 03 17:38:37 2006 mcrawfor [...] cpan.org - Ticket created

Subject:

All docs in an index must have the same fields - badly reported

It seems that all documents in an index must contain the same fields. Perhaps that is a bug - I don't see it mentioned in the documentation anywhere. If this is by design, it seems like a reasonable restriction, but it is badly handled in the API, and badly reported when an error occurs (see below). If KinoSearch died or returned false when you try to create this situation, it would be much easier to handle this problem programmatically. Perhaps if $invindexers that were created from existing indexes already knew about the spec'ed fields, this wouldn't be a problem? Perhaps calling $doc->set_value for a field that didn't exist in the invindexer should die? Either way - the error I get when I try to have docs in an invindexer that do not all have the same fields looks like this: Error in function read_bytes at lib/KinoSearch/Store/InStream.pm:590: read_bytes: tried to read 1 bytes, got 0 at /usr/local/lib/perl/5.8.7/KinoSearch/Index/NormsReader.pm line 32 KinoSearch::Index::NormsReader::_ensure_read('KinoSearch::Index::NormsReader=HASH(0x9834a64)') called at /usr/local/lib/perl/5.8.7/KinoSearch/Index/NormsReader.pm line 25 KinoSearch::Index::NormsReader::get_bytes('KinoSearch::Index::NormsReader=HASH(0x9834a64)') called at /usr/local/lib/perl/5.8.7/KinoSearch/Index/SegWriter.pm line 138 KinoSearch::Index::SegWriter::_merge_norms('KinoSearch::Index::SegWriter=HASH(0x985cba4)', 'KinoSearch::Index::SegReader=HASH(0x98545f0)', 'KinoSearch::Util::IntMap=SCALAR(0x985a6ec)') called at /usr/local/lib/perl/5.8.7/KinoSearch/Index/SegWriter.pm line 118 KinoSearch::Index::SegWriter::add_segment('KinoSearch::Index::SegWriter=HASH(0x985cba4)', 'KinoSearch::Index::SegReader=HASH(0x98545f0)') called at /usr/local/lib/perl/5.8.7/KinoSearch/InvIndexer.pm line 289 KinoSearch::InvIndexer::finish('KinoSearch::InvIndexer=HASH(0x9828854)') called at /usr/local/apache/htdocs/solstice/lib//Solstice/Model.pm line 322 Solstice::Model::storeSearchIndex('WebQ::Model::Survey=HASH(0x93419c4)') called at /usr/local/apache/htdocs/apps/webq/lib//WebQ/Model/Survey.pm line 83 WebQ::Model::Survey::index('WebQ::Model::Survey=HASH(0x93419c4)') called at index_surveys.pl line 25 Which seems to be very difficult to understand and react to as a user of the KinoSearch API. This is broken in the 0.20_01 version of KinoSearch that you sent me awhile ago.

Wed Oct 04 11:59:50 2006 marvin [...] rectangular.com - Correspondence added

Subject:	Re: [rt.cpan.org #21888] All docs in an index must have the same fields - badly reported
Date:	Wed, 4 Oct 2006 08:59:31 -0700
To:	bug-KinoSearch [...] rt.cpan.org
From:	Marvin Humphrey <marvin [...] rectangular.com>

On Oct 3, 2006, at 2:38 PM, via RT wrote: Show quoted text

> It seems that all documents in an index must contain the same fields. > Perhaps that is a bug - I don't see it mentioned in the documentation > anywhere. If this is by design, it seems like a reasonable > restriction, > but it is badly handled in the API, and badly reported when an error > occurs (see below).

It's not by design, so we have a bug in the merge algorithm. KinoSearch is supposed to handle merging of segments which contain disparate fields -- in fact, t/213-segment_merging.t has tests designed to verify precisely this behavior. It creates an invindex with one field called "letters", then adds two other invindexes to it via add_invindexes() which have only a field called "content". Apparently the test is insufficiently rigorous. There is a restriction on field use, but it's within the context of a single session. You can't do this... $doc->set_value( foo => 'foo foo' ); ... unless you told the InvIndexer about field 'foo' beforehand during this particular indexing session, via spec_field(). Are you able to work around the bug for now by spec'ing all fields? The code where this bug resides is about to get an overhaul (when the file format changes). [long stack trace snipped] Show quoted text

> Which seems to be very difficult to understand and react to as a > user of > the KinoSearch API.

I agree that it is hard for a user to understand. Fortunately, it tells me a good deal about what's going on. A SegReader thinks that a particular file (the norms file) exists because it knows about a given field. However, that file turns out not to exist, and KS blows up when it tries to read from it. Something is awry in the field- definition merging logic, as the SegReader should not think that file exists when it doesn't. Out of curiosity, do you have some fields which are not indexed? That's a scenario that's not currently being tested. Also, are you changing any field definitions between indexing sessions? KS is supposed to handle that, but the algo's kind of sketchy and in the future that will probably result in an error. Show quoted text

> This is broken in the 0.20_01 version of KinoSearch that you sent me > awhile ago.

Just for the RT record, that's not the official 0.20_01 release (which isn't out yet), it's subversion repository revision 1216. Thanks for the report, Marvin Humphrey Rectangular Research http://www.rectangular.com/

Wed Oct 04 11:59:52 2006 The RT System itself - Status changed from 'new' to 'open'

Wed Oct 04 15:03:33 2006 mcrawfor [...] u.washington.edu - Correspondence added

CC:	mcrawfor [...] cpan.org
Subject:	Re: [rt.cpan.org #21888] All docs in an index must have the same fields - badly reported
Date:	Wed, 4 Oct 2006 12:02:24 -0700 (PDT)
To:	"marvin [...] rectangular.com via RT" <bug-KinoSearch [...] rt.cpan.org>
From:	Miles Crawford <mcrawfor [...] u.washington.edu>

Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=21888 > > > It's not by design, so we have a bug in the merge algorithm. > KinoSearch is supposed to handle merging of segments which contain > disparate fields -- in fact, t/213-segment_merging.t has tests > designed to verify precisely this behavior. It creates an invindex > with one field called "letters", then adds two other invindexes to it > via add_invindexes() which have only a field called "content". > Apparently the test is insufficiently rigorous.

Well, I like that - having a very heterogenous index with a number of document types in it will be very useful to me. The merge code may work - this surfaces for me during the inital creation of one index. The index is opened, fields are spec'ed, one doc is added, and the index is finished in a loop. In each iteration the spec'ed fields and the fields added to the doc match - but these fields differ from iteration to iteration. Show quoted text

> There is a restriction on field use, but it's within the context of a > single session. You can't do this... > $doc->set_value( foo => 'foo foo' ); > ... unless you told the InvIndexer about field 'foo' beforehand > during this particular indexing session, via spec_field().

Ah, no - the indexer fields always match the doc fields in my example - I sometimes just spec and add fields that are different between indexing sessions. Show quoted text

> Are you able to work around the bug for now by spec'ing all fields? > The code where this bug resides is about to get an overhaul (when the > file format changes).

I think I can work around it, yes. This won't go into production for months anyway, so there is some flexibility. I was kind of waiting for the file-format change, just to avoid that dance. Show quoted text

> I agree that it is hard for a user to understand. Fortunately, it > tells me a good deal about what's going on. A SegReader thinks that

Yes - my original report was written with the assumption that the all-docs-have-the-same-fields behavior was expected. It makes much more sense now. Show quoted text

> Out of curiosity, do you have some fields which are not indexed?

No, all the fields in use are indexed. Not all are analyzed or vectorized though. Show quoted text

> Also, are you changing any field definitions between indexing sessions? KS > is supposed to handle that, but the algo's kind of sketchy and in the future > that will probably result in an error.

No, the field defs are always the same - in some cases a few are just left out. I will write you a test case that exhibits the behavior. It'll take me a second to write it in just bare kinosearch classes - right now i have kinosearch wrapped a bit in our stuff. The pseudo-code looks kinda like this though: for (1..100){ $indexer = new indexer from existing index; $sometimes_true = some condition check; $indexer->spec_field(always here); if($sometimes_true){ $indexer->spec_field(sometimes_here); } $doc = $indexer->new_doc; $doc->add_value(always here); if($sometimes_true){ $doc->add_value(sometimes_here); } $indexer->finish } I get the original stack trace die before completing the iteration. The index being used by the invindexer is always the same. I realize it would be more efficient to leave the indexer open, but this index happens when I store an object, and i'm trying to jumpstart the index by just instantiating and storing all the objects currently in the datastore. Again, I'll see if I can work up an actual failing test case. Thanks, -miles Show quoted text

>

>> This is broken in the 0.20_01 version of KinoSearch that you sent me >> awhile ago.

> > Just for the RT record, that's not the official 0.20_01 release > (which isn't out yet), it's subversion repository revision 1216. > > Thanks for the report, > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > > >

Thu Oct 05 18:27:01 2006 mcrawfor [...] cpan.org - Correspondence added

Subject:

All docs in an index must have the same fields

The following snippet reproduces 100% of the time for me. It runs a variable length based on the randomness, but it never finishes the iteration. I realize this is not a terribly efficient method of indexing a large body of docs, but it accurately models the way our object oriented code would be indexing as each object stores as it is created over time. use strict; use KinoSearch::InvIndexer; use KinoSearch::Analysis::PolyAnalyzer; my $index_filename = '/tmp/example.idx'; for (1..1000){ warn $_; my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' ); my $create = -d $index_filename ? 0 : 1; #does the index exist? my $invindexer = KinoSearch::InvIndexer->new( invindex => $index_filename, create => $create, analyzer => $analyzer, ); my $extra = (rand() < .5); $invindexer->spec_field( name => 'title'); $invindexer->spec_field( name => 'owner' ) if $extra; my $doc = $invindexer->new_doc; $doc->set_value( title => 'some text'); $doc->set_value( owner => 'some text') if $extra; $invindexer->add_doc($doc); $invindexer->finish; }

Wed Oct 11 02:57:15 2006 marvin [...] rectangular.com - Correspondence added

Subject:	Re: [rt.cpan.org #21888] All docs in an index must have the same fields
Date:	Tue, 10 Oct 2006 23:56:55 -0700
To:	bug-KinoSearch [...] rt.cpan.org
From:	Marvin Humphrey <marvin [...] rectangular.com>

On Oct 5, 2006, at 3:27 PM, via RT wrote: Show quoted text

> The following snippet reproduces 100% of the time for me.

Thanks for the test; I've incorporated its essence into the KS test suite. The error is 100% reproducible, and I had a go at debugging the problem. Attempts at a quick fix were not fruitful. The bug appears to reside in some crufty code which is a legacy of the attempt to make KinoSearch file-format compatible with Lucene. Lucene compatibility was abandoned as a goal a long time ago, and this code is going to get an overhaul soon. Until it does, I'll be annoyed by a failing test... so, um, thanks for that. :) Since you have a workaround, and since an intermediate fix would probably mean piling cruft on top of cruft, I'm going to hold off on fixing it for now. Marvin Humphrey Rectangular Research http://www.rectangular.com/

Tue Mar 20 20:10:55 2007 CREAMYG [...] cpan.org - Fixed in 0.20_02 added

Tue Mar 20 20:13:44 2007 CREAMYG [...] cpan.org - Correspondence added

The mechanism for specifying fields in KS 0.20_xx has changed significantly. As of 0.20_01, docs no longer need have the same fields. As of version 0.20_02, it is possible to add fields at any time during indexing.

Tue Mar 20 20:13:48 2007 CREAMYG [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #21888 for KinoSearch: All docs in an index must have the same fields - badly reported