Le Dim. Fév. 22 10:04:52 2009, NKH a écrit :
Show quoted text> Removing document when the 'positions' argument is 0 and you don't have
> the original document keeps references to the document in the index
> database. If you add a document and then remove it, a search would still
> return the removed document id. This is very wrong as the engine should
> also remove the document id from the index or provide a garbage
> collection function.
>
> This means the following problems:
> - One must use unique ids
> - One must check the documents ids returned by a search
The doc says that "when removing a document, and when the index was
created without word positions, then the text representation of the
document must be given as second argument and must be the same as the
one that was supplied when calling the add() method".
So if this API is not respected, the index database indeed gets
corrupted; but this is not a bug, it's just that the method was called
with improper arguments, so I'm rejecting the ticket.
An index created with option {positions=>0} just keeps a collection of
inverted lists $word_id => [$doc1, $doc2, ...], with one entry for each
word. This is optimized for providing fulltext search functionality at a
minimal cost in terms of storage; so if you know a word id, you can
retrieve which documents contain that word; but if you know a doc id,
there is no direct way to retrieve which words are in that document
(that would require walking through the entire index, which would be
much too slow).
An index created with option {positions=>1} has a more complex
datastructure, with enough information to retrieve which words belong to
a given document; but the cost is much more disk space and indexing time
... there is no free lunch !