Bug #43508 for Search-Indexer: Removed documents are matched even after they are removed from the database

Sun Feb 22 10:04:52 2009 NKH [...] cpan.org - Ticket created

Subject:

Removed documents are matched even after they are removed from the database

Removing document when the 'positions' argument is 0 and you don't have the original document keeps references to the document in the index database. If you add a document and then remove it, a search would still return the removed document id. This is very wrong as the engine should also remove the document id from the index or provide a garbage collection function. This means the following problems: - One must use unique ids - One must check the documents ids returned by a search

Sun Feb 22 14:49:57 2009 DAMI [...] cpan.org - Correspondence added

Le Dim. Fév. 22 10:04:52 2009, NKH a écrit : Show quoted text

> Removing document when the 'positions' argument is 0 and you don't have > the original document keeps references to the document in the index > database. If you add a document and then remove it, a search would still > return the removed document id. This is very wrong as the engine should > also remove the document id from the index or provide a garbage > collection function. > > This means the following problems: > - One must use unique ids > - One must check the documents ids returned by a search

The doc says that "when removing a document, and when the index was created without word positions, then the text representation of the document must be given as second argument and must be the same as the one that was supplied when calling the add() method". So if this API is not respected, the index database indeed gets corrupted; but this is not a bug, it's just that the method was called with improper arguments, so I'm rejecting the ticket. An index created with option {positions=>0} just keeps a collection of inverted lists $word_id => [$doc1, $doc2, ...], with one entry for each word. This is optimized for providing fulltext search functionality at a minimal cost in terms of storage; so if you know a word id, you can retrieve which documents contain that word; but if you know a doc id, there is no direct way to retrieve which words are in that document (that would require walking through the entire index, which would be much too slow). An index created with option {positions=>1} has a more complex datastructure, with enough information to retrieve which words belong to a given document; but the cost is much more disk space and indexing time ... there is no free lunch !

Sun Feb 22 14:49:57 2009 The RT System itself - Status changed from 'new' to 'open'

Sun Feb 22 14:49:58 2009 DAMI [...] cpan.org - Status changed from 'open' to 'rejected'