Skip Menu |

This queue is for tickets about the Search-Indexer CPAN distribution.

Report information
The Basics
Id: 43508
Status: rejected
Priority: 0/
Queue: Search-Indexer

People
Owner: Nobody in particular
Requestors: NKH [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 0.75
Fixed in: (no value)



Subject: Removed documents are matched even after they are removed from the database
Removing document when the 'positions' argument is 0 and you don't have the original document keeps references to the document in the index database. If you add a document and then remove it, a search would still return the removed document id. This is very wrong as the engine should also remove the document id from the index or provide a garbage collection function. This means the following problems: - One must use unique ids - One must check the documents ids returned by a search
Le Dim. Fév. 22 10:04:52 2009, NKH a écrit : Show quoted text
> Removing document when the 'positions' argument is 0 and you don't have > the original document keeps references to the document in the index > database. If you add a document and then remove it, a search would still > return the removed document id. This is very wrong as the engine should > also remove the document id from the index or provide a garbage > collection function. > > This means the following problems: > - One must use unique ids > - One must check the documents ids returned by a search
The doc says that "when removing a document, and when the index was created without word positions, then the text representation of the document must be given as second argument and must be the same as the one that was supplied when calling the add() method". So if this API is not respected, the index database indeed gets corrupted; but this is not a bug, it's just that the method was called with improper arguments, so I'm rejecting the ticket. An index created with option {positions=>0} just keeps a collection of inverted lists $word_id => [$doc1, $doc2, ...], with one entry for each word. This is optimized for providing fulltext search functionality at a minimal cost in terms of storage; so if you know a word id, you can retrieve which documents contain that word; but if you know a doc id, there is no direct way to retrieve which words are in that document (that would require walking through the entire index, which would be much too slow). An index created with option {positions=>1} has a more complex datastructure, with enough information to retrieve which words belong to a given document; but the cost is much more disk space and indexing time ... there is no free lunch !