Bug #98566 for Kasago: Indexing bug for large source trees

Subject:

Indexing bug for large source trees

For large source trees (200k lines) the implicit btree index on words(word) via the unique constraint can make postgres barf. I added support for optionally hash indexing it instead via this patch.

Subject:

Kasago.patch

diff --git a/CHANGES b/CHANGES index c7aaa13..46c1059 100644 --- a/CHANGES +++ b/CHANGES @@ -1,4 +1,6 @@ CHANGES file for Kasago: +0.3 Tue Sep 3 + Add support for hash indexing on words(word) 0.29 Tue Jul 26 14:32:05 BST 2005 - first release \ No newline at end of file diff --git a/lib/Kasago.pm b/lib/Kasago.pm index 687a455..9abd51d 100644 --- a/lib/Kasago.pm +++ b/lib/Kasago.pm @@ -12,7 +12,7 @@ use PPI; use Search::QueryParser; use base qw( Class::Accessor::Chained::Fast ); __PACKAGE__->mk_accessors(qw( dbh )); -our $VERSION = '0.29'; +our $VERSION = '0.3'; sub new { my $class = shift; @@ -31,7 +31,7 @@ sub DESTROY { } sub init { - my $self = shift; + my ($self, $index_type) = @_; my $dbh = $self->dbh; eval { @@ -65,12 +65,20 @@ CREATE TABLE files ( CREATE INDEX source_id_index ON files(source_id); "); - $dbh->do(" + my $words_table = " CREATE TABLE words ( word_id SERIAL PRIMARY KEY, - word TEXT UNIQUE + word TEXT "; + if ($index_type eq 'hash') { + $words_table = ." ) WITHOUT OIDS; -"); +CREATE INDEX words_word ON words USING hash (word); +"; + } + else { + $words_table = " UNIQUE) WITHOUT OIDS;"; + } + $dbh->do($words_table); $dbh->do(" CREATE TABLE lines ( @@ -552,12 +560,15 @@ You pass a source name and the directory path: $kasago->import($source, $dir); -=head2 init +=head2 init ($index_type) -This created the tables needed by Kasago in the database. You only need run this -once. If you run this after initialisation, it will delete the index. - - $kasago->init; +This created the tables needed by Kasago in the database. You only need run +this once. If you run this after initialisation, it will delete the index. +If $index_type eq 'hash' then a hash based index will be created on +words(word). Otherwise an implicit btree index will be created. For large +codebases, postgres can complain about index size for the btree index. The +hash index fixes this, but at the expense of only being useful for equality +operators. $kasago->init; =head2 search