Subject: | Complexity of string construction is O(ML^2) |
Consider this short program:
###
use Text::Match::FastAlternatives;
my @dict = ();
for my $i (0 .. 1999) {
push @dict, ('b' x $i) . ('a' x (2000 - $i));
}
my $aho = Text::Match::FastAlternatives->new(@dict);
###
There are 2K strings, each of length 2K. One would expect O(ML) construction to take at most a couple of seconds (where M = number of strings, L = length of strings), however this runs for a substantial amount of time with the last line included; when it is excluded the runtime of the program is negligible.
This is due to the use of longest_suffix to determine the failure function for each vertex. When the number of vertices is O(ML), as here, and the time complexity of longest_suffix is O(L), this leads to O(ML^2) overall complexity.
The effect is just as pronounced with a dictionary containing a single string, ('a' x 1000000).
One of the key points in the KMP/Aho-Corasick algorithm is that longest_suffix can be computed in constant amortised time by exploiting substructure in the dictionary. See http://wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm for more information.