Subject: | Interleaved identical lines cause significant slowdown |
Short description:
Algorithm::Diff::diff() runs quickly on something like:
1 2 3 4 5 6 7 8 9 0
1 X 3 4 5 6 7 8 9 X
It runs much slower when identical entries are interleaved:
a 2 a 4 a 6 a 8 a 0
a X a 4 a 6 a 8 a X
Roughly speaking, it appears that diffs against the first type of comparison slow down proportional to the size of the compared arrays. Diffs against the second type of comparison appear to slow down proportional to the square of the size of the compared arrays.
I see this performance hit with both 1.15 and 1.1901, and I see it on all hosts I've tried (HP/UX, linux, Windows). For comparison, /bin/diff handles the interleaved identical entries without any slowdown. I originally encountered this issue when running diffs against large html files (where numerous identical lines are common for items such as tables).
Judging by Devel::Dprof, the difference is in the number of calls to _replaceNextLargerWith. I'm assuming the issue is that the optimization ("optimization: most of the time this will be true") is not true most of the time for the interleaved array comparison.
Here's the testcase program:
Show quoted text
--- snip here ---
#!/usr/local/bin/perl
use lib 'Algorithm-Diff-1.1901/lib/';
use warnings;
use strict;
use Benchmark qw(cmpthese);
use Algorithm::Diff qw(diff);
my $count = @ARGV ? shift : 500;
$count = 10 if $count < 10;
if ($count % 2) {
$count++;
}
print << "EO_MESSAGE";
Algorithm::Diff version is $Algorithm::Diff::VERSION.
Array size is $count.
EO_MESSAGE
my @plain_left = (1 .. $count);
my @plain_right = @plain_left;
$plain_right[1] = $plain_right[-1] = '8888888888';
my @leaved_left = map { $_ % 2 ? 'SAME' : $_ } @plain_left;
my @leaved_right = map { $_ % 2 ? 'SAME' : $_ } @plain_right;
my $run_count = @ARGV ? shift : 10;
cmpthese(
$run_count
, {
'Plain' => sub { diff(\@plain_left, \@plain_right) }
, 'Leaved' => sub { diff(\@leaved_left, \@leaved_right) }
}
);
--- snip here ---
Here are results for runs on linux (similar results appear on the other hosts). Here, the perl version is 5.6.1.
rhlgn003[9]$ diffem 100 200
Algorithm::Diff version is 1.1901.
Array size is 100.
Benchmark: timing 200 iterations of Leaved, Plain...
Leaved: 4 wallclock secs ( 4.13 usr + 0.00 sys = 4.13 CPU) @ 48.43/s (n=200)
Plain: 1 wallclock secs ( 0.55 usr + 0.00 sys = 0.55 CPU) @ 363.64/s (n=200)
Rate Leaved Plain
Leaved 48.4/s -- -87%
Plain 364/s 651% --
rhlgn003[10]$ diffem 200 200
Algorithm::Diff version is 1.1901.
Array size is 200.
Benchmark: timing 200 iterations of Leaved, Plain...
Leaved: 17 wallclock secs (16.79 usr + 0.00 sys = 16.79 CPU) @ 11.91/s (n=200)
Plain: 1 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) @ 183.49/s (n=200)
Rate Leaved Plain
Leaved 11.9/s -- -94%
Plain 183/s 1440% --
rhlgn003[11]$ diffem 400 200
Algorithm::Diff version is 1.1901.
Array size is 400.
Benchmark: timing 200 iterations of Leaved, Plain...
Leaved: 70 wallclock secs (70.28 usr + 0.00 sys = 70.28 CPU) @ 2.85/s (n=200)
Plain: 3 wallclock secs ( 2.19 usr + 0.00 sys = 2.19 CPU) @ 91.32/s (n=200)
Rate Leaved Plain
Leaved 2.85/s -- -97%
Plain 91.3/s 3109% --
rhlgn003[12]$