Skip Menu |

This queue is for tickets about the Algorithm-Diff CPAN distribution.

Report information
The Basics
Id: 8507
Status: new
Priority: 0/
Queue: Algorithm-Diff

People
Owner: Nobody in particular
Requestors: edoverton [...] nyc.rr.com
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in:
  • 1.15
  • 1.1901
Fixed in: (no value)



Subject: Interleaved identical lines cause significant slowdown
Short description: Algorithm::Diff::diff() runs quickly on something like: 1 2 3 4 5 6 7 8 9 0 1 X 3 4 5 6 7 8 9 X It runs much slower when identical entries are interleaved: a 2 a 4 a 6 a 8 a 0 a X a 4 a 6 a 8 a X Roughly speaking, it appears that diffs against the first type of comparison slow down proportional to the size of the compared arrays. Diffs against the second type of comparison appear to slow down proportional to the square of the size of the compared arrays. I see this performance hit with both 1.15 and 1.1901, and I see it on all hosts I've tried (HP/UX, linux, Windows). For comparison, /bin/diff handles the interleaved identical entries without any slowdown. I originally encountered this issue when running diffs against large html files (where numerous identical lines are common for items such as tables). Judging by Devel::Dprof, the difference is in the number of calls to _replaceNextLargerWith. I'm assuming the issue is that the optimization ("optimization: most of the time this will be true") is not true most of the time for the interleaved array comparison. Here's the testcase program: Show quoted text
--- snip here --- #!/usr/local/bin/perl use lib 'Algorithm-Diff-1.1901/lib/'; use warnings; use strict; use Benchmark qw(cmpthese); use Algorithm::Diff qw(diff); my $count = @ARGV ? shift : 500; $count = 10 if $count < 10; if ($count % 2) { $count++; } print << "EO_MESSAGE"; Algorithm::Diff version is $Algorithm::Diff::VERSION. Array size is $count. EO_MESSAGE my @plain_left = (1 .. $count); my @plain_right = @plain_left; $plain_right[1] = $plain_right[-1] = '8888888888'; my @leaved_left = map { $_ % 2 ? 'SAME' : $_ } @plain_left; my @leaved_right = map { $_ % 2 ? 'SAME' : $_ } @plain_right; my $run_count = @ARGV ? shift : 10; cmpthese( $run_count , { 'Plain' => sub { diff(\@plain_left, \@plain_right) } , 'Leaved' => sub { diff(\@leaved_left, \@leaved_right) } } );
--- snip here --- Here are results for runs on linux (similar results appear on the other hosts). Here, the perl version is 5.6.1. rhlgn003[9]$ diffem 100 200 Algorithm::Diff version is 1.1901. Array size is 100. Benchmark: timing 200 iterations of Leaved, Plain... Leaved: 4 wallclock secs ( 4.13 usr + 0.00 sys = 4.13 CPU) @ 48.43/s (n=200) Plain: 1 wallclock secs ( 0.55 usr + 0.00 sys = 0.55 CPU) @ 363.64/s (n=200) Rate Leaved Plain Leaved 48.4/s -- -87% Plain 364/s 651% -- rhlgn003[10]$ diffem 200 200 Algorithm::Diff version is 1.1901. Array size is 200. Benchmark: timing 200 iterations of Leaved, Plain... Leaved: 17 wallclock secs (16.79 usr + 0.00 sys = 16.79 CPU) @ 11.91/s (n=200) Plain: 1 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) @ 183.49/s (n=200) Rate Leaved Plain Leaved 11.9/s -- -94% Plain 183/s 1440% -- rhlgn003[11]$ diffem 400 200 Algorithm::Diff version is 1.1901. Array size is 400. Benchmark: timing 200 iterations of Leaved, Plain... Leaved: 70 wallclock secs (70.28 usr + 0.00 sys = 70.28 CPU) @ 2.85/s (n=200) Plain: 3 wallclock secs ( 2.19 usr + 0.00 sys = 2.19 CPU) @ 91.32/s (n=200) Rate Leaved Plain Leaved 2.85/s -- -97% Plain 91.3/s 3109% -- rhlgn003[12]$
After some further investigation, it seems I've simply hit a situation for worst case behavior in the Hunt-McIlroy algorithm. http://c2.com/cgi/wiki?DiffAlgorithm I assume there is a way to optimize around this situation, though, since /bin/diff fields this sort of difference very quickly.
Long story short, I'm working on a new LCS implementation based on the one detailed in "An O(NP) Sequence Comparison Algorithm" by Wu, Myers, Manber, and Miller. The gnu diff implementation is based on the algorithm in "An O(ND) Difference Algorithm and Its Variations" by Myers, and the O(NP) algorithm is an improvement on the O(ND) one. I'll submit a patch for Algorithm::Diff when I'm finished. Ed