Bug #27986 for File-Find-Duplicates: A suggestion for File::Find::Duplicates

Subject:	A suggestion for File::Find::Duplicates
Date:	Thu, 05 Jul 2007 17:48:04 +0200
To:	bug-File-Find-Duplicates [...] rt.cpan.org
From:	Tobias Gödderz <goedderz [...] uni-bonn.de>

Hello! Today I wrote a script with nearly the same functionality as File::Find::Duplicates provides. Later I thought about uploading it to CPAN; then I found your Module and had a look at it. I noticed that you decide whether two files are equal or by their md5sum if their sizes doesn't differ, and as you probably know, there is a small chance for false positives. I wrote a routine which compares a list of files (that is, filenames) with the same size and returns a list of lists, each inner list containing filenames of files which contents are identical. Files without multiple occurrences aren't returned. It is pretty efficient; it reads each file only once, and reads them blockwise, so if it processes n files and uses a block size of 4kb, it needs n*4kb memory. I made some tests, and it seems to be as fast as calculating the md5sum, as I expected. I thought you might be interested, and I would be glad if you want to use it, or parts of it, in File::Find::Duplicates instead of comparing md5sums. Kind regards and looking forward to your reply, Tobias Gödderz -- perl -le 'open STDOUT, "|-" and print "uJa tsonrehtP lreahrekc" or print pack "nN"x4, unpack "vV"x4, <STDIN>'

sub cmpfiles { # default to $_; has to be an arrayref to an array of filenames my $files = $_; $files = shift if @_; my @lolofhs = ([ map { open my $fh, "<", $_ or die "Can't open `$_': $!"; binmode($fh); { fn => $_, fh => $fh } } @$files ]); my @result; # read files in parallel, 4kb-wise, compare them, and leave a list # of lists in @result, every inner list containing equal files while(@lolofhs) { my @fhs = @{shift @lolofhs}; my %chunkify; for(@fhs) { my $content; read $_->{fh}, $content, 2**12; push @{$chunkify{$content}}, $_; } if(@fhs == grep eof($_->{fh}), @fhs) { push @result, grep @$_ > 1, values %chunkify; } else { push @lolofhs, grep @$_ > 1, values %chunkify; } } return map [ map $_->{fn}, @$_ ], @result; }