Subject: | A suggestion for File::Find::Duplicates |
Date: | Thu, 05 Jul 2007 17:48:04 +0200 |
To: | bug-File-Find-Duplicates [...] rt.cpan.org |
From: | Tobias Gödderz <goedderz [...] uni-bonn.de> |
Hello!
Today I wrote a script with nearly the same functionality as
File::Find::Duplicates provides. Later I thought about uploading it to
CPAN; then I found your Module and had a look at it.
I noticed that you decide whether two files are equal or by their md5sum
if their sizes doesn't differ, and as you probably know, there is a
small chance for false positives.
I wrote a routine which compares a list of files (that is, filenames)
with the same size and returns a list of lists, each inner list
containing filenames of files which contents are identical. Files
without multiple occurrences aren't returned.
It is pretty efficient; it reads each file only once, and reads them
blockwise, so if it processes n files and uses a block size of 4kb, it
needs n*4kb memory. I made some tests, and it seems to be as fast as
calculating the md5sum, as I expected.
I thought you might be interested, and I would be glad if you want to
use it, or parts of it, in File::Find::Duplicates instead of comparing
md5sums.
Kind regards and looking forward to your reply,
Tobias Gödderz
--
perl -le 'open STDOUT, "|-"
and print "uJa tsonrehtP lreahrekc"
or print pack "nN"x4, unpack "vV"x4, <STDIN>'
sub cmpfiles {
# default to $_; has to be an arrayref to an array of filenames
my $files = $_;
$files = shift if @_;
my @lolofhs = ([ map {
open my $fh, "<", $_ or die "Can't open `$_': $!";
binmode($fh);
{ fn => $_, fh => $fh }
} @$files ]);
my @result;
# read files in parallel, 4kb-wise, compare them, and leave a list
# of lists in @result, every inner list containing equal files
while(@lolofhs) {
my @fhs = @{shift @lolofhs};
my %chunkify;
for(@fhs) {
my $content;
read $_->{fh}, $content, 2**12;
push @{$chunkify{$content}}, $_;
}
if(@fhs == grep eof($_->{fh}), @fhs) {
push @result, grep @$_ > 1, values %chunkify;
}
else {
push @lolofhs, grep @$_ > 1, values %chunkify;
}
}
return map [ map $_->{fn}, @$_ ], @result;
}