Skip Menu |

This queue is for tickets about the File-Find-Duplicates CPAN distribution.

Report information
The Basics
Id: 27986
Status: new
Priority: 0/
Queue: File-Find-Duplicates

People
Owner: Nobody in particular
Requestors: goedderz [...] uni-bonn.de
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: A suggestion for File::Find::Duplicates
Date: Thu, 05 Jul 2007 17:48:04 +0200
To: bug-File-Find-Duplicates [...] rt.cpan.org
From: Tobias Gödderz <goedderz [...] uni-bonn.de>
Hello! Today I wrote a script with nearly the same functionality as File::Find::Duplicates provides. Later I thought about uploading it to CPAN; then I found your Module and had a look at it. I noticed that you decide whether two files are equal or by their md5sum if their sizes doesn't differ, and as you probably know, there is a small chance for false positives. I wrote a routine which compares a list of files (that is, filenames) with the same size and returns a list of lists, each inner list containing filenames of files which contents are identical. Files without multiple occurrences aren't returned. It is pretty efficient; it reads each file only once, and reads them blockwise, so if it processes n files and uses a block size of 4kb, it needs n*4kb memory. I made some tests, and it seems to be as fast as calculating the md5sum, as I expected. I thought you might be interested, and I would be glad if you want to use it, or parts of it, in File::Find::Duplicates instead of comparing md5sums. Kind regards and looking forward to your reply, Tobias Gödderz -- perl -le 'open STDOUT, "|-" and print "uJa tsonrehtP lreahrekc" or print pack "nN"x4, unpack "vV"x4, <STDIN>'
sub cmpfiles { # default to $_; has to be an arrayref to an array of filenames my $files = $_; $files = shift if @_; my @lolofhs = ([ map { open my $fh, "<", $_ or die "Can't open `$_': $!"; binmode($fh); { fn => $_, fh => $fh } } @$files ]); my @result; # read files in parallel, 4kb-wise, compare them, and leave a list # of lists in @result, every inner list containing equal files while(@lolofhs) { my @fhs = @{shift @lolofhs}; my %chunkify; for(@fhs) { my $content; read $_->{fh}, $content, 2**12; push @{$chunkify{$content}}, $_; } if(@fhs == grep eof($_->{fh}), @fhs) { push @result, grep @$_ > 1, values %chunkify; } else { push @lolofhs, grep @$_ > 1, values %chunkify; } } return map [ map $_->{fn}, @$_ ], @result; }