Subject: | Bug/feature request: better handling of hard links |
Date: | Thu, 28 Sep 2017 17:22:27 -0400 (EDT) |
To: | bug-Filesys-DiskUsage [...] rt.cpan.org |
From: | "Thomas M. Payerle" <payerle [...] umd.edu> |
It appears the Filesys::DiskUsage (as of 0.04) does not
properly handle hard links in Unix file systems.
E.g. if rootdir has two subdirs, A and B, and A contains
a 10 GB file bigfile.zip, and B contains a hard link to
bigfile.zip in A. In this case, Filesys::DiskUsage
will report both rootdir/A and rootdir/B
as being 10 GB (not unreasonable) and rootdir as 20 GB. But
rootdir is only consuming 10 GB of space (the files A/bigfile.zip
B/bigfile.zip are sharing the same data blocks on the disk, so
although there are two equal paths to the file, there is only one
file and only 10 GB of space is consumed).
The standard Unix du command (as least the GNU version distributed
with recent linuxes) correctly reports 10 GB for rootdir (as well
as for rootdir/A and rootdir/B).
I glanced briefly at the GNU code, and it looks like it records
device and inode numbers for every file it traverses, and uses
that information to avoid counting the space consumed by any file
more than once. I.e., du rootdir/A will give 10 GB (as A/bigfile.zip
is there and consumes 10 GB). Similarly, du rootdir/B gives 10 GB.
But du rootdir detects that A/bigfile.zip and B/bigfile.zip have the
same device and inode numbers, and only count it once, so du rootdir
also only gives 10 GB.
While a similar strategy might be useful for Filesys::DiskUsage, I
also see that as potentially problematic (due to inefficiencies of
storing such in Perl vs C, causing excessive memory consumption
on large filesystems).
However, a more performance friendly option would be to divide the
size of regular files by the number of links. (Only regular files
should be so treated; directories normally will have have links
for itself, its parent, and each child. But regular files will
only have more than one link if there are more than one hard links
'to the file). This should be an option, e.g.
'divide-among-hardlinks'
Such a change, for our previous example (and assuming A/bigfile.zip
only has 2 links), would result in rootdir reporting 10 GB, and
rootdir/A and rootdir/B each reporting 5 GB.
This also disagrees with the standard du command in the interpretation of
the usage of rootdir/A and rootdir/B, but there is merit to both interpretations
(there are a number of questions on the web basically wondering why
du rootdir != (du rootdir/A) + (du rootdir/B)
). There will always be some ambiguity when dealing with the diskusage
of a tree which does NOT contain all hardlinks to all files in the tree.
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroads payerle@umd.edu
5825 University Research Court (301) 405-6135
University of Maryland
College Park, MD 20740-3831