It is common to end up with multiple copies of the same file, each taking up its own disk space, all on the same disk partition. For instance I often end up in this situation due to making multiple copies of photos off my cameras memory cards -- eg, one that is just a straight copy of the camera card to its own directory, and another one as a result of a backup of the card on import into my photo management software. After ensuring that there are backups of the photos on other hard drives (and ideally other locations) it can be very helpful to reclaim some of the disk space occupied by the mutliple independent copies. But frequently I do not want to just delete one of the directory trees, especially if I have not finished sorting all the new photos. So a way of combining common files into a single set of disk storage is helpful.
There are many tools to try to identify duplicate
files,
including rdfind,
duff,
fdupes, and
hardlink, as well as
several
others.
It is also possible to use rsync
--link-dest=...
to solve this problem where the layout of the directories is also
identical. All of these tools come with their own idea about how
to solve the problem: some only want to list the duplicates, some
want to remove one of the duplicate files, and some give the option
to hardlink them. And they all make their own assumptions about
what is important.
However one limitation they have in common is that checksumming the
files -- by far the most expensive part of the task, especially on
a slower (eg, USB2) external hard drive -- is an integral part of
the tools operation. Which means that if you want to do a "dry
run" (list changes) and then a "live run" (actually combine files),
you end up waiting for for the checksum calculations to happen twice
for no good reason. On one attempt with rdfind
(which is proud
of how much faster it is than the
alternatives), after 30+ minutes of
elapsed time, and 15 minutes of CPU time, I still did not have any
results on the dry run over 10,000 files (of which about 9,000 are
in pairs of common files). Obviously this is mostly due to the
slow external hard drive. But the idea of waiting at least that
long twice (once for the dry run, and once the live run) was very
unappealing.
To avoid calculating the checksums twice, I dusted off an approach
I have used a few times before, including to recover from iBooks
stealing all my
eBooks.
Use find
and md5sum
to calculate the checksums first and save those
to a file, then use a separate script to analyse those checksum files
and look for identical files -- and optionally hard link them together.
It is a very naive approach -- assuming any identical file is both not
hard linked yet, and should be hard linked now. And it ignores
considerations of file owners/groups, file permissions, file timestamps,
etc -- basically assuming that the files with the same content
originated from exactly the same place, and were copied preserving the
original timestamps and permission, such as in my photo card backup
example above.
The basic idea is to calculate the checksums, perhaps leaving it running overnight:
find /backup/photos/bycard -type f | xargs md5sum | tee /tmp/photos-by-card.md5sum
find /backup/photos/bydate -type f | xargs md5sum | tee /tmp/photos-by-date.md5sum
Then see if there are common files that would be linked:
cut -f 1 -d ' ' /tmp/photos-by-*.md5sum | sort | uniq -c | grep -v "^ *1 "
which identifies md5sum
s represented by more than one file. And also
a "dry run" of the naive linking script:
lnsame /tmp/photos-by-card.md5sum /tmp/photos-by-date.md5sum
Perhaps followed by spot checking some of the files really are identical
by hand -- ls -il
, diff
, etc.
When you are happy that the analysis makes sense, a live hardlinking can be done with:
GO=1 lnsame /tmp/photos-by-card.md5sum /tmp/photos-by-date.md5sum
ie, by setting the environment variable GO
, to tell it to really make
the changes.
The lnsame
script is
a trivial perl
script that simple parses the md5sum
output into a hash
arranged by md5sum
, and then proposes to link any sets of files with
the same md5sum
together, picking an arbitrary one (first out of the
hash) to be the inode
(and hence file
owner/group/permissions/timestamp) to surive. It is definitely not
suited for all purposes. But where it is, the opportunity to only
calculate the checksums once can be invaluable.
(Note that at present the perl
only understand the GNU style md5sum
output; it could fairly easily be extended to also understand the BSD
style md5
output, and optionally other checksum formats.)
As a bonus with this approach, one can repeat the find
/xargs md5sum
after the linking, in each file tree, and then compare those with the
md5sum
checksums before linking, to gain some reassurance that nothing
went missing. Most of the other tools simply require either trusting that
they worked properly, or a third checksum pass before hand to have
something to compare with.