It is common to end up with multiple copies of the same file, each taking up its own disk space, all on the same disk partition. For instance I often end up in this situation due to making multiple copies of photos off my cameras memory cards -- eg, one that is just a straight copy of the camera card to its own directory, and another one as a result of a backup of the card on import into my photo management software. After ensuring that there are backups of the photos on other hard drives (and ideally other locations) it can be very helpful to reclaim some of the disk space occupied by the mutliple independent copies. But frequently I do not want to just delete one of the directory trees, especially if I have not finished sorting all the new photos. So a way of combining common files into a single set of disk storage is helpful.

There are many tools to try to identify duplicate files, including rdfind, duff, fdupes, and hardlink, as well as several others. It is also possible to use rsync --link-dest=... to solve this problem where the layout of the directories is also identical. All of these tools come with their own idea about how to solve the problem: some only want to list the duplicates, some want to remove one of the duplicate files, and some give the option to hardlink them. And they all make their own assumptions about what is important.

However one limitation they have in common is that checksumming the files -- by far the most expensive part of the task, especially on a slower (eg, USB2) external hard drive -- is an integral part of the tools operation. Which means that if you want to do a "dry run" (list changes) and then a "live run" (actually combine files), you end up waiting for for the checksum calculations to happen twice for no good reason. On one attempt with rdfind (which is proud of how much faster it is than the alternatives), after 30+ minutes of elapsed time, and 15 minutes of CPU time, I still did not have any results on the dry run over 10,000 files (of which about 9,000 are in pairs of common files). Obviously this is mostly due to the slow external hard drive. But the idea of waiting at least that long twice (once for the dry run, and once the live run) was very unappealing.

To avoid calculating the checksums twice, I dusted off an approach I have used a few times before, including to recover from iBooks stealing all my eBooks. Use find and md5sum to calculate the checksums first and save those to a file, then use a separate script to analyse those checksum files and look for identical files -- and optionally hard link them together. It is a very naive approach -- assuming any identical file is both not hard linked yet, and should be hard linked now. And it ignores considerations of file owners/groups, file permissions, file timestamps, etc -- basically assuming that the files with the same content originated from exactly the same place, and were copied preserving the original timestamps and permission, such as in my photo card backup example above.

The basic idea is to calculate the checksums, perhaps leaving it running overnight:

find /backup/photos/bycard -type f | xargs md5sum | tee /tmp/photos-by-card.md5sum
find /backup/photos/bydate -type f | xargs md5sum | tee /tmp/photos-by-date.md5sum

Then see if there are common files that would be linked:

cut -f 1 -d ' ' /tmp/photos-by-*.md5sum | sort | uniq -c | grep -v "^ *1 "

which identifies md5sums represented by more than one file. And also a "dry run" of the naive linking script:

lnsame /tmp/photos-by-card.md5sum /tmp/photos-by-date.md5sum

Perhaps followed by spot checking some of the files really are identical by hand -- ls -il, diff, etc.

When you are happy that the analysis makes sense, a live hardlinking can be done with:

GO=1 lnsame /tmp/photos-by-card.md5sum /tmp/photos-by-date.md5sum

ie, by setting the environment variable GO, to tell it to really make the changes.

The lnsame script is a trivial perl script that simple parses the md5sum output into a hash arranged by md5sum, and then proposes to link any sets of files with the same md5sum together, picking an arbitrary one (first out of the hash) to be the inode (and hence file owner/group/permissions/timestamp) to surive. It is definitely not suited for all purposes. But where it is, the opportunity to only calculate the checksums once can be invaluable.

(Note that at present the perl only understand the GNU style md5sum output; it could fairly easily be extended to also understand the BSD style md5 output, and optionally other checksum formats.)

As a bonus with this approach, one can repeat the find/xargs md5sum after the linking, in each file tree, and then compare those with the md5sum checksums before linking, to gain some reassurance that nothing went missing. Most of the other tools simply require either trusting that they worked properly, or a third checksum pass before hand to have something to compare with.