Wednesday, 26 December 2012

Deleting Duplicate Files in Backups

rsync-backup (and other variants) can reduce backup size by referring to previous backups; although this can make the backup process slower and requires some fore-thought.

This script allows post-backup size reduction by detecting duplicate files in backups and making them share the same hardlink target, saving disk space.

We keep a placeholder link in MD5 folder. Each file whose MD5 file already exists in the MD5 folder is forced as a link to that MD5 file. Otherwise we link the file to create that MD5 file.

mkdir -p MD5
find . -path ./MD5 -prune -o -type f | xargs md5sum  | while read md5 file
do if test -f "MD5/$md5"
   then ln -f "MD5/$md5" "$file"
   else ln "$file" "MD5/$md5"
   echo "$md5 $file"

Because we keep an MD5 directory we don't need the md5 list to be sorted and we can re-run the script later on a smaller disk subset without needing to refer to the full MD5 list.

We can also easily examine the MD5 directory to see how many copies of a specific file exist and how much disk space is saved.

Any files in MD5 directory with only 1 link have been deleted from the normal file system tree and could also be deleted... but serve as a backup-backup!


  1. Don't let it scan the MD5 folder though, "ln -f source target" (when the src is the target) will delete the source - and target ;-)

    maybe find should be: find *
    and the MD5 folder should be: .MD5

  2. For this I use DuplicateFilesDeleter, an easy fix for duplicates.

    1. Is that a windows program?

      You'll note that I don't delete the duplicates but hard-link them

  3. I inserted "-path ./MD5 -prune -o " to prevent descending into the MD5 dir (but not tested it)

  4. Thanks for the detailed info on this topic. It’s very hard to find nowadays to know about the basics but you did it so much well and I love GBWhatsApp . I would love to see more about it.