Wednesday 26 December 2012

Deleting Duplicate Files in Backups

rsync-backup (and other variants) can reduce backup size by referring to previous backups; although this can make the backup process slower and requires some fore-thought.

This script allows post-backup size reduction by detecting duplicate files in backups and making them share the same hardlink target, saving disk space.

We keep a placeholder link in MD5 folder. Each file whose MD5 file already exists in the MD5 folder is forced as a link to that MD5 file. Otherwise we link the file to create that MD5 file.

cd BACKUP-PATH
mkdir -p MD5
find . -path ./MD5 -prune -o -type f | xargs md5sum  | while read md5 file
do if test -f "MD5/$md5"
   then ln -f "MD5/$md5" "$file"
   else ln "$file" "MD5/$md5"
   fi
   echo "$md5 $file"
done

Because we keep an MD5 directory we don't need the md5 list to be sorted and we can re-run the script later on a smaller disk subset without needing to refer to the full MD5 list.

We can also easily examine the MD5 directory to see how many copies of a specific file exist and how much disk space is saved.

Any files in MD5 directory with only 1 link have been deleted from the normal file system tree and could also be deleted... but serve as a backup-backup!

4 comments:

  1. Don't let it scan the MD5 folder though, "ln -f source target" (when the src is the target) will delete the source - and target ;-)

    maybe find should be: find *
    and the MD5 folder should be: .MD5

    ReplyDelete
  2. For this I use DuplicateFilesDeleter, an easy fix for duplicates.

    ReplyDelete
    Replies
    1. Is that a windows program?

      You'll note that I don't delete the duplicates but hard-link them

      Delete
  3. I inserted "-path ./MD5 -prune -o " to prevent descending into the MD5 dir (but not tested it)

    ReplyDelete