Sam Liddicott: Deleting Duplicate Files in Backups

Wednesday, 26 December 2012

Deleting Duplicate Files in Backups

rsync-backup (and other variants) can reduce backup size by referring to previous backups; although this can make the backup process slower and requires some fore-thought.

This script allows post-backup size reduction by detecting duplicate files in backups and making them share the same hardlink target, saving disk space.

We keep a placeholder link in MD5 folder. Each file whose MD5 file already exists in the MD5 folder is forced as a link to that MD5 file. Otherwise we link the file to create that MD5 file.

cd BACKUP-PATH
mkdir -p MD5
find . -path ./MD5 -prune -o -type f | xargs md5sum  | while read md5 file
do if test -f "MD5/$md5"
   then ln -f "MD5/$md5" "$file"
   else ln "$file" "MD5/$md5"
   fi
   echo "$md5 $file"
done

Because we keep an MD5 directory we don't need the md5 list to be sorted and we can re-run the script later on a smaller disk subset without needing to refer to the full MD5 list.

We can also easily examine the MD5 directory to see how many copies of a specific file exist and how much disk space is saved.

Any files in MD5 directory with only 1 link have been deleted from the normal file system tree and could also be deleted... but serve as a backup-backup!

4 comments:

Sam Liddicott11 January 2013 at 14:26
Don't let it scan the MD5 folder though, "ln -f source target" (when the src is the target) will delete the source - and target ;-)

maybe find should be: find *
and the MD5 folder should be: .MD5
ReplyDelete
Replies
Unknown13 June 2013 at 21:58
For this I use DuplicateFilesDeleter, an easy fix for duplicates.
ReplyDelete
Replies
Sam Liddicott25 July 2013 at 16:56
I inserted "-path ./MD5 -prune -o " to prevent descending into the MD5 dir (but not tested it)
ReplyDelete
Replies

Add comment

Sam Liddicott

Wednesday, 26 December 2012

Deleting Duplicate Files in Backups

4 comments:

Featured Post

Android 6 semi-adopted storage

About Me