This script allows post-backup size reduction by detecting duplicate files in backups and making them share the same hardlink target, saving disk space.
We keep a placeholder link in MD5 folder. Each file whose MD5 file already exists in the MD5 folder is forced as a link to that MD5 file. Otherwise we link the file to create that MD5 file.
cd BACKUP-PATH mkdir -p MD5 find . -path ./MD5 -prune -o -type f | xargs md5sum | while read md5 file do if test -f "MD5/$md5" then ln -f "MD5/$md5" "$file" else ln "$file" "MD5/$md5" fi echo "$md5 $file" done
Because we keep an MD5 directory we don't need the md5 list to be sorted and we can re-run the script later on a smaller disk subset without needing to refer to the full MD5 list.
We can also easily examine the MD5 directory to see how many copies of a specific file exist and how much disk space is saved.
Any files in MD5 directory with only 1 link have been deleted from the normal file system tree and could also be deleted... but serve as a backup-backup!
Don't let it scan the MD5 folder though, "ln -f source target" (when the src is the target) will delete the source - and target ;-)
ReplyDeletemaybe find should be: find *
and the MD5 folder should be: .MD5
For this I use DuplicateFilesDeleter, an easy fix for duplicates.
ReplyDeleteIs that a windows program?
DeleteYou'll note that I don't delete the duplicates but hard-link them
I inserted "-path ./MD5 -prune -o " to prevent descending into the MD5 dir (but not tested it)
ReplyDelete