This script allows post-backup size reduction by detecting duplicate files in backups and making them share the same hardlink target, saving disk space.
We keep a placeholder link in MD5 folder. Each file whose MD5 file already exists in the MD5 folder is forced as a link to that MD5 file. Otherwise we link the file to create that MD5 file.
cd BACKUP-PATH mkdir -p MD5 find . -path ./MD5 -prune -o -type f | xargs md5sum | while read md5 file do if test -f "MD5/$md5" then ln -f "MD5/$md5" "$file" else ln "$file" "MD5/$md5" fi echo "$md5 $file" done
Because we keep an MD5 directory we don't need the md5 list to be sorted and we can re-run the script later on a smaller disk subset without needing to refer to the full MD5 list.
We can also easily examine the MD5 directory to see how many copies of a specific file exist and how much disk space is saved.
Any files in MD5 directory with only 1 link have been deleted from the normal file system tree and could also be deleted... but serve as a backup-backup!