7414.

ttkb-oss/dedup: dedup finds and clones duplicate files

github.com/ttkb-oss/dedup

dedup finds and clones duplicate files.

dedup finds files with identical content using the provided file arguments. Duplicates are replaced with a clone of another (using clonefile(2)). If no file is specified, the current directory is used.

Cloned files share data blocks with the file they were cloned from, saving space on disk. Unlike a hardlinked file, any future modification to either the clone or the original file will be remain private to that file (copy-on-write).

dedup works in two phases. First it evaluates all of the paths provided recursively, looking for duplicates. Once all duplicates are found, any files that are not already clones of "best" clone source are replaced with clones.

There are limits which files can be cloned:

the file must be a regular file
the file must have only one link
the file and its directory must be writable by the user
The "best" source is chosen by first finding the file with the most hard links. Files with multiple hard links will not be replaced, so using them as the source of other clones allows their blocks to be shared without modifying the data to which they point. If all files have a single link, a file which shares the most clones with others is chosen. This ensures that files which have been previously processed will not need to be replaced during subsequent evaluations of the same directory. If none of the files have multiple links or clones, the first file encountered will be chosen.

Files with multiple hard links are not replaced because it is not possible to guarantee all other links to that inode exist within the tree(s) being evaluated. Replacing a link with a clone changes the semantics from two link pointing at the same, mutable shared storage to two links pointing at the same copy-on-write storage. For scenarios where hard links were previously being used because clones were not available, future versions may provide a flag to destructively replace hard links with clones. Future versions may also consider cloning files with multiple hard links if all links are within the space being evaluated and two or more hard link clusters reference duplicated data.

If all files in a matched set are compressed with HFS transparent compression, none of the files with be deduplicated. Future versions of dedup may select one file from the set to decompress in place and then use that file as a clone source.

dedup will only work on volumes that have the VOL_CAP_INT_CLONE capability. Currently that is limited to APFS.

While dedup is primarily intended to be used to save storage by using clones, it also provides -l and -s flags to replace duplicates with hard links or symbolic links respectively. Care should be taken when using these options, however. Unlike clones, the replaced files share the metadata of one of the matched files, though it might not seem deterministic which one. If these options are used with automation where all files have default ownership and permissions, there should be little issue. The created files are also not copy-on-write and will share any modifications made. These options should only be used if the consequences of each choice are understood.