Recently, I have studied the incremental synchronization of files, focusing on the differential encoding of files, because this is actually the core of file synchronization. Rsync in Linux is the most widely usedAlgorithmBut this algorithm has its own defect, that is, when two files are completely irrelevant, the efficiency of differential encoding is very low and almost unacceptable!
With this problem, I studied the content-defined chunking algorithm and found that the CDC algorithm just solves this problem: when the differences between the two files are very large, the efficiency of the CDC is very high. I tried to perform differential encoding on two completely different installation package files. The file size is about 180 MB. The Rsync algorithm takes about S, while the CDC algorithm takes only 4 s! However, CDC also has its own problems. When there is little difference in files, Rsync is similar to CDC, but rsync can find more duplicate chunks, about 10% more than CDC.
For the differential encoding of large files, I think they can be combined. First, we use the CDC Algorithm for differential encoding, if there are more identical chunks (you can determine the proportion of the data size of the same chunks to the total size of the files), use the Rsync algorithm to extract more identical chunks; if there are few identical parts (maybe two completely unrelated files), you do not need to use rsync for differential encoding. In this way, we can avoid the low validity rate of two completely unrelated file differential codes, and extract as many identical parts as possible when the file is different!