About the incremental file synchronization algorithm: rsync and CDC

Source: Internet
Author: User
Tags rsync

Recently, I have studied the incremental synchronization of files, focusing on the differential encoding of files, because this is actually the core of file synchronization. Rsync in Linux is the most widely usedAlgorithmBut this algorithm has its own defect, that is, when two files are completely irrelevant, the efficiency of differential encoding is very low and almost unacceptable!

With this problem, I studied the content-defined chunking algorithm and found that the CDC algorithm just solves this problem: when the differences between the two files are very large, the efficiency of the CDC is very high. I tried to perform differential encoding on two completely different installation package files. The file size is about 180 MB. The Rsync algorithm takes about S, while the CDC algorithm takes only 4 s! However, CDC also has its own problems. When there is little difference in files, Rsync is similar to CDC, but rsync can find more duplicate chunks, about 10% more than CDC.

For the differential encoding of large files, I think they can be combined. First, we use the CDC Algorithm for differential encoding, if there are more identical chunks (you can determine the proportion of the data size of the same chunks to the total size of the files), use the Rsync algorithm to extract more identical chunks; if there are few identical parts (maybe two completely unrelated files), you do not need to use rsync for differential encoding. In this way, we can avoid the low validity rate of two completely unrelated file differential codes, and extract as many identical parts as possible when the file is different!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.