Rsync Core Algorithm

Source: Internet
Author: User
Tags md5 hash

 

From: http://coolshell.cn/articles/7425.html
 

Rsync is an efficient way to synchronize files in Unix/LinuxAlgorithmIt can synchronously update the files and directories of the two computers, and properly use different blocks in the file to reduce data transmission. One item in rsync is similar to most others.ProgramOr the important feature not seen in the agreement is that the image only transmits the changed part. Rsync can copy/display directory attributes, copy files, and perform selective compression and recursive copy. Rsync uses algorithms invented by Andrew tridgell. Here we will not describe how to use it. We will only introduce its Core algorithms. We can see that there are a lot of exquisite things in UNIX, a command, and a tool. How can we never learn them? This is the Unix culture.

I didn't want to write this articleArticleBecause many Chinese Blogs mentioned this algorithm, but after reading it, I found that these Chinese Blogs either translate foreign articles or translate very poorly, either the introduction of this algorithm is messy and confusing, and there are still errors and mistakes, so I feel it is necessary to write an article about rsync algorithms. (Of course, my documents are too hasty and there may be some errors. please correct me)

Problem

First, let's take a look at the problem to be solved by rsync. If we only want to transfer different parts of the file to be synchronized, We Need To diff the files on both sides, however, these two problems are caused by the inability to perform diff on two different machines. If we do diff, We need to upload a file to another machine for diff, but in this way, we will upload the entire file, which is different from the original intention that we only want to transmit different parts.

So we need to find a way to make the files on both sides invisible, but we can also know what is the difference between them. This shows the Rsync algorithm.

Algorithm

The Rsync algorithm is as follows :(Suppose we synchronize the source file named filesrc and the target file named filedst.)

1)Block checksum Algorithm. First, we will divide the filedst file into several small parts on average, for example, each 512 bytes (the last part will be smaller than this number), and then calculate two checksum for each part,

    • A name is rolling checksum, which is a weak checksum and a 32-bit checksum, it uses the adler-32 algorithm invented by Mark Adler,
    • the other is strong checksum, 128 bits, which was previously used in md4, now we use the MD5 hash algorithm.

Why? Because the algorithm used to run md4 on hardware a few years ago is too slow, we need a fast algorithm to identify the differences between file blocks, but the collision probability of the weak adler32 algorithm is too high, therefore, we need to introduce a strong checksum algorithm to ensure that the two file blocks are the same.That is to say, the weak checksum is used to differentiate the differences, while the strong one is used to confirm the same. (For more information about the checksum formula, see this article)

2)Transmission Algorithm.The synchronization target end transmits a checksum list of filedst to the synchronization source, which contains three items,Rolling checksum (32 bits),MD5 checksume (128 bits),File block number.

I guess you have guessed that after the synchronization source machine obtains this list, it will perform the same checksum for filesrc and compare it with the checksum of filedst to know which file blocks have changed.

However, if you are smart, you must have the following two questions:

    • If filesrc adds a character to the file, in this way, all the subsequent file blocks will be displaced by one character, which is totally different from that on the filedst side. But theoretically, I just need to pass one character. How can this be solved?
    • If the checksum list is particularly long and the same file blocks on both sides of the list may not be in the same order, you need to find them. linear search should be very slow. How can this be solved?

Good. Let's take a look at the synchronization source algorithm.

3) checksum search algorithm . After the synchronization source obtains the checksum array of filedst, it stores the data in a hash table and uses rolling checksum for hash to obtain the search performance of O (1) time complexity. The hash table is 16bits, so the size of the hash table is 16 to the power of 2, the hash of rolling checksum is hashed to an integer between 0 and 2 ^ 16-1. (If you are not clear about the hash table, we recommend that you go back to the data structure textbooks of the University.)

By the way, I have seen many articles on the internet saying that "we want to sort rolling checksum" (such as this article and this article ), both of these articles reference and translate the original author's article, but they both understand the error. Instead of sorting, they just use the Checksum Data of filedst, press rolling checksum to save it to the hash table of 2 ^ 16. Of course, a collision will occur. Just make the collision into a linked list. This is the second step in the original article-search collision.

4)Comparison Algorithm. This is the most critical algorithm. The details are as follows:

4.1) Get the first file block of filesrc (we assume the length is 512), that is, from the 1st bytes of filesrc to 512nd bytes, and then perform rolling checksum calculation. The calculated value is queried in the hash table.

4.2) if it is found that there are potentially identical file blocks in filedst, the MD5 checksum will be compared. Because the rolling checksume is too weak, a collision may occur. So we need to calculate the checksum of MD5 128 bits, so that we will have a collision probability of 2 ^-(32 + 160) = 2 ^-, which is too small to ignore.If Rolling checksum and MD5 checksum are both the same, this indicates that there are the same blocks in filedst. We need to write down the file number under filedst..

4.3) if the rolling checksum of filesrc is not found in the hash table, MD5 checksum is not counted. Indicates that there is different information in this section. In short, as long as one of rolling checksum or MD5 checksum cannot find a match in the checksum hash table of filedst, the rolling action of the algorithm on filesrc is triggered. So,The algorithm will hold the last step 1 byte, and take the file block of the byte 2-513 in filesrc as the checksum, go to (4.1)-Now you understand what rolling checksum is.

4.4) in this way, we can find the text characters in the two adjacent match of filesrc. These are the content of the file to be uploaded to the synchronization target.

Illustration

Why? Okay, I will send Buddha to the west and draw a picture for you (I will not explain anything in the figure ).

In this way, at the same step source end, our Rsync algorithm may obtain an array of data like the following. In the figure, the red block indicates that the target end has been matched, no need to transmit (Note: I show two chunks #5 here, I believe you will understand), and the white part is the content to be transmitted (note: these white blocks are not long). In this way, the synchronization source will compress the Array (White is the actual content, and red will put a label) to the target end, at the destination, rsync will generate a file based on the table, so that the synchronization is complete.

In the end, for some compressed files, using rsync for transmission may pass more, because the compressed files may be very different. For commands such as gzip and Bzip2, remember to enable the "rsyncalbe" mode.

(The full text is complete,Indicate the author and source when Reprinting)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.