How to implement file incremental synchronization--algorithm

Source: Internet
Author: User

How to implement file incremental synchronization--algorithm

Problem:

How to incrementally synchronize files, such as a text file has 10M, respectively, stored in a A, b two places, and now two files are exactly the same, but I am going to modify this file on a, b how to achieve automatic and a file consistency, and the network is the least transmitted.

Application Scenarios:

There are so many usage scenarios, here are just a few examples

The 1.A machine is an online machine, and now requires a backup machine B, which can be recovered quickly from B when a downtime occurs, or when a hard disk is damaged, for reasons that are not human-caused.

2.SVN Such a scenario, do not need each modification to the server to send and replace a file, but only to send the modified part

3. Mobile phone client to a text modification, if that text has 2M, do I need to upload the entire file every time update? Every 2M, a fool to use!

Wait a minute....

Solution:

a . Divide and conquer

The most important basic algorithm of computer thinking is divide and conquer, in our eyes, a file is not a file, but a heap of storage blocks, each storage block may be 20Byte size, as to how large this value, you can set your own, here the 20Byte only provide reference. In this way, a file is divided into a number of blocks, we only need to be the same as the block is the same as to determine which part has been modified accordingly.

two . Quick Check

Just above mentioned how to compare the file, of course, there is certainly not to upload each block of files to the right, so it makes no sense.  Fast-track This reminds me of the hash rule, and the hash table can find a key through the complexity of O (1), why? Because it verifies the key by calculating the hash value, the hash value of a key is unique. However, just verifying that the hash value is not reliable, because the hash value may conflict, so after verifying the hash value, we are in the key comparison to determine the value to find ...

Through the idea of hashing, we can use a similar method to achieve file incremental synchronization, each storage block, through MD5 compute its value, and then pass the MD5 value to the server, so that the server than the MD5 to determine whether it has been modified, if the MD5 value is not equal, then determine that the file block has been modified

Why the MD5?

1) Ability to convert strings of any length to 128-bit fixed-length strings (MD5 16)

2) MD5 can guarantee that the hash value is not the same after the different value hash in most cases, and the hashing conflict is relatively small

Is that OK?

The generation of NO,MD5 takes longer CPU time, so we need to find a more concise calibration method, where Alder32 is a relatively common solution

Alder32 has two advantages:1, the calculation is very fast, much faster than the MD5, the cost is small;2, when we have from the 0-k length of the checksum, the calculation of 1-k or 2-k, and other calibration is very convenient, as long as a small number of operations can be. (k can be understood as above 20Byte)

Of course, its shortcomings are also obvious, is the collision rate is much higher than the MD5, so, our client needs to calculate both Alder32 checksum and MD5 value, to the server, and the server, in order to save CPU time, the first step is only generated Alder32 to verify, the value is equal, The MD5 check is performed so that the server can save a lot of money.

Alder32 Algorithm implementation:


A =1+ D1 + D2 + ... + Dn (mod65521)
B = (1+ D1) + (1+ D1 + D2) + ... + (1 + d1 + d2 + ... + dn)   (mod < Span style= "color: #800080; font-size:10pt; " >65521)
   = nxd1 +   (N−1265521)

32 (D)  = B x  65536 + a

C Implementation version

ConstIntMod_adler =65521;

UnsignedLongAdler32 (unsignedChar*data,IntLen/*Where data is the location of the data in physical memory and
Len is the length of the data in bytes*/
{
UnsignedLongA =1, B =0;
IntIndex

/*Process each byte of the data in order*/
For  (Index = 0; index < len; ++index)
    {
        a =  (a +  data[index])  % MOD_ADLER;
        b =  (b +  a)  % MOD_ADLER;
&NBSP;&NBSP;&NBSP;&NBSP;}
 
    return  (b <<  16)  | a;
}

Three. Implementing changes

Because the file has been found in different places, so only need to upload the changes on demand to the server, and then the server to make changes.

Example Analysis:  

Summary of the theory, a little example.

The client file contents are:

Taohuiissoman

The file contents of the server are:

Itaohuiamsoman

First, the client starts chunking and calculates the MD5 and Alder32 values.

For example, like Taoh is a piece, calculate the MD5 and Alder32 values respectively for Taoh. And so on, the last n letter is less than 4 bits reserved. The client then sends out the computed MD5 and alder32 in sequence, emitting the last character N.

After receiving the server, the contents of the file.2 you saved are divided by 4 bytes.

Divide the Itao, Huia, Msom, an, of course, the Alder32 value of these strings will certainly not be divided from the file.1: Taoh, Uiis, Soma, n find the same. It then moves backwards by one byte, starting from T to 4 bytes.

From the Taoh found Alder32 the same block, and then compare MD5 value, also the same! So write down, skip Taoh these 4 characters, see Uiam, and can't find file.1 on the same block. Continue jumping backwards 1 bytes from I start looking. Still not found Alder32 the same, continue to move backwards, and so on.

To Soma, we found the same block again.

Repeat the above steps until the file.2 file ends.

Through this simple example, you can imagine any other additions and deletions function

How to implement file incremental synchronization--algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.