Problem:
How to incrementally synchronize files, for example, a text file has 10M, stored in the A,b two places, now two files are exactly the same, but I immediately to the file on a modified, b how to implement automatic and a file consistency, and the network has the least amount of traffic.
Application Scenario:
This is too much to use, here to list a few
1.A machine for online operation machine, now need a backup machine B, when a occurred downtime, or hard disk damage and other reasons that the data is not available, can quickly recover from B
2.SVN Such a scenario, do not need every change to the server to send and replace a file, but only to send the modified part
3. Mobile phone client to a text modification, if that text has 2M, do I need to upload the entire file every time update? 2M every time, fools use!
Wait a minute....
Solution:
a . Divide and conquer
Computer's most important basic algorithm is divide and conquer, in our eyes, a file is not a file, but a heap of storage blocks, each storage block may be 20Byte size, as for this value specific how big, you can set up, here 20Byte only provide reference. In this way, a file is divided into a number of blocks, we only need to be compared to the block is the same as to what part of the corresponding modification.
two . Quick Check
Just mentioned how to compare the file, of course, this is certainly not to upload each piece of the file to compare to, that does not make sense. Faster than it makes me think of hash rules, the hash table can find a key through the complexity of O (1), why? Because it first verifies the key by calculating the hash value, the hash value of a key is unique. But only verifying the hash value is unreliable, because the hash value may conflict, so after verifying the hash value, we are in the key comparison to determine the value to find ...
Through the idea of hashing, we can use a similar method to achieve incremental file synchronization, each block, through MD5 to calculate its value, and then pass the MD5 value to the server, so that the server than the MD5 to determine whether the changes, if the MD5 value is not equal, then determine that the file block has been modified
Why the MD5?
1 can convert any length of string to a 128-bit fixed-length string (MD5 16)
2) MD5 can ensure that the majority of the value of different hash after the hash value is not the same, the hash conflict is relatively few
Is that all you got?
No,md5 generation requires a relatively long CPU time, so we need to look for a more concise way of checking, where the selection of Alder32 is a more general solution
Alder32 has two advantages:
1, the calculation is very fast, more quickly than MD5, the cost is small;
2, when we have from the 0-k length of the checksum, calculate the 1-k or 2-k and other checksum is very convenient, as long as a small number of operations can be. (k can be understood as the 20Byte above)
Of course, its shortcoming is also very obvious, is the collision rate is much higher than the MD5, therefore, our client needs to compute the ALDER32 checksum and the MD5 value at the same time, to the server, and the server, in order to save the CPU time, the first step only generates the Alder32 to verify, the equal value, in carries on This will save a lot of money for the server.
Alder32 Algorithm implementation:
A = 1 + D1 + D2 + ... + DN (mod 65521)
B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
= NxD1 + (N1) XD2 + (n2) xD3 + ... + Dn + N (mod 65521)
Adler-32 (D) = Bx65536 + A
C Implementation version
const int mod_adler = 65521;
unsigned long adler32 (unsigned char *data, int len)//Where data is the location of the ' data in physical memory and
L En
is the length of the ' data in bytes/{unsigned long a = 1, b = 0;
int index;
/* Process Each byte of the ' data in ' order
(index = 0; index < len; ++index)
{
A = (A + data[index ]% Mod_adler;
B = (b + a)% Mod_adler;
return (b << 16) | A;
}