Research on data synchronization algorithm (good blog

Source: Internet
Author: User
Tags file copy sha1 hash sha1 hash algorithm rsync dedupe

Http://blog.csdn.net/liuben/archive/2010/08/06/5793706.aspx

1. Introduction
Data transmission or synchronization between LAN or WAN based network applications is very common, such as remote data mirroring, Backup, replication, synchronization, data downloading, uploading, sharing, and so on, the simplest way is to completely replicate. However, there are a large number of replicas of data being replicated back and forth over the network, and in many cases there is only a small difference between the copies of these files, probably from the same file version. If the file is completely copied, in the case of large files, it will occupy a large amount of network bandwidth, synchronization time will be longer. Currently, the bandwidth and access latency of Wan Wan is still an urgent problem, and full replication makes it impossible for many network applications to provide good service quality, such as Distributed File System (DFS), cloud storage (Cloud Storage). Rsync and RDC (Remote differential Compression) are two of the most common data synchronization algorithms that transmit only differential data, which saves network bandwidth and improves efficiency. Based on these two algorithms, and with the help of Data deduplication (de-duplication) technology, this paper studies and analyzes the data synchronization algorithm deeply, and develops the prototype system. First introduces the rsync and RDC algorithm, then describes the algorithm design and the corresponding data structure in detail, and focuses on the analysis of file chunking, differential coding, file synchronization algorithm, and finally introduces the two application models.

2. Related work
Rsync is an efficient remote file Replication (synchronization) tool in Unix-like environments, which optimizes processes by using well-known rsync algorithms, reduces data traffic, and improves file transfer efficiency. Assuming there are now two computers alpha and beta, computer alpha can access a file, computer beta access to B files, files A and b very similar, computer alpha and beta through the Low-speed network interconnect. Its approximate process is as follows (refer to the tech_report.ps of the rsync author Andrew Tridgell For detailed procedures):
1, the beta partition file B into a continuous non-overlapping fixed size block S, the last block of data may be less than s bytes;
2, Beta for each block of data, calculated two checksum value, a 32-bit weak rolling checksum and a 128-bit MD4 checksum;
3, beta officers transferred Guevara value sent to Alpha;
4. Alpha searches for a block of data that has the same weak checksum and strong parity code as a piece of file B by searching for all blocks of file A that are of the size s (the offset can be arbitrary, not necessarily a multiple of s). This mainly by the rolling check rolling checksum fast completion;
5, Alpha to the beta send refactoring a file instructions, each instruction is a file B block reference (matching) or file a data block (mismatch).
Rsync is a very good tool, but it still has some drawbacks.
1, rolling checksum although can save a lot of checksum check calculation, but also the checksum search for optimization, but more than one more than a hash lookup, this consumption is not small;
2, the rsync algorithm, Alpha and beta calculation is not equal, alpha computation is very large, and bete calculation is very small. Usually alpha is a server, so the pressure is greater;
3, the data block size in rsync is fixed, the ability to adapt to data changes is limited.
The typical representative of the RDC algorithm is DFSR (distributed File System Replication) in Microsoft DFS, which differs from rsync in that it uses consistent chunking rules to segment replicated source and destination files. Therefore, RDC is equal to the amount of computation on the source and target side. The RDC and rsync algorithms differ in their focus, while rsync pursues higher data-discovery and does not hesitate to compute; RDC makes a trade-off between the two, with the goal of quickly discovering data differences with a small amount of computing, and of course repeating data found to be less than rsync. In addition, rsync is a fixed-length block strategy, and RDC is a variable-length chunking strategy.

3. Duplicate Data deletion technology
De-duplication, a duplicate data deletion, is a very new and highly popular storage technology that can significantly reduce the number of data. Data de-duplication technology, which eliminates redundant data by repeating data in a dataset. With the help of Dedup technology, the efficiency of the storage system can be improved, the cost is saved and the network bandwidth in the transmission process is reduced. At the same time it is also a green storage technology, can effectively reduce energy consumption.
Dedupe can be divided into file-level and block-level according to the granularity of weight dissipation. FILE-level Dedup technology is also known as Single Instance Storage (SIS, one-Instance store), data block-level duplication of data deletion, its weight-less granularity can reach between 4-24kb. Obviously, the block-level can provide a higher data dissipation rate, so the current mainstream Dedup products are block-level. Splits the file into blocks of data (fixed-length or variable-length blocks), using MD5 or SHA1 hash algorithm (two or more hash algorithms can be used at the same time or CRC checksum to obtain a very small probability of data collisions) for the data block calculation of fingerprint (FP, fingerprint). A block of data with the same FP fingerprint can be considered to be the same block of data, and only one copy of the storage system needs to be retained. Thus, a physical file in the storage system corresponds to a logical representation, consisting of a set of FP metadata. When the file is read, the logical file is read first, then the corresponding data block is removed from the storage system according to the FP sequence, and the physical file copy is restored.
Dedupe technology is now mainly used for data backup, so after multiple backups of data, there is a large number of duplicate data, it is very suitable for this technology. In fact, Dedupe technology can be used on many occasions, including online data, near-line data, off-line data storage systems, and even in file systems, volume managers, NAS, sans. can also be used for network data transfer, of course, can also be applied to the packaging technology. Dedupe technology can help many applications to reduce data storage, save network bandwidth, improve storage efficiency, reduce backup windows, green energy saving.

4. Data Synchronization algorithm
If rsync assumes that there are now two computers alpha and beta, computer alpha can access a file, the computer can have beta access to B files, files A and B are very similar, and computer alpha and beta are interconnected by slow networks. Data synchronization algorithm based on Dedupe technology The approximate flow is similar to rsync, briefly described as follows:
1, the beta data segmentation algorithm, such as FSP (Fixed-size partition), CDC (content-defined chuking), the file B split into equal size or unequal data blocks;
2. Beta for each block of data, compute a similar rsync weak checksum and MD5 strong checksum, and record data block length Len and offset offsets in file B;
3. Beta sends this data block information to Alpha;
4. Alpha uses the same block-splitting technique to cut file a into blocks of equal or unequal size, and to search matches with the data information from the beta, and generate the differential coding information;
5. Alpha sends the differential encoding information to beta, sending the instructions for refactoring file A at the same time;
6, beta based on differential coding information and file B refactoring file a.
In the above algorithm description, there are several key problems need to be solved, namely file segmentation, block information description, Difference coding, differential coding information description, file synchronization. File segmentation, Differential coding, file synchronization will be introduced in the following sections, where the segmentation data block information description and differential coding information description.
The data file layout of the Shard data block information consists of a file header (Chunk_file_header) and a block description (Chunk_block_entry) entity set, which is defined as follows. Where the file header defines the block size of file B, the total number of blocks of data. The file header is followed by a set of data blocks describing the entity, each entity representing a block of data, defining the block length, the block's offset in file B, the weak checksum, and the strong MD5 checksum value.
View Plaincopy to Clipboardprint?
/* Define chunk file header and block entry * *
typedef struct _CHUNK_FILE_HEADER {
uint32_t BLOCK_SZ;
uint32_t Block_nr;
} Chunk_file_header;
#define CHUNK_FILE_HEADER_SZ (sizeof (Chunk_file_header))
typedef struct _CHUNK_BLOCK_ENTRY {
uint64_t offset;
uint32_t Len;
uint8_t md5[16 + 1];
uint8_t csum[10 + 1];
} chunk_block_entry;
#define CHUNK_BLOCK_ENTRY_SZ (sizeof (chunk_block_entry))
/* Define chunk file header and block entry * *
typedef struct _CHUNK_FILE_HEADER {
uint32_t BLOCK_SZ;
uint32_t Block_nr;
} Chunk_file_header;
#define CHUNK_FILE_HEADER_SZ (sizeof (Chunk_file_header))
typedef struct _CHUNK_BLOCK_ENTRY {
uint64_t offset;
uint32_t Len;
uint8_t md5[16 + 1];
uint8_t csum[10 + 1];
} chunk_block_entry;
#define CHUNK_BLOCK_ENTRY_SZ (sizeof (chunk_block_entry))
The data file layout of the differential encoding information is also composed of the file header (Delta_file_header) and the Data Block description entity (Delta_block_entry) set, as defined below. Where the file header defines the total number of blocks of file A, the length and offset of the last data. The file header is followed by a group of data blocks describing the entity, each representing a block of data, defining the block length, offset, and block position indication. If the embeded is 1, the data block is located at offset in the differential encoding file, the data is followed by the entity, and if embeded is 0, the data block is located at the offset in file B. The last block of data is stored at the end of the differential encoding file, and the length and offset are indicated by the head.
View Plaincopy to Clipboardprint?
/* Define Delta File header and block entry * *
typedef struct _DELTA_FILE_HEADER {
uint32_t Block_nr;
uint32_t LAST_BLOCK_SZ;
uint64_t Last_block_offset; /* Offset in delta file */
} Delta_file_header;
#define DELTA_FILE_HEADER_SZ (sizeof (Delta_file_header))
typedef struct _DELTA_BLOCK_ENTRY {
uint64_t offset;
uint32_t Len;
uint8_t embeded; /* 1, block in delta file; 0, block in source file. */
} delta_block_entry;
#define DELTA_BLOCK_ENTRY_SZ (sizeof (delta_block_entry))
/* Define Delta File header and block entry * *
typedef struct _DELTA_FILE_HEADER {
uint32_t Block_nr;
uint32_t LAST_BLOCK_SZ;
uint64_t Last_block_offset; /* Offset in delta file */
} Delta_file_header;
#define DELTA_FILE_HEADER_SZ (sizeof (Delta_file_header))
typedef struct _DELTA_BLOCK_ENTRY {
uint64_t offset;
uint32_t Len;
uint8_t embeded; /* 1, block in delta file; 0, block in source file. */
} delta_block_entry;
#define DELTA_BLOCK_ENTRY_SZ (sizeof (delta_block_entry))
From real-time performance considerations, block information and differential coding information are not necessarily written to the file, can exist in the cache, but the data layout is the same as described above.

5, File segmentation
Dedupe Technology, there are three kinds of data chunking algorithms, namely, fixed-length segmentation (fixed-size partition), CDC segmentation (content-defined chunking) and slider (sliding
Block) segmentation. The fixed-length block algorithm uses a predefined block size to segment the file, and carries out weak checksum and MD5 strong checksum values. Weak check value is mainly to improve the performance of differential coding, first calculate weak checksum and hash lookup, if found then calculate MD5 strong check value and make further hash lookup. Because the weak checksum value is much smaller than the MD5, the coding performance can be improved effectively. Fixed-length block algorithm has the advantages of simplicity, high performance, but it is very sensitive to data insertion and deletion, processing is very inefficient, can not be adjusted and optimized according to content changes. The
CDC algorithm is a variable-length block algorithm that uses data fingerprints (such as Rabin fingerprints) to segment a file into chunks of varying lengths. Unlike the fixed-length block algorithm, it is based on the content of the data block segmentation, so the size of the data block can be changed. During the execution of the algorithm, the CDC uses a sliding window of a fixed size (such as 48 bytes) to compute the data fingerprint on the file data. If the fingerprint satisfies a condition, such as when its value model sets an integer equal to the preset number, the window position is used as the boundary of the block. The CDC algorithm may appear morbid phenomenon, namely the fingerprint condition is not satisfied, the block boundary is uncertain, causes the data block to be too big. The implementation can limit the size of the block, set the upper and lower limits, to solve this problem. The CDC algorithm is insensitive to changes in file content, and inserting or deleting data affects only the less-seized blocks of data, and the rest of the data blocks are unaffected. CDC algorithm is also flawed, the size of the data block is difficult to determine, the size of the rule is too expensive, coarse dedup effect is poor. It is a difficult point to weigh the trade-offs between the two. The
slider algorithm combines the advantages of fixed-length segmentation and CDC segmentation, with block size fixed. It calculates a weak checksum for a fixed-length block of data, and then computes the MD5 strong checksum value if it matches, both of which are considered to be a data block boundary. The data fragment in front of the block is also a block of data, which is indeterminate. If the slide window moves past a block size that still does not match, it is also identified as a data block boundary. The slider algorithm is very efficient for inserting and deleting problems, and can detect more redundant data than the CDC, which is less likely to produce data fragmentation.

6. Differential coding
The basis of differential coding is file B data chunking information and file A, it first makes a Peer-to-peer data block for file a (except for the slider algorithm, which is a fixed-length algorithm for file B, and a sliding block algorithm for file a), and then matches the block information of file B data. If the data block matches, the data block index is used to achieve the duplicate data deletion effect. Otherwise, the corresponding file a block of data is written to the differential encoding file. In the data block matching algorithm, the fixed length segmentation and the CDC segmentation are the same, and file a uses the segmentation algorithm of File B equivalence to block the data. The slider algorithm is different from the other two algorithms, it is similar to rsync, it is a fixed-length algorithm for file B segmentation, and the segmentation of file A is a sliding block algorithm. Therefore, this algorithm segmentation is not equal. Then, according to the file B structure Hashtable, the hash lookup is used to match, and the corresponding data file is constructed according to the difference coded data layout.

7, file synchronization
The beta gets the delta of the differential encoding file, combined with the existing file B, which synchronizes file B to a copy of file a. The synchronization algorithm traverses the delta file, reads each block description entity, reads the corresponding data blocks from delta and file B according to the embeded flag, and reconstructs file a.

8, pull and push mode
Data synchronization has pull and push two application modes, pull is to synchronize remote data to local, and push is to synchronize local data to remote. corresponding to the synchronization algorithm, the main difference is that the data block and differential coding location. The pull and push sync mode steps are described below.
Pull synchronization Mode Process:
1, local to file a data segmentation, generate data block description file chunk;
2, upload chunk files to the remote server;
3, the remote server to file B differential coding, generate differential coding file delta;
4, download the delta file to the local;
5, local sync file A to file B, equivalent to download file B to local file a.


Push sync Mode process:
1, the remote server to the file B data segmentation, generate data block description file chunk;
2, download chunk files to local;
3, local to file a differential coding, generate differential coding file delta;
4, upload delta files to the remote server;
5, remote sync files B to a, the equivalent of uploading file A to remote file B.

This article from Csdn Blog, reproduced please indicate the source: http://blog.csdn.net/liuben/archive/2010/08/06/5793706.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.