Research on Data Synchronization Algorithms

Source: Internet
Author: User
Tags sha1 hash dedupe

1. Introduction

Data transmission or synchronization is common among LAN or WAN-based network applications, such as remote data mirroring, backup, replication, synchronization, data download, upload, and sharing, the simplest way is to completely replicate the data. However, there will be a large number of replicas after data is replicated multiple times over the network. In many cases, there is only a small difference between these replicas, it is likely that the same file version has evolved. If the file is completely copied, a large amount of network bandwidth will be occupied and the synchronization time will be long. At present, the bandwidth and access latency of Wan are still urgent issues. Full replication makes many network applications unable to provide good service quality, such as Distributed File System (DFS) and cloud storage ). Rsync and RDC (Remote differential compression) are two of the most common data synchronization algorithms. They only transmit differential data, thus saving network bandwidth and improving efficiency. Based on these two algorithms, this paper uses de-duplication technology to conduct in-depth research and analysis on the data synchronization algorithm, and develops a prototype system. First, introduce the rsync and RDC algorithms, then describe the algorithm design and the corresponding data structure in detail, and focus on analyzing file blocks, differential encoding, and file synchronization algorithms. Finally, we will introduce the two application modes of push-pull.

2. Related Work

Rsync is an efficient Remote File Replication (synchronization) tool in Unix-like environments. It uses the famous Rsync algorithm to optimize the process, reduce data communication volume and improve file transmission efficiency. Assume that there are now two computers alpha and beta, the computer alpha can access file a, and the computer beta can access file B. Files A and B are very similar. The computer alpha and beta are interconnected through a low-speed network. The general process is as follows (For details, refer to the rsync author Andrew tridgell's tech_report.ps ):
1. Beta splits file B into consecutive, non-overlapping, fixed-size data blocks. The last data block may be smaller than S bytes;
2. Beta calculates two checksum values for each data block, a 32-bit weak rolling checksum and a 128-bit md4 checksum;
3. Beta sends the verification value to Alpha;
4. Alpha searches for all data blocks of file a whose size is s (the offset can be arbitrary, not necessarily a multiple of S ), to find data blocks with the same weak verification code and strong verification code as one of file B. This is mainly accomplished quickly by rolling verification rolling checksum;
5. Alpha sends the command for restructuring file a to Beta. Each Command is a file B data block reference (matching) or file a data block (not matching ).
Rsync is a very good tool, but it still has some shortcomings.
1. Although rolling checksum can save a lot of checksum check calculation workload, it is also optimized for checksum search, but it consumes more than one times of hash search;
2. In The Rsync algorithm, the alpha and beta calculation workload is not equal. The Alpha calculation workload is very large, while the bete calculation workload is very small. Generally, Alpha is a server, so the stress is high;
3. The data block size in rsync is fixed, and the adaptability to data changes is limited.
A typical example of the RDC algorithm is the dfsr (Distributed File System replication) in Microsoft DFS. It differs from rsync in that it uses consistent block rules to split the copied source and target files. Therefore, the amount of RDC computing on the source and target ends is equal. The RDC and rsync algorithms differ in their focus. Rsync pursues higher duplicate data discovery without additional computation. RDC makes a compromise between the two, the goal is to quickly discover data differences with a small amount of computing. Of course, duplicate data is not found as good as rsync. In addition, Rsync is a fixed-length block policy, while RDC is a variable-length block policy.

3. deduplication Technology

De-duplication (deduplication) is a new and popular storage technology that can greatly reduce the amount of data. Deduplication technology eliminates redundant data by means of duplicate data in a dataset. With dedup technology, you can improve the efficiency of the storage system, effectively save costs, and reduce network bandwidth during transmission. It is also a green storage technology that can effectively reduce energy consumption.
Dedupe can be divided into file-level and data block-level based on the granularity of deduplication. The file-level dedup technology is also called the Single Instance Storage (SIS, single instance store). The deduplication granularity of data block-level deduplication is smaller and can reach 4-24 KB. Obviously, data block-level products can provide a higher data deduplication rate. Therefore, the current mainstream dedup products are all data block-level products. Split all files into data blocks (data blocks with a fixed length or extended length) and use MD5 or sha1 hash algorithms (two or more hash algorithms or CRC verification can be used at the same time, calculate the fingerprint (FP, fingerprint) based on the data block to obtain a very small probability of Data collision ). Data blocks with the same FP fingerprint can be considered as the same data blocks, and only one copy needs to be retained in the storage system. In this way, a physical file corresponds to a logical representation in the storage system, consisting of a group of FP metadata. When reading a file, read the logical file first, and then extract the corresponding data blocks from the storage system based on the FP sequence to restore the physical file copy.
Currently, dedupe is mainly used for data backup. Therefore, a large amount of duplicate data exists after multiple data backups, which is very suitable for this technology. In fact, dedupe technology can be used in many occasions, including online data, nearline data, offline data storage systems, and even in file systems, Volume managers, NAS, and San. It can also be used for network data transmission, or for data packaging technology. The dedupe technology can help many applications reduce data storage, save network bandwidth, improve storage efficiency, reduce backup windows, and save energy.

4. Data Synchronization Algorithm

For example, rsync assumes that there are now two computers alpha and beta, the computer alpha can access file a, and the computer beta can access file B. Files A and B are very similar, alpha and beta computers are interconnected through low-speed networks. The data synchronization algorithm process based on dedupe technology is similar to rsync, which is described as follows:
1. Beta uses data splitting algorithms, such as FSP (fixed-size partition) and CDC (content-defined chuking), to split file B into data blocks of equal or unequal sizes;
2. For each data block, beta calculates a weak check value similar to rsync and MD5 strong check value, and records the length of the data block Len and the offset in file B;
3. Beta sends the data block information to Alpha;
4. Alpha uses the same data block splitting technology to cut file a into data blocks of the same size or not, and searches and matches the data information sent by beta to generate the differential encoding information;
5. Alpha sends the differential encoding information to Beta, and simultaneously sends the command for restructuring file;
6. Beta reconstructs file a based on the differential encoding information and file B.
In the preceding algorithm description, several key problems need to be solved, including file segmentation and data block Information Description, differential encoding, differential encoding information description, and file synchronization. File splitting, differential encoding, and file synchronization will be introduced in the subsequent sections. Here we will describe the information about the split data block and differential encoding.
The layout of the data file for splitting data block information consists of the file header (chunk_file_header) and chunk_block_entry (chunk_block_entry) entity set. The specific definition is as follows. The file header defines the size and total number of data blocks of file B. The object header is followed by a group of data blocks to describe the object. Each object represents a data block and defines the block length, the offset of the block in file B, the weak check value, and the strong MD5 check value.
/* Define chunk file header and block entry */<br/> typedef struct _ chunk_file_header {<br/> uint32_t block_sz; <br/> uint32_t block_nr; <br/>} chunk_file_header; <br/> # define chunk_file_header_sz (sizeof (chunk_file_header) <br/> typedef struct _ chunk_block_entry {<br/> uint64_t offset; <br/> uint32_t Len; <br/> uint8_t MD5 [16 + 1]; <br/> uint8_t csum [10 + 1]; <br/>} chunk_block_entry; <br/> # define chunk_block_entry_sz (sizeof (chunk_block_entry ))
The data file layout of the differential encoding information is also composed of the file header (delta_file_header) and the data block description entity (delta_block_entry) set, as defined below. The file header defines the total number of data blocks of File A, the length and offset of the last data. The file header is followed by a group of data blocks to describe the entity. Each object represents a data block, and defines the length, offset, and Position Indication of the data block. If embeded is 1, it indicates that the data block is located at the offset position in the differential encoding file, followed by the object. If embeded is 0, it indicates that the data block is located at the offset position in file B. Finally, the data block is stored at the end of the differential encoding file. The length and offset are indicated by the header.
/* Define delta file header and block entry */<br/> typedef struct _ delta_file_header {<br/> uint32_t block_nr; <br/> uint32_t last_block_sz; <br/> uint64_t last_block_offset;/* offset in delta file */<br/>} delta_file_header; <br/> # define delta_file_header_sz (sizeof (delta_file_header )) <br/> typedef struct _ delta_block_entry {<br/> uint64_t offset; <br/> uint32_t Len; <br/> uint8_t embeded;/* 1, block in delta file; 0, block in source file. */<br/>} delta_block_entry; <br/> # define delta_block_entry_sz (sizeof (delta_block_entry ))
In terms of real-time performance, data block information and differential encoding information do not have to be written into files and may exist in the cache, but the data layout is the same as described above.

5. File splitting

In dedupe technology, there are three main data segmentation algorithms: fixed-size partition, content-defined chunking, and sliding.
Block. The fixed-length multipart algorithm splits the file based on the predefined block size and performs the weak check value and MD5 strong check value. The weak check value is mainly used to improve the performance of differential encoding. It is used to calculate the weak check value and perform hash search. If it is found, the MD5 strong check value is calculated for further hash search. Since the calculation of the weak check value is much smaller than that of MD5, the encoding performance can be effectively improved. The fixed-length block algorithm has the advantages of simplicity and high performance. However, it is very sensitive to data insertion and deletion and is inefficient in processing. It cannot be adjusted or optimized based on content changes.
The CDC algorithm is a variable-length block algorithm that uses data fingerprints (such as Rabin fingerprints) to split files into blocks of varying lengths. Unlike the fixed-length Chunk Algorithm, it splits data blocks Based on the file content, so the data block size can be changed. During Algorithm Execution, CDC uses a fixed sliding window (such as 48 bytes) to calculate data fingerprints for file data. If the fingerprint meets a certain condition, for example, when its value is set to an integer equal to the preset number, the window position is used as the boundary of the block. The CDC algorithm may be ill-formed, that is, the fingerprint condition cannot be met, and the block boundary cannot be determined, resulting in an excessively large data block. In implementation, you can limit the size of data blocks and set the upper and lower limits to solve this problem. The CDC algorithm is not sensitive to file content changes. inserting or deleting data will only affect the data blocks that are rarely checked, while other data blocks will not be affected. The CDC algorithm is also flawed. It is difficult to determine the size of data blocks. If the granularity is too detailed, the overhead is too large. If the granularity is too rough, dedup is ineffective. It is difficult to balance the two.
The sliding block algorithm combines the advantages of fixed-length splitting and CDC splitting, and the block size is fixed. It calculates the weak check value for a fixed-length data block first. If it matches, it then calculates the MD5 strong check value. If both match, it is considered a data block boundary. The data fragmentation before the data block is also a data block, which is not long. If the distance between the sliding window and the block size still does not match, it is also considered as a data block boundary. The slide block algorithm is very efficient in processing insertion and deletion problems and can detect more redundant data than CDC. Its disadvantage is that it is prone to data fragmentation.

6. differential encoding

The basis of differential encoding is file B's data block information and file a. It first performs peer-to-peer data blocks on file a (except for the sliding block algorithm, which splits file B by a fixed-length algorithm, for file A, it is a sliding block algorithm), and then matches file B's data block information. If the data block matches, it is represented by the data block index to achieve the deduplication effect. Otherwise, the corresponding file a data block is written into the differential encoding file. In terms of data block matching algorithms, fixed-length splitting and CDC splitting are basically the same. File a uses the same splitting algorithm as file B to split data blocks. Unlike the other two algorithms, the sliding block algorithm is similar to rsync. It splits file B by a fixed-length algorithm, and file a by a sliding block algorithm. Therefore, this algorithm is not equal. Then, a hashtable is constructed based on file B, matched by Hash Lookup, and corresponding data files are constructed based on the layout of differential encoding data.

7. file synchronization

In beta, the differential encoding file Delta is obtained. Combined with the existing file B, file B can be synchronized to a copy of File. The synchronization algorithm traverses the delta file, reads each data block to describe the entity, reads the corresponding data block from Delta and file B Based on the embeded mark, and reconstructs file.

8. pull and push Modes

Data Synchronization has two application modes: pull and push. Pull synchronizes remote data to the local device, while push synchronizes local data to the remote device. Corresponding to the synchronization algorithm, the main difference is that the data block and differential encoding location are different. The steps for pull and push synchronization modes are as follows.
Pull synchronization mode process:
1. Split the data of file a locally to generate the chunk of the data block description file;
2. Upload the chunk file to the remote server;
3. The remote server performs differential encoding on file B to generate the delta of the differential encoding file;
4. Download the delta file to your local device;
5. Local synchronization of file a to file B is equivalent to downloading file B to local file.

Push synchronization mode process:
1. The remote server splits file B to generate a chunk;
2. Download the chunk file to the local device;
3. Perform local differential encoding on file a to generate the delta file;
4. Upload the delta file to the remote server;
5. Remote synchronization of file B to a is equivalent to uploading file a to remote file B.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.