Automatic data segmentation during deduplication

Last Update:2013-12-28 Source: Internet

Author: User

Tags sha1 hash sha1 hash algorithm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Deduplication has been widely used in data backup. We found that for backup applications, we can delete and compress data by repeat data about 20 times, thus saving a lot of storage space. How can I retrieve duplicate data blocks? If byte-level comparison is adopted, the performance of the entire system is certainly unacceptable. To solve this problem, you can use the data fingerprint technology to retrieve duplicate data blocks.

The principle of this deduplication method is very simple. First, the input data stream is segmented. The simplest method of segmented processing is the fixed-length data block splitting method. After the data block is completed, calculate the fingerprint value of the data block and then retrieve whether the data block with the same fingerprint exists in the system. If yes, the input data block is discarded. Otherwise, the data block is stored in the system. Based on the principle, we can find that Fingerprint is the key to deduplication. First, obtaining fingerprint through data blocks requires a lot of computing work, which will affect the throughput in the data storage process. Second, different data blocks may have the same fingerprint. In practical applications, the SHA1 HASH algorithm is usually used to calculate the fingerprint, but there may be a HASH conflict event. If this conflict is not solved, it will cause data loss to the application. Again, after the fingerprint computation is complete, you also need to find whether the data block with the same fingerprint exists in the system. For large-capacity storage, this query is very time-consuming, quick search is the key to online deduplication.

Deduplication is easy to understand in principle, but there are still many difficulties in implementing an effective system. Here we will first discuss how to achieve automatic segmentation of data blocks. As mentioned above, the simplest method for input data streams is fixed data block segmentation. That is, the input data stream is split according to the fixed window size. This splitting method is simple, but affects the efficiency of deduplication. In the backup application, full backup is performed once a week, and Incremental backup is performed every day within one week. In the full backup process, this backup is only a small part of data for the last time. If the data is split according to the fixed window size, the data block after this chunk is completely different from the previous one, but in fact, only the local data is different. Therefore, the block size method cannot efficiently Delete duplicate data.

To solve this problem, we need to use the content recognition block method, which is a block algorithm with variable block size. After using this method, if the backup data only changes a little compared to the previous one, then only a few data blocks are different, and most of the data blocks are still the original segmentation method, this allows efficient data deletion.

The most common block algorithm is the mobile window method. Shows the algorithm:

650) this. width = 650; "title =" 222.jpg" src = "http://www.bkjia.com/uploads/allimg/131228/004JW4W-0.jpg"/>

First, define A data block window, calculate the fingerprint value A in the window from the start point of the data stream, then modulo A % M for the value of A, M is A fixed feature value) Get B, finally, compare whether B and the expected value CC are a fixed value of the system). If they are equal, it means that Anchor points are found and a segment is obtained. If they are not equal, the offset of moving a window offset is also fixed.) Continue the calculation of the fingerprint value in the next window and repeat the above process until an Anchor point is found. This is the idea of automatic segmentation for content recognition. It is an image description. We assume that the feature value is abcdef, which is to find the abcdef point in the stream as the anchor point. If this algorithm is used, the segment size is different. To prevent the segment from being too large or too small, you can set‍. Maximum or minimum value. In addition, different feature values, offset values, and expected values are obtained.

When the input data stream changes, the segmentation method based on the above algorithm can also obtain the results similar to the previous one.

650) this. width = 650; "title =" 22.jpg" src = "http://www.bkjia.com/uploads/allimg/131228/004JU1M-1.jpg" width = "595" height = "228"/>

As shown in, when some characters are added to segment 0, the original anchor point may not be detected in the moving window. However, the subsequent segmentation results can still be found. In the above case, I think offset has a great impact on the detection results. If the offset value is too small, the calculation workload is huge. If the offset value is too large, the segmentation result is not ideal.

Automatic Segmentation of data streams is the first step in deduplication. The research and optimization of efficient content recognition algorithms have a great impact on the efficiency of deduplication.

This article is from the "Storage path" blog, please be sure to keep this source http://alanwu.blog.51cto.com/3652632/1275636

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More