Data deduplication (de-duplication) technology Research (published on SourceForge dedup util)

Source: Internet
Author: User
Tags repetition sha1 rsync dedupe

Dedup util is an open source Lightweight file Packaging tool that is based on block-level deduplication technology that effectively reduces data capacity and saves user storage space. The project has been created on SourceForge and the source code is constantly being updated. The data package generated by the tool is provided in the internal Data Department (layout) as follows:

--------------------------------------------------
| header | Unique Block Data | File Metadata |
--------------------------------------------------

The packet consists of three parts: the header (header), the unique block data, and the logical file meta-data (file metadata). The header is a struct that defines metadata such as the size of the data block, the number of unique data blocks, the size of the block ID, the number of files in the package, and the location of the metadata in the package. All unique data blocks are stored immediately after the head of the file, and the size and quantity are indicated by meta information in the file header. After the data block, it is the logical representation metadata of the file in the packet, consisting of multiple entities, as shown in the following structure, one entity representing a file. When unpacking, the data block is extracted according to the metadata of the file, and the original physical file is restored.

The metadata representation of the logical file:

-----------------------------------------------------------------
| Entry Header | Pathname | Entry Data | Last Block Data |
-----------------------------------------------------------------

The entity header of a logical file records information such as file name length, number of blocks, block ID size, and last chunk size. The file name data is followed by the length defined in the entity header. After the file name data, the number of a unique set of data blocks is stored, and the number corresponds to block one by one in the unique data block set. Finally, the last data block of the file is stored, because the block size is usually smaller than the normal data block and the repetition probability is very small, so it is saved separately.

For more information, see http://blog.csdn.net/liuben/archive/2010/01/09/5166538.aspx

Dedup util is currently in the Pre-Alpha development stage, supporting file packaging, unpacking, appending files, deleting files, listing files in the package and other functions. Preliminary test results show that dedup technology can significantly reduce the amount of data packets, resulting in smaller packets than the TAR tool, even in the case of unclear data with high repetition rates.

Source

Item Url:https://sourceforge.net/projects/deduputil

SVN code base Url:https://deduputil.svn.sourceforge.net/svnroot/deduputil

Compiler

1. Get the source code

SVN Co https://deduputil.svn.sourceforge.net/svnroot/deduputil deduputil

2. Installing Libz-dev

Apt-get Install Libz-dev

If you do not support Apt-get, please install it in a different way.

3. Compiling the installation

./gen.sh

./configure

Make

Make install

[Command line]

usage:dedup [OP tion ...] [FILE] ...

Dedup tool packages files with Deduplicaton technique.

Examples:

Dedup-c foobar.ded Foo Bar # Create foobar.ded from Files foo and bar.

Dedup-a foobar.ded foo1 bar1 # Append files foo1 and bar1 into foobar.ded.

Dedup-r foobar.ded foo1 bar1 # Remove files Foo1 and bar1 from foobar.ded.

Dedup-t foobar.ded # List All files in foobar.ded.

Dedup-x foobar.ded # Extract all files from foobar.ded.

Options:

-C,--creat create a new archive

-X,--extract extrace files from an archive

-A,--append append files to an archive

-R,--remove remove files from an archive

-T,--list list files in an archive

-Z,--compress filter the archive through zlib compression

-B,--block block size for deduplication, default is 4096

-H,--hashtable Hashtable backet number, default is 10240

-D,--directory change to directory, default is PWD

-V,--verbose print verbose messages

-H,--help give this Help list

[Run platform]

Currently only in the Linux platform development testing, other platforms are not evaluated.

[TODO]

1. Data Block Collision problem

Although the probability of collisions generated by MD5 is very small, there is still the possibility of small probability events occurring. Need to use technical means to solve the collision problem, so as to ensure data security, let users rest assured that use.

2. Variable-length data blocks

At present, the realization of the fixed-length data block is more simple in technology, and the variable-length data block may get higher data compression rate.

3. Similar file recognition

If there are only very small differences between the two files, such as inserting a few bytes somewhere, finding the blocks and processing them separately may increase the data compression rate.

Author

Liu, focus on storage technology, focus on data mining and distributed computing, [email protected]

2010.06.02


This article from Csdn Blog, reproduced please indicate the source: http://blog.csdn.net/liuben/archive/2010/06/02/5641891.aspx

Favorites 1.3.0 Package: Http://files.cnblogs.com/dkblog/deduputil-1.3.0.tar.gz.zip

===============================================================

Research on Data deduplication (de-duplication) technology

1. Dedupe Overview

De-duplication, or deduplication, is a current mainstream and very popular storage technology that optimizes storage capacity efficiently. It eliminates redundant data by removing duplicate data from the data set, preserving only one copy. As shown in. This technology can significantly reduce the need for physical storage space to meet the growing demand for data storage. Dedupe technology can bring many practical benefits, mainly including the following aspects:
(1) Meet the demand for ROI (ROI, Return on Investment)/tco (TCO, total cost of Ownership);
(2) Can effectively control the rapid growth of data;
(3) Increase the effective storage space and improve the storage efficiency;
(4) Save the total cost of storage and management costs;
(5) Save the network bandwidth of data transmission;
(6) Save space, power supply, cooling and other operation and maintenance costs.

Dedupe technology is currently used in a large number of data backup and archiving systems, because there is a lot of duplicate data after multiple backups of the data, which is well suited for this technology. In fact, Dedupe technology can be used in many applications, including online data, near-line data, and offline data storage systems, which can be implemented in file systems, volume managers, NAS, and Sans. Dedupe can also be used for data disaster recovery, transmission and synchronization, as a data compression technology can be used for data packaging. Dedupe technology can help many applications reduce data storage, save network bandwidth, improve storage efficiency, reduce backup windows, and save costs.

There are two main measures of dedupe, that is, data deduplication rate (deduplocation ratios) and performance. Dedupe performance depends on the implementation technology, and the deduplication rate is determined by the data itself and the application pattern, as shown in table [2]. Data deduplication rates from 20:1 to 500:1 are currently available from each storage vendor.

High data deduplication Rate

Low data deduplication Rate

Data created by the user

Data obtained from the natural world

Data low rate of change

Data high rate of change

Reference data, inactive data

Activity data

Low data Change rate application

High data change Rate applications

Full Data backup

Incremental Data backup

Long-term data retention

Data short-term preservation

Wide range of data applications

Small-scale data applications

Continuous data business Processing

General Data Service Processing

Small Data chunking

Big Data chunking

Variable-length data chunking

Block of fixed-length data

Data content can be perceived

Data content is unknown

Time Data weight dissipation

Spatial data weight dissipation

2, Dedupe realization key points

Various factors should be considered when developing or applying DEDUPE technology, as these factors directly affect their performance and effectiveness.

(1) What is the data that is being weighed?
How to eliminate the weight of time data or spatial data, and to eliminate the weight of global data or local data? This is the first factor to consider, which directly determines the dedupe to achieve the technology and data weight dissipation rate. Data over time, such as periodic backups, archived data, has a higher weight-dissipation ratio than spatial data, and DEDUPE technology is widely used in the field of backup archiving. It is not difficult to imagine that the global range of data repetition rate is higher than the local range of data, you will get a higher data weight-dissipation rate.

(2) when: When to eliminate weight?
Data extinction time is divided into two situations: on-line and off-line weight dissipation. With the online weight-deduplication mode, the data-write storage system performs the dissipation simultaneously, so the amount of data actually transferred or written is small, which is suitable for storage systems such as network backup archiving and remote disaster recovery system for data processing over LAN or WAN. Because it needs real-time file segmentation, data fingerprint calculation, hash lookup, the system data consumption is large. Offline deduplication mode, first write the data to the storage system, and then take the appropriate time to re-use the re-processing. This pattern, in contrast to the previous one, consumes less system data, but writes data that contains duplicates and requires additional storage space to pre-store the pre-weight data. This model is suitable for direct-attached storage Das and storage Area Network SAN storage architectures where data transfer does not occupy network bandwidth. In addition, the offline deduplication mode requires sufficient time windows to perform data deduplication operations. In summary, when to eliminate the weight, according to the actual storage application scenario to determine.

(3) where: where is the weight dissipation?
Data deduplication can be done at either the source or destination (target) side. The source-end weight is carried out in the data source, the data is transferred, which can save the network bandwidth, but consumes a lot of source-side system resources. The target-side extinction occurs on the target side, the data is transferred to the target end of the deduplication, it does not occupy the source-side system resources, but consumes a large amount of network bandwidth. The advantage of target-side dissipation is that it is transparent to the application, has good interoperability, does not require the use of specialized APIs, and the existing application software can be applied directly without any modification.

(4) How to: How to do the weight dissipation?
The data deduplication technology contains many technical implementation details, including how the files are sliced. How is the data block fingerprint calculated? How do I retrieve a block of data? Using the same data to detect or use similar data detection and differential coding technology? Is the content of the data perceptible and does the content need to be parsed? These are closely related to the concrete realization of dedupe. This paper mainly studies the same data detection technology, based on binary files for the elimination of heavy processing, has a wider applicability.

3, Dedupe Key technology

The data deduplication process of the storage system is generally the following: first, the data files are divided into a set of data blocks, the fingerprint is computed for each block, then the hash is searched with the fingerprint keyword, and the matching means that the data block is a repeating block of data, only the block index number is stored, otherwise it is a new unique block. Stores data blocks and creates related meta-information. Thus, a physical file in the storage system corresponds to a logical representation, consisting of a set of FP metadata. When the file is read, the logical file is read first, then the corresponding data block is removed from the storage system according to the FP sequence, and a copy of the physical file is restored. As can be seen from the above process, the key technology of Dedupe mainly includes file data block segmentation, data block fingerprint calculation and data block retrieval.

 (1) File Data block segmentation

Dedupe can be divided into file-level and block-level according to the granularity of weight dissipation. FILE-level Dedupe technology is also known as single-instance storage (SIS, Instance Store), data block-level deduplication is less granular and can reach between 4-24kb. Obviously, the data block level can provide higher data dissipation rate, so the current mainstream Dedupe products are data block level. There are three kinds of data chunking algorithms, that is, fixed-length segmentation (fixed-size partition), CDC segmentation (content-defined chunking), and slider (sliding block) segmentation. The fixed-length block algorithm uses the pre-defined block size to slice the file, and makes the weak check value and the MD5 strong check value. Weak check value is mainly to improve the performance of the differential coding, first calculate the weak check value and hash lookup, if found to calculate the MD5 strong check value and make further hash lookup. Because the weak checksum calculation is much smaller than the MD5, it can effectively improve the coding performance. The advantage of fixed-length block algorithm is simple, high performance, but it is very sensitive to data insertion and deletion, processing is very inefficient, can not be adjusted and optimized according to the change of content. The code for the FSP block algorithm in Deduputil is as follows.

The CDC (content-defined chunking) algorithm is a variable-length block algorithm that uses data fingerprints, such as Rabin fingerprints, to split a file into a chunking strategy of varying length. Unlike the fixed-length block algorithm, it is based on the file content of the data block segmentation, so the size of the data block is changeable. During the execution of the algorithm, CDC uses a sliding window of a fixed size (such as 48 bytes) to calculate the data fingerprint for the file data. If a fingerprint satisfies a condition, such as when its value model sets an integer equal to a predetermined number, the window position is used as the bounds of the block. The CDC algorithm may be ill-conditioned, that is, the fingerprint condition is not satisfied, the block boundary is not determined, and the data block is too large. The implementation can limit the size of the data block, set the upper and lower limits to solve this problem. The CDC algorithm is insensitive to file content changes, and inserting or deleting data affects only small chunks of data that are not checked, and the rest of the data blocks are unaffected. The CDC algorithm is also flawed, the size of the data block is difficult to determine, the granularity of the rules overhead is too large, grain through coarse dedup effect is poor. How to weigh the tradeoffs between the two is a difficult point. The CDC block algorithm code in Deduputil is as follows.

The sliding block (sliding block) algorithm combines the advantages of fixed-length segmentation and CDC segmentation, with a block-size fixation. It calculates the weak check value for the fixed-length data block first, and then computes the MD5 strong check value if the match, and the matching is considered to be a data block boundary. The data fragment in front of the block is also a block of data, which is variable length. If the sliding window does not match the distance of a block size, it is also considered a data block boundary. The slider algorithm is highly efficient for inserting and deleting problems, and can detect more redundant data than CDC, which is less prone to data fragmentation. The code of the SB block algorithm in Deduputil is as follows.

 (2) Data block fingerprint calculation

Data fingerprint is the essential feature of data block, the ideal state is that each unique data block has a unique data fingerprint, and different data blocks have different data fingerprints. The data block itself is often large, so the goal of the data fingerprint is to expect a smaller representation of data (such as 16, 32, 64, 128 bytes) to differentiate different blocks of data. The data fingerprint is usually obtained from the relevant mathematical operations of the data block content, and the hash function is close to the ideal target from the current research results, such as MD5, SHA1, SHA-256, SHA-512, one-way, rabinhash and so on. In addition, there are many string hash functions that can also be used to calculate data block fingerprints. Unfortunately, however, there is a collision problem with these fingerprint functions, that is, different data blocks may produce the same data fingerprint. In contrast, MD5 and SHA series hash functions have a very low probability of collision occurrence, so they are often used as a fingerprint calculation method. Of these, MD5 and SHA1 are 128-bit, and Sha-x (X represents the number of digits) has a lower probability of collision, but at the same time the computational amount is greatly increased. In real-world applications, tradeoffs are needed in terms of performance and data security. In addition, multiple hash algorithms can be used to calculate the fingerprint for a block of data.

 (3) Data block retrieval

For Dedupe systems with large storage capacity, the number of data blocks is very large, especially if the data blocks are finely grained. Therefore, in such a large data fingerprint database retrieval, performance will become a bottleneck. There are many kinds of information retrieval methods, such as dynamic array, database, rb/b/b+/b* tree, Hashtable and so on. Hash lookup is known for its O (1) Search performance and is widely used in applications where high performance requirements are found, as well as in Dedupe technology. Hashtable is in memory, consumes a lot of memory resources, and requires a reasonable planning of memory requirements before designing Dedupe. Memory requirements can be estimated based on the data block fingerprint length, the number of data blocks (which can be estimated by storage capacity and average block size).

A hash table (Hashtable, also known as a hash table) is a data structure that is accessed directly from a key value. That is, it accesses records by mapping key code values to a location in the table to speed up lookups. This mapping function is called a hash function, and the array that holds the record is called the hash table. The lookup process of the hash table is basically the same as the watchmaking process, some key codes can be found directly by the address of the hash function transformation, and some key codes have conflicts on the address of the hash function and need to be searched by the method of dealing with conflicts. For details, please refer to the hash table design.

4. Dedupe Data Security

The data security here contains two levels of meaning: first, data block collisions, and second, data availability. Both types of security are critical to the user and must be considered beforehand.

Data block fingerprint FP (fingerprinter) is usually calculated using hash functions, such as MD5, SHA1, SHA-256, SHA-512, and so on. From a purely mathematical point of view, if the two data block fingerprints are different, the two data blocks must be different. However, if the two data blocks have the same fingerprint, we cannot conclude that the two data blocks are the same. Because the hash function will collide, the team led by Professor Xiao of Shandong University has found a quick way to collide. However, the probability of this collision is very, very small, small to even lower than the probability that the disk will be damaged, so it is generally approximate to think that if the data block has the same fingerprint, the data block is the same. Dedupe technology is rarely used in critical data storage applications because of the potential for collisions, and there is a huge economic loss in the event of a collision. In response to this problem, there are two main solutions: first, the data fingerprint of the same block byte-level full comparison, its difficulty is that the data block raw data is sometimes difficult to obtain, in addition to performance will incur a certain loss. I developed the open source software Deduputil Adoption is this strategy, see Deduputil data block 0 collision algorithm. The second is the possibility of minimizing collisions, i.e. using better hash functions (such as SHA-512, SHA-1024), or using two or more hash algorithm combinations, which can obviously affect performance. I use this method in "Data synchronization algorithm research", for each data block to calculate two fingerprints, a similar to the rsync algorithm in the weak check value (rsync rolling check algorithm) and a strong check value MD5. The weak check value calculation consumes much less than MD5 calculation, first calculates the weak checksum value of the target data block, if it is different from the source data block, it does not have to calculate its MD5 checksum value, the same is calculated MD5 and compared. In this way, the probability of collisions is greatly reduced with a small performance cost, and there is little performance loss through optimization.

Dedupe only holds a unique copy of the data, and if the copy is corrupted it will cause all relevant data files to be inaccessible, and the data availability pressure will be higher than that of Dedupe. Data availability issues can be addressed using traditional data protection methods, including data redundancy (RAID1,RAID5, RAID6), local backup and replication, remote backup and replication, error correction data encoding techniques (such as sea-code, information dispersion algorithm IDA), and distributed storage technology. These technologies can effectively eliminate single point of failure, thus improving data availability. Of course, this will take a certain price, in exchange for security in space.

5. Open Source software Deduputil

Dedup util is a self-developed open source Lightweight file Packaging tool, based on block-level deduplication technology, can effectively reduce data storage capacity, save user storage space. Its main features are as follows:
(1) Support FSP fixed-length block, CDC variable-length block and SB-slider block three file segmentation technology;
(2) 0 data block collision, but the loss of partial performance;
(3) Global, source-side, on-line data de-duplication implementation;
(4) Support the data package file append, delete, data deduplication rate statistics function;
(5) Support the data compression after the elimination of weight.

Deduputil Project related information is as follows:
(1) Soureforge project information: Http://sourceforge.net/projects/deduputil
(2) Introduction and use Method: Http://blog.csdn.net/liuben/archive/2010/06/02/5641891.aspx

6. Extended Reading

[1] SNIA ddsr SIG. Http://www.snia.org/forums/dmf/programs/data_protect_init/ddsrsig
[2] The business value of data deduplication. Http://www.snia.org/forums/dpco/knowledge/pres_tutorials/Dedupe_Business_Value_V5.pdf
[3] Evaluation criteria for data de-dupe. Http://www.snia.org/forums/dmf/news/articles/DMF_DeDupe.PDF
[4] Auli, Shujiwu, Li Mingqiang. Data deduplication technology. Journal of Software, 2010,5 (+):p p916-929.
[5] Cheng. The research of data deduplication technology. Hua Sai Technology, 2008,4:p8-11.

from:http://blog.csdn.net/liuben/article/details/5829083

Http://www.cnblogs.com/dkblog/archive/2010/12/07/1980685.html

Data deduplication (de-duplication) technology Research (published on SourceForge dedup util)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.