Data deduplication 2---Research on high performance duplicate data detection and deletion technology some fragmentary knowledge

Source: Internet
Author: User

research on high performance data deduplication and detection and deletion technology

Here are some fragmentary data about the re-deletion of things, previously summarized, put on can communicate with you.

The explosion of 1 data volumes brings new challenges to the capacity, throughput performance, scalability, reliability, security, maintainability, and Energy management of existing storage systems, eliminating redundant information and optimizing storage space efficiency as The main techniques to alleviate the bottleneck of storage capacity, including data compression [8] and data deduplication, are the key technologies for eliminating information redundancy.

2 data compression is theprocess of expressing raw data with fewer bits (bit) by encoding , and the data compression can be subdivided into lossless compression and lossy compression, depending on whether the encoding process loses raw information. Lossless compression usually uses statistical redundancy, with fewer bits to express a higher frequency of the bit string, according to the coding algorithm and the dictionary information can be completely recovered from the lossless compressed data from the original data information.

3 Data de-weight

Data de-weight ( Data deduplication) is the process of discovering and eliminating duplicate content in a dataset or data stream to improve the storage and / or transfer efficiency of data, also known as deduplication ( Duplicate data Elimination), simply weigh or re-delete. As a key technology for optimizing storage space, data deduplication helps reduce the resource and TCO of storage systems , with the following benefits:

Increase storage space efficiency by increasing the amount of data that can be accommodated in the unit's storage space.

Reduce the amount of equipment needed to maintain unit data and reduce the corresponding energy consumption.

Improve network transmission efficiency by avoiding sending duplicate content.

4 Data Backup.

data backup by creating and dumps multiple copies of the target dataset at different points in time to improve the reliability of the data. Data backup is generally divided into full backup ( differential Incremental Backup) and differential incremental backups based on policy differences and cumulative incremental backups ( cumulative Incremental backup). a full backup copies and saves a full copy of the target dataset each time , thus requiring the longest backup time and maximum storage space; Differential incremental backups only replicate and save data that has been modified since the last full or differential incremental backup, which reduces the amount of traffic, but restores restore to the previous full-volume backup, and then restore all subsequent delta data in turn, with a poor recovery time objective ( Recovery-Objectives, RTO) Cumulative incremental backups replicate and save all data that was modified since the last full-time backup, so you only need to restore the last full-volume backup and the most recent when recovering data.

5 How to define duplicate data

The new storage solution introduces a content-based, repeatable data definition approach that class methods can work on three different levels of granularity, such as bytes, blocks, or files.

byte-level repeating data definition method mainly uses Delta coding algorithm [+]. The algorithm treats the file as an ordered string consisting of a series of symbol numbers, computes and encodes the pending file Fnew and the existing benchmark file Fbase the difference between the amount of according to the formation Delta (fnew, Fbase) file. If there is a high degree of similarity between Fnew and Fbase , only the delta is stored (fnew,Fbase) can achieve the effect of saving storage space.

file-level deduplication typically uses a high-reliability hashing algorithm, such as MD5 and SHA-1) Generate identifiers (also called fingerprints) with very low conflict probabilities for a given file and identify duplicates by screening the same identifiers File files are data organization units commonly used to manage unstructured data, and common files such as documents, Images , audio, and video have specific data structures and are easy to use as a whole in different storage areas (such as folders)

block-level data deduplication is primarily used to detect duplicate data between similar and different files, It is usually using hash-based content identity technology, in order to achieve higher deduplication speeds than byte levels and a wider Duplicate content detection range. The main methods to determine the block boundaries are divided into two categories, including fixed-length chunking and variable - length chunking.

L fixed Then the process of calculating the fingerprint of each tile, and sift through the duplicate data objects from it. This method is simple and efficient, But when a file is created with a new version of the data inserted,

Fixed-length block method based on sliding window [30] with length W = L fixed The window parses the file data. The determines that the data in the current window is a repeating chunk and slides the window directly to the unhandled data area, whereas the swipe L fixed L fixed non-repeating points. Such methods can resolve the boundary caused by the insertion of data drift problem, but it requires frequent calculation of candidate block fingerprint and query fingerprint repeatability, with high time overhead,

Content-based chunking ( contents Defined Chunking, Cdc method [31-33] w << L fixed sliding window analyzes file data, and using a lower computational complex a e can be B re-capture, enter Rabin fingerprint algorithm [34] calculation window MD5 and SHA-1), is highly practical and widely used in various deduplication solutions.

6 evaluation method of the de-weight efficiency

In a network-based distributed storage environment, data deduplication can be selectively deployed at the source (client) or End (server side). Source-Side de -deduplication [ 36,37] First computes the fingerprint of the data to be transmitted on the client and discovers and eliminates duplicate content by fingerprint comparison to the server. The non-repeating data content is then sent only to the server to achieve the goal of saving both network bandwidth and storage resources. Host deduplication ( destinationdeduplication) [one] directly transfers the client's data to the server and detects and eliminates duplicate content within the service side . Both deployment methods can improve storage space efficiency, the main difference is the source-side de-emphasis by consuming customers.

the efficiency of data deduplication can be evaluated from two dimensions of time and space. In terms of timeliness, the data can be re- Square Method Strokes points to be in the Line go to Heavy (Inline deduplication)[38]and the after Place Management go to Heavy (Postprocessing deduplication)[39]. Online deduplication completes the definition, detection, and deletion of duplicate content before data is written to the storage system In addition to the process. To ensure real-time, online deduplication typically maintains full data indexes (such as a hash table) in memory, and consumes a lot of computational resources. Post-processing deduplication writes data to the storage system and then detects and eliminates heavy complex content, which has low compute and memory resource occupancy, but has a large hard disk space overhead and does not guarantee the completion time of the go-to-heavy process.

The quantitative evaluation Index of time efficiency (DE-weighting performance) is throughput rate ( Throughput). The throughput of the host is usually capped by the network card throughput capability.

7 The main technical means to break through the performance bottleneck of the current Deduplication method are to construct a fast index of memory, to excavate data locality, to utilize data similarity and to use new type of storage medium.

8 Data DomainGo heavy File System (Data Domain deduplication File System, Dddfs) technology architecture. Dddfsdivided into access interfaces, file services, content management, segment management, and container management five levels, where the Access interface layer forNfs,Cifsand theVTLand other standard protocol access services; The File service layer is responsible for namespace and file meta-data management; The Content management layer divides files into variable-length data segments (blocks) and calculates segment fingerprints and segments that create files The section Management maintains the index information of the data segment and detects and eliminates duplicate data in it, and the Content management layer will The average length is8KBthe segments are aggregated and encapsulated in4MBsize of the container, and the sequential compression algorithm is used to improve the storage space efficiency of the data. in addition, the use of containers as a storage unit also improves the data read/write throughput rate.

9 ways to improve the efficiency of the re-

Mark Lillibridge et [17] In 2009 proposed the Sparse indexing method to simultaneously excavate the locality and similarity of data to further reduce memory overhead. The Sparse indexing uses a dual-limit dual-mode algorithm (Two-thresholds
Two-divisors, TTTD) [41] cuts the backup data stream into chunks with an average length of 4KB and divides the block sequence into 10MB magnitude data segments according to the basic principles of the content-based block algorithm. On the storage side, Sparse
Indexing extracts the block fingerprint from each piece of data at a sample rate ofn and creates a memory index, where the sampled block fingerprint is called a hook, and each hook maintains at most L a segment identifier associated with it. Each new data segment S is sampled first, and then the M -Similar data segments that share the most hooks with the data segment are determined by querying the memory index. Finally, Sparse indexing loads all the fingerprint information of the M -Similar data segment with the fingerprint sequence of the new data segment S and eliminates the duplication of the latter. As can be seen from the above, Sparse indexing is an approximate deduplication solution that allows a small amount of repetitive content to exist between data segments with lower similarity, thus achieving the goal of reducing memory overhead and improving deduplication performance.

to avoid hard drive access bottlenecks when looking for duplicate content, and to further improve deduplication performance,Biplob Debnathwait[21]in theChunkstashthe solution uses SSDs to store all fingerprint indexes. ChunkstashThe data flow is zoned divided into the average length of8KBblock sequence, each of which uses64 bytes Record fingerprint and meta-data information, every 1024blocks are encapsulated as a container that remains localized (Container), its corresponding64KBfingerprint information is insured in a logical page where the SSD is present, the SSD has100~1000More efficient than disk access, so you can fundamentally improves the search performance of fingerprints. Chunkstashmaintains three of data structures in memory, including metadata mitigation memory, write cache, and compact hash table (Compact Hash Table). Where the metadata cache is maintained from the SSD pre-fetch The fingerprint sequence of each container is used to excavate the data locality; the write cache is used to buffer the deduplication data and its fingerprints, in order to encapsulate the container and improve the write efficiency of the SSD index; Compact hash table for storing SSD index for information, each table entry contains2bytes of compact fingerprint digest and4The SSD address of the byte, so that the redo process can to read the full fingerprint information of the corresponding container of the table item directly from the SSD. For newly received data blocks,Chunkstashfirst, look for possible duplicate fingerprints in the metadata cache and the write cache . further inquiries if not found Compact Hash table, and reads the fingerprint page of the corresponding container on the SSD when the hash hits to confirm the duplication of the data block. sex. Because compact hash tables are only recorded2Byte fingerprint Digest, when a hash conflict occurs when a target is queried causes incorrect reads to the SSD. according to the above data, If you use8KBthe average block length, the256GBThe SSD can store about32TBThe total block fingerprint of the data, and the corresponding memory overhead is approximately24GB. In order to further save memory overhead,Biplob Debnathsuch as the proposed adoption of similarSparse indexingthe sampling method, only in each Insert a partial fingerprint into the compact hash table and allowChunkstashbecome an approximate de-weight method. its experiments show that when using1.563%the sample rate, only about0.5% of duplicate tiles are not recognized.

Wen Xia etc [42]proposedSiLoDeduplication Solutions that combine the locality and similarity of data to improves de-weight performance and reduces memory overhead. SiLoin the variable length block (Chunk) and file based on the introduction of segments (Segment) and district (Blockthe concept of which the variable length block is the length in8KBvolume-level data deduplication orders the segment is of length in2MBMagnitude of the similarity Evaluation Unit, large file data may be distributed across multiple segments, and the contents of multiple small files may also be aggregated into one segment, and the extents consist of multiple segments, which are length256MBMagnitude of Local hold unit. SiLomaintain three types of data structures in memory: similarity hash table (Similarity Hash Table,sht), read cache, and write cache. Among them,SHTsave sample fingerprints from each segment, read cache maintenance has recently made A sequence of block fingerprints for a used area containing duplicate data, Write cache is used to aggregate captured non-repeating blocks of data as segments and extents and write to the hard disk. For new data Segments SI,SiLofirst extract the sample fingerprint and querySHTJudging Department There are duplicate or similar segments in the system, If a similar segment is found, all fingerprints of the area of the similar segment are loaded into the read cached, and finally retrieved in the cache SIthe fingerprint that contains the tiles and eliminates duplicate content. It is important to note that if SIof similarity is detected in the adjacent segment before the SIis not recognized as a similar segment,SiLoIt is still possible to read slow through queries the fingerprint sequence that is stored and prefetch identifies SIThe duplicate content contained in the. Compared with the solution described above,SiLonot unilaterally polygons depend on the locality or similarity of the data, and thus have better adaptability to different types of datasets.

Data deduplication 2---Research on high performance duplicate data detection and deletion technology some fragmentary knowledge

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.