Highlights of technical questions about deduplication

Source: Internet
Author: User
Q: What are the advantages and disadvantages of the software-based deduplication and hardware-based deduplication products?

A: software-based deduplication aims to eliminate source redundancy, while hardware-based deduplication emphasizes data reduction of the storage system. Although bandwidth compensation cannot be provided for hardware-based deduplication, bandwidth compensation may be obtained for deduplication in the source, but the compression level of hardware-based deduplication is usually higher, in addition, hardware-based deduplication products require less maintenance.

Hardware deduplication devices have received much attention for their high performance, scalability, and continuous deployment. Under normal circumstances, the backup software regards a dedicated device as a general "disk system" and does not notice any internal deduplication process. Small businesses or remote offices usually avoid using equipment because the cost of these devices is higher than the cost of using software to delete duplicate data, but they are indeed the ideal choice for enterprise-level deployment.

Hardware-based deduplication may also be integrated into other storage (target) platforms. For example, deduplication is often a feature of the VTL system. The VTL system uses disks instead of tapes for storage to speed up backup tasks. In addition, you can add duplicate data deletion to maximize VTL disk usage. In many cases, VTL deduplication is executed as an out-of-band process. This is one of the advantages, because all VTL content can be compressed by deduplication technology. The downside is that deduplication is not real-time. However, some VTL systems introduce the in-band deduplication processing capability after receiving data from the backup server.

Q: I have heard that the hardware-based deduplication product has the in-band and out-of-band features. Which of the two features is better?

A: First, let me talk about the benefits of hardware-based deduplication products: hardware-based deduplication products can reduce the processing burden associated with software-based deduplication products. The deduplication function is also integrated into other data protection hardware, such as the backup platform, virtual tape library (VTL) system, and even general storage systems such as network attached storage (NAS. This method is generally not intended to narrow down the backup window or restore the target, but generally, users can achieve the highest compression level to create the largest available storage space.

As for the in-band and out-of-band types you mentioned, you can only say that each has its own advantages. The following are the differences between the two features and their respective advantages:

In-band deduplication reduces data when data is written into the memory. Although process processing requires additional processing capabilities to expand the size of the backup window, in-band deduplication is efficient because it is only executed once.

Out-of-band deduplication is performed after the data is stored. This method does not affect the size of the backup window, and can alleviate the attention on CPU process processing, thus avoiding bottlenecks between the backup server and memory. However, the out-of-band deduplication uses a little more disk space during execution. In addition, it may take longer to delete out-of-band duplicate data than the actual backup window. Disk competition is another problem because users attempt to access the storage during the deduplication process, thus reducing disk performance.

Q: What are the advantages and disadvantages of the file-level and block-level deduplication technologies?

A: The deduplication technology greatly improves disk-based data protection policies, Wan-based remote branch backup consolidation policies, and the value proposition of disaster recovery policies. This technology can identify duplicate data, eliminate redundancy, and reduce the total capacity of data to be transferred and stored. Some deduplication technologies run at the file level, while others perform more in-depth checks on subfiles or data blocks. Although the results are different, it is advantageous to determine whether a file or block is unique. The difference between the two is that the reduced data capacity is different, and the time required to judge duplicate data is different.

File-level deduplication Technology

The file-level deduplication technology is also known as single-instance storage (SIS). The attributes of the files to be backed up or archived are checked based on the index and compared with the stored files. If there are no identical files, store them and update the index. Otherwise, only the pointer is saved, pointing to an existing file. Therefore, only one instance is saved for the same file, and subsequent copies are replaced by "stubs", while "stubs" point to the original file.

Block-level deduplication Technology

The block-level deduplication technology runs at the sub-file level. As shown in its name, a file is usually divided into several parts-strip or block, and these parts are compared with the previously stored information to check whether there is redundancy.

The most common method to check duplicate data is to specify an identifier for a data block, for example, using a hashAlgorithmGenerate a unique ID or footprint to identify the data block. Then, compare the generated ID with the centralized index. If the ID already exists, it indicates that the data block was previously processed and stored. Therefore, you only need to store the pointer and point to the previously stored data. If the ID does not exist, it indicates that the data block is unique. Add the ID to the index and store the data block to the disk.

The data block size of each supplier check varies. Some vendors fix the size of data blocks, while some use blocks of different sizes (some even allow end users to change the size of the fixed blocks, which is even more confusing ). The size of a fixed block may be 8 kb or 64 KB. The difference is that the smaller the block, the higher the probability of being determined as redundant. This means that more redundancy is eliminated and less data is stored. There is only one problem with the fixed block: if the file changes, and the duplicate data deletion product still uses the fixed block that was last checked, it may not be able to monitor the redundant part, because the data block in the file has been changed or removed, and the fixed block used is still before the change, the remaining comparison is meaningless.

Blocks of various sizes can improve the monitoring probability of common redundancy, especially when files change. This method can monitor the possible real-sample mode or breakpoint in the file to split the data. This method can detect duplicate data even if the file changes and data blocks are transferred. What are the disadvantages? This method can change the block size. Suppliers need to track and compare multiple IDs, which affects the increase of the scale and computing time.

File-level and block-level technologies are not only different in operation. Both methods have their own advantages and disadvantages.

The efficiency of the file-level deletion technology is not as good as that of the block-level technology:

If the file changes, the entire file needs to be restored. Files such as PowerPoint reports may need to change some simple content, such as changing the homepage to display new reporters or dates, which also causes the entire document to be restored. The block-level deduplication technology only stores changes between a certain version of a file and the next version. The compression ratio of file-level technology is generally less than, while block-level technology can compress the storage data capacity by or even 50: 1.

The efficiency of the file-level deletion technology is higher than that of the block-level technology:

The index of the file-level deduplication technology is very small, and it takes only a small amount of computing time to judge duplicate data. Therefore, the deletion process has little impact on the backup performance. The file-level deletion technology requires a low processing load due to a small index and a small number of times. There is little impact on the recovery time. The block-level deletion technology needs to use the primary index to match the data block and the pointer of the data block, so as to "reassemble" The data block. The file-level technology stores unique files and pointers to these files, so there is little need for restructuring.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.