Zfs and data deduplication

Source: Internet
Author: User
Http://hub.opensolaris.org/bin/view/Community+Group+zfs/WebHome Https://blogs.oracle.com/bonwick/entry/zfs_dedup

 

Zfs and data deduplication What is deduplication? Deduplication is the process of eliminating duplicate data. The deduplication process can be based on the file-level file level, block-level block level, or byte-level bytes. Use highly probable hashAlgorithmTo uniquely identify data blocks (files, blocks, bytes ). When a secure hash is used, for example, sha256, the probability of hash collision is 256 to the power of 2, 2 \ ^ 256 = 10 \ 67 or expressed 0.00000000000000000000000000000000000000000000000000000000000000000000000000001.
Which levels of deduplication, files, blocks, and bytes are selected? Data can be de-duplicated at the file level, block level, and byte level.
File-level deduplication is used as a whole to calculate the hash signature. If a natural file is processed, This method requires the minimum consumption, but the disadvantage is that any modification to the file requires re-calculation of the file's hash signature, that is, any modification to the file will make the previously saved space of the file disappear, because the two files are no longer the same. This method is suitable for JPEG and MPEG files, but it will not work for files like Virtual Machine images (large files), because even if they are only a small part of different, but they will be different files at the file level.
Block-level deduplication (blocks of the same size) requires more computing consumption than file-level deduplication, however, it can remove large files like Virtual Machine images. Most of the VM image files are duplicated data, such as the operating system section in the image file. With block-level deduplication, only the image data occupies extra space, and the same data will be shared.
The deduplication at the byte level requires more computing consumption to determine the start and end regions of duplicate data. In any case, this method is an ideal choice for mail servers, for example, attachments to an email may appear many times, but they cannot be optimized when block-level deduplication is used. This type of deduplication is generally used for some applications.ProgramFor example, exchangeserver, because the application knows the data it manages, it can easily remove it internally.

Zfs provides block-level deduplication technology, which is more suitable for general scenarios. Zfs uses sha256 to calculate the hash signature.

When can I remove duplicates? now or in the future?

In addition to the differences between the file level, block level, and byte level described above, deduplication can also be divided into synchronous (Real-time or built-in) and asynchronous (batch processing or offline processing ). In synchronous deduplication, duplicate files are removed when they appear. In asynchronous deduplication, files are already stored on the disk, then, you can use the subsequent processing method (for example, processing at night ). Asynchronous deduplication is typically used in storage systems with limited CPU and multithreading to minimize the impact on daily work. However, if the CPU performance is sufficient, deduplication is recommended because unnecessary disk write operations are avoided.

Zfs deduplication is synchronous deduplication. ZFS requires a high-performance CPU and a highly multi-threaded operating system (such as Solaris ).

How to Use ZFS deduplication

It is very easy to use. If you have a storage pool tank and you need to use ZFS for tank, set it:

Zfs set dedup = on Tank

Balance between whether ZFS deduplication is required

It depends on your data. If your data does not contain duplicates, enabling deduplication brings additional costs without any benefits. However, if your data contains duplicates, the de-duplication of ZFS can save space and improve performance. Space saving is obvious. The performance improvement is because it reduces the disk write consumption of duplicate data and replaces memory pages.

Most storage environments are mixed with multiple types of data. ZFS supports deduplication for duplicated parts. For example, your storage pool contains the home directory and virtual machine image directory,Source codeDirectory. You can set the following parameters:

Zfs set dedup = off tank/home

Zfs set dedup = on tank/Vm

Zfs set dedup = on tank/src

Trust or verification

If the hash values of the two data items are the same, the two data items are considered to be the same. For example, if the hash values are sha256 and the two data items are hash values of 1/2 ^ 256, the hash values of the two data items are very small. Of course, ZFS also provides the verification option, which compares the two data to determine whether they are the same. If they are not the same, the verify syntax is specified as follows:

Zfs set dedup = verify Tank

Hash Selection

Because the operations required for different types of hash are also different, a recommended method is to use a weak Hash (requires fewer operations) and verify to provide fast deduplication, the options for ZFS are:

Zfs set dedup = fletcher4, verify Tank

Unlike sha256, fletcher4 cannot be trusted to be non-collision-free and is only suitable for use with verify. Verify ensures collision processing. In general, the performance is relatively better.

If you are not sure which hash is more efficient, use the default dedup = on

Scalability and Efficiency

Most de-duplication solutions only target limited data, generally at the TB level, because they need to deduplicate Data Tables resident in the memory. Zfs has no restrictions on data size and can process Pb-level data. However, if the deduplication data table is resident memory, the performance is better.

 

Complete!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.