Do dedup first or do compression first?

Source: Internet
Author: User
Tags dedupe

All along, the enterprise storage products seen, when there are dedup and compression two brothers, are Dedup first, compession followed. Seems natural, but is this order necessary? Is it the best?

Is this going to be a problem?

In a nutshell, the hypothetical block (segment, ...) is based on the fixed length:

1. Compression is CPU intensive, on the wire, no additional lookups are required, usually the size of the input cannot be too small to facilitate the discovery of pattern for improved efficiency. Compression is easy to speed up through Asics, in fact IBM, HP and many others have

2. Dedup is a lot more complex, in online, for example, need to calculate (depends, usually less than compression CPU consumption; multiple hashes another thing), need additional FG lookup and update, and even the middle will have IO (cache miss,byte). The larger the block, the smaller the metadata, but the worse the dedup efficiency; the smaller The block, the better the granularity dedup effect may be, but the more metadata, the more memory is tight. Offline's Dedup

Take Xtremio For example, block fixed length. Its first generation of products brings inline dedup, and compression is later added, which is closely related to its architecture: content-based Addressing,write-nvram and internal two-time FW. Come in. After the hash, according to a hash to do a partiton, to a node, if the first compression (to tell the truth, can not), come in the node will have to bear compression + hash, and then can balance to other nodes, This will affect its global balance, although it may be advantageous to save-mirror,cache consumption after compression may be lower.

In practical engineering, the key factors of influence order are: 1) block size selection and the effect of variation on engineering quantity, 2) workload itself is dedup and compressible.

1.dedup, the size of the unique block is constant, 8kb/16kb, ... This is more friendly to the Fs/lun (most so) of the fixed-length block, and the compressed size is variable and undecided, and there is some extra work to be done, so you'd better buffer-split, and you might end up wasting some disk.

2. High repetition rate of data first dedup, always quickly reduce the amount of data, conversely high compression ratio of the same reason. Unfortunately, this important information is often unclear in the actual process.

However, recently seen a counter-dedup, compressed in front, the post. Why? What about it?

Tegile All-Flash

The company was founded about 2010 years ago, the first to do hybrid array (2012,t3100-t3400), until 2014 just launched all-Flash (t3600,t3800), products and capabilities in just 1 years to enter the Gartner-visionarie Vision Quadrant, really difficult. All systems use the ZFS platform, which is a rare support for block and FS full-flash, and its dedup/compression makes a comparison with features: Per pool, per lun/fs, block size from 4KB to 128KB (fixed once SE T

Recently its official blog published some tests and comparisons, the highlights are as follows.

1. From the final data reduction effect, the final effect is the same and deterministic in the case of a given amount of data, regardless of which one is in the previous period. -it means that customers are assured that the effect is not bad.

2. From the reduction efficiency (CPU, time), its test shows that the compression efficiency is usually the first advantage, only if the data is very high dedup rate and there is almost no compressibility of the first dedup some advantages . --so more workload such as database and so on or obediently obey my method, first compress one.

Cf–compress First
Hf–hashing First
c –cost of compression per unit
h –cost of hashing per unit

X is put as a units, sha256 for H, LZ4 for C. The number is time to process those units. The lower, the better

Case 1:0% Dedupe, 0% Compression

cf:x*C + y*h [4760 NS]
hf:x*H + x*c [4760 NS]
Case VERDICT: CF and HF is equal

Case 2:0% Dedupe, 50% Compression.

cf:x*C + x/2*h [2940 NS]
hf:x*H + x*c [4760 NS]
Case VERDICT: CF is better

Case 3:50% Dedupe, 50% Compression

cf:x*C + x/2*h [2940 NS]
hf:x*H + x/2*c [4200 NS]
Case VERDICT: CF is better

Case 4:50% Dedupe, 0% Compression

cf:x*C + x*h [4760 NS]
hf:x*H + x/2*c [4200 NS]
Case Verdict: HF is better

Case 5:80% Dedupe, 0% Compression

cf:x*C + x*h [4760 NS]
hf:x*H + x/5*c [3864 NS]
Case Verdict: HF is better

Case 6:80% Dedupe, 25% Compression

cf:x*C + 3x/4*h [3850 NS]
hf:x*H + x/5*c [3864 NS]
Case VERDICT: CF is better

Case 7:80% Dedupe, 50% Compression

cf:x*C + x/2*h [2940 NS]
hf:x*H + x/5*c [3864 NS]
Case VERDICT: CF is better

Case 8:80% Dedupe, 75% Compression

cf:x*C + x/4*h [2030 NS]
hf:x*H + x/5*c [3864 NS]
Case VERDICT: CF is better

This may provide a way of thinking, at least stating that someone has done so, the effect can be, but specifically what you should use, must combine your overall architecture, application scenarios, workload, etc. If you don't know your situation, no one else can help.

Do dedup first or do compression first?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.