[Guide]What are the differences between data compression and deduplication? In practice, how can we apply it correctly? I have not studied the principles and technologies of Data Compression before, so I did some homework, read and sort out relevant materials, and compared and analyzed the data deduplication technology.
In the face of the rapid expansion of data, enterprises need to constantly purchase a large number of storage devices to meet the increasing storage requirements. However, simply increasing the storage capacity does not seem to fundamentally solve the problem. First, the purchase budget for storage devices is getting higher and higher, and most enterprises are unable to afford such huge expenses. Second, with the expansion of data centers, the storage management costs, occupied space, cooling capacity, and energy consumption have also become increasingly serious, with energy consumption being particularly prominent. Furthermore, a large number of heterogeneous physical storage resources greatly increase the complexity of storage management, which may lead to a waste of storage resources and low utilization efficiency. Therefore, we need to find another way to solve the problem of rapid information growth and block the data "blowout ". The concept of efficient storage is proposed for this purpose. It aims to alleviate the space growth problem of the storage system, reduce the space occupied by data, simplify storage management, maximize the use of existing resources, and reduce costs. Currently, the five most efficient storage technologies recognized in the industry are data compression, deduplication, automatic configuration streamlining, automatic Hierarchical Storage, and storage virtualization. Currently, data compression and deduplication are two key technologies for data reduction. In short, Data Compression Technology reduces redundancy by reencoding data, while deduplication Technology focuses on deleting duplicate data blocks to reduce data capacity.
Data Compression 
The origins of data compression can be traced back to the Shannon code proposed by Shannon, the father of Information Theory in 1947. In 1952, Huffman proposed the first practical Encoding Algorithm for data compression, which is still widely used. In 1977, Israeli mathematicians Jacob ZIV and Abraham Lempel proposed a new data compression encoding method, Lempel-Ziv series algorithms (lz77, lz78, and several variants) with its simple, efficient, and other superior features, it eventually becomes the basis of the current main data compression algorithms. The LZ series algorithms belong to the category of lossless data compression algorithms and are implemented using the keyword encoding technology. Currently, they mainly include four mainstream algorithms: lz77, lzss, lz78, and LZW. There are two types:
The idea of the first dictionary method is to try to find whether the Character Sequence being compressed has appeared in the previous input data. If yes, then, replace the repeated string with a "Pointer" pointing to a string that has appeared earlier. This encoding Idea 1 is shown. The "Dictionary" here is implicit and refers to the data used for pre-processing. All algorithms in such encoding are based on the algorithms developed and published in 1977 by Abraham Lempel and Jakob Ziv (called lz77. An Improved Algorithm of this algorithm was developed by storer and Szymanski in 1982 and is called lzss algorithm.
Figure 1 first-class dictionary-based coding Concept
The idea of the second type of algorithms is to create a "Dictionary of the phrases" from the input data )". When a phrase already appears in the dictionary during data encoding, the encoder outputs the index number of the phrase in the dictionary, instead of the phrase itself. This concept 2 is shown in. A. Lempel and J. Ziv published an article about this encoding method for the first time in 1978, called lz78. Based on their research, Terry A. Welch published an article on the improvement of this encoding algorithm in 1984, and first applied this algorithm on the high-speed hard drive controller. Therefore, this encoding method was later called LZW (Lempel-Ziv Walch) compression encoding.
Figure 2 dictionary-based coding Concept
The basic idea of Lempel-Ziv series algorithms is to use location information to replace the original data for compression. during decompression, location information is used to restore the data, which is also called "dictionary-type" encoding. Currently, the industrial standard for compression algorithms in storage applications (ANSI, qic, IETF, FRF, TIA/EIA) is LZs (Lempel-Ziv-STAC), which is proposed and patented by STAC, the owner of the patent right is hfn, Inc .. Data Compression applications can significantly reduce the amount of data to be processed and stored. Generally, the data volume can be 2 to 1 ~ The compression ratio is.
In 1977, Jacob ZIV and Abraham Lempel described a sliding window-based caching technique used to save the recently processed text (J. ziv and. lempel, "A Universal Algorithm for sequential data compression", IEEE transaction on information theory, May 1977 ). This algorithm is generally called lz77. Lz77 and its variants found that words and phrases (Image Patterns in GIF) in the body stream are likely to be duplicated. When a duplicate occurs, the duplicate sequence can be replaced by a short encoding. The compression program scans such duplicates and generates encoding instead of Repeated sequences. Over time, encoding can be reused to capture new sequences. The algorithm must be designed as a decompression program to export the current ing in the encoding and original data sequence.
Figure 3 lz77 Algorithm
The lz77 (and its variants) compression algorithm uses two caches. The slide history cache contains N source characters that have been previously processed, and the forward cache contains the following l characters. The algorithm tries to match two or more characters starting from the forward cache with the string in the sliding history cache. If no match is found, the first character in the forward cache is output as a 9-bit character and moved into the sliding window. The longest character in the sliding window is removed. If a match is found, the algorithm continues scanning to find the longest match. Then match the string as the triple output (indicating the mark, pointer, and length ). For a string of k characters, the longest k characters in the sliding window are removed and the encoded k characters are moved into the window.
Although lz77 is valid, it is suitable for the current input, but there are some shortcomings. The algorithm uses a limited window to search for matching in the previous text. For text blocks that are very long relative to the window size, many possible matches will be lost. The window size can be increased, but this will cause two losses: (1) the processing time of the algorithm will increase, because it must match the string of the forward cache at each position of the sliding window; (2) <pointer> the field must be longer to allow a longer jump.
The LZs algorithm is implemented based on lz77 and consists of two parts: Sliding Window and adaptive coding ). During compression, find the same block as the data to be processed in the sliding window, and use the offset value and block length of the block in the sliding window to replace the data to be processed, so as to implement compression and encoding. If the Sliding Window does not contain the same fields as the data block to be processed, or the offset value and length data exceed the length of the replaced data block, no replacement is performed. The implementation of LZs algorithms is very simple and easy to process, and can adapt to various high-speed applications.
Lz77 solves the problem of no matching string in the window by outputting real characters, but this solution contains redundant information. The redundant information table has two aspects: a null pointer and an encoder may output additional characters which may be contained in the next matching string. The lzss algorithm solves this problem in a more effective way. The idea is that if the length of a matching string is longer than the length of the pointer itself, the pointer is output; otherwise, the actual characters are output. Since the output compressed data stream contains pointers and characters, an additional flag is required to distinguish them, that is, the ID bit.
In the same computing environment, the lzss algorithm can obtain a higher compression ratio than lz77, and the decoding is also simple. This is why this algorithm has become the basis for developing new algorithms. Many later developed File compression programs use lzss ideas, such as PKZIP, ARJ, lharc, and zoo. The difference is that the pointer length and window size are different. Lzss can also be used in combination with entropy encoding. For example, arj can be used in combination with Hoffmann encoding, while PKZIP is used in combination with Shannon-Fano. Later versions of lzss will also adopt Hoffmann encoding.
Lz78's coding idea is to constantly extract new "Fix-character string" (commonly understood as a new "term") from the NLP stream "), then, the "entry" is represented by the "code word ". In this way, the encoding of the ghost stream is changed to replacing the ghost stream with a code word to generate a code stream to compress data. Compared with lz77, lz78 reduces the number of strings in each encoding step, and the compression ratio is similar to lz77.
LZW encoding is performed around a conversion table called a dictionary. This conversion table stores character sequences called prefixes and assigns a code word or serial number to each table item. This conversion table actually expands the 8-bit ASCII character set and adds symbols to indicate variable-length ASCII strings that appear in text or images. The expanded code can be expressed by 9-bit, 10-bit, 11-bit, 12-bit, or more bits. In the Welch thesis, 12 digits are used, and 12 digits can have 4096 different 12-bit codes. That is to say, the conversion table has 4096 table items, among them, 256 table items are used to store the defined characters, and the remaining 3840 table items are used to store the prefix ).
LZW encoder (software encoder or hardware encoder) is the conversion between input and output by managing this dictionary. The input of the LZW encoder is charstream, which can be a string consisting of eight ASCII characters, and the output is a string consisting of n bits (for example, 12 bits) codestream, which represents a single character or a string consisting of multiple characters.
The LZW algorithm is widely used, and its speed is faster than the lz77 algorithm, because it does not need to perform so many character string comparison operations. Further improvements to the LZW algorithm are the addition of variable codeword lengths and the deletion of old fix-character strings in the dictionary. The LZW algorithm added with these improvement measures has been adopted in GIF image formats and Unix compression programs. LZW algorithm has obtained a patent. The patent owner is an American large-scale computer company, Unisys. Apart from commercial software production companies, LZW algorithm can be used for free.
Delete duplicate data  
In actual storage practices such as backup and archive, a large number of duplicate data blocks exist, consuming both transmission bandwidth and a considerable amount of storage resources: some new files are only partially modified on the original files, and some files have multiple copies. If only one copy of the same data block is retained, the amount of data actually stored will be greatly reduced-this is the basis of the deduplication technology. This practice was first proposed by Professor Li Kai of Princeton University (one of the three founders of datadomain), known as global compression, and promoted to commercial applications as capacity optimization storage.
Deduplication is a data reduction technology that can effectively optimize the storage capacity. It deletes duplicate data in a dataset and retains only one of them to eliminate redundant data, as shown in principle 4. Dedupe technology can effectively improve storage efficiency and utilization, and reduce data to 1/20 ~ 1/50. This technology can greatly reduce the demand for physical storage space, reduce network bandwidth during transmission, and effectively save equipment procurement and maintenance costs. It is also a green storage technology that can effectively reduce energy consumption.
Figure 4 Principles of deduplication
Dedupe can be divided into file-level and data block-level based on the granularity of deduplication. The file-level dedupe technology is also called the Single Instance Storage (SIS, single instance store). The deduplication granularity of data block-level deduplication is smaller and can reach 4-24 KB. Obviously, the data block level can provide a higher data deduplication rate, so the current mainstream dedupe products are all data block-level. Dedupe splits the file into data blocks with a fixed length or extended length. It uses MD5, sha1, and other hash algorithms to calculate the fingerprint (FP, fingerprint) of the data blocks ). Two or more hash algorithms can be used to calculate data fingerprints at the same time to obtain a very small probability of Data collision. Data blocks with the same fingerprint can be considered as the same data block, and only one copy needs to be retained in the storage system. In this way, a physical file corresponds to a logical representation in the storage system, consisting of a group of FP metadata. When reading a file, read the logical file first, and then extract the corresponding data blocks from the storage system based on the FP sequence to restore the physical file copy.
The dedupe technology can help many applications reduce data storage, save network bandwidth, improve storage efficiency, reduce backup windows, and effectively save costs. Currently, the most successful application of dedupe technology is data backup, Disaster Tolerance, and archiving systems. However, dedupe technology can be used in many occasions, including online data, nearline data, and offline data storage systems, it can be implemented in the file system, Volume Manager, NAS, and San. Dedupe can also use data transmission and synchronization as a data compression technology that can be used for data packaging. Why is the most successful application of dedupe Technology in the Data Backup field, but few applications in other fields? This is mainly determined by two reasons: first, a large number of duplicate data exists after the data backup application backs up the data for multiple times, which is very suitable for this technology. The second is the defect of dedupe technology, mainly data security and performance. Dedupe uses the hash fingerprint to identify the same data, which may result in data collision and data inconsistency. Dedupe requires data block splitting, data block fingerprint computing, and data block retrieval. This consumes considerable system resources and affects the storage system performance.
Dedupe has two dimensions: deduplication ratios and performance. The performance of dedupe depends on the specific implementation technology, while the deduplication rate is determined by the characteristics of the data and the application mode. Currently, the deduplication rate published by each storage vendor ranges from to 500: 1. Which data is deduplication, time data, space data, global data, or local data? When Will deduplication be performed, online or offline? Where is deduplication performed, source or target? How does one remove duplicates? Various factors should be taken into account when dedupe technology is applied, because these factors will directly affect its performance and performance. It is also worth noting that there is no fundamental solution to the hash collision problem. Therefore, dedupe technology should be carefully considered for key business data.
The deduplication process of the storage system is generally like this: first, split the data file into a group of data blocks, calculate the fingerprints for each data block, and then use the fingerprints as keywords for hash search, matching indicates that the data block is a duplicate data block and only stores the index number of the data block. Otherwise, it indicates that the data block is a new unique block, store data blocks and create metadata. In this way, a physical file corresponds to a logical representation in the storage system, consisting of a group of FP metadata. When reading a file, read the logical file first, and then extract the corresponding data blocks from the storage system based on the FP sequence to restore the physical file copy. From the above process, we can see that the key technologies of dedupe mainly include file data block segmentation, data block fingerprint calculation, and data block retrieval.
The key to deduplication is the generation and identification of data blocks "Fingerprints. Data blocks "Fingerprints" are the basis for identifying whether data blocks are duplicated. If the "Fingerprints" of different data blocks are the same, content will be lost and serious consequences will be irrecoverable. In actual applications, the Digest of data blocks generated by standard hash algorithms such as MD5 and SHA-1 is generally used as "fingerprint ", to distinguish the differences between different data blocks, so as to ensure that there is no conflict between different data blocks. However, the calculation process of algorithms such as MD5 and SHA-1 is very complex, and pure software computing is difficult to meet the performance requirements of storage applications. Fingerprint computing often becomes a performance bottleneck for deduplication applications.
Comparison and Analysis of Data Compression and deduplication
Both data compression and deduplication are designed to reduce the amount of data. The difference lies in the premise of data compression technology that there is redundancy in the expression of information data, which is based on information theory research; the implementation of deduplication relies on the repeated appearance of data blocks, which is a practical technology. However, through the above analysis, we found that these two technologies are essentially the same, that is, to reduce data capacity by retrieving redundant data and using shorter pointers. The key difference between them is that there are different redundancy ranges, different redundancy methods are found, and the granularity of redundancy is different. In addition, there are many different implementation methods.
(1) eliminate redundancy range
Data Compression is usually applied to data streams to eliminate the redundant range from sliding windows or cache windows. Due to performance considerations, this window is usually relatively small and can only be used for local data, and the effect on a single file is obvious. The deduplication technology first blocks all data and then removes redundancy globally in units of data blocks. Therefore, it applies to global storage systems that contain a large number of files, such as file systems, the effect is more significant. If you apply data compression to the global data or delete duplicate data to a single file, the data reduction effect will be greatly reduced.
(2) redundant Methods discovered
Data Compression mainly uses string matching to retrieve the same data block. It mainly uses the String Matching Algorithm and its variants, which are precise matching. The deduplication technology uses data fingerprints of data blocks to discover the same data blocks. The data fingerprints are obtained using hash function computing, which is a fuzzy match. Precise matching is more complicated to implement, but it has a high precision and is more effective for fine-grained Redundancy Elimination. Fuzzy Matching is much simpler, and it is more suitable for large-Granularity Data blocks, with insufficient precision.
(3) redundancy Granularity
The redundancy granularity of data compression is very small. It can be a small data block of several bytes and is adaptive. You do not need to specify a granularity range in advance. Deduplication is different. Data blocks have a large granularity, generally ranging from 512 to 8 K bytes. Data blocks are not adaptive. For a fixed-length data block, you must specify the length in advance. For a variable-length data block, you must specify the upper and lower limits. Smaller data block granularity can achieve greater data redundancy, but the computing consumption is also greater.
(4) performance bottleneck
The key performance bottleneck of data compression is data string matching. The larger the sliding window or cache window, the larger the calculation workload will increase. The performance bottleneck of deduplication is that data blocks and data fingerprint computing, while MD5, SHA-1, and other hash functions are very complex in computing and occupy CPU resources. In addition, data Fingerprints need to be saved and retrieved, and a large amount of memory is usually required to build a hash table. If the memory is limited, the performance will be seriously affected.
(5) Data Security
The Data Compression here is lossless and will not cause data loss. The data is secure. One problem of deduplication is that the data block fingerprints generated by hash may collide, that is, two different data blocks generate the same data fingerprint. In this way, a data block will be lost, resulting in data destruction. Therefore, the deduplication technology has potential data security risks.
(6) application perspective
Data Compression directly processes streaming data without the need to analyze and count the global information in advance. This makes good use of pipelines or pipelines in combination with other applications, or transparently act on the storage system or network system in the in-band mode. To delete duplicate data, you need to process the data in blocks, store and retrieve the fingerprints, and logically represent the original physical files. If the existing system needs to apply this technology, it is difficult to implement it transparently by modifying the application. At present, deduplication is not a common function, but more in the form of products, such as storage systems, file systems, or application systems. Therefore, data compression is a standard function, and deduplication has not yet reached this standard. From the application perspective, data compression is simpler.
Data Compression and deduplication are targeted at different levels and can be used in combination to achieve a higher proportion of data reduction. It is worth mentioning that if the data compression and deduplication technologies are applied at the same time, in order to reduce the processing requirements of the system and increase the data compression ratio, we usually need to apply the data deletion technology first, then, we use the data compression technology to further reduce the volume of the "structure chart" and basic data blocks. What will happen if the order is reversed? Compression re-encodes the data, which destroys the native Redundant Structure of the data. Therefore, the effect of deduplication is greatly reduced, and more time is consumed. The difference is that deduplication is performed first. redundant data blocks are eliminated first, and then the unique copy data blocks are compressed again by applying data compression. In this way, the data reduction functions of the two technologies are superimposed, and the time consumed for data compression is greatly reduced. Therefore, you can achieve a higher data compression rate and performance by compressing the data first. Here we use gzip and the author's own small open-source software deduputil  to verify this application experience.
Raw data: linux-2.6.37 kernel source code, Du-H Capacity of 1081.8 kb, about MB.
Execute time gzip-c-r linux-2.6.37> linux-2.6.37.gzon the linux-2.6.37 directory, compress to linux-2.6.37.gz with a capacity of about 264 MB and consumes 152.776 S;
Execute the time dedup-C-B 4096 linux-2.6.37 linux-2.6.37.ded ON THE linux-2.6.37 directory, de-duplicated the linux-2.6.37.ded, the capacity is about 622 Mb, the consumption time is 28.890 S;
Execute time dedup-C-B 4096 linux-2.6.37.gz.ded linux-2.6.37.gzon linux-2.6.37.gz. ded on linux-2.6.37.gz. The size is about 241 MB, and the time consumption is 7.216;
Execute time gzip-C linux-2.6.37.ded> linux-2.6.37.ded.gzon on the linux-2.6.37.ded, compress to linux-2.6.36.ded.gz with a capacity of about 176 MB and a time consumption of 38.682;
After experiments, we can see that the linux-2.6.37.ded.gz size of dedup + gzipted is 176 MB, and the consumed time is 67.572 seconds. The linux-2.6.37.gz. ded capacity of gzip + dedupis is 241 MB, and the consumed time is 159.992 seconds. The experimental data further verifies the above analysis. Data deduplication and data compression can achieve a higher data compression rate and performance.
1 Data Reduction Technology http://tech.watchstor.com/management-116492.htm
2 Data Compression principle http://jpkc.zust.edu.cn/2007/dmt/course/MMT03_05_1.htm
3 lz77 algorithm http://jpkc.zust.edu.cn/2007/dmt/course/MMT03_05_2.htm
4 lzss algorithm http://jpkc.zust.edu.cn/2007/dmt/course/MMT03_05_3.htm
5 lz78 algorithm http://jpkc.zust.edu.cn/2007/dmt/course/MMT03_05_4.htm.
6 LZW algorithm http://jpkc.zust.edu.cn/2007/dmt/course/MMT03_05_5.htm.
7 research http://blog.csdn.net/liuben/archive/2010/08/21/5829083.aspx of deduplication Technology
8 research http://blog.csdn.net/liuben/archive/2010/12/08/6064045.aspx of efficient storage technology
8 deduputil http://sourceforge.net/projects/deduputil/