Generally speaking, the data that the computer processing has some redundancy, at the same time the data, especially the neighboring data have the correlation, so can save the data by some special encoding way different from the original encoding, make the data occupy the storage space is relatively small, this process is called compression. The concept of compression corresponds to decompression, the process of restoring compressed data from a special encoding to the original data.
Compression is widely used in mass data processing, compression of data files, can effectively reduce the space required to store files, and speed up the data on the network or to disk transmission speed. In Hadoop, the compression applies to the data exchange in the file store, the map phase to the reduce phase, and so on.
There are many ways to compress data, and different data compression methods are different: such as the sound and image compression of special data, you can use lossy compression method, allowing the loss of certain information in the compression process, in exchange for a larger compression ratio, and the compression of music data, because the data has its own relatively special coding methods, Therefore, some special data compression algorithms can be used for these specific codes.
Introduction to
2 Hadoop compression
As a general data processing platform, Hadoop mainly considers the compression speed and the fragmentation of compressed files in the use of compression mode.
All compression algorithms consider the trade-off between time and space, and faster compression and decompression speeds usually consume more space (lower compression). For example, when you compress data by using the GZIP command, users can set different options to choose either speed first or space first, and the option –1 indicates the priority of the speed, and the option –9 represents the optimal space, and the maximum compression ratio can be obtained. It should be noted that some compression algorithms compression and decompression speed will be quite different: gzip and zip are common compression tools that are relatively balanced in time/space processing, GZIP2 compression is more efficient than gzip and zip, but slower, and bzip2 faster than its compression speed.
When using MapReduce to process compressed files, you need to consider the fragmentation of compressed files. Considering that we need to handle a 1GB text file that remains on the HDFs, the current HDFs block size is 64MB, the file is stored as 16 blocks, and the corresponding MapReduce job will divide the file into 16 input slices, providing 16 independent The map task is processed. However, if the file is a compressed file in gzip format (unchanged in size), the MapReduce job cannot divide the file into 16 slices because it is not possible to extract data from a point in the gzip data stream. However, if the file is a compressed file in a bzip2 format, the MapReduce job can compress the blocks in the file by bzip2 format, dividing the input into several input slices, and starting at the beginning of the block to decompress the data. In BZIP2 format compressed files, a 48-bit synchronization token is provided between blocks and blocks, so bzip2 supports data segmentation.
Table 3-2 lists some common compression formats and features that you can use for Hadoop.
Table 3-2 the compressed format supported by Hadoop
In order to support multiple compression decompression algorithms, Hadoop introduces the encoding/decoding device. Like the Hadoop serialization framework, an encoding/decoder is also a design pattern that uses an abstract factory. Currently, the encoding/decoder supported by Hadoop is shown in table 3-3.
Table 3-3 compression algorithm and its encoding/decoder
The compression and decompression related tools corresponding to the same compression method can be obtained by the corresponding encoding/decoder.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.