Hadoop data compression

Last Update:2016-01-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There are two main advantages of file compression, one is to reduce the space for storing files, and the other is to speed up data transmission. In the context of Hadoop big data, these two points are especially important, so I'm going to look at the file compression of Hadoop.

There are many compression formats supported in Hadoop, see the following table:

Deflate is a lossless data compression algorithm using both the LZ77 algorithm and the Havermann code (Huffman Coding), which can be found in the Zlib library. Gzip is an algorithm based on the DEFLATE algorithm.

All compression algorithms are spatial and temporal transformations, faster compression times or smaller compression ratios, which can be specified by parameters-1 means speed, 9 means space. Taking gzip as an example, the following means faster compression:

[HTML]View PlainCopy

Gzip-1 file

Gzip tradeoff between time and space, bzip2 compression is more efficient than gzip, but slower. The bzip2 is faster than the compression speed. However, compared with other compression formats is the slowest, but the compression effect is clearly the best. Snappy and lz4 have a much better decompression speed than lzo.

Is the comparison of the efficiency of some decompression algorithms:

Splitable indicates whether the compressed format can be split, that is, whether it is supported to read immediately. Whether compressed data can be used by mapreduce and whether compressed data can be segmented is critical.

For example: An uncompressed file has a 1GB size, the default block size of HDFs is 64MB, then this file will be divided into 16 blocks as a mapreduce input, each using a map task alone. What if this file is already compressed with gzip? If it is divided into 16 blocks, each block is made into an input, which is obviously not appropriate because the gzip compressed stream is not immediately readable. In fact, when MapReduce handles compressed format files, it realizes that this is a gzip compressed file, and Gzip does not support the immediately read, it will be 16 pieces of a map to deal with, there will be a lot of non-local processing map task, the whole process will take a long time.

Lzo compression format also has the same problem, but by using the Index tool of the Hadoop Lzo Library, Lzo can support splitable. BZIP2 also supports splitable.

So how do you choose the compression format? Depending on the size of the file, the compression tool you use, here are a few options for the recommendations, efficiency from high to Low:

1. Use some file formats that contain compression and support splitable, such as sequence File,rcfile or Avro files, which we'll talk about later. If it is for fast compression, you can use the LZO,LZ4 or snappy compression format.

2. Use a compression format that provides splitable, such as bzip2 and indexing, to support splitable Lzo.

3. Divide the file into blocks in advance, and each block is compressed separately, so there is no need to consider the splitable problem.

4. Do not compress files.

Note: It is not appropriate to store a large data file in a compressed format that does not support splitable, and the efficiency of non-local processing is very low.

Hadoop data compression

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop data compression

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support