Hadoop detailed (vii) compression

Last Update:2017-02-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

File compression has two main benefits, one is to reduce the space occupied by storage files, the other is to speed up the data transmission. In the context of Hadoop's big data, these two points are particularly important, so let me now look at the file compression in Hadoop.

Hadoop supports a wide variety of compression formats, so let's look at a table:

Deflate is a lossless data compression algorithm using LZ77 algorithm and Huffman Code (Huffman coding), which can be found in zlib library. Gzip is an algorithm that expands on the basis of the deflate algorithm.

All compression algorithms are space and time conversion, faster compression time or a smaller compression ratio, can be specified by parameters,-1 means speed,-9 means space. Taking gzip as an example, the following means faster compression:

Gzip-1 file

Gzip in time and space trade-offs are more eclectic, bzip2 compression than gzip more efficient, but slower. The bzip2 decompression speed is faster than its compression speed. But it's the slowest one, but the compression effect is obviously the best. Snappy and lz4 are much better at decompression than Lzo.

The splittable indicates whether the compressed format can be split, that is, whether it is supported and read immediately. Whether compressed data can be mapreduce or not, compressing data can be segmented is critical.

For example, an uncompressed file has a 1GB size and the default block size of HDFs is 64MB, and the file is divided into 16 blocks as input to the MapReduce, each using a map task alone. If the file is already compressed with gzip, if it is divided into 16 blocks, each block is made into one input, which is obviously not appropriate, because it is not possible to read from the gzip compressed stream. In fact, when MapReduce handles compressed-format files It realizes that this is a gzip compressed file, and Gzip does not support then read, it will be 16 pieces to a map to deal with, there will be a lot of non-local processing map tasks, the whole process will take quite a long time.

Lzo compression format can also be the same problem, but by using the Hadoop Lzo Library's indexing tool, Lzo supports splittable. Bzip2 is also a supporter of splittable.

So how do you choose a compression format? Depending on the size of the file, the compression tool you use, the following are a few selection suggestions, efficiency from high to low sort:

1. With some file formats that contain compression and support splittable, such as sequence File,rcfile or Avro files, we'll talk about these file formats later. If you can use LZO,LZ4 or snappy compressed format for fast compression.

2. The lzo of splittable can be supported by using splittable compression formats, such as bzip2 and indexing.

3. The file into several blocks in advance, each block individually compressed, so that no need to consider the problem of splittable

4. Do not compress files

It is not appropriate to store a large data file in a compressed format that does not support splittable, and the efficiency of non-local processing can be very low.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop detailed (vii) compression

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop detailed (vii) compression

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support