Hadoop detailed (vii) compression

Source: Internet
Author: User

File compression has two main benefits, one is to reduce the space occupied by storage files, the other is to speed up the data transmission. In the context of Hadoop's big data, these two points are particularly important, so let me now look at the file compression in Hadoop.

Hadoop supports a wide variety of compression formats, so let's look at a table:

Deflate is a lossless data compression algorithm using LZ77 algorithm and Huffman Code (Huffman coding), which can be found in zlib library. Gzip is an algorithm that expands on the basis of the deflate algorithm.

All compression algorithms are space and time conversion, faster compression time or a smaller compression ratio, can be specified by parameters,-1 means speed,-9 means space. Taking gzip as an example, the following means faster compression:

Gzip-1 file

Gzip in time and space trade-offs are more eclectic, bzip2 compression than gzip more efficient, but slower. The bzip2 decompression speed is faster than its compression speed. But it's the slowest one, but the compression effect is obviously the best. Snappy and lz4 are much better at decompression than Lzo.

The splittable indicates whether the compressed format can be split, that is, whether it is supported and read immediately. Whether compressed data can be mapreduce or not, compressing data can be segmented is critical.

For example, an uncompressed file has a 1GB size and the default block size of HDFs is 64MB, and the file is divided into 16 blocks as input to the MapReduce, each using a map task alone. If the file is already compressed with gzip, if it is divided into 16 blocks, each block is made into one input, which is obviously not appropriate, because it is not possible to read from the gzip compressed stream. In fact, when MapReduce handles compressed-format files It realizes that this is a gzip compressed file, and Gzip does not support then read, it will be 16 pieces to a map to deal with, there will be a lot of non-local processing map tasks, the whole process will take quite a long time.

Lzo compression format can also be the same problem, but by using the Hadoop Lzo Library's indexing tool, Lzo supports splittable. Bzip2 is also a supporter of splittable.

So how do you choose a compression format? Depending on the size of the file, the compression tool you use, the following are a few selection suggestions, efficiency from high to low sort:

1. With some file formats that contain compression and support splittable, such as sequence File,rcfile or Avro files, we'll talk about these file formats later. If you can use LZO,LZ4 or snappy compressed format for fast compression.

2. The lzo of splittable can be supported by using splittable compression formats, such as bzip2 and indexing.

3. The file into several blocks in advance, each block individually compressed, so that no need to consider the problem of splittable

4. Do not compress files

It is not appropriate to store a large data file in a compressed format that does not support splittable, and the efficiency of non-local processing can be very low.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.