Application of Four Common Compression Formats in Hadoop

Source: Internet
Author: User
Keywords Compare Merit After Text Or
Tags advantages and disadvantages application based bzip2 data different direct file

Currently used in Hadoop more than four compression formats lzo, gzip, snappy, bzip2, the author based on practical experience to introduce the advantages and disadvantages of these four compression formats and application scenarios, so that we in practice according to the actual situation of choice Different compression formats.

1 gzip compression

Advantages: Compression ratio is relatively high, and the compression / decompression speed is faster; Hadoop itself supports processing gzip format in the application of the document and the same text directly; hadoop native library; Most linux systems come with gzip command, Easy to use.

Disadvantages: Does not support split.

Application scenarios: When each file is compressed within 130M (1 block size), can consider using gzip compression format. For example, a day or an hour of the log compressed into a gzip file, run the mapreduce program through multiple gzip files to achieve concurrency. hive program, streaming program, and mapReduce program written in java completely the same as the text processing, the original program after compression does not need to make any changes.

2 lzo compression

Advantages: compression / decompression speed is faster, reasonable compression rate; Support split, is the most popular hadoop compression format; Support for hadoop native library; Can be installed in the linux system lzop command, easy to use.

Disadvantages: Compression rate lower than gzip; hadoop itself does not support, you need to install; In the application of the lzo format files need to do some special treatment (in order to support the need to build an index split, but also need to specify inputformat lzo format).

Application scenarios: a large text file, compressed more than 200M or more can be considered, and the larger a single file, the more obvious advantages of lzo.

3 snappy compression

Advantages: high-speed compression speed and reasonable compression rate; support Hadoop native library.

Disadvantages: does not support split; compression rate lower than gzip; hadoop itself does not support, you need to install; There is no corresponding command under linux system.

Usage Scenario: Compression format of map to reduce intermediate data when the map output data of the mapreduce job is relatively large, or the output of one mapreduce job and the input of another mapreduce job.

4 bzip2 compression

Advantages: support split; has a high compression rate, higher than the gzip compression rate; hadoop itself, but does not support native; bzip2 command comes with linux system, easy to use.

Disadvantages: compression / decompression slow; does not support native.

Application Scenario: Suitable for low speed but high compression rate, it can be used as the output format of mapreduce jobs. Or the data after output is relatively large, and the processed data needs to be compressed and archived to reduce the disk space and used for later data Less likely, or to compress a single, large text file to reduce the storage space while also supporting split and compatibility with previous applications Program (ie, the application does not need to be modified).

Finally use a table to compare the above four compression format features (advantages and disadvantages):

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.