[Hadoop] Common compression formats for use in Hadoop (Spark)

Source: Internet
Author: User
Currently in Hadoop used more than lzo,gzip,snappy,bzip2 these 4 kinds of compression format, the author based on practical experience to introduce the advantages and disadvantages of these 4 compression formats and application scenarios, so that we in practice according to the actual situation to choose different compression format.

1 gzip compression

Advantages: The compression ratio is high, and the compression/decompression speed is relatively fast; Hadoop natively supports that files in the gzip format are processed in the same way as direct text; there is a Hadoop native library; Most Linux systems have their own GZIP commands for ease of use.

Cons: Split is not supported.

Application scenario: When each file is compressed within 130M (1 blocks in size), you can consider using gzip compression format. For example, a day or one-hour log is compressed into a gzip file, and the MapReduce program runs through multiple gzip files to achieve concurrency. The hive program, the streaming program, and the Java-written MapReduce program are exactly the same as text processing, and the original program does not need to be modified after compression.

2 Lzo Compression

Advantages: Compression/decompression speed is relatively fast, reasonable compression rate; split is the most popular compression format in Hadoop, supports Hadoop native library, and can be installed under Linux system lzop command, easy to use.

Cons: Compression ratio is lower than gzip; Hadoop itself is not supported, needs to be installed, there are some special processing required for files in LZO format in the application (in order to support split needs to be indexed, you also need to specify InputFormat as Lzo format).

Application scenario: A large text file, after compression is more than 200M can be considered, and the larger the size of a single file, Lzo the more obvious advantages.

3 Snappy Compression

Advantages: High speed compression speed and reasonable compression rate; support for Hadoop native library.

Cons: Split is not supported, compression is lower than gzip, Hadoop itself is not supported, requires installation, and no corresponding command is available under Linux system.

Scenario: When the map output data for a mapreduce job is larger, it acts as a compressed format for intermediate data from map to reduce, or as an output of a mapreduce job and an input to another mapreduce job.

4 bzip2 Compression

Pros: Support split, high compression ratio, higher than gzip compression rate; Hadoop natively supports it, but does not support native; it is easy to use with BZIP2 commands in Linux systems.

Disadvantage: Compression/decompression speed is slow; native is not supported.

Application scenario: Suitable for the speed requirements are not high, but need high compression rate, can be used as the output format of the MapReduce job, or the data after the output is larger, the data after processing need to compress the archive to reduce disk space and later data used relatively small situation , or you want to compress a single large text file to reduce storage space while supporting split and compatibility with previous applications (that is, applications that do not need to be modified).

Finally, a table is used to compare the features (pros and cons) of the above 4 compression formats:

comparison of features in 4 compression formats
compression Format Split native Compression ratio Speed whether Hadoop comes with Linux Commands if the original application has to be modified after you change to a compressed format
Gzip Whether Is Very high Relatively fast Yes, direct use Yes As with text processing, there is no need to modify
Lzo Is Is higher than Soon No, you need to install Yes You need to build an index and specify the input format
Snappy Whether Is higher than Soon No, you need to install No As with text processing, there is no need to modify
Bzip2 Is Whether Highest Slow Yes, direct use Yes As with text processing, there is no need to modify


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.