Application of 4 kinds of common compression formats in Hadoop __hdfs

Source: Internet
Author: User
Tags processing text

Application of 4 kinds of common compression formats in Hadoop

Currently in Hadoop used more than LZO,GZIP,SNAPPY,BZIP2 these 4 compression formats, the author based on practical experience to introduce the advantages and disadvantages of these 4 compression formats and application scenarios, so that everyone in practice according to the actual situation to choose a different compression format.

--------------------------------------Split Line--------------------------------------

Related reading :

Set up Hadoop environment http://www.linuxidc.com/Linux/2013-06/86106.htm on Ubuntu 13.04

Ubuntu 12.10 +hadoop 1.2.1 Version cluster configuration http://www.linuxidc.com/Linux/2013-09/90600.htm

Set up Hadoop environment on Ubuntu (stand-alone mode + pseudo distribution mode) http://www.linuxidc.com/Linux/2013-01/77681.htm

Configuration of the Hadoop environment under Ubuntu http://www.linuxidc.com/Linux/2012-11/74539.htm

The single edition builds the Hadoop environment picture and text tutorial detailed http://www.linuxidc.com/Linux/2012-02/53927.htm

Hadoop Lzo Installation Tutorial http://www.linuxidc.com/Linux/2013-01/78397.htm

Use Lzo compression http://www.linuxidc.com/Linux/2012-05/60554.htm on Hadoop clusters

--------------------------------------Split Line--------------------------------------

1 gzip compression

Advantages: Compression ratio is high, and compression/decompression speed is also relatively fast; Hadoop itself supports the process of processing gzip-formatted files in applications as well as directly processing text; There are hadoop native libraries; Most Linux systems come with gzip commands and are easy to use.

Disadvantage: Do not support split.

Scenario: After each file is compressed within 130M (1 block size), you can consider using gzip compression format. For example, a day or one hours of log compression into a gzip file, running the MapReduce program through multiple gzip files to achieve concurrency. Hive program, streaming program, and Java written MapReduce program is exactly the same as text processing, after compression, the original program does not need to make any changes.

2 Lzo Compression

Advantages: Compression/decompression speed is also relatively fast, reasonable compression rate; Support split, is the most popular compression format Hadoop, support Hadoop native Library, can be installed under the Linux System lzop command, easy to use.

Disadvantage: Compression ratio is lower than gzip, Hadoop itself does not support, need to install, in the application of Lzo format files need to do some special processing (in order to support split need to build index, also need to specify InputFormat as Lzo format).

Application scenario: A very large text file, compressed after more than 200M can be considered, and the larger the single file, Lzo advantages more obvious.

3 Snappy Compression

Advantages: High speed compression speed and reasonable compression rate; support for Hadoop native library.

Disadvantage: Do not support split, compression rate is lower than gzip, Hadoop itself does not support, need to install, Linux system does not have the corresponding command.

Scenario: When the data for the map output of the MapReduce job is large, the compression format of the intermediate data as map to reduce, or as the output of a mapreduce job and the input of another mapreduce job.

4 bzip2 Compression

Advantages: Support Split, with a high compression rate, than the gzip compression rate is high; Hadoop itself supports, but does not support native, the Linux system with bzip2 command, easy to use.

Disadvantage: Compression/decompression speed is slow; native is not supported.

Application scenario: Suitable for the speed requirements are not high, but need a higher compression rate, can be used as the output format of the MapReduce job, or after the output of the data is relatively large, processed data need to compress the archive to reduce disk space and later use less data , or a single large text file that wants to compress and reduce storage space, while supporting split, and compatibility with previous applications, that is, applications that do not need to be modified.

Finally, use a table to compare the characteristics of the above 4 compression formats (pros and cons):

Comparison of characteristics of 4 compression formats
compressed Format Split native Compression Rate Speed whether Hadoop is self-brought Linux Command if the original application changes to a compressed format
Gzip Whether Is Very high Relatively fast Yes, direct use Yes As with text processing, you do not need to modify
Lzo Is Is Relatively high Soon No, you need to install Yes You need to build an index and you need to specify the input format
Snappy Whether Is Relatively high Soon No, you need to install No As with text processing, you do not need to modify
Bzip2 Is Whether Highest Slow Yes, direct use Yes As with text processing, you do not need to modify

More Hadoop related information see Hadoop topic page http://www.linuxidc.com/topicnews.aspx?tid=13

This article permanently updates the link address : http://www.linuxidc.com/Linux/2014-04/101230.htm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.