Comparison of features of four compression formats in hadoop

Source: Internet
Author: User
Tags processing text

1 gzip Compression

Advantages: the compression rate is relatively high, and the compression/Decompression speed is also relatively high. hadoop itself supports processing gzip files in applications just like directly processing text; hadoop native libraries are available; most Linux systems use gzip commands for ease of use.

Disadvantage: split is not supported.

Application Scenario: when each file is compressed within MB (within 1 block size), Gzip compression format can be used. For example, logs of one day or one hour are compressed into a GZIP file, and multiple gzip files are used to run mapreduce programs concurrently. The hive program and streaming program are the same as the mapreduce program written in Java. After compression, the original program does not need to be modified.

2 lzo Compression

Advantages: compression/Decompression speed is also relatively fast, reasonable compression rate; Support for split, is the most popular compression format in hadoop; Support for hadoop native Library; can install lzop command in Linux, easy to use.

Disadvantages: the compression ratio is lower than that of gzip; hadoop itself is not supported and needs to be installed; lzo files need to be specially processed in the application (indexes must be created to support split, you also need to specify inputformat as lzo format ).

Application Scenario: A large text file that is compressed and later than MB can be considered. In addition, the larger a single file, the more obvious the advantage of lzo.

3 snappy Compression

Advantages: high compression speed and reasonable compression ratio; Support for hadoop native libraries.

Disadvantages: split is not supported; compression rate is lower than gzip; hadoop itself is not supported and needs to be installed; there is no corresponding command in Linux.

Application Scenario: When the map output data of mapreduce jobs is large, it is used as the compression format of the intermediate data from map to reduce; or as the output of a mapreduce job and input of another mapreduce job.

4 Bzip2 Compression

Advantages: Support for split; high compression rate, higher than gzip compression rate; hadoop itself, but does not support native; Bzip2 command is provided in Linux for ease of use.

Disadvantages: the compression/Decompression speed is slow; Native is not supported.

Application Scenario: Suitable for scenarios where the speed requirement is not high, but the compression ratio is high, it can be used as the output format of mapreduce jobs; or the output data is large, after processing, the data needs to be compressed and archived to reduce disk space and reduce data usage in the future. Or, if you want to compress a single large text file to reduce storage space, you also need to support split, it is also compatible with the previous application procedure (that is, the application does not need to be modified.

Finally, compare the features (advantages and disadvantages) of the above four compression formats with a table ):

Comparison of features in four compression formats
Compression format Split Native Compression rate Speed Hadoop built-in? Linux commands After the compression format is changed, does the original application need to be modified?
Gzip No Yes Very high Relatively Fast Yes, use it directly Yes Same as text processing, it does not need to be modified.
Lzo Yes Yes Relatively high Soon No, you need to install Yes You need to create an index and specify the input format.
Snappy No Yes Relatively high Soon No, you need to install No Same as text processing, it does not need to be modified.
Bzip2 Yes No Highest Slow Yes, use it directly Yes Same as text processing, it does not need to be modified.


Comparison of features of four compression formats in hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.