Currently in Hadoop used more than lzo,gzip,snappy,bzip2 these 4 kinds of compression format, the author based on practical experience to introduce the advantages and disadvantages of these 4 compression formats and application scenarios, so that we in practice according to the actual situation to choose different compression format.
1 gzip compression
Advantages: The compression ratio is high, and the compression/decompression speed is relatively fast; Hadoop natively supports that files in the gzip format are processed in the same way as direct text; there is a Hadoop native library; Most Linux systems have their own GZIP commands for ease of use.
Cons: Split is not supported.
Application scenario: When each file is compressed within 130M (1 blocks in size), you can consider using gzip compression format. For example, a day or one-hour log is compressed into a gzip file, and the MapReduce program runs through multiple gzip files to achieve concurrency. The hive program, the streaming program, and the Java-written MapReduce program are exactly the same as text processing, and the original program does not need to be modified after compression.
2 Lzo Compression
Advantages: Compression/decompression speed is relatively fast, reasonable compression rate; split is the most popular compression format in Hadoop, supports Hadoop native library, and can be installed under Linux system lzop command, easy to use.
Cons: Compression ratio is lower than gzip; Hadoop itself is not supported, needs to be installed, there are some special processing required for files in LZO format in the application (in order to support split needs to be indexed, you also need to specify InputFormat as Lzo format).
Application scenario: A large text file, after compression is more than 200M can be considered, and the larger the size of a single file, Lzo the more obvious advantages.
3 Snappy Compression
Advantages: High speed compression speed and reasonable compression rate; support for Hadoop native library.
Cons: Split is not supported, compression is lower than gzip, Hadoop itself is not supported, requires installation, and no corresponding command is available under Linux system.
Scenario: When the map output data for a mapreduce job is larger, it acts as a compressed format for intermediate data from map to reduce, or as an output of a mapreduce job and an input to another mapreduce job.
4 bzip2 Compression
Pros: Support split, high compression ratio, higher than gzip compression rate; Hadoop natively supports it, but does not support native; it is easy to use with BZIP2 commands in Linux systems.
Disadvantage: Compression/decompression speed is slow; native is not supported.
Application scenario: Suitable for the speed requirements are not high, but need high compression rate, can be used as the output format of the MapReduce job, or the data after the output is larger, the data after processing need to compress the archive to reduce disk space and later data used relatively small situation , or you want to compress a single large text file to reduce storage space while supporting split and compatibility with previous applications (that is, applications that do not need to be modified).
Finally, a table is used to compare the features (pros and cons) of the above 4 compression formats:
comparison of features in 4 compression formats
compression Format |
Split |
native |
Compression ratio |
Speed |
whether Hadoop comes with |
Linux Commands |
if the original application has to be modified after you change to a compressed format |
Gzip |
Whether |
Is |
Very high |
Relatively fast |
Yes, direct use |
Yes |
As with text processing, there is no need to modify |
Lzo |
Is |
Is |
higher than |
Soon |
No, you need to install |
Yes |
You need to build an index and specify the input format |
Snappy |
Whether |
Is |
higher than |
Soon |
No, you need to install |
No |
As with text processing, there is no need to modify |
Bzip2 |
Is |
Whether |
Highest |
Slow |
Yes, direct use |
Yes |
As with text processing, there is no need to modify |