Currently used in Hadoop more than four compression formats lzo, gzip, snappy, bzip2, the author based on practical experience to introduce the advantages and disadvantages of these four compression formats and application scenarios, so that we in practice according to the actual situation of choice Different compression formats.
1 gzip compression
Advantages: Compression ratio is relatively high, and the compression / decompression speed is faster; Hadoop itself supports processing gzip format in the application of the document and the same text directly; hadoop native library; Most linux systems come with gzip command, Easy to use.
Disadvantages: Does not support split.
Application scenarios: When each file is compressed within 130M (1 block size), can consider using gzip compression format. For example, a day or an hour of the log compressed into a gzip file, run the mapreduce program through multiple gzip files to achieve concurrency. hive program, streaming program, and mapReduce program written in java completely the same as the text processing, the original program after compression does not need to make any changes.
2 lzo compression
Advantages: compression / decompression speed is faster, reasonable compression rate; Support split, is the most popular hadoop compression format; Support for hadoop native library; Can be installed in the linux system lzop command, easy to use.
Disadvantages: Compression rate lower than gzip; hadoop itself does not support, you need to install; In the application of the lzo format files need to do some special treatment (in order to support the need to build an index split, but also need to specify inputformat lzo format).
Application scenarios: a large text file, compressed more than 200M or more can be considered, and the larger a single file, the more obvious advantages of lzo.
3 snappy compression
Advantages: high-speed compression speed and reasonable compression rate; support Hadoop native library.
Disadvantages: does not support split; compression rate lower than gzip; hadoop itself does not support, you need to install; There is no corresponding command under linux system.
Usage Scenario: Compression format of map to reduce intermediate data when the map output data of the mapreduce job is relatively large, or the output of one mapreduce job and the input of another mapreduce job.
4 bzip2 compression
Advantages: support split; has a high compression rate, higher than the gzip compression rate; hadoop itself, but does not support native; bzip2 command comes with linux system, easy to use.
Disadvantages: compression / decompression slow; does not support native.
Application Scenario: Suitable for low speed but high compression rate, it can be used as the output format of mapreduce jobs. Or the data after output is relatively large, and the processed data needs to be compressed and archived to reduce the disk space and used for later data Less likely, or to compress a single, large text file to reduce the storage space while also supporting split and compatibility with previous applications Program (ie, the application does not need to be modified).
Finally use a table to compare the above four compression format features (advantages and disadvantages):