1 gzip Compression
Advantages: the compression rate is relatively high, and the compression/Decompression speed is also relatively high. hadoop itself supports processing gzip files in applications just like directly processing text; hadoop native libraries are available; most Linux systems use gzip commands for ease of use.
Disadvantage: split is not supported.
Application Scenario: when each file is compressed within MB (within 1 block size), Gzip compression format can be used. For example, logs of one day or one hour are compressed into a GZIP file, and multiple gzip files are used to run mapreduce programs concurrently. The hive program and streaming program are the same as the mapreduce program written in Java. After compression, the original program does not need to be modified.
2 lzo Compression
Advantages: compression/Decompression speed is also relatively fast, reasonable compression rate; Support for split, is the most popular compression format in hadoop; Support for hadoop native Library; can install lzop command in Linux, easy to use.
Disadvantages: the compression ratio is lower than that of gzip; hadoop itself is not supported and needs to be installed; lzo files need to be specially processed in the application (indexes must be created to support split, you also need to specify inputformat as lzo format ).
Application Scenario: A large text file that is compressed and later than MB can be considered. In addition, the larger a single file, the more obvious the advantage of lzo.
3 snappy Compression
Advantages: high compression speed and reasonable compression ratio; Support for hadoop native libraries.
Disadvantages: split is not supported; compression rate is lower than gzip; hadoop itself is not supported and needs to be installed; there is no corresponding command in Linux.
Application Scenario: When the map output data of mapreduce jobs is large, it is used as the compression format of the intermediate data from map to reduce; or as the output of a mapreduce job and input of another mapreduce job.
4 Bzip2 Compression
Advantages: Support for split; high compression rate, higher than gzip compression rate; hadoop itself, but does not support native; Bzip2 command is provided in Linux for ease of use.
Disadvantages: the compression/Decompression speed is slow; Native is not supported.
Application Scenario: Suitable for scenarios where the speed requirement is not high, but the compression ratio is high, it can be used as the output format of mapreduce jobs; or the output data is large, after processing, the data needs to be compressed and archived to reduce disk space and reduce data usage in the future. Or, if you want to compress a single large text file to reduce storage space, you also need to support split, it is also compatible with the previous application procedure (that is, the application does not need to be modified.
Finally, compare the features (advantages and disadvantages) of the above four compression formats with a table ):
Comparison of features in four compression formats
Compression format |
Split |
Native |
Compression rate |
Speed |
Hadoop built-in? |
Linux commands |
After the compression format is changed, does the original application need to be modified? |
Gzip |
No |
Yes |
Very high |
Relatively Fast |
Yes, use it directly |
Yes |
Same as text processing, it does not need to be modified. |
Lzo |
Yes |
Yes |
Relatively high |
Soon |
No, you need to install |
Yes |
You need to create an index and specify the input format. |
Snappy |
No |
Yes |
Relatively high |
Soon |
No, you need to install |
No |
Same as text processing, it does not need to be modified. |
Bzip2 |
Yes |
No |
Highest |
Slow |
Yes, use it directly |
Yes |
Same as text processing, it does not need to be modified. |
Comparison of features of four compression formats in hadoop