Reprinted please indicate the source: hadoop in-depth research: (7) -- Compression
File compression has two main advantages: one is to reduce the storage space of files, and the other is to speed up data transmission. In the context of hadoop big data, these two points are particularly important. Now I will first learn about File compression in hadoop.
Hadoop supports many compression formats. Let's look at a table:
Deflate is a lossless data compression algorithm that uses both the lz77 algorithm and the Huffman coding algorithm. The source code can be found in the zlib library. Gzip is an algorithm based on the deflate algorithm.
All compression algorithms convert the space and time, and the compression ratio is faster or smaller. You can specify the parameter.-1 indicates the speed and-9 indicates the space. Take gzip as an example. The following means faster compression:
gzip -1 file
Gzip makes a trade-off between time and space. The Bzip2 compression ratio is more effective, but the speed is slower. Bzip2 can be decompressed faster than Bzip2. But it is the slowest than other compression formats, but the compression effect is obviously the best. The decompression speed of snappy and lz4 is much better than that of lzo.
Splittable indicates whether the compression format can be split, that is, whether the data can be read immediately. Whether the compressed data can be used by mapreduce is critical to splitting the compressed data.
For example, an uncompressed file is 1 GB and the default block size of HDFS is 64 mb. Therefore, this file is divided into 16 blocks as mapreduce input, each task uses a single map task. If this file has been compressed using gzip, it is obviously inappropriate to divide it into 16 blocks and make each block an input, because it is impossible to immediately read the gzip compressed stream. In fact, when mapreduce processes compressed files, it will realize that this is a gzip compressed file, and gzip does not support instant reading, it will allocate 16 blocks to a map for processing. Here there will be many map tasks that are not processed locally, and the whole process will take quite a long time.
The lzo compression format is the same, but lzo supports splittable after using the index tool of the hadoop lzo library. Bzip2 also supports splittable.
So how do I select the compression format? This depends on the file size. The compression tool you are using. Below are several suggestions for selection, with the efficiency sorted from high to low:
1. Use some file formats that contain compression and support splittable, such as Sequence File, rcfile or Avro files. These file formats will be discussed later. For quick compression, you can use lzo, lz4, or snappy compression formats.
2. Use the compression format that provides splittable. For example, Bzip2 and the index can support the splittable lzo.
3. Divide the file into several blocks in advance, and each block is compressed separately. This eliminates the need to consider the splittable issue.
4. Do not compress files
It is inappropriate to store a large data file in a compressed format that does not support splittable, and the non-local processing efficiency will be very low.
Thanks to Tom White, most of this article is from the definitive guide of the great god. However, the Chinese translation is too bad, so I will add some understanding to some official documents on the basis of the original English version. It's all about reading notes.