There are two main advantages of file compression, one is to reduce the space for storing files, and the other is to speed up data transmission. In the context of Hadoop big data, these two points are especially important, so I'm going to look at the file compression of Hadoop.
There are many compression formats supported in Hadoop, see the following table:
Deflate is a lossless data compression algorithm using both the LZ77 algorithm and the Havermann code (Huffman Coding), which can be found in the Zlib library. Gzip is an algorithm based on the DEFLATE algorithm.
All compression algorithms are spatial and temporal transformations, faster compression times or smaller compression ratios, which can be specified by parameters-1 means speed, 9 means space. Taking gzip as an example, the following means faster compression:
[HTML]View PlainCopy
- Gzip-1 file
Gzip tradeoff between time and space, bzip2 compression is more efficient than gzip, but slower. The bzip2 is faster than the compression speed. However, compared with other compression formats is the slowest, but the compression effect is clearly the best. Snappy and lz4 have a much better decompression speed than lzo.
Is the comparison of the efficiency of some decompression algorithms:
Splitable indicates whether the compressed format can be split, that is, whether it is supported to read immediately. Whether compressed data can be used by mapreduce and whether compressed data can be segmented is critical.
For example: An uncompressed file has a 1GB size, the default block size of HDFs is 64MB, then this file will be divided into 16 blocks as a mapreduce input, each using a map task alone. What if this file is already compressed with gzip? If it is divided into 16 blocks, each block is made into an input, which is obviously not appropriate because the gzip compressed stream is not immediately readable. In fact, when MapReduce handles compressed format files, it realizes that this is a gzip compressed file, and Gzip does not support the immediately read, it will be 16 pieces of a map to deal with, there will be a lot of non-local processing map task, the whole process will take a long time.
Lzo compression format also has the same problem, but by using the Index tool of the Hadoop Lzo Library, Lzo can support splitable. BZIP2 also supports splitable.
So how do you choose the compression format? Depending on the size of the file, the compression tool you use, here are a few options for the recommendations, efficiency from high to Low:
1. Use some file formats that contain compression and support splitable, such as sequence File,rcfile or Avro files, which we'll talk about later. If it is for fast compression, you can use the LZO,LZ4 or snappy compression format.
2. Use a compression format that provides splitable, such as bzip2 and indexing, to support splitable Lzo.
3. Divide the file into blocks in advance, and each block is compressed separately, so there is no need to consider the splitable problem.
4. Do not compress files.
Note: It is not appropriate to store a large data file in a compressed format that does not support splitable, and the efficiency of non-local processing is very low.
Hadoop data compression