File compression has two main benefits, one is to reduce the space occupied by storage files, the other is to speed up the data transmission. In the context of Hadoop's big data, these two points are particularly important, so let me now look at the file compression in Hadoop.
Hadoop supports a wide variety of compression formats, and we look at a table:
Deflate is a lossless data compression algorithm using LZ77 algorithm and Huffman encoding (Huffman http://www.aliyun.com/zixun/aggregation/1552.html ">coding"). Source code can be found in the Zlib library. Gzip is an algorithm based on deflate algorithm.
All compression algorithms are space and time conversion, faster compression time or a smaller compression ratio, can be specified by parameters,-1 means speed,-9 means space. Taking gzip as an example, the following means faster compression:
[Plain] View plaincopy
Gzip-1 file
Gzip in time and space trade-offs are more eclectic, bzip2 compression than gzip more efficient, but slower. BZIP2 decompression faster than its compression speed. But it's the slowest one, but the compression effect is obviously the best. Snappy and lz4 are much better at decompression than Lzo.
SplitTable indicates whether the compression format can be split, that is, whether to read or not. Whether compressed data can be mapreduce or not, compressing data can be segmented is critical.
For example, an uncompressed file has a 1GB size and the default block size of HDFs is 64MB, and the file is divided into 16 blocks as input to the MapReduce, each using a map task alone. If the file is already compressed with gzip, if it is divided into 16 blocks, each block is made into one input, which is obviously not appropriate, because it is not possible to read from the gzip compressed stream. In fact, when MapReduce handles compressed-format files It realizes that this is a gzip compressed file, and Gzip does not support then read, it will be 16 pieces to a map to deal with, there will be a lot of non-local processing map tasks, the whole process will take quite a long time.
Lzo compression format can also be the same problem, but by using the Hadoop Lzo Library's indexing tool, Lzo supports splittable. BZIP2 also supports splittable.
So how do you choose a compression format? Depending on the size of the file, the compression tool you use, the following are a few selection suggestions, efficiency from high to low sort:
1. Use some file formats that contain compression and support splittable, such as sequence File,rcfile or Avro files, which we'll talk about later. If you can use LZO,LZ4 or snappy compressed format for fast compression.
2. SplitTable Lzo can be supported by using a splittable compression format, such as bzip2 and indexing.
3. The file into several blocks in advance, each block individually compressed, so that no need to consider the problem of splittable
4. Do not compress files
It is not appropriate to store a large data file in a compressed format that does not support splittable, and the efficiency of non-local processing is very low.
Thank Tom White, this article mostly from the Great God's definitive guide, but the Chinese version of the translation is too bad, on the basis of the original English and some official documents to add some of their own understanding.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.