The "can be sliced" field description in the Hadoop compression format

Source: Internet
Author: User
Tags file size

Two benefits of file compression: reducing the disk space required to store files and accelerating data transmission over networks and disks
In storage, all algorithms weigh space/time, and all algorithms weigh cpu/transfer speed when processing

The following is a list of common compression methods used in conjunction with Hadoop:

compression Format Tools algorithm file name extension is it possible to slice
DEFLATE No DEFLATE . Deflate Whether
Gzip Gzip DEFLATE . gz Whether
Bzip2 Bzip2 Bzip2 . bz2 Is
LZO Lzop LZO . Lzo Whether
LZ4 No LZ4 . lz4 Whether
Snappy No Snappy . Snappy Whether

In the above table, the other columns are easy to understand, but what is the concept of "can be segmented"?
The official argument is whether the corresponding compression algorithm can search any location of the data stream and further read the data down.

Here is an example of a file that is stored in the HDFs file system and is not compressed by a size of 1 GB. If the block size of HDFs is set to 128, then the file will be stored in 8 blocks, which will be used as the Mapreduc/spark job for input data, and 8 map/task tasks are created, each of which corresponds to one task as input data. Now, if the file size is 1GB after gzip compression. As before, HDFs also stores this file as 8 blocks of data. However, each individual Map/task task will not be able to process data independently from other tasks, the official point of view, because the data stored in HDFs is cut into chunks, and the compression algorithm cannot be read from any. Popular explanation, is because each block stored in HDFs is not a complete file, we can put a complete file is identified as the first and last, because it is segmented, so each data block some have a head mark, some have a tail mark, some of the end of the mark is not, So you can't multitask to handle this file in parallel. For this non-sharding, only the data blocks of all HDFs of the file are transferred to a map/task task for processing, but most of the data blocks are not stored on the node of the task, so they need to be transferred across nodes and cannot be processed in parallel, so it can take a long time to run.

One thing to note is that Lzo is compressed, and we face the same problem because this compression format does not support data reading and data flow synchronization. However, it is possible to use the index tool contained in the Hadoop LZO library file when preprocessing the LZO file, which builds the segmentation index, which effectively implements the sharding characteristics of the file if the appropriate task input format is used.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.