Using the Lzo compression algorithm in Hadoop can reduce the size of the data and the disk read and write time of the data, not only that, Lzo is based on block block, so he allows the data to be decomposed into chunk, parallel by Hadoop processing. This allows Lzo to become a very useful compression format on Hadoop.
Lzo itself is not splitable, so when the data is in text format, the data compressed with Lzo as the job input is a file as a map. But Sequencefile itself is chunked, so the sequencefile format of the file, and then with the Lzo compression format, you can implement Lzo file mode splitable.
Since the compressed data is usually only 1/4 of the original data, storing compressed data in HDFs can enable the cluster to save more data and prolong the service life of the cluster. Not only that, because mapreduce jobs are usually bottlenecks on Io, storing compressed data means less IO operations and more efficient job operations. However, there are two more troublesome places to use compression on Hadoop: first, some compression formats can not be chunked, parallel processing, such as gzip. Second, while some other compression formats support chunked processing, the decompression process is slow and the job bottleneck is transferred to the CPU, such as bzip2. For example we have a 1.1GB gzip file that is partitioned into 128mb/chunk and stored in HDFs, then it is divided into 9 blocks. To be able to handle each chunk in parallel in MapReduce, there is a dependency between the various mapper. The second mapper is processed in a random byte out of the file. The context dictionary that is used when gzip is uncompressed is empty, which means that the gzip compressed file cannot be processed correctly in parallel on Hadoop. As a result, the large gzip compressed file on Hadoop can only be handled by a single mapper, so it's not efficient, and it's no different than using MapReduce. The other BZIP2 compression format, although the bzip2 compression is very fast, and can even be chunked, but its decompression process is very very slow, and can not be used to read streaming, so it can not be used in Hadoop efficient use of this compression. Even if used, due to its inefficient decompression, it will also make the job bottleneck transferred to the CPU up.
If you can have a compression algorithm, that can be chunked, parallel processing, the speed is very fast, it is very ideal. This approach is lzo. Lzo's compressed files are made up of a number of small blocks (about 256K), so that the Hadoop job can be splitjob based on block partitioning. Not only that, lzo in the design of the efficiency of the problem, its decompression speed is gzip twice times, which allows it to save a lot of disk read and write, it is less than the compression ratio of gzip, about the compressed file than gzip compression half, but this is still more than the uncompressed file to save 20%- 50% of the storage space , so that the efficiency of the job can greatly improve the speed of execution. The following is a set of compression-comparison data, which is compared using a 8.0GB of uncompressed data:
compressed format |
file |
size (GB) | td> compression time
decompression time |
None |
some_logs |
8.0 |
- |
- |
Gzip |
some_logs.gz |
1.3 |
241 |
|
LZO |
some_logs.lzo |
2.0 |
/ |
/ | '
As you can see, the Lzo compressed file is slightly larger than the gzip compressed file, but it is still much smaller than the original file, and the Lzo file is compressed almost 5 times times faster than Gzip, and the decompression is twice times faster than Gzip. Lzo files can be divided according to Blockboundaries, such as a 1.1G lzo compressed file, then processing the second 128MBblock of Mapper must be able to confirm the next block boundary, in order to extract operations. Lzo does not write any data headers to do this, but instead implements a Lzoindex file that writes the file (Foo.lzo.index) in each Foo.lzo file. This index file simply contains the offset of each block in the data so that the data is read and written very quickly due to the known offset. Usually can reach 90-100mb/seconds, that is, 10-12 seconds to read a GB of files. Once the index file is created, any Lzo-based compressed file can be chunked by load of the index file, and a block is read by a block. As a result, each mapper is able to get the right block, which means that you can use Lzo in parallel and efficiently in Hadoop's mapreduce, just by having a lzopinputstream package. If you now have a job inputformat is Textinputformat, then you can use Lzop to compress the file, make sure that it correctly created index, Textinputformat to Lzotextinputformat, The job can then run as correctly as before and be faster. Sometimes, when a large file is Lzo compressed, it can be processed by a single mapper without even blocking it.
[Go]-compression using Lzo in Hadoop