[Go]-compression using Lzo in Hadoop

Last Update:2016-09-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Using the Lzo compression algorithm in Hadoop can reduce the size of the data and the disk read and write time of the data, not only that, Lzo is based on block block, so he allows the data to be decomposed into chunk, parallel by Hadoop processing. This allows Lzo to become a very useful compression format on Hadoop.

Lzo itself is not splitable, so when the data is in text format, the data compressed with Lzo as the job input is a file as a map. But Sequencefile itself is chunked, so the sequencefile format of the file, and then with the Lzo compression format, you can implement Lzo file mode splitable.

Since the compressed data is usually only 1/4 of the original data, storing compressed data in HDFs can enable the cluster to save more data and prolong the service life of the cluster. Not only that, because mapreduce jobs are usually bottlenecks on Io, storing compressed data means less IO operations and more efficient job operations. However, there are two more troublesome places to use compression on Hadoop: first, some compression formats can not be chunked, parallel processing, such as gzip. Second, while some other compression formats support chunked processing, the decompression process is slow and the job bottleneck is transferred to the CPU, such as bzip2. For example we have a 1.1GB gzip file that is partitioned into 128mb/chunk and stored in HDFs, then it is divided into 9 blocks. To be able to handle each chunk in parallel in MapReduce, there is a dependency between the various mapper. The second mapper is processed in a random byte out of the file. The context dictionary that is used when gzip is uncompressed is empty, which means that the gzip compressed file cannot be processed correctly in parallel on Hadoop. As a result, the large gzip compressed file on Hadoop can only be handled by a single mapper, so it's not efficient, and it's no different than using MapReduce. The other BZIP2 compression format, although the bzip2 compression is very fast, and can even be chunked, but its decompression process is very very slow, and can not be used to read streaming, so it can not be used in Hadoop efficient use of this compression. Even if used, due to its inefficient decompression, it will also make the job bottleneck transferred to the CPU up.

If you can have a compression algorithm, that can be chunked, parallel processing, the speed is very fast, it is very ideal. This approach is lzo. Lzo's compressed files are made up of a number of small blocks (about 256K), so that the Hadoop job can be splitjob based on block partitioning. Not only that, lzo in the design of the efficiency of the problem, its decompression speed is gzip twice times, which allows it to save a lot of disk read and write, it is less than the compression ratio of gzip, about the compressed file than gzip compression half, but this is still more than the uncompressed file to save 20%- 50% of the storage space , so that the efficiency of the job can greatly improve the speed of execution. The following is a set of compression-comparison data, which is compared using a 8.0GB of uncompressed data:

td> compression time '

compressed format	file	size (GB)	decompression time
None	some_logs	8.0	-	-
Gzip	some_logs.gz	1.3	241
LZO	some_logs.lzo	2.0	/	/

As you can see, the Lzo compressed file is slightly larger than the gzip compressed file, but it is still much smaller than the original file, and the Lzo file is compressed almost 5 times times faster than Gzip, and the decompression is twice times faster than Gzip. Lzo files can be divided according to Blockboundaries, such as a 1.1G lzo compressed file, then processing the second 128MBblock of Mapper must be able to confirm the next block boundary, in order to extract operations. Lzo does not write any data headers to do this, but instead implements a Lzoindex file that writes the file (Foo.lzo.index) in each Foo.lzo file. This index file simply contains the offset of each block in the data so that the data is read and written very quickly due to the known offset. Usually can reach 90-100mb/seconds, that is, 10-12 seconds to read a GB of files. Once the index file is created, any Lzo-based compressed file can be chunked by load of the index file, and a block is read by a block. As a result, each mapper is able to get the right block, which means that you can use Lzo in parallel and efficiently in Hadoop's mapreduce, just by having a lzopinputstream package. If you now have a job inputformat is Textinputformat, then you can use Lzop to compress the file, make sure that it correctly created index, Textinputformat to Lzotextinputformat, The job can then run as correctly as before and be faster. Sometimes, when a large file is Lzo compressed, it can be processed by a single mapper without even blocking it.

[Go]-compression using Lzo in Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Go]-compression using Lzo in Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Go]-compression using Lzo in Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support