Enable hadoop to support splittable compression lzo

Last Update:2018-12-04 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The lzo compression algorithm can be used in hadoop to reduceDataNot only the size and data disk read/write time, But lzo is block-based. In this way, it allows data to be split into Chunk and processed by hadoop in parallel. This feature enables lzo to become a very useful compression format on hadoop.

Lzo itself is not splitable, so when the data is in text format, the data compressed by lzo is used as the input of a job as a file as a map. However, the sequence file itself is segmented, so the Sequence File Format File, coupled with the lzo compression format, can implement the lzo file splitable.

Since the compressed data is usually only 1/4 of the original data, the storage of compressed data in HDFS enables the cluster to save more data and prolong the service life of the cluster. In addition, because mapreduce jobs usually have Io bottlenecks, storing compressed data means less Io operations and more efficient job operation. However, compression on hadoop is also annoying: first, some compression formats cannot be segmented and processed in parallel, such as gzip. Second, some other compression formats support block processing, but the decompression process is very slow, causing the bottleneck of the job to shift to the CPU, such as Bzip2. For example, we have a 1 gb gzip fileFileQuilt
Divided into 128 MB/Chunk and stored on HDFS, it will be divided into 9 blocks. In order to be able to process each chunk in parallel in mapreduce, there is dependency between each er. The second mapper will process the file in a random byte. The context dictionary used for gzip decompression is empty, which means that the compressed gzip files cannot be correctly processed in parallel on hadoop. Therefore, a large GZIP file in hadoop can only be processed by one Mapper, which is not efficient and is no different from mapreduce. In another Bzip2 compression format, although Bzip2 can be compressed very quickly and even segmented, the decompression process is very slow.
It is slow and cannot be read by streaming. This compression cannot be efficiently used in hadoop. Even if it is used, due to its low decompression efficiency, the bottleneck of the job will also be transferred to the CPU.

If we can have a compression algorithm that can be segmented and processed in parallel, and the speed is also very fast, it is very ideal. This method is lzo. The lzo compressed file is composed of many small blocks (about 256 K), so that hadoop jobs can be split by block. In addition, lzo has taken efficiency into account during design, and its decompression speed is twice that of gzip, which allows it to save a lot of disk read/write, the compression ratio is not as good as that of gzip. The compressed files are about half the size of gzip files, but this still saves 20%-50% of storage space than uncompressed files.SpaceIn this way, the job execution speed can be greatly improved in terms of efficiency. The following is a group of compressed and compared data, which uses a GB uncompressed data for comparison:

Compression format	File	Size (GB)	Compression time	Decompression Time
None	Some_logs	8.0	-	-
Gzip	Some_logs.gz	1.3	241	72
Lzo	Some_logs.lzo	2.0	55	35

It can be seen that the lzo compressed file is slightly larger than the gzip compressed file, but it is still much smaller than the original file, and the lzo File compression speed is almost five times as fast as that of gzip, the decompression speed is twice that of gzip. Lzo files can be segmented based on block boundaries. For example, if a 1.1g lzo compressed file is used, the Mapper that processes the second 128 MB block must be able to confirm the boundary of the next block, for decompression. Lzo does not write any data header to achieve this, but implements an lzo index file, which is written in every Foo. lzo file (FOO. lzo. index. The index file simply contains each block in the data.
Offset, so that the read and write operations on the data become very fast due to the known offset. Generally, it can reach 90-100 mb/second, that is, 10-12 seconds can read a GB file. Once the index file is created, any lzo-based compressed file can load the index file to implement corresponding chunks, and one block is read by one block. Therefore, each mapper can obtain the correct block, which means that lzo can be used in parallel and efficiently in hadoop mapreduce by encapsulating lzopinputstream. If the inputformat of a job is textinputformat, you can use lzop
Compress the file, make sure it has created the index correctly, replace textinputformat with lzotextinputformat, and then the job can run correctly and faster as before. Sometimes, after a large file is compressed by lzo, it can be efficiently processed by a single mapper without even being segmented.

Install lzo in a hadoop Cluster

It is very easy to build an lzo environment in hadoop:

Install lzop native Libraries
Example: sudo Yum install lzop lzo2
Download hadoop lzo support to source code from the address below: http://github.com/kevinweil/hadoop-lzo
Compile the code from the above link to checkout, usually ant compile-native tar
CompileHadoop-lzo-*. JarDeploy it to the hadoop cluster to each slave to a valid directory, such as $ hadooop_home/lib
Deploy the above compiled hadoop-lzo native lib binary to a cluster to a valid directory, such as $ hadoop_home/lib/native/Linux-amd64-64.

Configure the following to the core-site.xml:

<property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value></property><property><name>io.compression.codec.lzo.class</name><value>com.hadoop.compression.lzo.LzoCodec</value></property>

Configure the following to the mapred-site.xml:

<property><name>mapred.child.env</name><value>JAVA_LIBRARY_PATH=/path/to/your/native/hadoop-lzo/libs</value></property>

If you want mapreduce to use compression when writing intermediate results again, you can write the following configuration into the mapred-site.xml as well.
```
<property><name>mapred.map.output.compression.codec</name><value>com.hadoop.compression.lzo.LzoCodec</value></property>
```

If all the above operations are successful, you can try to use lzo now. For example, to package an lzo compressed file, such as the lzo_log file, upload it to HDFS, and then use the followingCommandTest:
Hadoop JAR/path/to/hadoop-lzo.jar com. hadoop. Compression. lzo. lzoindexer HDFS: // namenode: 9000/lzo_logs

If you want to write a job to use lzo, you can find a job, such as wordcount, and change the content in textinputformatLzotextinputforma. without modification, jobs can read lzo compressed files from HDFS for distributed parallel processing.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More