Compression using Lzo in Hadoop

Source: Internet
Author: User
Keywords You can dfs dress.

Using Lzo compression algorithms in Hadoop reduces the size of the data and the disk read and write time of the data, and Lzo is based on block chunking so that he allows the data to be decomposed into chunk, which is handled in parallel by Hadoop. This feature allows Lzo to become a very handy compression format for Hadoop.

Lzo itself is not splitable, so when the data is in text format, the data compressed using Lzo as the job input is a file as a map. But Sequencefile itself is a chunk, so sequencefile format files, and then with Lzo compression format, you can achieve Lzo file splitable.

Since compressed data is usually only 1/4 of the original data, storing compressed data in HDFs can enable the cluster to save more data and prolong the service life of the cluster. Not only that, because mapreduce jobs usually have bottlenecks on Io, storing compressed data means fewer IO operations and more efficient job runs. However, the use of compression on Hadoop has two more troublesome places: first, some compression formats can not be divided into parallel processing, such as gzip. Second, while some of the other compression formats support chunking, the decompression process is very slow, which shifts the job bottleneck to the CPU, such as bzip2. For example, we have a 1.1GB gzip file, which is divided into 128mb/chunk stored on HDFs, and then it is divided into 9 pieces. In order to be able to handle each chunk in parallel in MapReduce, there is dependency between each mapper. The second mapper is processed in a random byte out of the file. The context dictionary to be used for gzip decompression is empty, which means that the gzip compressed file cannot be properly parallel to the Hadoop. So in Hadoop large gzip compressed file can only be a single mapper to deal with, so very efficient, and no mapreduce no difference. Another bzip2 compression format, although bzip2 compression is very fast and can even be divided, but its decompression process is very slow, and can not be used to read with streaming, so it is not efficient in Hadoop use of this compression. Even if used, because of its inefficient decompression, it will also make the job bottleneck transferred to the CPU.

If you can have a compression algorithm, that can be divided, parallel processing, speed is very fast, it is very ideal. This way is Lzo. Lzo's compressed files are made up of a number of small blocks (about 256K) so that the Hadoop job can be splitjob based on block partitioning. Not only that, Lzo is designed to take into account the problem of efficiency, its decompression speed is twice times the gzip, which allows it to save a lot of disk read and write, it's less than gzip compression, about compressed files than gzip compressed half, but this is still less than the compressed file to save 20%- 50% of the http://www.aliyun.com/zixun/aggregation/17325.html "> Storage space, so you can greatly improve the efficiency of job execution speed. The following is a set of contrast data that is compared using a 8.0GB of uncompressed data:

As you can see, the Lzo compressed file is slightly larger than the gzip compressed file, but still much smaller than the original file, and Lzo file compression is almost 5 times times the speed of gzip, and the speed of decompression is equivalent to twice times the gzip. Lzo files can be divided according to Blockboundaries, such as a 1.1G lzo compressed file, then processing the second 128MBblock mapper must be able to confirm the next block boundary for decompression operations. Lzo does not write any data headers to do this, but instead implements a Lzoindex file that writes this file (Foo.lzo.index) in each Foo.lzo file. This index file simply contains the offset of each block in the data, so that the read and write to the data becomes very fast because of the known offset. Usually can reach 90-100mb/seconds, that is, 10-12 seconds to read a GB of files. Once the index file is created, any compressed file based on Lzo can be divided by the load of the index file, and a block is read. Therefore, each mapper can get the correct block, that is to say, you can only need to do a lzopinputstream encapsulation, you can in the mapreduce of Hadoop in parallel efficient use of lzo. If you now have a job inputformat is Textinputformat, then you can use Lzop to compress the file to make sure it creates the index correctly, replacing Textinputformat with Lzotextinputformat, The job can then run as correctly as before and be faster. Sometimes, a large file is Lzo compressed, even without chunking can be a single mapper efficient processing.

Installing Lzo in the Hadoop cluster

It's easy to build Lzo in Hadoop:

Install Lzop native Libraries

For example: sudo yum install lzop Lzo2

Download Hadooplzo support to source code from the following address: Http://github.com/kevinweil/hadoop-lzo

Compiled from the above link checkout to the code, usually: Ant compile-native tar

will be compiled into a hadoop-lzo-*.jar deployment to a Hadoop cluster to various slave to a valid directory, such as $hadooop_home/lib

Deploy the above compiled Hadoop-lzo native Libbinary to the cluster into a valid directory, such as $hadoop_home/lib/native/linux-amd64-64.

Configure the following into Core-site.xml:

Io.compression.codecsorg.apache.hadoop.io.compress.gzipcodec,org.apache.hadoop.io.compress.defaultcodec, Org.apache.hadoop.io.compress.bzip2codec,com.hadoop.compression.lzo.lzocodec, Com.hadoop.compression.lzo.LzopCodec

Io.compression.codec.lzo.class

Com.hadoop.compression.lzo.LzoCodec

Configure the following into Mapred-site.xml:

Mapred.child.env

Java_library_path=/path/to/your/native/hadoop-lzo/libs

If you want to mapreduce the intermediate results and then use compression, you can also write the following configuration to the Mapred-site.xml.

Mapred.map.output.compression.codec

Com.hadoop.compression.lzo.LzoCodec

If all of these actions are successful, you can now try using Lzo. such as packaging a LZO all compressed files, such as lzo_log files, upload to the HDFs, and then use the following command to test:

Hadoop jar/path/to/hadoop-lzo.jarcom.hadoop.compression.lzo.lzoindexerhdfs://namenode:9000/lzo_logs

If you want to write a job to use Lzo, you can find a job, such as WordCount, change the Textinputformat to Lzotextinputforma, and the other without modification, the job can read HDFs compressed files from the Lzo, Distributed and divided into parallel processing.

Instance:

1. Create Flow table:

use;

CREATE TABLE Top_flow

(

Src_address String,

Dst_address String,

Src_port int,

Dst_port int,

Trans_protocol int,

Packets bigint,

Bytes bigint,

Flags string,

Start_time timestamp,

Duration Double,

End_time timestamp,

Sensor string

)

Partitioned by (DD int,hh Int,protocol int)

ROW FORMAT delimited FIELDS terminated by ' | ';

--location ' hdfs://hadoop:8020/user/hive/warehouse/gj.db/top_flow/'

Build Top_flow Partition statement:

HDFS://HADOOP:8020/USER/HIVE/WAREHOUSE/GJ.D 777

Hdfs://hadoop:8020/user/hive/warehouse/gj.db/top_flow/dd=20140707/hh=1/protocol=6/ip=192.168.1.1/

[hadoop:21000] > DESC formatted top_flow;

Hdfs://hadoop:8020/user/hive/warehouse/gj.db/sim_event/dd=20140707/hh=1/devtype=topidp/ip=192.168.1.1/logtype= ipsav/

CREATE TABLE Sim_event

(

event_id

)

ROW FORMAT delimited FIELDS terminated by ' \ t ';

Add Partition:

ALTER TABLE Top_flow add partition (DD=20140707,HH=01,PROTOCOL=6);

HDFs dfs-chmod-r a+w hdfs://hadoop:8020/user/hive/warehouse/gj.db/top_flow/dd=20140707/hh=1/protocol=6

New structure-like table:

CREATE table top_flow1 like Top_flow;

To add a partition:

ALTER TABLE TOP_FLOW1 add partition (DD=20140707,HH=01,PROTOCOL=6);

ALTER TABLE TOP_FLOW1 partition (dd=20140707,hh=1,protocol=6) set FileFormat parquet;

Inserting data

Insert INTO table Top_flow1 partition (dd=20140707,hh=1,protocol=6) Select Src_address,dst_address,src_port,dst_port, Trans_protocol,packets,bytes,flags,start_time,duration,end_time,sensor from Top_flow where dd=20140707 and hh=01 and protocol=17;

Delete data from the source table

ALTER TABLE Top_flow drop partition (dd=20140707,hh=1,protocol=6)

or transfer files:

----Important Content Compression table:

1. First install the LZO package locally (CD-ROM) and install

Yum Install Hadoop-lzo-cdh4

--Files need to be copied to the native directory of multiple systems

Yum Install Impala-lzo

2. Edit Core-site.xml (Note transfer file to specified directory)

Add to:

Io.compression.codecs

Org.apache.hadoop.io.compress.defaultcodec,org.apache.hadoop.io.compress.gzipcodec,

Org.apache.hadoop.io.compress.bzip2codec,org.apache.hadoop.io.compress.deflatecodec,

Org.apache.hadoop.io.compress.snappycodec,com.hadoop.compression.lzo.lzopcodec

Copy native file!!!

Cp/usr/lib/hadoop/lib/native ${hadop_home}/lib/

Restart MapReduce and Impala services

CREATE TABLE Top_flow1 (

Src_address String,

Dst_address String,

Src_port int,

Dst_port int,

Trans_protocol int,

Packets bigint,

Bytes bigint,

Flags string,

Start_time timestamp,

Duration Double,

End_time timestamp,

Sensor string)

Partitioned by (DD int,hh Int,protocol int)

STORED as

InputFormat ' Com.hadoop.mapred.DeprecatedLzoTextInputFormat '

OutputFormat ' Org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat ';

Insert INTO table Top_flow1 partition (dd=20140707,hh=1,protocol=6) Select Src_address,dst_address,src_port,dst_port, Trans_protocol,packets,bytes,flags,start_time,duration,end_time,sensor from Top_flow where dd=20140707 and Hh=1 and protocol=6;

Indexing, File chunking

Hadoop Jar/usr/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar Com.hadoop.compression.lzo.DistributedLzoIndexer hdfs://hdfs:8020/user/hive/warehouse/gj.db/top_flow1/dd= 20140707/hh=1/protocol=6/000000_0.lzo

[localhost:21000] > Invalidate metadata;

0

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.