Hadoop2.2.0 + hive use lzo to compress those tasks

Last Update:2014-08-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Environment:

Centos6.4 64-bit

Hadoop2.2.0

Sun jdk1.7.0 _ 45

Hive-1, 0.12.0

Preparations:

Yum-y install lzo-devel zlib-devel GCC Autoconf automake libtool

Started!

(1) install lzo

Wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
Tar-zxvf lzo-2.06.tar.gz
./Configure-enable-shared-Prefix =/usr/local/hadoop/lzo/
Make & make Test & make install

Installation is complete, copy/usr/local/hadoop/lzo/lib/* To/usr/lib/and/usr/lib64/sudo CP/usr/local/hadoop/lzo/lib /* /usr/lib/sudo CP/usr/local/hadoop/lzo/lib/*/usr/lib64/configure the environment variable (Vim/etc/bashrc): Export Path =/usr/local // hadoop/lzo/: $ path

(2) install lzop
Wget http://www.lzop.org/download/lzop-1.03.tar.gz
Tar-zxvf lzop-1.03.tar.gz

Export c_include_path =/usr/local/hadoop/lzo/include/

PS: If this parameter is not set, the following error occurs: Configure: Error: lzo header files not found. Please check your installation or set the environment variable 'cppflags '. Next,

./Configure-enable-shared-Prefix =/usr/local/hadoop/lzop
Make & make install

(3) Copy lzop to/usr/bin/
Ln-S/usr/local/hadoop/lzop/bin/lzop/usr/bin/lzop

(4) test lzop
Lzop/home/hadoop/data/access_20131219.log

Input lzop

Error: lzop: Error while loading shared libraries: liblzo2.so. 2: cannot open shared object file: no such file or directory

Solution: add the environment variable export LD_LIBRARY_PATH = $ LD_LIBRARY_PATH:/usr/lib64.

A compressed file with the lzo Suffix:/home/hadoop/data/access_20131219.log.lzo indicates that the preceding steps are correct.
(5) install hadoop-lzo

Of course, there is another premise, that is, to configure Maven, SVN, or git (I am using SVN), so I won't talk about it. If this is not the case, it is not necessary to proceed!

I use https://github.com/twitter/hadoop-lzo here

Use SVN to download the code from https://github.com/twitter/hadoop-lzo/trunkand modify a folder in the pom.xmlfile.

From:

To:

Run the following commands in sequence:

MVN clean package-dmaven. Test. Skip = true
Tar-cv-c target/native/Linux-amd64-64/lib. | tar-xbvf--C/home/hadoop/hadoop-2.2.0/lib/native/
CP target/hadoop-lzo-0.4.20-SNAPSHOT.jar/home/hadoop/hadoop-2.2.0/share/hadoop/common/

The next step is to synchronize/home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar and/home/hadoop/hadoop-2.2.0/lib/native/to all other hadoop nodes. Note: Make sure that the jar package under the directory/home/hadoop/hadoop-2.2.0/lib/native/has execution permissions for all users running hadoop.

(6) Configure hadoop

Append the following content to the file $ hadoop_home/etc/hadoop/hadoop-env.sh:

Export LD_LIBRARY_PATH =/usr/local/hadoop/lzo/lib

Append the following content to the file $ hadoop_home/etc/hadoop/core-site.xml:

<Property>
<Name> Io. Compression. codecs </Name>
<Value> org. Apache. hadoop. Io. Compress. gzipcodec,
Org. Apache. hadoop. Io. Compress. defaultcodec,
Com. hadoop. Compression. lzo. lzocodec,
Com. hadoop. Compression. lzo. lzopcodec,
Org. Apache. hadoop. Io. Compress. bzip2codec
</Value>
</Property>
<Property>
<Name> Io. Compression. codec. lzo. Class </Name>
<Value> com. hadoop. Compression. lzo. lzocodec </value>
</Property>

Append the following content to the file $ hadoop_home/etc/hadoop/mapred-site.xml:

<Property>
<Name> mapred. Compress. Map. Output </Name>
<Value> true </value>
</Property>
<Property>
<Name> mapred. Map. Output. Compression. codec </Name>
<Value> com. hadoop. Compression. lzo. lzocodec </value>
</Property>
<Property>
<Name> mapred. Child. env </Name>
<Value> LD_LIBRARY_PATH =/usr/local/hadoop/lzo/lib </value>
</Property>

(7) experience lzo in hive

A: First create the nginx_lzo table.

Create Table logs_app_nginx (
IP string,
User string,
Time string,
Request string,
Status string,
Size String,
RT string,
Referer string,
Agent string,
Forwarded string
)
Partitioned (
Date string,
Host string
)
Row format delimited
Fields terminated by '\ t'
Stored as inputformat "com. hadoop. mapred. deprecatedlzotextinputformat"
Outputformat "org. Apache. hadoop. hive. QL. Io. hiveignorekeytextoutputformat ";

B: import data

Load data local inpath '/home/hadoop/data/access_20131230_25.log.lzo' into Table logs_app_nginx partition (date = 20131229, host = 25 );

The format of the/home/hadoop/data/access_20131219.log file is as follows:

221.207.93.109-[23/DEC/2013: 23: 22: 38 + 0800] "Get/clientgetresourcedetail. Action? Id = 318880 & token = ocm http/1.1 "200 199 0.008" xxx.com "" android4.1.2/Lenovo a706/ch_lenovo/80 ""-"

Use lzop/home/hadoop/data/access_20131219.log to generate the lzo compressed file/home/hadoop/data/access_20131219.log.lzo

C: index the lzo File

$ Hadoop_home/bin/hadoop JAR/home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar COM. hadoop. compression. lzo. distributedlzoindexer/user/hive/warehouse/<span style = "font-family: Arial, Helvetica, sans-serif;"> logs_app_nginx </span>

D: start to run the MAP/reduce task using hive.

Set hive.exe C. Fetch CERs. max = 10;
Set mapred. Reduce. Tasks = 10;
Select IP, RT from nginx_lzo limit 10;

On the hive console, you can see the output in a format similar to the following, which means it is correct!

Hive> set hive.exe C. Fetch CERs. max = 10;
Hive> set mapred. Reduce. Tasks = 10;
Hive> select IP, RT from nginx_lzo limit 10;
Total mapreduce jobs = 1
Launching job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce Operator
Starting job = job_1388065803340_0009, tracking url = http: // lrts216: 8088/Proxy/application_1388065803340_0009/
Kill command =/home/hadoop/hadoop-2.2.0/bin/hadoop job-kill job_1388065803340_0009
Hadoop job information for stage-1: Number of mappers: 1; number of concurrent CERs: 0
09:13:39, 163 stage-1 Map = 0%, reduce = 0%
09:13:45, 343 stage-1 Map = 100%, reduce = 0%, cumulative CPU 1.22 Sec
09:13:46, 369 stage-1 Map = 100%, reduce = 0%, cumulative CPU 1.22 Sec
Mapreduce total cumulative CPU time: 1 secondds 220 msec
Ended job = job_1388065803340_0009
Mapreduce Jobs launched:
Job 0: Map: 1 Cumulative CPU: 1.22 sec HDFS read: 63570 HDFS write: 315 success
Total mapreduce CPU time spent: 1 secondds 220 msec
OK
221.207.93.109 "xxx.com"
Time taken: 17.498 seconds, fetched: 10 row (s)

Hadoop2.2.0 + hive use lzo to compress those tasks

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop2.2.0 + hive use lzo to compress those tasks

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop2.2.0 + hive use lzo to compress those tasks

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support