Hadoop2.2.0 + hive use lzo to compress those tasks

Source: Internet
Author: User

Environment:

Centos6.4 64-bit

Hadoop2.2.0

Sun jdk1.7.0 _ 45

Hive-1, 0.12.0

Preparations:

Yum-y install lzo-devel zlib-devel GCC Autoconf automake libtool

Started!

(1) install lzo

Wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
Tar-zxvf lzo-2.06.tar.gz
./Configure-enable-shared-Prefix =/usr/local/hadoop/lzo/
Make & make Test & make install

Installation is complete, copy/usr/local/hadoop/lzo/lib/* To/usr/lib/and/usr/lib64/sudo CP/usr/local/hadoop/lzo/lib /* /usr/lib/sudo CP/usr/local/hadoop/lzo/lib/*/usr/lib64/configure the environment variable (Vim/etc/bashrc): Export Path =/usr/local // hadoop/lzo/: $ path

(2) install lzop
Wget http://www.lzop.org/download/lzop-1.03.tar.gz
Tar-zxvf lzop-1.03.tar.gz

Export c_include_path =/usr/local/hadoop/lzo/include/

PS: If this parameter is not set, the following error occurs: Configure: Error: lzo header files not found. Please check your installation or set the environment variable 'cppflags '. Next,

./Configure-enable-shared-Prefix =/usr/local/hadoop/lzop
Make & make install

(3) Copy lzop to/usr/bin/
Ln-S/usr/local/hadoop/lzop/bin/lzop/usr/bin/lzop

(4) test lzop
Lzop/home/hadoop/data/access_20131219.log

Input lzop

Error: lzop: Error while loading shared libraries: liblzo2.so. 2: cannot open shared object file: no such file or directory

Solution: add the environment variable export LD_LIBRARY_PATH = $ LD_LIBRARY_PATH:/usr/lib64.

A compressed file with the lzo Suffix:/home/hadoop/data/access_20131219.log.lzo indicates that the preceding steps are correct.
(5) install hadoop-lzo

Of course, there is another premise, that is, to configure Maven, SVN, or git (I am using SVN), so I won't talk about it. If this is not the case, it is not necessary to proceed!

I use https://github.com/twitter/hadoop-lzo here

Use SVN to download the code from https://github.com/twitter/hadoop-lzo/trunkand modify a folder in the pom.xmlfile.

From:

<Properties>
<Project. Build. sourceencoding> UTF-8 </Project. Build. sourceencoding>
<Hadoop. Current. version> 2.1.0-beta <Hadoop. Old. version> 1.0.4 </Properties>

To:

<Properties>
<Project. Build. sourceencoding> UTF-8 </Project. Build. sourceencoding>
<Hadoop. Current. version> 2.2.0 <Hadoop. Old. version> 1.0.4 </Properties>

Run the following commands in sequence:

MVN clean package-dmaven. Test. Skip = true
Tar-cv-c target/native/Linux-amd64-64/lib. | tar-xbvf--C/home/hadoop/hadoop-2.2.0/lib/native/
CP target/hadoop-lzo-0.4.20-SNAPSHOT.jar/home/hadoop/hadoop-2.2.0/share/hadoop/common/

The next step is to synchronize/home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar and/home/hadoop/hadoop-2.2.0/lib/native/to all other hadoop nodes. Note: Make sure that the jar package under the directory/home/hadoop/hadoop-2.2.0/lib/native/has execution permissions for all users running hadoop.

(6) Configure hadoop

Append the following content to the file $ hadoop_home/etc/hadoop/hadoop-env.sh:

Export LD_LIBRARY_PATH =/usr/local/hadoop/lzo/lib

Append the following content to the file $ hadoop_home/etc/hadoop/core-site.xml:

<Property>
<Name> Io. Compression. codecs </Name>
<Value> org. Apache. hadoop. Io. Compress. gzipcodec,
Org. Apache. hadoop. Io. Compress. defaultcodec,
Com. hadoop. Compression. lzo. lzocodec,
Com. hadoop. Compression. lzo. lzopcodec,
Org. Apache. hadoop. Io. Compress. bzip2codec
</Value>
</Property>
<Property>
<Name> Io. Compression. codec. lzo. Class </Name>
<Value> com. hadoop. Compression. lzo. lzocodec </value>
</Property>

Append the following content to the file $ hadoop_home/etc/hadoop/mapred-site.xml:

<Property>
<Name> mapred. Compress. Map. Output </Name>
<Value> true </value>
</Property>
<Property>
<Name> mapred. Map. Output. Compression. codec </Name>
<Value> com. hadoop. Compression. lzo. lzocodec </value>
</Property>
<Property>
<Name> mapred. Child. env </Name>
<Value> LD_LIBRARY_PATH =/usr/local/hadoop/lzo/lib </value>
</Property>

(7) experience lzo in hive

A: First create the nginx_lzo table.

Create Table logs_app_nginx (
IP string,
User string,
Time string,
Request string,
Status string,
Size String,
RT string,
Referer string,
Agent string,
Forwarded string
)
Partitioned (
Date string,
Host string
)
Row format delimited
Fields terminated by '\ t'
Stored as inputformat "com. hadoop. mapred. deprecatedlzotextinputformat"
Outputformat "org. Apache. hadoop. hive. QL. Io. hiveignorekeytextoutputformat ";

B: import data

Load data local inpath '/home/hadoop/data/access_20131230_25.log.lzo' into Table logs_app_nginx partition (date = 20131229, host = 25 );

The format of the/home/hadoop/data/access_20131219.log file is as follows:

 

221.207.93.109-[23/DEC/2013: 23: 22: 38 + 0800] "Get/clientgetresourcedetail. Action? Id = 318880 & token = ocm http/1.1 "200 199 0.008" xxx.com "" android4.1.2/Lenovo a706/ch_lenovo/80 ""-"

Use lzop/home/hadoop/data/access_20131219.log to generate the lzo compressed file/home/hadoop/data/access_20131219.log.lzo

C: index the lzo File

$ Hadoop_home/bin/hadoop JAR/home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar COM. hadoop. compression. lzo. distributedlzoindexer/user/hive/warehouse/<span style = "font-family: Arial, Helvetica, sans-serif;"> logs_app_nginx </span>

D: start to run the MAP/reduce task using hive.

Set hive.exe C. Fetch CERs. max = 10;
Set mapred. Reduce. Tasks = 10;
Select IP, RT from nginx_lzo limit 10;

On the hive console, you can see the output in a format similar to the following, which means it is correct!

Hive> set hive.exe C. Fetch CERs. max = 10;
Hive> set mapred. Reduce. Tasks = 10;
Hive> select IP, RT from nginx_lzo limit 10;
Total mapreduce jobs = 1
Launching job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce Operator
Starting job = job_1388065803340_0009, tracking url = http: // lrts216: 8088/Proxy/application_1388065803340_0009/
Kill command =/home/hadoop/hadoop-2.2.0/bin/hadoop job-kill job_1388065803340_0009
Hadoop job information for stage-1: Number of mappers: 1; number of concurrent CERs: 0
09:13:39, 163 stage-1 Map = 0%, reduce = 0%
09:13:45, 343 stage-1 Map = 100%, reduce = 0%, cumulative CPU 1.22 Sec
09:13:46, 369 stage-1 Map = 100%, reduce = 0%, cumulative CPU 1.22 Sec
Mapreduce total cumulative CPU time: 1 secondds 220 msec
Ended job = job_1388065803340_0009
Mapreduce Jobs launched:
Job 0: Map: 1 Cumulative CPU: 1.22 sec HDFS read: 63570 HDFS write: 315 success
Total mapreduce CPU time spent: 1 secondds 220 msec
OK
221.207.93.109 "xxx.com"
Time taken: 17.498 seconds, fetched: 10 row (s)

Hadoop2.2.0 + hive use lzo to compress those tasks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.