Environment:
Centos6.4 64-bit
Hadoop2.2.0
Sun jdk1.7.0 _ 45
Hive-1, 0.12.0
Preparations:
Yum-y install lzo-devel zlib-devel GCC Autoconf automake libtool
Started!
(1) install lzo
Wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
Tar-zxvf lzo-2.06.tar.gz
./Configure-enable-shared-Prefix =/usr/local/hadoop/lzo/
Make & make Test & make install
Installation is complete, copy/usr/local/hadoop/lzo/lib/* To/usr/lib/and/usr/lib64/sudo CP/usr/local/hadoop/lzo/lib /* /usr/lib/sudo CP/usr/local/hadoop/lzo/lib/*/usr/lib64/configure the environment variable (Vim/etc/bashrc): Export Path =/usr/local // hadoop/lzo/: $ path
(2) install lzop
Wget http://www.lzop.org/download/lzop-1.03.tar.gz
Tar-zxvf lzop-1.03.tar.gz
Export c_include_path =/usr/local/hadoop/lzo/include/
PS: If this parameter is not set, the following error occurs: Configure: Error: lzo header files not found. Please check your installation or set the environment variable 'cppflags '. Next,
./Configure-enable-shared-Prefix =/usr/local/hadoop/lzop
Make & make install
(3) Copy lzop to/usr/bin/
Ln-S/usr/local/hadoop/lzop/bin/lzop/usr/bin/lzop
(4) test lzop
Lzop/home/hadoop/data/access_20131219.log
Input lzop
Error: lzop: Error while loading shared libraries: liblzo2.so. 2: cannot open shared object file: no such file or directory
Solution: add the environment variable export LD_LIBRARY_PATH = $ LD_LIBRARY_PATH:/usr/lib64.
A compressed file with the lzo Suffix:/home/hadoop/data/access_20131219.log.lzo indicates that the preceding steps are correct.
(5) install hadoop-lzo
Of course, there is another premise, that is, to configure Maven, SVN, or git (I am using SVN), so I won't talk about it. If this is not the case, it is not necessary to proceed!
I use https://github.com/twitter/hadoop-lzo here
Use SVN to download the code from https://github.com/twitter/hadoop-lzo/trunkand modify a folder in the pom.xmlfile.
From:
<Properties>
<Project. Build. sourceencoding> UTF-8 </Project. Build. sourceencoding>
<Hadoop. Current. version> 2.1.0-beta <Hadoop. Old. version> 1.0.4 </Properties>
To:
<Properties>
<Project. Build. sourceencoding> UTF-8 </Project. Build. sourceencoding>
<Hadoop. Current. version> 2.2.0 <Hadoop. Old. version> 1.0.4 </Properties>
Run the following commands in sequence:
MVN clean package-dmaven. Test. Skip = true
Tar-cv-c target/native/Linux-amd64-64/lib. | tar-xbvf--C/home/hadoop/hadoop-2.2.0/lib/native/
CP target/hadoop-lzo-0.4.20-SNAPSHOT.jar/home/hadoop/hadoop-2.2.0/share/hadoop/common/
The next step is to synchronize/home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar and/home/hadoop/hadoop-2.2.0/lib/native/to all other hadoop nodes. Note: Make sure that the jar package under the directory/home/hadoop/hadoop-2.2.0/lib/native/has execution permissions for all users running hadoop.
(6) Configure hadoop
Append the following content to the file $ hadoop_home/etc/hadoop/hadoop-env.sh:
Export LD_LIBRARY_PATH =/usr/local/hadoop/lzo/lib
Append the following content to the file $ hadoop_home/etc/hadoop/core-site.xml:
<Property>
<Name> Io. Compression. codecs </Name>
<Value> org. Apache. hadoop. Io. Compress. gzipcodec,
Org. Apache. hadoop. Io. Compress. defaultcodec,
Com. hadoop. Compression. lzo. lzocodec,
Com. hadoop. Compression. lzo. lzopcodec,
Org. Apache. hadoop. Io. Compress. bzip2codec
</Value>
</Property>
<Property>
<Name> Io. Compression. codec. lzo. Class </Name>
<Value> com. hadoop. Compression. lzo. lzocodec </value>
</Property>
Append the following content to the file $ hadoop_home/etc/hadoop/mapred-site.xml:
<Property>
<Name> mapred. Compress. Map. Output </Name>
<Value> true </value>
</Property>
<Property>
<Name> mapred. Map. Output. Compression. codec </Name>
<Value> com. hadoop. Compression. lzo. lzocodec </value>
</Property>
<Property>
<Name> mapred. Child. env </Name>
<Value> LD_LIBRARY_PATH =/usr/local/hadoop/lzo/lib </value>
</Property>
(7) experience lzo in hive
A: First create the nginx_lzo table.
Create Table logs_app_nginx (
IP string,
User string,
Time string,
Request string,
Status string,
Size String,
RT string,
Referer string,
Agent string,
Forwarded string
)
Partitioned (
Date string,
Host string
)
Row format delimited
Fields terminated by '\ t'
Stored as inputformat "com. hadoop. mapred. deprecatedlzotextinputformat"
Outputformat "org. Apache. hadoop. hive. QL. Io. hiveignorekeytextoutputformat ";
B: import data
Load data local inpath '/home/hadoop/data/access_20131230_25.log.lzo' into Table logs_app_nginx partition (date = 20131229, host = 25 );
The format of the/home/hadoop/data/access_20131219.log file is as follows:
221.207.93.109-[23/DEC/2013: 23: 22: 38 + 0800] "Get/clientgetresourcedetail. Action? Id = 318880 & token = ocm http/1.1 "200 199 0.008" xxx.com "" android4.1.2/Lenovo a706/ch_lenovo/80 ""-"
Use lzop/home/hadoop/data/access_20131219.log to generate the lzo compressed file/home/hadoop/data/access_20131219.log.lzo
C: index the lzo File
$ Hadoop_home/bin/hadoop JAR/home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar COM. hadoop. compression. lzo. distributedlzoindexer/user/hive/warehouse/<span style = "font-family: Arial, Helvetica, sans-serif;"> logs_app_nginx </span>
D: start to run the MAP/reduce task using hive.
Set hive.exe C. Fetch CERs. max = 10;
Set mapred. Reduce. Tasks = 10;
Select IP, RT from nginx_lzo limit 10;
On the hive console, you can see the output in a format similar to the following, which means it is correct!
Hive> set hive.exe C. Fetch CERs. max = 10;
Hive> set mapred. Reduce. Tasks = 10;
Hive> select IP, RT from nginx_lzo limit 10;
Total mapreduce jobs = 1
Launching job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce Operator
Starting job = job_1388065803340_0009, tracking url = http: // lrts216: 8088/Proxy/application_1388065803340_0009/
Kill command =/home/hadoop/hadoop-2.2.0/bin/hadoop job-kill job_1388065803340_0009
Hadoop job information for stage-1: Number of mappers: 1; number of concurrent CERs: 0
09:13:39, 163 stage-1 Map = 0%, reduce = 0%
09:13:45, 343 stage-1 Map = 100%, reduce = 0%, cumulative CPU 1.22 Sec
09:13:46, 369 stage-1 Map = 100%, reduce = 0%, cumulative CPU 1.22 Sec
Mapreduce total cumulative CPU time: 1 secondds 220 msec
Ended job = job_1388065803340_0009
Mapreduce Jobs launched:
Job 0: Map: 1 Cumulative CPU: 1.22 sec HDFS read: 63570 HDFS write: 315 success
Total mapreduce CPU time spent: 1 secondds 220 msec
OK
221.207.93.109 "xxx.com"
Time taken: 17.498 seconds, fetched: 10 row (s)
Hadoop2.2.0 + hive use lzo to compress those tasks