Environment Description:
OS Version: rhel5.7 64-bit
Hadoop version: hadoop-0.20.2
HBase version: hbase-0.90.5
Pig Version: pig-0.9.2
Access the log file and download the attachment in the article!
The log is placed on the local directory path:/home/hadoop/access_log.txt
The log format is:
220.181.108.151--[31/jan/2012:00:02:32 +0800] "get/home.php?mod=space&uid=158&do=album&view=me& From=space http/1.1 "8784"-"mozilla/5.0" (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html) "
1) Create the input directory in the HDFs file system
grunt> mkdir Input
Grunt> ls
Hdfs://node1.test.com:9000/user/hadoop/input <dir>
grunt> CD Input
Grunt> ls
grunt> pwd
Hdfs://node1.test.com:9000/user/hadoop/input
2) load the local log file system into the Log.txt file in the current directory of the HDFS system;
Grunt> Copyfromlocal/home/hadoop/access_log.txt Log.txt
2014-10-14 10:53:49,667 [Thread-7] INFO org.apache.hadoop.hdfs.dfsclient-exception in Createblockoutputstream Java.net.NoRouteToHostException:No Route to host
2014-10-14 10:53:49,667 [Thread-7] INFO org.apache.hadoop.hdfs.dfsclient-abandoning block blk_-7546596643624545852_ 1118
2014-10-14 10:53:49,669 [Thread-7] INFO org.apache.hadoop.hdfs.dfsclient-excluding datanode 172.16.41.154:50010
#查看相关文件
Grunt> ls
Hdfs://node1.test.com:9000/user/hadoop/input/log.txt<r 2> 7118627
3) load the file contents into the variable A, the delimiter is ' ';
grunt> a = Load '/user/hadoop/input/log.txt '
>> using Pigstorage (')
>> as (IP,A1,A2,A3,A4,A5,A6,A7,A8);
4) filtering on IP fields
Grunt> B = foreach a generate IP;
5) Group by operation of C by IP:
grunt> C = Group B by IP;
6) Count the number of IP clicks:
Grunt> d = foreach C Generate Group,count ($);
Show calculation results :
Grunt> dump D;
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/4C/67/wKioL1Q8ujORUDYGAA0I7AaqpQE715.jpg "style=" float: none; "title=" snap3.jpg "alt=" Wkiol1q8ujorudygaa0i7aaqpqe715.jpg "/>
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/4C/66/wKiom1Q8uf2A0pLQAARYff1F624627.jpg "style=" float: none; "title=" snap4.jpg "alt=" Wkiom1q8uf2a0plqaaryff1f624627.jpg "/>
This article is from the "Shine_forever blog" blog, make sure to keep this source http://shineforever.blog.51cto.com/1429204/1563850
You use pig to analyze the number of IP accesses in the Access_log log