You use pig to analyze the number of IP accesses in the Access_log log

Source: Internet
Author: User

Environment Description:

OS Version: rhel5.7 64-bit

Hadoop version: hadoop-0.20.2

HBase version: hbase-0.90.5

Pig Version: pig-0.9.2


Access the log file and download the attachment in the article!


The log is placed on the local directory path:/home/hadoop/access_log.txt

The log format is:

220.181.108.151--[31/jan/2012:00:02:32 +0800] "get/home.php?mod=space&uid=158&do=album&view=me& From=space http/1.1 "8784"-"mozilla/5.0" (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html) "


1) Create the input directory in the HDFs file system

grunt> mkdir Input

Grunt> ls

Hdfs://node1.test.com:9000/user/hadoop/input <dir>


grunt> CD Input

Grunt> ls

grunt> pwd

Hdfs://node1.test.com:9000/user/hadoop/input


2) load the local log file system into the Log.txt file in the current directory of the HDFS system;

Grunt> Copyfromlocal/home/hadoop/access_log.txt Log.txt

2014-10-14 10:53:49,667 [Thread-7] INFO org.apache.hadoop.hdfs.dfsclient-exception in Createblockoutputstream Java.net.NoRouteToHostException:No Route to host

2014-10-14 10:53:49,667 [Thread-7] INFO org.apache.hadoop.hdfs.dfsclient-abandoning block blk_-7546596643624545852_ 1118

2014-10-14 10:53:49,669 [Thread-7] INFO org.apache.hadoop.hdfs.dfsclient-excluding datanode 172.16.41.154:50010

#查看相关文件

Grunt> ls

Hdfs://node1.test.com:9000/user/hadoop/input/log.txt<r 2> 7118627


3) load the file contents into the variable A, the delimiter is ' ';

grunt> a = Load '/user/hadoop/input/log.txt '

>> using Pigstorage (')

>> as (IP,A1,A2,A3,A4,A5,A6,A7,A8);


4) filtering on IP fields

Grunt> B = foreach a generate IP;


5) Group by operation of C by IP:

grunt> C = Group B by IP;


6) Count the number of IP clicks:

Grunt> d = foreach C Generate Group,count ($);




Show calculation results :

Grunt> dump D;

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/4C/67/wKioL1Q8ujORUDYGAA0I7AaqpQE715.jpg "style=" float: none; "title=" snap3.jpg "alt=" Wkiol1q8ujorudygaa0i7aaqpqe715.jpg "/>

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/4C/66/wKiom1Q8uf2A0pLQAARYff1F624627.jpg "style=" float: none; "title=" snap4.jpg "alt=" Wkiom1q8uf2a0plqaaryff1f624627.jpg "/>








This article is from the "Shine_forever blog" blog, make sure to keep this source http://shineforever.blog.51cto.com/1429204/1563850

You use pig to analyze the number of IP accesses in the Access_log log

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.