Use Hive's regular parser RegexSerDe to analyze nginx logs

Source: Internet
Author: User

Use Hive's regular parser RegexSerDe to analyze nginx logs

1. Environment:

Hadoop-2.6.0 & apache-hive-1.2.0-bin

2. Use Hive to analyze nginx logs. The website access logs are as follows:

Cat/home/hadoop/hivetestdata/nginx.txt
192.168.1.128--[09/Jan/2015: 12: 38: 08 + 0800] "GET/avatar/helloworld.png HTTP/1.1" 200 1521 "http://write.blog.csdn.net/postlist" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/34.0.1847.131 Safari/537.36 & quot"
183.60.212.153--[19/Feb/2015: 10: 23: 29 + 0800] "GET/o2o/media.html? Menu = 3 HTTP/1.1 "200 16691"-"" Mozilla/5.0 (compatible; EasouSpider; + http://www.easou.com/search/spider.html )"


This log contains nine columns separated by spaces,
Each column indicates the client access IP address, user ID, user, access time, request page, Request status, size of the returned file, jump source, and browser UA.


We use the regular expression in Hive to match the nine columns:
([^] *) ([^] *) ([^] *) (\ [. * \]) (\ ". *? \ ") (-| [0-9] *) (-| [0-9] *) (\". *? \")(\".*? \")
At the same time, we can specify the serialization and deserialization Parser (SerDe) of the parsing file in Hive, and a built-in org. apache. hadoop. hive. serde2.RegexSerDe regular parser, which can be used directly.


3. Table creation statement
Create table logs
(
Host STRING,
Identity STRING,
Username STRING,
Time STRING,
Request STRING,
Status STRING,
Size STRING,
Referer STRING,
Agent STRING
)
Row format serde 'org. apache. hadoop. hive. serde2.regexserde'
With serdeproperties (
"Input. regex "=" ([^] *) ([^] *) ([^] *) (\ [. * \]) (\ ". *? \ ") (-| [0-9] *) (-| [0-9] *) (\". *? \")(\".*? \")",
"Output. format. string "=" % 1 $ s % 2 $ s % 3 $ s % 4 $ s % 5 $ s % 6 $ s % 7 $ s % 8 $ s % 9 $ s"
)
Stored as textfile;



4. load data:
Load data local inpath '/home/hadoop/hivetestdata/nginx.txt' into table logs;

Query IP addresses with an access volume of more than 100 per hour:
Select substring (time, 2, 14) datetime, host, count (*) as count
From logs
Group by substring (time, 2, 14), host
Having count> 100

Sort by datetime, count;

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.