Use Hive's regular parser RegexSerDe to analyze Nginx logs
Use Hive's regular parser RegexSerDe to analyze Nginx logs
1. Environment:
Hadoop-2.6.0 & apache-hive-1.2.0-bin
2. Use Hive to analyze nginx logs. The website access logs are as follows:
Cat/home/hadoop/hivetestdata/nginx.txt
192.168.1.128--[09/Jan/2015: 12: 38: 08 + 0800] "GET/avatar/helloworld.png HTTP/1.1" 200 1521 "http://write.blog.linuxidc.net/postlist" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/34.0.1847.131 Safari/537.36 & quot"
183.60.212.153--[19/Feb/2015: 10: 23: 29 + 0800] "GET/o2o/media.html? Menu = 3 HTTP/1.1 "200 16691"-"" Mozilla/5.0 (compatible; baiduuSpider; + )"
This log contains nine columns separated by spaces,
Each column indicates the client access IP address, user ID, user, access time, request page, Request status, size of the returned file, jump source, and browser UA.
We use the regular expression in Hive to match the nine columns:
([^] *) ([^] *) ([^] *) (. Equals )(\".*? \ ") (-| [0-9] *) (-| [0-9] *) (\". *? \")(\".*? \")
At the same time, we can specify the serialization and deserialization Parser (SerDe) of the parsing file in Hive, and a built-in org. apache. hadoop. hive. serde2.RegexSerDe regular parser, which can be used directly.
3. Table creation statement
Create table logs
(
Host STRING,
Identity STRING,
Username STRING,
Time STRING,
Request STRING,
Status STRING,
Size STRING,
Referer STRING,
Agent STRING
)
Row format serde 'org. apache. hadoop. hive. serde2.regexserde'
With serdeproperties (
"Input. regex "=" ([^] *) ([^] *) ([^] *) (\ [. * \]) (\ ". *? \ ") (-| [0-9] *) (-| [0-9] *) (\". *? \")(\".*? \")",
"Output. format. string "=" % 1 $ s % 2 $ s % 3 $ s % 4 $ s % 5 $ s % 6 $ s % 7 $ s % 8 $ s % 9 $ s"
)
Stored as textfile;
4. load data:
Load data local inpath '/home/hadoop/hivetestdata/nginx.txt' into table logs;
Query IP addresses with an access volume of more than 100 per hour:
Select substring (time, 2, 14) datetime, host, count (*) as count
From logs
Group by substring (time, 2, 14), host
Having count> 100
Sort by datetime, count;
Hive programming guide PDF (Chinese Version)
Hadoop cluster-based Hive Installation
Differences between Hive internal tables and external tables
Hadoop + Hive + Map + reduce cluster installation and deployment
Install in Hive local standalone Mode
WordCount word statistics for Hive Learning
Hive operating architecture and configuration and deployment
Hive details: click here
Hive: click here
This article permanently updates the link address: