Use Hive's regular parser RegexSerDe to analyze nginx logs
1. Environment:
Hadoop-2.6.0 & apache-hive-1.2.0-bin
2. Use Hive to analyze nginx logs. The website access logs are as follows:
Cat/home/hadoop/hivetestdata/nginx.txt
192.168.1.128--[09/Jan/2015: 12: 38: 08 + 0800] "GET/avatar/helloworld.png HTTP/1.1" 200 1521 "http://write.blog.csdn.net/postlist" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/34.0.1847.131 Safari/537.36 & quot"
183.60.212.153--[19/Feb/2015: 10: 23: 29 + 0800] "GET/o2o/media.html? Menu = 3 HTTP/1.1 "200 16691"-"" Mozilla/5.0 (compatible; EasouSpider; + http://www.easou.com/search/spider.html )"
This log contains nine columns separated by spaces,
Each column indicates the client access IP address, user ID, user, access time, request page, Request status, size of the returned file, jump source, and browser UA.
We use the regular expression in Hive to match the nine columns:
([^] *) ([^] *) ([^] *) (\ [. * \]) (\ ". *? \ ") (-| [0-9] *) (-| [0-9] *) (\". *? \")(\".*? \")
At the same time, we can specify the serialization and deserialization Parser (SerDe) of the parsing file in Hive, and a built-in org. apache. hadoop. hive. serde2.RegexSerDe regular parser, which can be used directly.
3. Table creation statement
Create table logs
(
Host STRING,
Identity STRING,
Username STRING,
Time STRING,
Request STRING,
Status STRING,
Size STRING,
Referer STRING,
Agent STRING
)
Row format serde 'org. apache. hadoop. hive. serde2.regexserde'
With serdeproperties (
"Input. regex "=" ([^] *) ([^] *) ([^] *) (\ [. * \]) (\ ". *? \ ") (-| [0-9] *) (-| [0-9] *) (\". *? \")(\".*? \")",
"Output. format. string "=" % 1 $ s % 2 $ s % 3 $ s % 4 $ s % 5 $ s % 6 $ s % 7 $ s % 8 $ s % 9 $ s"
)
Stored as textfile;
4. load data:
Load data local inpath '/home/hadoop/hivetestdata/nginx.txt' into table logs;
Query IP addresses with an access volume of more than 100 per hour:
Select substring (time, 2, 14) datetime, host, count (*) as count
From logs
Group by substring (time, 2, 14), host
Having count> 100
Sort by datetime, count;