The Nginx log used here is the access log of the website, for example:
[Java]View PlainCopy
- 180.173.250.74 - - [08/jan/2015: 12: 38:08 +0800] " get /avatar/xxx.png http/1.1 " 200 968
- "http://www.iteblog.com/archives/994"    
- "Mozilla/5.0 (Windows nt 6.1; WOW64) applewebkit/537.36 (Khtml, like gecko)
- chrome/34.0. 1847.131 safari/537.36 "  
This log contains 9 columns (in order to show the beautiful, I add a line break here), each column is separated by a space, the meaning of each column is the client access IP, user identification, user, access time, request page, request status, return file size, jump source, browser UA. It's a bit difficult to parse this log in a general way. But if we're going to have regular expressions, it's easy to match the nine columns of data:
[Java]View PlainCopy
- ([^ ]*) ([^ ]*) ([^ ]*) (. ∗ ) (\".*?\") (-|[ 0-9]*) (-|[ 0-9]*) (\ ". *?\") (\ ". *?\")
This makes it possible to match the values of each column. In hive we can specify the input file parser (SerDe), and a org.apache.hadoop.hive.contrib.serde2.RegexSerDe regular parser is built into hive, and we can use it directly. So the entire build statement can be written like this:
[Java]View PlainCopy
- CREATE TABLE Logs (
- Host STRING,
- Identity STRING,
- User STRING,
- Time STRING,
- Request STRING,
- Status STRING,
- Size STRING,
- Referer STRING,
- Agent STRING)
- ROW FORMAT SERDE ' org.apache.hadoop.hive.contrib.serde2.RegexSerDe '
- With Serdeproperties (
- "Input.regex" = "([^]*) ([^]*) ([^]*) (\\[.*\\]) (\". *?\ ") (-|[ 0-9]*)
- (-| [0-9]*] (\". *?\") (\ ". *?\") ",
- "output.format.string" = "%1 $ s%2$s%3$s%4$s%5$s%6$s%7$s%8$s%9$s"
- )
- STORED as Textfile;
The log is placed in the table directory, GZ format and unknown compression format can be directly parsed by hive. Use the following statement to query for IPs with an hourly access amount of more than 20:
[Java]View PlainCopy
- Hive> Select substring (time, 2, ) date, host, COUNT (*) as Count
- From logs
- GROUP BY substring (time, 2, + ), host
- Having count >
- Sort by date, count;
- 29/dec/:xx 47.18. 236.106
- 29/dec/: 81.215. 34.45
- 29/dec/: 66.249. 64.18
- 29/dec/: 66.249. 64.22
- 29/dec/: 119.145. 14.213
- 29/dec/: 113.90. 78.63
- 29/dec/:ten 106.39. 255.133
- 29/dec/:ten 211.99. 9.68
- 29/dec/:ten 60.10. 71.97
- 29/dec/:ten 222.128. 29.21
- 29/dec/: 91.237. 69.17
- 29/dec/: 211.151. 238.52 144
- 29/dec/: 222.92. 189.35
- 29/dec/: 218.85. 130.110
- 29/dec/: 218.4. 189.13
- 29/dec/: 61.57. 231.254
- 29/dec/: 124.207. 11.123
- 29/dec/: 134.134. 139.76
- 29/dec/: 218.15. 33.28
- 29/dec/: 218.247. 17.100
- 29/dec/: 116.235. 244.139
- 29/dec/: 101.231. 119.202
- 29/dec/: 183.11. 249.158
- 29/dec/: 116.235. 244.139
- 29/dec/: 211.151. 238.52
- 29/dec/: 123.138. 184.84
- 29/dec/: 219.159. 77.110
- 29/dec/: 87.204. 102.195
- 29/dec/: 111.203. 3.1
- 29/dec/: 125.41. 147.243
- 29/dec/: 66.249. 64.18
- 29/dec/: 101.251. 230.3
- 29/dec/: 110.249. 70.182
- 29/dec/: 91.200. 12.26
- 29/dec/: 218.64. 17.230
- 29/dec/: 66.249. 64.22
- 29/dec/: 222.129. 35.102
Or some other kind of operation.
If you're familiar with bash, then you can simply use awk, sort, and so on, instead of hive, for example, I would like to know the number of traffic today and sort them, and the first 10 statements can be written like this:
[Java]View PlainCopy
- [Root@iteblog]# awk ' {print '} ' www.iteblog.com.access.log | sort | uniq-c |
- > Sort-nr | Head-n Ten
- 241 46.119. 121.149
- 224 66.249. 65.51
- 66.249. 65.49
- 219 66.249. 65.47
- 211.151. 238.52
- 184 207.46. 13.96
- 183 157.55. 39.44
- 182 112.247. 104.147
- 173 157.55. 39.239
- 169 157.55. 39.106
Analyzing Nginx logs with Hive