Analyzing Nginx logs with Hive

Source: Internet
Author: User

The Nginx log used here is the access log of the website, for example:

[Java]View PlainCopy
    1. 180.173.250.74 -  - [08/jan/2015: 12: 38:08 +0800]  " get /avatar/xxx.png http/1.1 " 200 968    
    2. "http://www.iteblog.com/archives/994"    
    3. "Mozilla/5.0  (Windows nt 6.1;  WOW64)  applewebkit/537.36  (Khtml, like gecko)    
    4. chrome/34.0. 1847.131 safari/537.36 "  

This log contains 9 columns (in order to show the beautiful, I add a line break here), each column is separated by a space, the meaning of each column is the client access IP, user identification, user, access time, request page, request status, return file size, jump source, browser UA. It's a bit difficult to parse this log in a general way. But if we're going to have regular expressions, it's easy to match the nine columns of data:

[Java]View PlainCopy
    1. ([^ ]*) ([^ ]*) ([^ ]*) (. ∗ ) (\".*?\") (-|[ 0-9]*) (-|[ 0-9]*) (\ ". *?\") (\ ". *?\")

This makes it possible to match the values of each column. In hive we can specify the input file parser (SerDe), and a org.apache.hadoop.hive.contrib.serde2.RegexSerDe regular parser is built into hive, and we can use it directly. So the entire build statement can be written like this:

[Java]View PlainCopy
  1. CREATE TABLE Logs (
  2. Host STRING,
  3. Identity STRING,
  4. User STRING,
  5. Time STRING,
  6. Request STRING,
  7. Status STRING,
  8. Size STRING,
  9. Referer STRING,
  10. Agent STRING)
  11. ROW FORMAT SERDE ' org.apache.hadoop.hive.contrib.serde2.RegexSerDe '
  12. With Serdeproperties (
  13. "Input.regex" = "([^]*) ([^]*) ([^]*) (\\[.*\\]) (\". *?\ ") (-|[ 0-9]*)
  14. (-| [0-9]*] (\". *?\") (\ ". *?\") ",
  15. "output.format.string" = "%1 $ s%2$s%3$s%4$s%5$s%6$s%7$s%8$s%9$s"
  16. )
  17. STORED as Textfile;

The log is placed in the table directory, GZ format and unknown compression format can be directly parsed by hive. Use the following statement to query for IPs with an hourly access amount of more than 20:

[Java]View PlainCopy
  1. Hive> Select substring (time, 2, ) date, host, COUNT (*) as Count
  2. From logs
  3. GROUP BY substring (time, 2, + ), host
  4. Having count >
  5. Sort by date, count;
  6. 29/dec/:xx 47.18. 236.106
  7. 29/dec/: 81.215. 34.45
  8. 29/dec/: 66.249. 64.18
  9. 29/dec/: 66.249. 64.22
  10. 29/dec/: 119.145. 14.213
  11. 29/dec/: 113.90. 78.63
  12. 29/dec/:ten 106.39. 255.133
  13. 29/dec/:ten 211.99. 9.68
  14. 29/dec/:ten 60.10. 71.97
  15. 29/dec/:ten 222.128. 29.21
  16. 29/dec/: 91.237. 69.17
  17. 29/dec/: 211.151. 238.52 144
  18. 29/dec/: 222.92. 189.35
  19. 29/dec/: 218.85. 130.110
  20. 29/dec/: 218.4. 189.13
  21. 29/dec/: 61.57. 231.254
  22. 29/dec/: 124.207. 11.123
  23. 29/dec/: 134.134. 139.76
  24. 29/dec/: 218.15. 33.28
  25. 29/dec/: 218.247. 17.100
  26. 29/dec/: 116.235. 244.139
  27. 29/dec/: 101.231. 119.202
  28. 29/dec/: 183.11. 249.158
  29. 29/dec/: 116.235. 244.139
  30. 29/dec/: 211.151. 238.52
  31. 29/dec/: 123.138. 184.84
  32. 29/dec/: 219.159. 77.110
  33. 29/dec/: 87.204. 102.195
  34. 29/dec/: 111.203. 3.1
  35. 29/dec/: 125.41. 147.243
  36. 29/dec/: 66.249. 64.18
  37. 29/dec/: 101.251. 230.3
  38. 29/dec/: 110.249. 70.182
  39. 29/dec/: 91.200. 12.26
  40. 29/dec/: 218.64. 17.230
  41. 29/dec/: 66.249. 64.22
  42. 29/dec/: 222.129. 35.102


Or some other kind of operation.


If you're familiar with bash, then you can simply use awk, sort, and so on, instead of hive, for example, I would like to know the number of traffic today and sort them, and the first 10 statements can be written like this:

[Java]View PlainCopy
    1. [Root@iteblog]# awk ' {print '} ' www.iteblog.com.access.log | sort | uniq-c |
    2. > Sort-nr | Head-n Ten
    3. 241 46.119. 121.149
    4. 224 66.249. 65.51
    5. 66.249. 65.49
    6. 219 66.249. 65.47
    7. 211.151. 238.52
    8. 184 207.46. 13.96
    9. 183 157.55. 39.44
    10. 182 112.247. 104.147
    11. 173 157.55. 39.239
    12. 169 157.55. 39.106

Analyzing Nginx logs with Hive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.