awk Statistics IP Access times
Now there is a file, the amount of data in the 200多万条 record, want to use the shell of awk to do statistics, the file format is as follows
#关键字 #url#ip Address #
Test|123|1
Test|123|1
Test|123|2
Test2|12|1
Test2|123|1
Test2|123|2
Now want to count the result is: see the same keyword and URL the total number of visits, as well as how many different IP, output to a file
SQL implementation is very simple select keyword, url, count (1), COUNT (distinct IP) GROUP by keyword, URL, but the amount of data is too large, the report can not run out, want to be implemented under the shell, But my shell is not proficient, do not know how to fast implementation, especially the distinct that
The ideal result is:
#关键字 #url# different ip# search times
Test 123 2 3
Test2 123 1 2
Test2 12 1 1
Wk-f "|" ' {a[$1 ' "$2]++;b[$1" "$" "$3]++}" (b[$1 "$" "$3]==1) {++c[$1" "$2]}end{for (i in a) print I,c[i],a[i]} ' file
Test2 123 2 2
Test2 12 1 1
Test 123 2 3
statistics of the day Apache log per IP access times per hour
The log format is as follows:
127.0.0.1--[03/feb/2013:14:18:10 +0800] "get/ucenterrvicecenter/scenterrequest.php http/1.0" 302 242
127.0.0.1--[03/feb/2013:14:18:10 +0800] "get/ucenterrvicecenter/scenterrequest.php http/1.0" 200-
111.111.111.35--[03/feb/2013:14:18:32 +0800] "get/myadmin/http/1.1" 401 933
111.111.111.35-root [03/feb/2013:14:18:33 +0800] "get/myadmin/http/1.1" 200 1826
111.111.111.35-root [03/feb/2013:14:18:34 +0800] "Get/myadmin/main.php?token=67b1c9d29f9ac9107627bb991c8d2ca6 HTTP /1.1 "200 7633
111.111.111.35--[03/feb/2013:14:18:34 +0800] "Get/myadmin/css/print.css?token=67b1c9d29f9ac9107627bb991c8d2ca6 http/1.1 "200 1063
111.111.111.35-root [03/feb/2013:14:18:34 +0800] "get/myadmin/css/phpmyadmin.css.php?token= 67b1c9d29f9ac9107627bb991c8d2ca6&js_frame=right&nocache=1359872314 http/1.1 "200 20322
111.111.111.35-root [03/feb/2013:14:18:34 +0800] "get/myadmin/navigation.php?token= 67B1C9D29F9AC9107627BB991C8D2CA6 http/1.1 "200 1362
111.111.111.35-root [03/feb/2013:14:18:36 +0800] "get/myadmin/css/phpmyadmin.css.php?token= 67b1c9d29f9ac9107627bb991c8d2ca6&js_frame=left&nocache=1359872314 http/1.1 "200 3618
111.111.111.35-root [03/feb/2013:14:18:38 +0800] "get/myadmin/navigation.php?server=1&db=ucenter&table= &lang=zh-utf-8&collation_connection=utf8_unicode_ci http/1.1 "200 9631
The code is as follows:
[Root@localhost sampdb]# awk-vfs= "[:]" ' {gsub ("-.*", "", $); num[$2 "" $1]++}end{for (i in num) print I,num[i]} ' data1
14 127.0.0.1 2
14 111.111.111.35 8
The number of accesses to the same IP in the awk statistics log
The existing log, you need to count the number of times per IP access
180.153.114.199--[03/jul/2013:14:44:43 +0800] get/wp-login.php?redirect_to=http%3a%2f%2fdemo.catjia.com% 2fwp-admin%2fplugin-install.php%3ftab%3dsearch%26s%3dvasiliki%26plugin-search-input%3d%25e6%2590%259c%25e7% 25b4%25a2%25e6%258f%2592%25e4%25bb%25b6&reauth=1 http/1.1 2355-mozilla/4.0-
101.226.33.200--[03/jul/2013:14:45:52 +0800] get/wp-admin/plugin-install.php?tab=search&type=term&s= Photogram&plugin-search-input=%e6%90%9c%e7%b4%a2%e6%8f%92%e4%bb%b6 http/1.1 302 0-mozilla/4.0-
101.226.33.200--[03/jul/2013:14:45:52 +0800] get/wp-login.php?redirect_to=http%3a%2f%2fdemo.catjia.com% 2fwp-admin%2fplugin-install.php%3ftab%3dsearch%26type%3dterm%26s%3dphotogram%26plugin-search-input%3d%25e6% 2590%259c%25e7%25b4%25a2%25e6%258f%2592%25e4%25bb%25b6&reauth=1 http/1.1 2370-mozilla/4.0-
113.110.176.131--[03/jul/2013:15:03:57 +0800] Get/wp-content/themes/catjia-lio/images/menu_hover_bg.png HTTP/1.1 304 0 http://demo.catjia.com/wp-content/themes/catjia-lio/style.css mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) gecko/20100101 firefox/21.0-
180.153.205.103--[03/jul/2013:15:13:59 +0800] get/wp-admin/options-general.php http/1.1 302 0-mozilla/4.0-
180.153.205.103--[03/jul/2013:15:13:59 +0800] get/wp-login.php?redirect_to=http%3a%2f%2fdemo.catjia.com% 2fwp-admin%2foptions-general.php&reauth=1 http/1.1 2269-mozilla/4.0-
101.226.51.227--[03/jul/2013:15:14:07 +0800] Get/wp-admin/options-general.php?settings-updated=true http/1.1 302 0- mozilla/4.0-
101.226.51.227--[03/jul/2013:15:14:07 +0800] get/wp-login.php?redirect_to=http%3a%2f%2fdemo.catjia.com% 2fwp-admin%2foptions-general.php%3fsettings-updated%3dtrue&reauth=1 http/1.1 2291-mozilla/4.0-
I look at, there are too many log records, where to start?
Many people know that the first column of data can be extracted by awk, that is, the IP address.
But after it's been extracted? How do you count the number of times each IP appears?
It's complicated to say complex, but it's easy to use more.
# awk ' {a[$1]+=1;} End{for (i in a) {print a[i] "" I}} ' Demo.catjia.com_access.log
2 180.153.206.26
120 113.110.176.131
2 101.226.33.200
2 101.226.66.175
2 112.65.193.16
2 101.226.51.227
2 112.64.235.86
2 101.226.33.223
1 101.227.252.23
2 180.153.205.103
2 101.226.33.216
2 112.64.235.89
4 180.153.114.199
2 112.64.235.254
2 180.153.206.34
If you want to save the results, you can save them to the text through redirection.
Now the number of each of the same IP has been counted, but if the data is more and more confusing, such as to know the number of visits is the most IP?
Then add a sort order.
# awk ' {a[$1]+=1;} End{for (i in a) {print a[i] "" I}} ' Demo.catjia.com_access.log |sort
1 101.227.252.23
120 113.110.176.131
2 101.226.33.200
2 101.226.33.216
2 101.226.33.223
2 101.226.51.227
2 101.226.66.175
2 112.64.235.254
2 112.64.235.86
2 112.64.235.89
2 112.65.193.16
2 180.153.205.103
2 180.153.206.26
2 180.153.206.34
4 180.153.114.199
Such a look, looks like sort of, but carefully look, appeared 120 IP how ranked second, not should be in the end?
In fact here also need to add a parameter-G, otherwise sort will be sorted by the first character, it will appear as above.
Look at the result of adding a-G parameter
# awk ' {a[$1]+=1;} End{for (i in a) {print a[i] "" I}} ' Demo.catjia.com_access.log |sort-g
1 101.227.252.23
2 101.226.33.200
2 101.226.33.216
2 101.226.33.223
2 101.226.51.227
2 101.226.66.175
2 112.64.235.254
2 112.64.235.86
2 112.64.235.89
2 112.65.193.16
2 180.153.205.103
2 180.153.206.26
2 180.153.206.34
4 180.153.114.199
120 113.110.176.131
Well, that's the result.