Take the following example log as an example:
202.189.63.115--[31/aug/2012:15:42:31 +0800] "get/http/1.1" 1365 "-" "mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) gecko/20100101 firefox/15.0.1 " |
That's the whole record line.
$ $ is Access IP "202.189.63.115"
$4 is the first half of the request time "[31/aug/2012:15:42:31"
$ +0800 is the second half of the request time.
etc...
When we use the default Domain delimiter, we can parse out the following different types of information from the log:
awk ' {print $ ' access.log # IP address (%h) awk ' {print $} ' access.log # RFC 1413 identification (%l) awk ' {prin T $ $} ' access.log # user ID (%u) awk ' {print $4,$5} ' access.log # Date and Time (%t) awk ' {print $7} ' AC Cess _log # URI (%>s) awk ' {print $9} ' Access _log # state Code (%>S) awk ' {print $} ' access _lo G # Response size (%b) |
It is easy to find that using only the default domain delimiter, it is inconvenient to parse other information, such as request lines, reference pages, and browser types, because the information contains an indeterminate number of spaces. Therefore, we need to change the domain delimiter to "to be able to read this information easily."
Awk-f\ "' {print $} ' access.log # Request Line (%r) awk-f\" ' {print $4} ' access.log # Reference page awk-f\ "' {print $6} ' a Ccess.log # Browser |
Note: In order to avoid unix/linux Shell misunderstanding "as a string start, we used a backslash, escaped."
examples of using awk scenarios
Statistics Browser Type
If we want to know which types of browsers have visited the site and are sorted in reverse order, I can use the following command:
Awk-f\ "' {print $6} ' Access.log | Sort | uniq-c | Sort-fr |
This command line resolves the browser domain first, and then uses the pipe to output as the input to the first sort command. The first sort command is primarily designed to facilitate the Uniq command to count the number of times a different browser appears. The last sort command will arrange the previous statistics in reverse order and output.
discover problems with the system
We can use the following command line to count the status codes returned by the server and to discover possible problems with the system.
awk ' {print $9} ' Access.log | Sort | uniq-c | Sort |
Normally, the status code 200 or 30x should be the most frequently occurring. 40x typically represents a client access issue. 50x generally indicates a server-side problem.
Here are some common status codes:
- 200-The request was successful, and the desired response header or data body will be returned with this response.
- 206-The server has successfully processed a partial GET request
- 301-The requested resource has been permanently moved to a new location
- 302-The requested resource is now temporarily responding to requests from different URIs
- 400-Bad request. The current request could not be understood by the server
- 401-The request is not authorized and the current request requires user authentication.
- 403-No access. The server has understood the request, but refuses to execute it.
- 404-the file does not exist and the resource is not found on the server.
- 500-The server encountered an unexpected condition that caused it to be unable to complete the processing of the request.
- 503-The server is currently unable to process requests due to temporary server maintenance or overloading.
HTTP Protocol status Code definitions can be found in: Hypertext Transfer Protocol--http/1.1
An example of the awk command for the status code:
1. Find and display all requests with a status code of 404
awk ' ($9 ~/404/) ' Access.log |
2. Count all requests with a status code of 404
awk ' ($9 ~/404/) ' Access.log | awk ' {print $9,$7} ' | Sort |
Now we assume that a request (for example: URI:/path/to/notfound) produces a large number of 404 errors, and we can find out which reference page the request came from, and what browser it came from, using the following command.
Awk-f\ "' ($ ~" ^get/path/to/notfound ") {print $4,$6} ' Access.log |
Trace who's on the hotlinking site pictures
System administrators sometimes find that other sites use images stored on their sites on their sites for some reason. If you want to know exactly who is not authorized to use the images on your website, we can use the following command:
Awk-f\ "' ($/\. jpg|gif|png)/&& $4!~/^http:\/\/www\.example\.com/) \ {print $4} ' access.log \ | sort | uniq-c | sort |
Note: Before using, change www.example.com to the domain name of your website.
- Use "to decompose each row;
- The request line must include ". jpg", ". gif", or ". png";
- The reference page does not start with your site's domain name string (in this case, www.example.com);
- Displays all referenced pages and counts the number of occurrences.
commands related to accessing IP addresses
Statistics total number of different IP accesses:
awk ' {print '} ' Access.log |sort|uniq|wc–l |
Count the number of pages that each IP visited:
awk ' {++s[$1]} END {for (a in S) print A,s[a]} ' log_file |
The number of pages accessed per IP is sorted from small to large:
awk ' {++s[$1]} END {for (a in S) print S[a],a} ' log_file | Sort-n |
See which pages are accessed by an IP (for example, 202.106.19.100):
grep ^202.106.19.100 Access.log | awk ' {print $1,$7} ' |
Statistics August 31, 2012 14 o'clock how much IP access is in:
awk ' {print $4,$1} ' Access.log | grep 31/aug/2012:14 | awk ' {print $} ' | Sort | Uniq | Wc-l |
Count the top 10 IP addresses for the most visited
awk ' {print '} ' Access.log |sort|uniq-c|sort-nr |head-10 |
commands that respond to page size
List several files with the largest transfer size
Cat Access.log |awk ' {print $ "" $ "" $4 "" $7} ' |sort-nr|head-100 |
Lists pages with output greater than 204800 byte (200kb) and the number of corresponding page occurrences
Cat Access.log |awk ' ($ > 200000) {print $7} ' |sort-n|uniq-c|sort-nr|head-100 |
commands related to page response time
If the last column of the log records the paging file transfer time (%T), for example we can customize the log format to:
Logformat "%h%l%u%t \"%r\ "%>s%b \"%{referer}i\ "\"%{user-agent}i\ "%t" combined |
You can use the following command to count all log records that have a response time of more than 3 seconds.
awk ' ($NF > 3) {print $} ' Access.log |
Note: NF is the number of fields in the current record. $NF is the last field.
List requests that are longer than 5 seconds
awk ' ($NF > 5) {print $} ' Access.log | Awk-f\ "' {print $} ' |sort-n| Uniq-c|sort-nr|head-20 |
Examples of awk scenarios