the format of the IBM HTTP Server access log
define the format of the log
We can use the predefined classic format or customize the format of the access log in IBM HTTP Server configuration files. If there are no special instructions below, assume that the log uses the classic format named combined.
Logformat "%h%l%u%t \%r\"%>s%b \ "%{referer}i\" \ "%{user-agent}i\" "combined
The following is a brief description of each domain:
%h = The client IP address that originated the request. The IP address recorded here is not necessarily the IP address of the real user client, it may be the public network mapping address or proxy server address of the private network client. %l = Client's RFC 1413 identity (reference) and only clients that implement the RFC 1413 specification can provide this information. %u = Access User ID%t = time%r = Request from client%>s = Server returns the client's status code%b = The byte size returned to the client, but does not include the size of the response header%{referer}i = reference page%{user-agent }i = type of browser
The following three behavior sample logs:
202.189.63.115--[31/aug/2008:15:42:31 +0800] "get/http/1.1" 1365 "
" "mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) gecko/20100101 firefox/15.0.1 "
set up scroll logs
Because the WEB server can be a huge number of visits a day, we need to write the access log separately in different log files, which prevents a single file from being too large to be opened using the editor. For example, we can define a log file per 5 M in the configuration file.
Linux Server:
Transferlog "|/opt/ibm/httpserver/bin/rotatelogs/opt/ibm/httpserver/logs/access_log 5M"
Windows Server:
Customlog "| C:/ibm/httpserver/bin/rotatelogs.exe
c:/ibm/httpserver/logs/access%y_%m_%d_%h_%m_%s.log 5M "combined
Back to the first AWK introduction
AWK is a "style scanning and processing language." It allows you to create short programs that read input files, sort data, process data, perform calculations on input, and generate reports. Its name is taken from the first letter of its founder Alfred Aho, Peter Weinberger and Brian Kernighan's surname.
The awk commands discussed in this article refer primarily to the built-in program/bin/gawk that is widely included in the Linux operating system, which is the GNU version of the Unix awk program. This command is primarily responsible for reading and running programs written in the AWK language. You can use Cygwin to run awk commands in a simulated environment on the Windows platform.
Basically, awk can have a record of the specified pattern (that is, a line of text) from the input (standard input, or one or more files). Performs associated actions (such as writing to standard output or external files) each time a match is found. AWK Language Basics
To understand the AWK program, we outline its basics below. An AWK program can consist of one or more lines of text, the core of which is a combination of patterns and actions.
Pattern {Action}
Pattern is used to match each line of text in the input. For each line of text on the match, AWK performs the corresponding action (action). Patterns and actions are separated using curly braces. awk sequentially scans each line of text and uses a record delimiter (typically a newline character) to record every line that is read, using the field delimiter (typically spaces or tab) to split a line of text into multiple fields, each of which can use $, $, ... $n. The first field is represented, and the second field is represented by $ $n, which represents the nth domain. $ represents the entire record. The pattern or action can be unspecified, and in the case of the default mode, all rows will be matched. In the case of the default action, the action {print} is executed, which prints the entire record. use awk to decompose information in a log
Because we specify the fixed format of the access log in the IBM HTTP Server configuration file, we can easily use awk parsing to extract the data we need.
Take the following example log as an example:
202.189.63.115--[31/aug/2012:15:42:31 +0800] "get/http/1.1" 1365 "
" "mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) gecko/20100101 firefox/15.0.1 "
$ is the entire record line is to access the IP "202.189.63.115" $ is the first half of the request time "[31/aug/2012:15:42:31" $ is the second half of the request time "+0800]"
And so on ...
When we use the default Domain delimiter, we can parse the following different types of information from the log:
awk ' {print} ' Access.log # IP address (%h)
awk ' {print $} ' access.log # RFC 1413 identity (%l)
awk ' {prin T $} ' access.log # user ID (%u)
awk ' {print $4,$5} ' access.log # Date and Time (%t)
awk ' {print $} ' ACC ESS _log # URI (%>s)
awk ' {print $} ' Access _log # status code (%>S)
awk ' {print $} ' Access _log # Response Size (%b)
It is not hard to find that using the default domain delimiter alone makes it inconvenient to parse out other information, such as the request line, the reference page, and the browser type, because the information contains an indeterminate number of spaces. Therefore, we need to modify the domain delimiter to "to be able to read this information easily."
Awk-f\ "' {print $} ' access.log # Request Line (%r)
awk-f\" ' {print $} ' access.log # Reference page
awk-f\ ' ' {print $} ' a Ccess.log # Browser
Note: here to avoid unix/linux Shell misunderstanding "for the string, we used a backslash and escaped."
Now, we have mastered the basics of awk and how it resolves the logs. Here we are ready to begin the "adventure" of the real world.
Back to the top of the page using the awk scene example to Count browser types
If we want to know which types of browsers have visited the site and ordered it in reverse order, I can use the following command:
Awk-f\ "' {print $} ' Access.log | Sort | uniq-c | Sort-fr
This command line resolves the browser domain first, and then uses the pipe to input output as the first sort command. The first sort command is designed to facilitate the Uniq command to count the number of times a different browser appears. The last sort command will arrange and output the previous statistics in reverse order. identify problems with the system
We can use the command line below to count the status codes returned by the server and discover the possible problems with the system.
awk ' {print $} ' Access.log | Sort | uniq-c | Sort
Normally, the status code 200 or 30x should be the most frequent occurrence. 40x typically indicates client access issues. 50x typically represents a server-side problem.
Here are some common status codes: 200-the request was successful and the response header or the data body that the request expects will be returned with the response. 206-The server has successfully processed a partial GET request 301-The requested resource has been permanently moved to the new location 302-The requested resource now temporarily responds to request 400-wrong request from a different URI. The current request cannot be understood by the server 401-the request is not authorized and the current request requires user authentication. 403-No access. The server has understood the request, but refused to execute it. 404-the file does not exist and the resource is not found on the server. 500-The server encountered an unexpected condition that prevented it from completing processing of the request. 503-The server is currently unable to process the request due to temporary server maintenance or overload.
HTTP Protocol Status Code definitions can refer to: Hypertext Transfer Protocol--http/1.1
For an example of the awk command for a status code:
1. Find and display all requests with a status code of 404
awk ' ($ ~/404/) ' Access.log
2. Count all requests with status codes of 404
awk ' ($ ~/404/) ' Access.log | awk ' {print $9,$7} ' | Sort
Now we assume that a request (for example: URI:/path/to/notfound) generates a large number of 404 errors, and we can find out from which reference page the request came from, and from what browser, by the following command.
Awk-f\ "' ($ ~" ^get/path/to/notfound ") {print $4,$6} ' Access.log
tracing pictures of who on hotlinking site
System administrators sometimes find that other sites use images stored on their sites for some reason. If you want to know who is not authorized to use the pictures on their website, we can use the following command:
Awk-f\ "' ($ ~/\. ( Jpg|gif|png)/&& $!~/^http:\/\/www\.example\.com/) \
{print $} ' access.log \ | sort | uniq-c | sort
Note: Before using, change the www.example.com to the domain name of your site. Use to explode each row, the request line must include ". jpg", ". gif", or ". png"; The reference page is not started with your site's domain name string (in this case, www.example.com); All reference pages are displayed and the number of occurrences is counted. commands related to accessing IP addresses
How many different IP accesses are counted:
awk ' {print $} ' Access.log |sort|uniq|wc–l
Count the number of pages per IP accessed:
awk ' {++s[$1]} end {for (a in S) print A,s[a]} ' log_file
The number of pages per IP access is sorted from small to large:
awk ' {++s[$1]} end {for (a in S) print S[a],a} ' log_file | Sort-n
View which pages an IP (for example, 202.106.19.100) accesses:
grep ^202.106.19.100 Access.log | awk ' {print $1,$7} '
Statistics how many IP visits are available August 31, 2012 14 O'Clock:
awk ' {print $4,$1} ' Access.log | grep 31/aug/2012:14 | awk ' {print $} ' | Sort | Uniq | \
Wc-l
Top 10 IP addresses with statistics access
awk ' {print $} ' Access.log |sort|uniq-c|sort-nr |head-10
commands in response to page size