A brief introduction to apache log analysis is provided. For more information, see LogFiles on the apache official website...
Apache Log analysis is briefly introduced, mainly refer to the apache official website Log Files, manual reference http://httpd.apache.org/docs/2.2/logs.html
I. log analysis
If the default configuration is used during apache installation, two files, access_log and error_log, are generated in the/logs directory.
1. access_log
Access_log is an access log that records all requests to the apache server. its location and content are controlled by the CustomLog command. the LogFormat command can be used to simplify the content and format of the log.
For example, one of my servers is configured as follows:
CustomLog "|/usr/sbin/rotatelogs/var/log/apache2/% Y _ % m _ % d_other_vhosts_access.log 86400 480" vhost_combined
-Rw-r -- 1 root 22310750 12-05 23:59 2010_12_05_other_vhosts_access.log
-Rw-r -- 1 root 26873180 12-06 23:59 2010_12_06_other_vhosts_access.log
-Rw-r -- 1 root 26810003 12-07 23:59 2010_12_07_other_vhosts_access.log
-Rw-r -- 1 root 24530219 12-08 2010_12_08_other_vhosts_access.log
-Rw-r -- 1 root 24536681 12-09 23:59 2010_12_09_other_vhosts_access.log
-Rw-r -- 1 root 14003409 12-10 14:57 2010_12_10_other_vhosts_access.log
Through the CustomLog command, an independent log file is generated every day. at the same time, a timer is written to clear all the log files a week ago, which can be clearer, logs generated each day can be separated and logs generated earlier than a certain period of time can be cleared. LogFormat defines the log record format.
LogFormat "% h % l % u % t \" % r \ "%> s % B \" % {Referer} I \ "\" % {User-Agent} I \ "" combined
LogFormat "% {X-Forwarded-For} I % l % u % t \" % r \ "%> s % B \" % {Referer} I \ "\" % {User-Agent} I \ "" combinedproxy
LogFormat "% h % l % u % t \" % r \ "%> s % B" common
LogFormat "% {Referer} I-> % U" referer
LogFormat "% {User-agent} I" agent
Randomly tail an access_log file. Below is a classic access record
218.19.140.242--[10/Dec/2010: 09: 31: 17 + 0800] "GET/query/trendxml/district/todayreturn/month/2009-12-14/2010-12-09/haizhu_tianhe.xml HTTP/1.1" 200 1933 "-" "Mozilla/ 5.0 (Windows; u; Windows NT 5.1; zh-CN; rv: 1.9.2.8) Gecko/20100722 Firefox/3.6.8 (. net clr 3.5.30729 )"
There are nine items in total, and they will be split one by one
218.19.140.242
-
-
[10/Dec/2010: 09: 31: 17 + 0800]
"GET/query/trendxml/district/todayreturn/month/2009-12-14/2010-12-09/haizhu_tianhe.xml HTTP/1.1"
200
1933
"-"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv: 1.9.2.8) Gecko/20100722 Firefox/3.6.8 (. net clr 3.5.30729 )"
1) 218.19.140.242 this is the ip address of the client that requests to the apache server. by default, the first item is the ip address of the remote host. However, if we need apache to find the host name, you can set HostnameLookups to on, but this method is not recommended because it greatly slows down the server. in addition, the ip address here is not necessarily the ip address of the client host. if the client uses a proxy server, the ip address here is the ip address of the proxy server, not the original server.
2)-This item is blank and replaced by "-". This location is used to mark visitors. this information exists from the identd client, unless IdentityCheck is on, otherwise, apache will not obtain the information of this part (ps: not quite understandable, basically this item is empty, and the original article is provided)
The "hyphen" in the output indicates that the requested piece of information is not available. in this case, the information that is not available is the RFC 1413 identity of the client determined by identd on the clients machine. this information is highly unreliable and shoshould almost never be used operated T on tightly controlled internal networks. apache httpd will not even attempt to determine this information unless IdentityCheck is set to On.
3)-This item is blank, but this item is used to record the HTTP authentication of users. if some websites require users to perform identity authentication, this item is used to record the user's identity information.
4) [10/Dec/2010: 09: 31: 17 + 0800] The fourth item is the request time, in the format of [day/month/year: hour: minute: second zone], The Last + 0800 indicates that the server is located in the UTC + 8 zone
5) "GET /.. haizhu_tianhe.xml HTTP/1.1 "indicates the most useful information in the entire record. First, it indicates that the server receives a GET request, second, the resource path of the client request, and third, the protocol used by the client is HTTP/1.1. the entire format is "% m % U % q % H", that is, "request method/access path/protocol"
6) 200 This is a status code sent back to the client by the server, which tells us whether the client's request is successful, or is redirected, or what kind of error is encountered. The value is 200, indicates that the server has successfully responded to the client request. generally, a value starting with 2 indicates that the request is successful, and a value starting with 3 indicates redirection, there are some client errors marked with Start 4 and some server errors marked with start 5. for details, see HTTP specification (RFC2616 section 10 ). [http://www.w3.org/Protocols/rfc2616/rfc2616.txt]
7) 1933 indicates the number of bytes sent by the server to the client. During log analysis and statistics, the total number of bytes sent by the server at a certain time can be determined.
8)-Unknown
9) "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv: 1.9.2.8) Gecko/20100722 Firefox/3.6.8 (. net clr 3.5.30729) "this mainly records the browser information of the client.
2. error_log
Error_log is an error log that records any error processing requests. its location and content are controlled by the ErrorLog command. generally, if a server has any errors, check the error log first, is the most important log file
Tail error_log. extract a record at will.
[Fri Dec 10 15:03:59 2010] [error] [client 218.19.140.242] File does not exist:/home/htmlfile/tradedata/favicon. ico
There are also several items
[Fri Dec 10 15:03:59 2010]
[Error]
[Client 218.19.140.242]
File does not exist:/home/htmlfile/tradedata/favicon. ico
1) [Fri Dec 10 15:03:59 2010] records the time when an error occurred. Note that it is different from the time format recorded in the access_log above.
2) [error] indicates the error level. the error type is controlled based on the LogLevel command. the above 404 belongs to the error level.
3) [client 218.19.140.242] record the IP address of the client
4) File does not exist:/home/htmlfile/tradedata/favicon. ico first describes the error. for example, if the client accesses a file that does not exist or has a path error, the error 404 is returned.
II. practical log analysis script
After learning about the definitions of logs, I will share some log analysis scripts posted on the Internet.
1. view the number of apache processes
Ps-aux | grep httpd | wc-l
2. analyze the log to view the number of ip connections on the current day
Cat default-access_log | grep "10/Dec/2010" | awk '{print $2}' | sort | uniq-c | sort-nr
3. view the url accessed by the specified ip address on the current day.
Cat default-access_log | grep "10/Dec/2010" | grep "218.19.140.242" | awk '{print $7}' | sort | uniq-c | sort-nr
4. view the top 10 URLs on the current day
Cat default-access_log | grep "10/Dec/2010" | awk '{print $7}' | sort | uniq-c | sort-nr | head-n 10
5. see what the specified ip address does.
Cat default-access_log | grep 218.19.140.242 | awk '{print $1 "\ t" $8}' | sort | uniq-c | sort-nr | less
6. view the most frequently accessed minutes (find the hotspot)
Awk '{print $4}' default-access_log | cut-c 14-18 | sort | uniq-c | sort-nr | head
3. use awstats to automatically analyze logs
Of course, if you want to use the simplest and most intuitive log analysis tools, awstats is a popular online tool and a perl-based web log analysis tool, powerful functions and support for IIS and other servers
Http://awstats.sourceforge.net
Installation configuration see http://blog.s135.com/post/199/
Simple interface
Author: 21 aspnet