- Accesslog is Apache or Nginx and other Web service generated logs, corresponding to each request of the webpage, contains a lot of information, analysis good accesslog can have a whole understanding of the operation of the site, in the case of problems, It is also possible to locate the problem roughly by analyzing the results of accesslog data. Responsible for the operation of the site, architecture-related engineers need to be very familiar with the Accesslog, the strategic effect of the engineers can also be analyzed by the Accesslog, the user's behavior data.
- At the same time analysis Accesslog log can also estimate the user visits, peak access time period, region access data, etc., for performance testing and server expansion to provide a good data reference
Composition of the Accesslog
"GET /movie_cat.php?year=2013 HTTP/1.1" 200 5209 "http://www.baidu.com" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MDDR; .NET4.0C; .NET4.0E; .NET CLR 1.1.4322; Tablet PC 2.0); 360Spider"
Let's say the meaning of this line of record:
- 101.226.166.254: (User IP)
- [21/oct/2013:20:34:28 +0800]: (Access time)
- Get:http request mode, there are get and post two kinds
- /MOVIE_CAT.PHP?YEAR=2013: The currently visited Web page is a Dynamic Web page, movie_cat.php is the requested background interface, year=2013 for the specific interface parameters
- 200: Service status, 200 indicates normal, common, 301 Permanent Redirect, 4XX indicates error in request, 5XX server Internal error
- 5209: Transfer byte number is 5209, in units of byte
- "Http://www.baidu.com": Refer: The previous page of the current page
- "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; trident/4.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0; mddr;. net4.0c;. net4.0e;. NET CLR 1.1.4322; Tablet PC 2.0); 360Spider: Agent field: Commonly used to record information such as operating system, browser version, browser kernel, etc.
Data statistics and analysis
- Interface Request Frequency: sub-interface, by day and hour statistics respectively. Understand the operation of the site, the frequency of requests per interface, the user's behavior habits and so on.
- response Time: The average response time of the day, the average response time of the interface, and the average response time of the interface by the hour. Requests that have a long response time may indicate a defect in service performance and need to be optimized for the appropriate interface.
- Exception Analysis: the request that the status code is not 200, the response time exceeds a certain domain value. A lot of 404 do not use SEO, should try to avoid.
- Specific parameters Statistics: for dynamic pages, an interface usually with a number of parameters, one or some of the parameters are particularly important, can be further refined, resulting in the interface corresponding to the statistical results of the parameters
- IP Source Statistics: You can count the IP source of Web Access, further through the IP location, you can generate the site by geographical access, while the statistics of the IP, can also be some possible attacks or hacking behavior to identify.
- Spider Crawl Situation Analysis: search engine spiders usually set agent field, through the Analysis Agent field, you can get the site one day by Baidu, Google and other search engine crawl times, and which pages are often climbed by spiders, which is the basis of SEO.
Shell Script analysis for Accesslog
- View TCP connections on port 80
netstat -tan | grep "ESTABLISHED" | grep ":80" | wc -l
- The most frequently joined IP addresses in the current Web server:
Netstat-ntu|awk ' {print $} ' |sort| uniq-c| Sort-n-R231:: FFFF:127.0.0.1:809523:: FFFF:192.168.50.201:54322:: FFFF:192.168.50.203:801 servers)1:: FFFF:192.168.50.56:433141:: FFFF:192.168.50.21:29961:: FFFF:192.168.50.21:29891:: FFFF:192.168.50.200:80601:: FFFF:192.168.50.12:13001:: FFFF:192.168.50.12:12991:: FFFF:192.168.50.12:12981 : :ffff:127.0. 0.1:57933 1 address1 192.168.50.41:653101 192.168. 50.41:64949 1 192.168. 50.41:49653
- View the top 10 most visited IPs in a log
|cut -d ‘ ‘ -f 1 |sort |uniq -c | sort -nr | awk ‘{print $0 }‘ | head -n 10 |less14085 121.207.252.12213753 218.66.36.11911069 220.162.237.61188 59.63.158.1181025 ::1728 220.231.141.28655 114.80.126.139397 117.25.55.100374 222.76.112.211348 120.6.214.70
- See more than 100 IPs in the log
|cut -d ‘ ‘ -f 1 |sort |uniq -c | awk ‘{if ($1 > 100) print $0}‘|sort -nr |less14085 121.207.252.12213753 218.66.36.11911069 220.162.237.61188 59.63.158.1181025 ::1728 220.231.141.28655 114.80.126.139397 117.25.55.100374 222.76.112.211348 120.6.214.70252 58.211.82.150252 159.226.126.21206 121.204.57.94192 59.61.111.58186 218.85.73.40145 221.231.139.30134 121.14.148.220123 222.246.128.220122 61.147.123.46119 121.204.105.58107 116.9.75.237105 118.123.5.173.....
- View the number of visits for a given day
cat access_log|grep ‘12/Nov/2012‘|grep "******.htm"|wc|awk ‘{print $1}‘|uniq
- View a list of URLs with more than 30ms access time
cat access_log|awk ‘($NF > 30){print $7}’|sort -n|uniq -c|sort -nr|head -20
- List of URLs with response times exceeding 60m and count occurrences
cat access_log |awk ‘($NF > 60 && $7~/\.php/){print $7}’|sort -n|uniq -c|sort -nr|head -100
- Statistical/index. Access Uvs for [HTML] pages
"/index.html" access.log | cut –d “ ” –f 4| sort | uniq | wc –l
"/index.html" access.log | wc -l
Definition of IP, UV, and PV
- IP (standalone IP): The Internet Protocol, which refers to the number of independent IPs. The same IP address within the 00:00-24:00 is computed only once.
- PV (traffic): That is, Page view, that is, the amount of pageviews or clicks, the user each refresh is calculated once.
- UV (Independent visitor): a unique Visitor that accesses a computer client of your website as a visitor. The same client in 00:00-24:00 is counted only once.
The difference between IP,PV,UV
- IP (stand-alone IP): The number of times a computer from an IP address visited a Web site. This statistical approach is easy to achieve and has authenticity. So it is an important measure of website traffic.
- PV (traffic): PV reflects the number of pages viewed on a site, so each refresh is counted once. That is, PV is proportional to the number of visitors, but PV is not the number of visitors to the page, but the number of pages visited by the site.
- UV (Independent visitor): can be understood as the number of computers accessing a website. The website determines that the identity of the visiting computer is realized by visiting the computer's cookies. If you replace the IP but do not clear the cookies, and then visit the same site, the site's statistics of the number of UV is constant.
Example to illustrate:
- A third of the three people through the ADSL computer, visited the "goto52" this site, and each person visited 2 pages, then the website traffic statistics are:
- IP (standalone IP): 1
- PV (Traffic): 6 (3 people multiplied by 2 pages)
- UV (Independent visitors): 1
- If all three have been replaced by IP (ADSL re-dialing) after browsing 2 pages, then
- IP (standalone IP): 3
- PV (traffic): 6
- UV (Independent visitors): 1
- Therefore, IP (independent IP) reflects the number of network address objects, UV (independent visitors) reflects the number of actual users, each UV relative to each IP more accurately correspond to an actual viewer.
About Access_log log files and definitions for IP, Uvs, and PV