How to analyze a Web site log file

Source: Internet
Author: User

If your blog or website is built on a paid host, if you are a blog or site webmaster, if you even the original access log is nothing, or to its fundamental disdain, I can only say that you are a incompetent webmaster, once the site problems, must be helpless! I believe that everyone in their own web site to install the statistics of the code, such as Google Analytics, Quantum statistics, Baidu Statistics, CNZZ, 51.la, and so on, these tools can be counted on the site's traffic, that is, the site visitors can see all the page access, but these Statistical tools are not able to count the original access information of resources on your host, such as who downloaded an image, or where no statistical code was added, such as a background action page.

The vast majority of charge hosts provide the original access logs, the site server will be every visitor when the information is automatically recorded, saved in the original access log file, if your host does not provide logging function, it is recommended that you expire or change the host. The log records access information for all resources on the site, including images, CSS, JS, FLASH, HTML, MP3, and all of the resources that are loaded into the open process, as well as documenting who accessed the resources, what they are accessed and what the results are, and so on. It can be said that the original access log records all resource usage for the host.

What is the role of analyzing site logs?

1, we can be more accurate location search engine spiders to crawl our site, you can block the pseudo-spider (this kind of spider more to collect, will increase the cost of our server) point this identification baiduspider authenticity;

2, through the analysis of the website log, we can accurately locate the search engine Spider crawling page and the length of time, we can in turn targeted to our site to fine-tune;

3, HTTP return status code, search engine spiders and users every visit to our site once, the server will produce a similar state of 301,404,200, we can refer to this type of information, we have problems in the site of a simple diagnosis, timely processing problems.

Where are the site log files stored?

The general virtual host will provide log files, but different virtual host system will provide different log files to store the file name, the author uses a network of virtual hosts, log files stored in the Wwwlogs folder.

What about the records in the website log file?

Each row of the original access log is a record similar to the following:

116.231.220.179 - - [25/Mar/2015:11:21:15 +0800] "GET /blog/article/10.html HTTP/1.1" 200 8671 "http://www.weiaipin.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0)"

Let's say the meaning of this line of record:

116.231.220.179

This is the IP of the visitor (and possibly the robot)

[25/mar/2015:11:21:15 +0800]

This is the time that the visitor accesses the resource (date), and +0800 is the time zone for that period, which is the difference from GMT to 8 hours

"Get/blog/article/10.html http/1.1"

Request information, including the request method, the requested resource, and the protocol used, which means to get the page/blog/article/10.html,10html as a page on the site in Get mode, according to the http/1.1 protocol.

200 8671

200 the Status code (HTTP code) returned for the request, the different status codes represent different meanings, please read the HTTP status code; 8671 The traffic consumed for this request (Size in Bytes), in bytes

"Http://www.weiaipin.cn/"

For the visitor source (Referer). This paragraph tells us where the visitor came from to this page. There may be other pages of your site, there may be search pages from search engines and so on. Through this source information, you can find out the Hotlinking Web page.

"mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) "

For the browser type (Agent) used by the visitor, the user's operating system, browser model and other information are recorded here.

How do I analyze content in a site log?

1. Pay attention to those resources that are frequently accessed

If you find that a resource (Web page, picture, MP3, etc.) is frequently accessed in the log, you should be aware of where the resource is being used! If the source of these requests (Referer) is not your site or is empty, and the status code (HTTP code) is 200, it means that your resources are likely to be hotlinking, through Referer you can find out the URL of the hotlinking, which may be the cause of your website traffic explosion, You should do a good job of anti-theft chain. Please see, my site Japan.mp3 This file is a frequent visit, but also just part of the log, this person extremely sinister, because I have already deleted the file, it has been delayed to Japan.mp3, in just one hours to Japan.mp3 launched not hundreds of requests, See I set up the anti-theft chain on the forgery of the source Referer and agent, but also constantly change the IP, it is a pity that it does not work, there is no such file, the requested status code HTTP code is 403 or 404.

2. Note Those requests that do not exist on your site

If some request information is not the resources of this site, Http code is not 403 is 404, but from the name analysis, may be the file that holds the database information, if this information let others take away, then attack your site is much easier. The purpose of initiating these requests is to scan your website for vulnerabilities and to download these known vulnerability files through a random scan, and you will likely find a loophole in your website! Through observation, it can be found that the agents used in these requests are almost mozilla/4.0, mozilla/5.0 or libwww-perl/, and so on unconventional browser types, the above I provide the Log Format tool has integrated the alarm function for these requests. We can prevent these agents from accessing, to achieve the purpose of preventing the scan, the specific method is described below.

3, observe the search engine spider's visit situation

By observing the information in the log, you can see the frequency of your website being visited by spiders, and you can see whether your website is favored by search engines, which are the concerns of SEO. The log Formatting tool has been integrated with the prompt function of search engine spiders. The list of agents used by spiders in common search engines is as follows:

Google Spider: mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)

Baidu Spider: baiduspider+ (+http://www.baidu.com/search/spider.htm)

Yahoo! Spider: mozilla/5.0 (compatible; Yahoo! slurp/3.0; HTTP://HELP.YAHOO.COM/HELP/US/YSEARCH/SLURP)

Yahoo! Chinese Spider: mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)

Microsoft Bing Spider: msnbot/2.0b (+http://search.msn.com/msnbot.htm)

Google AdSense Spider: Mediapartners-google

Youdao Spider: mozilla/5.0 (compatible; youdaobot/1.0; http://www.youdao.com/help/webmaster/spider/; )

Soso Search Blog Spider: sosoblogspider+ (+http://help.soso.com/soso-blog-spider.htm)

Sogou Sogou spider: Sogou Web spider/4.0 (+http://www.sogou.com/docs/help/webmasters.htm#07)

Twiceler Reptile Program: mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html) '

Google image search spider: googlebot-image/1.0

Russian Yandex search engine spider: yandex/1.01.001 (compatible; Win16; I)

Alexa Spider: Ia_archiver (+http://www.alexa.com/site/help/webmasters; [Email protected])

Feedsky Spider: Mozilla 5.0 (compatible; Feedsky crawler/1.0; http://www.feedsky.com)

Korea Yeti Spider: yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)

4. Observing visitor Behavior

By looking at the formatted log, you can view a series of access behaviors that track an IP over a certain period of time, the more access records of a single IP, the higher the PV of your website, and the better stickiness of the user; If the access record for a single IP CeCe, you should consider how to make your site content more attractive. By analyzing the behavior of visitors, you can provide a strong reference for your website construction, what content is good, what content is not good, determine the direction of the development of the site, by analyzing the behavior of visitors, see what they have done, can speculate on the intentions of visitors, timely find out malicious users.

What are the common website log analysis software?

I have tried many web site log analysis tools, commonly used and more comprehensive features about three: light-years SEO log Analysis system, anti-fire website Log Analyzer, Web log Explorer.

For the specific use of these three software evaluation summary, please see I wrote another article, "Common Web site log analysis software use summary."

Attached: Website Log code Daquan

1xx-Information Tips

These status codes represent a temporary response. The client should be prepared to receive one or more 1xx responses before receiving a regular response.

100-Continue.

101-Switch protocol.
2xx-success

This type of status code indicates that the server successfully accepted the client request.

200-OK. The client request was successful.

201-created.

202-accepted.

203-Non-authoritative information.

204-no content.

205-Reset the content.

206-Partial content.
3xx-redirection

The client browser must take more action to implement the request. For example, the browser might have to request a different page on the server, or repeat the request through a proxy server.

301-The object has been permanently moved, that is, permanent redirection.

302-The object has been temporarily moved.

304-not modified.

307-Temporary redirection.
4xx-Client Error

An error occurred and the client appears to be having problems. For example, a client requests a page that does not exist, and the client does not provide valid authentication information. 400-Bad request.

401-access is denied. IIS defines a number of different 401 errors, which indicate a more specific cause of the error. These specific error codes are displayed in the browser, but are not displayed in the IIS log:

401.1-Login failed.

401.2-server configuration caused logon failure.

401.3-not authorized due to ACL restrictions on resources.

401.4-Filter Authorization failed.

401.5-ISAPI/CGI application authorization failed.

401.7– access is denied by the URL authorization policy on the Web server. This error code is dedicated to IIS6.0.

403-Forbidden: IIS defines a number of different 403 errors that indicate a more specific cause of the error:

403.1-execution access is forbidden.

403.2-Read access is forbidden.

403.3-Write access is forbidden.

403.4-Requires SSL.

403.5-Requires SSL128.

The 403.6-IP address is rejected.

403.7-Requires a client certificate.

403.8-site access is denied.

403.9-Excessive number of users.

403.10-Invalid configuration.

403.11-Password change.

403.12-Deny access to the mapping table.

403.13-The client certificate is revoked.

403.14-Reject directory list.

403.15-Client access permission exceeded.

403.16-Client certificate is not trusted or invalid.

403.17-The client certificate has expired or is not yet valid.

403.18-The requested URL cannot be executed in the current application pool. This error code is dedicated to IIS6.0.

403.19-CGI cannot be executed for clients in this application pool. This error code is dedicated to IIS6.0.

403.20-passport Login failed. This error code is dedicated to IIS6.0.

404-not found.

404.0-(None) – No files or directories found.

404.1-Unable to access the Web site on the requested port.

The 404.2-web service extension lockout policy blocks this request.

The 404.3-mime mapping policy blocks this request.

405-The HTTP verb used to access this page is not allowed (method not allowed)

406-The client browser does not accept the MIME type of the requested page.

407-proxy authentication is required.

412-Precondition failed.

413– request entity is too large.

414-The request URI is too long.

415– media types not supported.

The range requested by 416– is not sufficient.

417– execution failed.

423– a locked error.
5xx-Server Error

The server could not complete the request because it encountered an error.

500-Internal server error.

The 500.12-application is busy restarting on the Web server.

The 500.13-web server is too busy.

500.15-Direct Request Global.asa is not allowed.

500.16–unc authorization credentials are incorrect. This error code is dedicated to IIS6.0.

The 500.18–url authorization store cannot be opened. This error code is dedicated to IIS6.0.

500.100-Internal ASP error.

501-The header value specifies the configuration that is not implemented.

An invalid response was received when the 502-web server was used as a gateway or proxy server.

The 502.1-cgi application timed out.

502.2-CGI Application error. Application.

503-The service is not available. This error code is dedicated to IIS6.0.

The 504-gateway timed out.

505-http version is not supported.

This article is reproduced in "for the Love Spell", the original address:http://www.weiaipin.cn/blog/article/31.html

How to analyze a Web site log file

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.