1. Introduction to spider names
In website logs, spider names generally include the following types: baidu-> baiduspider, Google-> Googlebot, Msn-> msnbot, yahoo-> Slurp, yodao-> YoudaoBot, sogou-> Sogou + get + spider. In the log, you only need to search for the above Spider name to see the crawling trace of this spider type.
2. Crawler return type
After crawling, the spider will return code. By viewing the loan status, you can see the crawling result. The main HTTP status codes include:
(1) code 200 indicates that crawlers can crawl normally.
(2) code 304 indicates that the content has not been updated since the previous capture. This value is often returned for website images.
(3) code 404. The Accessed link is an incorrect link. This error link, on the one hand, comes from the original existence and then deleted the web page, on the other hand, may come from the original does not exist, but other people Chain such a dead link.
(4) code 302 indicates temporary redirection.
(5) code 301 indicates permanent redirection.
(6) code 500 indicates a program error.
3. Log code interpretation
# Software: Microsoft Internet Information Services 6.0
# Version: 1.0
# Date: 16:00:39
# Fields: date time s-sitename s-computername s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs (User- agent) cs (Cookie) cs (Referer) cs-host SC-status SC-substatus sc-win32-status SC-bytes cs-bytes time-taken
Date indicates the access record date;
Time access time;
S-sitename indicates the name of your VM.
S-ip visitor IP;
Cs-method indicates the access method. There are two common methods: GET, which is the action for opening a URL, POST, and form submission;
Cs-uri-stem is the file to access;
Cs-uri-query refers to the parameters attached to the access address, such as asp files? The string id = 12 and so on. If there is no parameter, it is represented;
The port accessed by s-port;
Cs-username: Visitor name;
C-ip source ip address;
Cs (User-Agent) access source;
SC-status, 200 indicates successful, 403 indicates no permission, 404 indicates that the page cannot be accessed, and 500 indicates that the program is wrong;
The size of the byte that the SC-substatus server sends to the client;
The size of bytes that the cs-win32-statu client sends to the server;
Case study:
2013-12-22 18:47:12 W3SVC2137573334 D-901195C886694 119.147.151.150 GET/. aspx id = 2230 & TypeId = 91 80-123.125.71.28 HTTP/1.1 Mozilla/5.0 + (compatible; + Baiduspider/2.0; ++ http://www.baidu.com/search/spider.html)---www.111cn.net 200 0 0 59004 243 2250
This log indicates that the crawling type of a spider is Baidu Spider, GET/. aspx id = 2230 & TypeId = 91 indicates that the crawler file name is. aspx id = 2230 & TypeId = 91. 200 is returned.
Tips
If you want to analyze website logs more accurately, you can try to use dedicated tools for analysis, such as iis log analysis tools and apache log professional analysis tools.