Xiao Su: from shallow to deep talk about Web site original access log analysis

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Recently, I have done a small investigation on about 50 seoer who have just engaged in or have been engaged in 1-2 years. Including a lot of interviewers, really in the work can be analyzed to the log seoer proportion is very very little, asked to visit the log in SEO played a role, many seoer are shaking their heads, Or just know that some of the fur did not hand-operated, the main reason is not in the company platform has not had the opportunity to practice, the following author to share their own website on the original access log knowledge:

What is an access log

The Web site access log is the file that logs the end of the. Log of the various raw information that the Web server receives processing requests and run-time errors, specifically, the server log. Its role is to let us seoer can clearly know what the user in what IP, what time, what operating system, what browser, what resolution of the display of the situation visited your site which page, whether access to success.

When do we need to analyze log and log features

Do we analyze the logs every day? No, because the log analysis is more boring, usually monthly or half month analysis. This analysis is a daily analysis, if your site has been relatively normal that can be analyzed once a month, or can be a simple analysis.

In fact, the log is more in the site when the anomaly, will observe a half month log, focus on analysis of the spider's movements. For example will analyze is not the website 404,robots set unusual or hangs the horse and so on the problem causes the spider to disappear, the son to look for solves the problem.

The following is an access record from the author's Daily analysis of the log:

119.254.22.200--[10/apr/2012:00:04:54 +0800] "get/bbjk/index.html http/1.0 25269"-"" Sogou Web spider/4.0 (+http:/ /WWW.SOGOU.COM/DOCS/HELP/WEBMASTERS.HTM#07) "

119.254.22.200 for user access to IP

10/apr/2012:00:04:54 +0800 for access date-time zone

Get/bbjk/index.html http/1.0 According to http/1.1 Protocol crawl (under the domain name)/bbjk/index.html This page (get represents the server action)

200 Server Response Status code

25269 is the number of page bytes.

Sogou Web spider/4.0 (+http://www.sogou.com/docs/help/webmasters.htm#07) for Sogou spider features.

Note: Any spider is only for ordinary users of the site. Do not think that spiders are very powerful, many people also think that the site if not log in to see the content, spiders can crawl after landing page content, it is impossible. Unless the website has done Spider-exclusive means.

How to analyze Web site access log

I remember a few years ago, the author just contact SEO time, tools scarce era, notes have always been like handmade

To analyze access to the log, of course, manual analysis is time-consuming and laborious, and here only to explain the author's favorite manual analysis of some of the log.

The author now manual analysis is generally concentrated in the study of daily spiders crawling on the site of the law and the site update the relationship between the data. Of course, each site needs to be observed according to their own, the end will have a very perfect law.

The author will be the Daily spider access time in chronological order statistics into statements,

For example: 2012-4-18 1-2 crawl 5 times

Crawling 3 times at 2-3.

Crawling 10 times at 3-4.

If you are careful, you can make a trend map is very intuitive. Such statistics are generally in the site after the establishment and the site after the exception to strengthen the analysis of the log, day-to-day operations, more focused on the analysis of the spider's Daily crawling laws and then timed the release of the volume of articles, increase included.

Manual analysis of the log are more boring, and sometimes affect the mood, but now the era of the prosperity of tools, we also use tools to achieve a multiplier.

The author compares the recommendation is light-years log analysis tool. The tool is very simple, the author is not in this demo, interested in their own Baidu, its advantage lies in the report generated clearly tell us spider crawling anomaly, and page crawl traces. Like 404. The only regret is that at present, the author has not found a tool with the analysis of spider crawling laws to generate trend map tools.

Note: In the log analysis, many times we want to find the problem from the log, in order to improve our own problems, so we need to pay special attention to 404,301 of the status code.

The true and false spiders distinguish

Why the true and false spiders, mainly because now the information is booming, a lot of collection tools to not let each other find their traces are simulated spider traces to download data sources. So will cause a lot of seoer mistakenly think spider a large number of crawl page but found that included no increase in the phenomenon. The following author tells us that it is easier to identify the true and false spiders and some special attention places.

1, the real spider.

220.181.108.96--[07/apr/2012:01:22:21 +0800] "get/site/sex/index.php http/1.1" 302 "" mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html) "

This is the author of a site of the log fragment, I took out the IP, in the win system under the CMD Login dos box under the input nslookup 220.181.108.96 view echo:

  

As above, if Baidu Spider, he will directly back to show the Baidu domain name.

2, Fake spider

False spiders are more classic is Chinaz query tool, he is simulating Baidu Spider, his IP for 125.90.88.96 we do not appear after the Baidu domain name. Interested can nslookup 125.90.88.96, the author will not be screenshots.

The existence form of the general false Spider is: XXX.XXX.XXX.XXX--[07/apr/2012:01:22:21 +0800] "get/site/sex/index.php http/1.1" 302 "" "mozilla/5.0 ( compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html) "

Look at the same as true, but IP is a trick!

3, special case spider

1 The CDN accelerates, resulting in IP chaos.

This kind of situation is generally the site in the Cdn acceleration, and then to see the Apache server visit log when found a lot of spider traces of IP are very similar, if according to the Baidu official out of the identification method nslookup IP must be anonymous, because these IP are CDN node IP, So it can cause misunderstanding. If the CDN is opened, the actual number of spider visits generally <= the number of spiders in the log.

2 The existence of Baidu anonymous spider discussion.

Anonymous spider? Baidu engineer Lee has always stressed that Baidu spiders will not be anonymous to visit the site, but the author from the Internet Access data, plus the author of a certain station data speculation, I think there will be two situations:

The first: If there is an anonymous spider, it's obviously a contradiction to Lee's words. So we can go dialectical to understand, if there is, what the anonymous spider will do, a lot of seoer including the author have speculated that there may be use such spiders to verify that the Web site to the spider and users do a different treatment. So in this case, do not be guilty of a thief, honestly do stand.

The second: May be Baidu's office workers in the company visited your website, Baidu employees are also people, they also have emotions, perhaps your site was someone they found, visited your site, thus leaving the Baidu IP, resulting in misunderstanding. (In fact, many departments of Baidu has been doing the collection of customer information, such as the Network Union Department)

Note: For the true and false spider identification must be considered in many aspects, do not only take the IP to determine the authenticity of the judge.

The last is the author recently in the analysis of the services of the site http://baby.wenkang.cn experience accumulation, would like to write a brief experience to share, but unknowingly has nearly 3,000 words, if you seoer have any problems can add my qq:123464947- -Su, contact me and discuss together! The three-person line will have my division, common progress!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.