Teach you to analyze the spider's crawl characteristics to understand how the site works

Last Update:2014-12-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

In the daily operation of the website and maintenance, we often need to through the space of the WWW log to understand the spider's grasp, and the normal work to make adjustments, The following will be step-by-step to let you fully understand the setting of the log and Spider's capture feature analysis so that you fully understand the meaning of each parameter and as a reference for their own adjustment and modification.

First: You need to make sure that your virtual host or server has the logging enabled, the general virtual Space Trader's control Panel has the WWW log record function, and provides the stationmaster to download and the analysis, the following is the editor uses a log style, because each space merchant differs its operation order and the way not only the same, Here is a reference only.

First click on Figure one or enter the interface in Figure II, click to download the weblog log will appear in Figure four of the interface, figure four inside each of the TXT is the year-month-day named, and recorded the size of the log, click to see the detailed information.

Second: In the code to find traces of spiders, because a txt log is hundreds of K, thousands of lines, so each to check is not realistic, we need to fully understand the characteristics of spiders and through the query function fast positioning, because the spider code is spider, So when retrieving spider will come out all the spider's visit, such as Baidu, google,360, and so on, and Baidu spider characteristics is baiduspider, we are here to explain the situation of Baidu Spider.

We first use Notepad to open the download txt document, and through the Edit lookup function (figure V) to quickly search, in the search box input Baidu, and according to confirm can find Baidu spiders crawl code (Figure VI)

Third: To find Baidu spiders crawl row after each parameter, the editor to explain and the corresponding situation to explain (see sample diagram).

Parameter 1: This is Baidu spider to crawl content of time, this time in general and computer time difference 8 hours, this is mainly the log time used is GMT, and Beijing time difference 8 hours; that is, you need to add time to 8 hours is the corresponding Beijing time, so the parameters 1 of the spider to crawl the time is May 23 13:8.

Parameter 2: Crawl the content of the way, get means to capture the meaning of the next immediately after the/index.html is crawled page, this means that the spider to grab the homepage, if get behind is/--means spiders do not crawl anything, this time need to cause site maintenance personnel's attention, Your content or is a problem, or the site's home page layout, or the content of the article, and so there are problems, need specific problems specific analysis.

Parameter 3: This is the spider to crawl content when the server's IP address, because now a lot of domain name is using CNAME way to resolve, so many webmasters do not know their own site IP is how much, and this IP is the space quotient to spiders to crawl content of IP, When you have problems with the site can be checked by the number of sites on the IP and the inclusion of the situation to determine whether they are implicated.

Parameter 4: This parameter is the status of the protocol, usually 200 means normal, 404 means that the file can not be found, 500 internal server error, the general site all pages should be 200 is correct, if the revision is generally a 404 error, here needs to be based on a different return value to query specific reasons

Digression: Every webmaster's new Web site is anxiously waiting for spiders to crawl and index, so that their site has a good ranking, but now the Baidu Spider for the new site review has been very strict and time is generally more than 20 days, So want to be Baidu spiders to crawl content and get a good ranking has become more and more difficult, as the spider more intelligent degree, want to cheat or black hat tricks to deceive the spider's Trust has not been easy, And even if you succeed in Baidu's anti-cheating center later on the site's in-depth inspection was found and will cheat the site according to the degree of cheating to do the corresponding shelf punishment, so advise stationmaster or steadfast do station, devote yourself to do a white hat master, let you maintain the Enterprise website ranking worry-free.

The above article by the Sichuan boric acid http://www.cdxzhg.com in a 5 first hair, hope with all stationmaster together encourage, if need reprint please indicate the source, thank cooperation.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More