Webmaster Sharing: Six aspects of spider crawling and crawling (i)

Last Update:2014-12-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

We all know that search engines want to provide users with high-quality search results, the first to be included in the page, and the Web page will need to search the spider to constantly crawl, and then according to the crawling situation has selective crawl and included. This article from six aspects and everyone analysis of spiders crawling and crawling, hoping to let novice webmaster more understanding of the principles of search engines, know these, to our website optimization will have guiding significance. Well, start today's text.

First, common spiders: Spiders are in fact the search engine to access the page of the program, English called Spider, also known as the robot, English for the bot. Sometimes look at the IIS log can see a variety of spiders access to the Web page, the optimization of the site to play a certain role in guiding. When the spider visits a website, it will issue a page access request and return the HTTP status code, and then the spider will put these status code into their own database, for the future of various calculations to pave the ground. Common spiders have Baidu spider (baiduspider), Yahoo Spider (Mozilla), Microsoft Bing Spider (msnbot), Sogou spider (Sogou+web+bot), Google Spider (Googlebot) and so on. Under normal circumstances, the IIS log will be displayed, the webmaster should spend more time to carefully look at the spider's visit to their site, and then make adjustments to their own site.

Second, tracking Links: Tracking links refers to the spider will follow the page on the link from a page to crawl to the next page. Because the entire internet is a different link composition, so theoretically spiders can crawl all the pages. But because the actual link between the site structure is very complex, the spider will take a certain strategy to crawl all the pages. Common strategies generally have two kinds, one is depth first, the other is breadth first. Depth first refers to crawling along the link until there is no link, and then returning to page one. and breadth first is to crawl along the link of layer one, until the link of the first layer crawls to crawl finish then crawl second layer of link. If in theory, as long as there is enough time, spiders can crawl through all the pages, but in fact, the search engine is only a small part of the Web page. So for us, strive to do enough external links, so that spiders have the opportunity to crawl and crawl.

Third, file storage: File storage is a technology key to search engines, but also a challenge. When the search engine crawls and crawls completes, the data is stored in the original page database. The data stored in this database is exactly the same as the page the user sees in the browser. Each URL will have a unique number. In addition, it is also necessary to store a variety of computing weights required by the data, such as the relationship between the various links, PR of the iterative calculation. The amount of data is huge. Many sites do not exist, we can access the search engine's snapshot page, which is the existence of search engines in their own database, and webmaster site itself data is not related to the existence of independent. Normal snapshot updates, ranking fluctuations and search engine file storage has a direct relationship.

Here, to share the basic three aspects: common spiders, tracking links, file storage. These content as a kind of understanding of common sense, we still can play a certain role. The search engine itself is a huge system in which we cannot imagine the huge sums involved. Sometimes optimize the site obviously feel fluctuations, webmaster are also very anxious, very puzzling why the site is not a problem, in fact, many times is not our own reasons, and we are facing a huge computing system, it itself is gradually mature and perfect, So sometimes there are abnormal phenomena in the normal range. Site fluctuations do not want to see, we should not put the focus on this, or to make more time to improve their content is the focus.

Well, this article is here, we have any good idea also welcome and 11544.html "> I contact, this article from: Jinhua game download, url: http://www.mobiledy.com/, also welcome reprint, reprint please keep the link, thank you!"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More