Analysis on the crawl principle of search engine crawler

Last Update:2014-12-19 Source: Internet

Author: User

Keywords Search engine crawl crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

This article simply analyzes the Crawler crawl Web page Some of the basic principles and enjoy, can understand the basic page of several ranking factors: link building and the layout of the Web page, a lot of experience, write bad also don't scold, thank you!

The working principle of the crawler includes crawl, strategy and storage, crawl is the basic labor process of the crawler, the strategy is the intelligence center of the reptile, the storage is the work result of the reptile, we understand the working principle of the whole crawler according to the process of the essence.

1: Start crawling from the seed site

Based on the World Wide Web butterfly structure, this non-linear Web page organization structure, there will be a crawl order of the problem, this crawl order strategy must ensure as much as possible to crawl all pages.

In general, the crawler chooses to grab the butterfly-type left structure for the crawl starting point, typically the homepage of a portal such as Sina.com and Sohu.com, which analyzes the URLs each time the page is crawled, a string of links that point to the URLs of other pages, which guide the crawler to crawl other pages. (Based on this we can initially understand the engine first to the left and right, first up the bottom of the crawl reason)

A: Depth-first strategy (Depth-first traversal)

A depth-first traversal strategy is similar to a family succession strategy, typically inherited by a feudal emperor, usually the eldest son, and if the eldest son dies, the priority of the grandson is greater than the priority of the second son (which is carefully analysed), and if the eldest son Dachansun has died, then the next son inherits, This succession of priorities is also called the depth-first strategy. (From this point we can understand the spider's Crawl page order)

B: Width-first strategy (Breadth-first traversal)

Width Priority We are also called breadth first, or hierarchy priority, for example: we give grandparents and parents and peers to the oldest ancestors, followed by the parents, and finally for peers, in the crawler also adopted such a strategy. There are three main reasons for the strategy based on the limited width of the use:

1> homepage Important pages tend to be close to the seeds, for example, when we open the news station is often the hottest news, with the constant depth of surfing, the PV value increased, see the importance of the page is getting lower.

The actual depth of the 2> World Wide Web can reach up to 17 levels, and the path to a Web page is much deeper, but there is always a short path.

3> width priority is advantageous to the cooperative crawl of many reptiles (Mozk is based on the data analysis of the predecessors and IIS log analysis, for the moment, there are different views, welcome to discuss exchanges), many crawler cooperation is usually first crawl station, encounter outside the connection and then start to crawl, grasping the closed very strong.

Attach: Link optimization, avoid crawling link dead loop, while also avoiding the crawl resources are not crawled, waste a lot of resources do not work hard. (How to establish a reasonable internal link can refer to the station).

2: Web Capture priority strategy

The crawl priority strategy for Web pages is also known as the "Face page Selection Problem" (page selection), which usually captures pages of importance, so it should be well understood that limited resources (crawler, server load) take care of the most important pages as much as possible.

So which pages are the pages of importance?

The importance of the Web page a lot of factors, there are links to the popularity (know the importance of the link), the importance of links and average depth links, site quality, historical weight and other major factors.

The popularity of links is primarily determined by the quantity and quality of backlinks (backlinks), which we define as IB (P).

The importance of a link is a function of a URL string that simply examines the string itself, such as ". CC" and "map" higher than the "." and "Maps" (here is the analogy is not absolute, as we usually default home index.**, To define other names can also, the other ranking is a combination of factors, COM is not necessarily ranked good, but only a small factor, we define as IL (P)

Average connection depth, personal opinion, the average link depth of the whole station is calculated based on the principle of width precedence as analyzed above, and then the higher the importance of distance from the seed site is considered. We define the ID (P)

We define the importance of Web pages as I (P)

So：

I (P) =x*ib (P) +y*il (p)

The ID (P) is guaranteed by the width-first traversal rule, and therefore not as an important index function, in order to ensure the high importance of the Web page is crawled, so, such a crawl is entirely reasonable, scientific.

The 1th of this article is to explain a point, the 2nd is the analysis of a face, writing is not good, everyone a lot of experience.

SEO goal is to improve the quality of the site, improve the quality of the site is to improve the site user experience friendliness, improve site user optimization of the ultimate goal is to leave se do evergreen, above is Mozk, SEO is a ranking of the reverse inference process, not all right, just a data analysis, Any information can only be a reference, or to rely on their own practice, welcome to the small station WWW.WOAISEO.COM,MOZK with you to learn SEO.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More