Analysis on the crawl principle of search engine crawler

Source: Internet
Author: User
Keywords Search engine crawl crawler

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

This article simply analyzes the Crawler crawl Web page Some of the basic principles and enjoy, can understand the basic page of several ranking factors: link building and the layout of the Web page, a lot of experience, write bad also don't scold, thank you!

The working principle of the crawler includes crawl, strategy and storage, crawl is the basic labor process of the crawler, the strategy is the intelligence center of the reptile, the storage is the work result of the reptile, we understand the working principle of the whole crawler according to the process of the essence.

1: Start crawling from the seed site

Based on the World Wide Web butterfly structure, this non-linear Web page organization structure, there will be a crawl order of the problem, this crawl order strategy must ensure as much as possible to crawl all pages.

In general, the crawler chooses to grab the butterfly-type left structure for the crawl starting point, typically the homepage of a portal such as Sina.com and Sohu.com, which analyzes the URLs each time the page is crawled, a string of links that point to the URLs of other pages, which guide the crawler to crawl other pages. (Based on this we can initially understand the engine first to the left and right, first up the bottom of the crawl reason)

A: Depth-first strategy (Depth-first traversal)

A depth-first traversal strategy is similar to a family succession strategy, typically inherited by a feudal emperor, usually the eldest son, and if the eldest son dies, the priority of the grandson is greater than the priority of the second son (which is carefully analysed), and if the eldest son Dachansun has died, then the next son inherits, This succession of priorities is also called the depth-first strategy. (From this point we can understand the spider's Crawl page order)

B: Width-first strategy (Breadth-first traversal)

Width Priority We are also called breadth first, or hierarchy priority, for example: we give grandparents and parents and peers to the oldest ancestors, followed by the parents, and finally for peers, in the crawler also adopted such a strategy. There are three main reasons for the strategy based on the limited width of the use:

1> homepage Important pages tend to be close to the seeds, for example, when we open the news station is often the hottest news, with the constant depth of surfing, the PV value increased, see the importance of the page is getting lower.

The actual depth of the 2> World Wide Web can reach up to 17 levels, and the path to a Web page is much deeper, but there is always a short path.

3> width priority is advantageous to the cooperative crawl of many reptiles (Mozk is based on the data analysis of the predecessors and IIS log analysis, for the moment, there are different views, welcome to discuss exchanges), many crawler cooperation is usually first crawl station, encounter outside the connection and then start to crawl, grasping the closed very strong.

Attach: Link optimization, avoid crawling link dead loop, while also avoiding the crawl resources are not crawled, waste a lot of resources do not work hard. (How to establish a reasonable internal link can refer to the station).

2: Web Capture priority strategy

The crawl priority strategy for Web pages is also known as the "Face page Selection Problem" (page selection), which usually captures pages of importance, so it should be well understood that limited resources (crawler, server load) take care of the most important pages as much as possible.

So which pages are the pages of importance?

The importance of the Web page a lot of factors, there are links to the popularity (know the importance of the link), the importance of links and average depth links, site quality, historical weight and other major factors.

The popularity of links is primarily determined by the quantity and quality of backlinks (backlinks), which we define as IB (P).

The importance of a link is a function of a URL string that simply examines the string itself, such as ". CC" and "map" higher than the "." and "Maps" (here is the analogy is not absolute, as we usually default home index.**, To define other names can also, the other ranking is a combination of factors, COM is not necessarily ranked good, but only a small factor, we define as IL (P)

Average connection depth, personal opinion, the average link depth of the whole station is calculated based on the principle of width precedence as analyzed above, and then the higher the importance of distance from the seed site is considered. We define the ID (P)

We define the importance of Web pages as I (P)

So:

I (P) =x*ib (P) +y*il (p)

The ID (P) is guaranteed by the width-first traversal rule, and therefore not as an important index function, in order to ensure the high importance of the Web page is crawled, so, such a crawl is entirely reasonable, scientific.

The 1th of this article is to explain a point, the 2nd is the analysis of a face, writing is not good, everyone a lot of experience.

SEO goal is to improve the quality of the site, improve the quality of the site is to improve the site user experience friendliness, improve site user optimization of the ultimate goal is to leave se do evergreen, above is Mozk, SEO is a ranking of the reverse inference process, not all right, just a data analysis, Any information can only be a reference, or to rely on their own practice, welcome to the small station WWW.WOAISEO.COM,MOZK with you to learn SEO.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.