Search engine Crawl algorithm

Source: Internet
Author: User
Keywords Crawl Web pages
Tags google + it is link links search search engine set simple

Absrtact: Search engine seemingly simple crawl-warehousing-query work, but the implicit algorithm of each link is very complex. Search engine crawl page work by Spider (Spider) to complete, crawl action is easy to achieve, but crawl which pages, priority to crawl which

Search engine seemingly simple crawl-warehousing-query work, but the various links implied in the algorithm is very complex. Search engine Crawl page work rely on Spider (Spider) to complete, crawl action is easy to achieve, but crawl which pages, priority to crawl which page but need algorithm to decide, the following introduction several crawl algorithm:

1, Width first crawl strategy:

As we all know, most of the sites are based on the tree map to complete the distribution of the page, then in a tree map of the link structure, which pages will be first crawled? Why should you crawl these pages first? Width-first crawl strategy is to follow the tree structure, the priority to crawl sibling links, when the sibling link crawl complete, Then crawl the next level of links. The following figure:

As you can see, I used the link structure instead of the site structure when I was expressing it. The link structure here can be composed of links to any page, not necessarily internal links. This is an idealized width-first crawl strategy, in the actual crawl process, it is not possible to like this full width first, but limited width first, the following figure:

Above, our spider in retrieving the G link, through the algorithm found that the G page has no value, so the tragedy of the G link and subordinate H link was spider to harmony. As for the G link, why is it being harmonized? OK, let's analyze it.

2, not fully traverse link weight calculation:

Each search engine has a set of PageRank (refers to the page weight, non-Google PR) calculation method, and often update. The internet is almost infinite, and every day it generates a huge amount of new links. The search engine's calculation of link weights can only be a complete traversal. Why Google PR to three months or so before the update? Why Baidu Big update one months 1-2 two times? This is because the search engine uses the incomplete traverse link weight algorithm to compute the link weight. In fact, according to the current technology, to achieve faster frequency of the weight update is not difficult, the calculation speed and storage speed completely up, but why not do it? Because it is not necessary, or has been achieved, but do not want to be published. So, what is a complete traversal link weight calculation?

We will form a set of K number of links, R on behalf of the link to the pagerank,s to represent the link contained in the number of links, Q represents whether to participate in the transfer, Beta represents damping factor, then the weight of the link obtained by the formula:

From the formula can be found that the weight of the decision link is q, if the link was found cheating, or search engine manual removal, or other reasons, Q was set to 0, then many of the outside chain is useless. Beta is the damping factor, the main role is to prevent the emergence of weight 0, resulting in links can not participate in weight transfer, and to prevent the emergence of cheating. The damping factor β is generally 0.85. Why is the number of sites multiplied by the damping factor? Because not all pages in a page are involved in weight transfer, the search engine will remove the filtered links again by 15%.

But this kind of incomplete traversal weight calculation needs to accumulate to a certain number of links in order to start the calculation again, so the general update cycle is relatively slow, can not meet the users of real-time information needs. On this basis, a real-time weighting allocation crawl strategy is presented. That is, when the spider completed the crawl page and the entrance, immediately weight allocation, the weight redistribution to crawl the link library, and then the spider according to the weight of the height to crawl.

3. Social engineering grasping strategy

Social engineering strategy, is in the spider crawl process, join artificial intelligence, or through artificial intelligence trained machine intelligence, to determine the priority of the crawl. The crawl strategies I've known today are:

A, Hot priority strategy: For the outbreak of hot keyword priority to crawl, and do not need to go through strict weight and filtering, because there will be new links to cover and the user's active choice.

B, authoritative priority strategy: The search engine will assign an authority to each site, through the site history, site updates, etc. to determine the authority of the site, the priority to crawl the site links High Authority.

c, the user clicks the policy: when most searches an industry thesaurus keyword, frequently clicks the same website the search result, then the search engine will crawl this website more frequently.

D, historical reference strategy: for the site to maintain frequent updates, the search engine will establish an updated history of the site, based on the update history to estimate the future of the amount of updates and determine the crawl frequency.

Guidance for SEO work:

Search engine crawl principle has been explained in depth, so now to shallow out of these principles on SEO work guiding role:

A, regular, quantitative updates will let spiders crawl on time crawl Site page;

B, the company's operating site than the authority of the individual site is higher;

C, build station time long site more easily be crawled;

D, the page should be appropriate distribution links, too much, too little is not good;

E, popular websites are also popular with search engines;

F, important pages should be placed in a more shallow web site structure;

G, the site's industry authority information will improve the authority of the site.

This tutorial is here, the next tutorial theme is: page value and the weight of the site calculation.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.