Search engine Web page collection algorithm

Source: Internet
Author: User
Keywords Collection algorithm search engine Web page
Tags .url can find download find link links search search engine

Absrtact: The first stage: the size of the search engine's web crawl is to take the "size of all" strategy, that is, the Web page can be found in the link to be crawled into the URL, the mechanical will be the new crawl page of the URL extracted, this way although

The first stage: the size of all

Search engine Web Crawl is to take " Size-All "strategy, that is, the Web page can be found in the link to the crawl URL, mechanical will be newly crawled Web page in the URL extracted, this way although relatively old, but the effect is very good, this is why many webmaster response spider to visit, but did not include the reason, this is only the first stage.

Phase II: Web rating

The second stage is to grade the importance of the Web page, PageRank is a well-known link analysis algorithm, can be used to measure the importance of the Web page, it is natural that the webmaster can use the idea of PageRank to sort the URL, which is your passion for "hair outside the chain", according to a friend, in China " Hair outside the chain "this market has billions of dollars a year on the scale."

The purpose of the crawler is to download the Web page, but PageRank is a global algorithm, that is, when all the pages have been downloaded, the results are reliable. For small and medium Web sites, if the server quality is not good, if in the crawl process, only to see part of the content, in the crawl phase is unable to obtain a reliable PageRank score.

Phase III: OCIP strategy

The OCIP strategy is more like the improvement of the PageRank algorithm. Before the algorithm starts, each page is given the same "cash", and whenever a page A is downloaded, a gives its "cash" average to the link page contained in the page, emptying its "cash". This is one of the reasons why the fewer links are exported, the higher the weight.

And for the Web page to be crawled, according to the amount of cash on hand to sort, priority to download the most abundant cash pages, Ocip is roughly the same as the PageRank idea, the difference is: PageRank each iteration to calculate, and ocip do not need, so the calculation speed is far faster than the PageRank, Suitable for real-time computing. This may be why a lot of Web pages will appear "seconds".

Stage four: The priority strategy of the station.

Big Station priority thinking is very direct, to the site as a unit to measure the importance of the Web page, for the URL to be crawled in the queue of pages, according to the site classification, if which site waiting to download the most pages, then priority to download these links. The essence of the idea is "a preference to download large Web site url". " Because large websites often contain more pages. In view of the large web site is often a famous station, the quality of its web page generally higher, so this idea is simple, but there is a certain basis.

Experiments show that the algorithm is simple and rough, but it can be used to collect high-quality Web pages, which is very effective. This is why many Web site content is reproduced, the major stations can be ranked in front of you one of the most important reasons.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.