Absrtact: The first stage: the size of the search engine's web crawl is to take the "size of all" strategy, that is, the Web page can be found in the link to be crawled into the URL, the mechanical will be the new crawl page of the URL extracted, this way although
The first stage: the size of all
Search engine Web Crawl is to take " Size-All "strategy, that is, the Web page can be found in the link to the crawl URL, mechanical will be newly crawled Web page in the URL extracted, this way although relatively old, but the effect is very good, this is why many webmaster response spider to visit, but did not include the reason, this is only the first stage.
Phase II: Web rating
The second stage is to grade the importance of the Web page, PageRank is a well-known link analysis algorithm, can be used to measure the importance of the Web page, it is natural that the webmaster can use the idea of PageRank to sort the URL, which is your passion for "hair outside the chain", according to a friend, in China " Hair outside the chain "this market has billions of dollars a year on the scale."
The purpose of the crawler is to download the Web page, but PageRank is a global algorithm, that is, when all the pages have been downloaded, the results are reliable. For small and medium Web sites, if the server quality is not good, if in the crawl process, only to see part of the content, in the crawl phase is unable to obtain a reliable PageRank score.
Phase III: OCIP strategy
The OCIP strategy is more like the improvement of the PageRank algorithm. Before the algorithm starts, each page is given the same "cash", and whenever a page A is downloaded, a gives its "cash" average to the link page contained in the page, emptying its "cash". This is one of the reasons why the fewer links are exported, the higher the weight.
And for the Web page to be crawled, according to the amount of cash on hand to sort, priority to download the most abundant cash pages, Ocip is roughly the same as the PageRank idea, the difference is: PageRank each iteration to calculate, and ocip do not need, so the calculation speed is far faster than the PageRank, Suitable for real-time computing. This may be why a lot of Web pages will appear "seconds".
Stage four: The priority strategy of the station.
Big Station priority thinking is very direct, to the site as a unit to measure the importance of the Web page, for the URL to be crawled in the queue of pages, according to the site classification, if which site waiting to download the most pages, then priority to download these links. The essence of the idea is "a preference to download large Web site url". " Because large websites often contain more pages. In view of the large web site is often a famous station, the quality of its web page generally higher, so this idea is simple, but there is a certain basis.
Experiments show that the algorithm is simple and rough, but it can be used to collect high-quality Web pages, which is very effective. This is why many Web site content is reproduced, the major stations can be ranked in front of you one of the most important reasons.