2. Width-first crawler and crawler with preference (3)

Source: Internet
Author: User

4. crawler with preference

Sometimes, when you select the URL to be crawled in the URL queue, you may not select the URL according to the queue's "first-in-first-out" method. The important URL is first extracted from the queue. This policy is also called "Page selection ). This allows limited network resources to take care of webpages with high importance.

So which webpages are important websites?

There are many factors to judge the importance of a webpage, mainly including the popularity of links (know the importance of links), the importance of the link and the average link depth, website quality, historical weight and other main factors.

The popularity of a link is mainly determined by the quantity and quality of the backlinks (that is, the link pointing to the current URL). We define it as IB (P ).

The importance of the link is a function about the URL string. It only examines the string itself, for example, ". COM and home are more important than. CC and map are high. We define it as IL (P ).

Average link depth: calculate the average link depth of the entire site based on the width-first principle analyzed above, and then consider that the closer the distance from the seed site is more important. We define it as ID (P ).

If we define the importance of a web page as I (p), the importance of the page is determined by the following formula:

I (p) = x * IB (p) + y * Il (P)

The X and Y parameters are used to adjust the proportions of IB (P) and IL (P). Id (P) is guaranteed by the width-first traversal rule, therefore, it is not an important indicator function.

How can we achieve the best priority crawler? The simplest way is to use priority queue to implement todo tables and take the importance of each URL as the priority of queue elements. In this way, each selected extended URL is the most important web page.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.