4. crawler with preference
Sometimes, when you select the URL to be crawled in the URL queue, you may not select the URL according to the queue's "first-in-first-out" method. The important URL is first extracted from the queue. This policy is also called "Page selection ). This allows limited network resources to take care of webpages with high importance.
So which webpages are important websites?
There are many factors to judge the importance of a webpage, mainly including the popularity of links (know the importance of links), the importance of the link and the average link depth, website quality, historical weight and other main factors.
The popularity of a link is mainly determined by the quantity and quality of the backlinks (that is, the link pointing to the current URL). We define it as IB (P ).
The importance of the link is a function about the URL string. It only examines the string itself, for example, ". COM and home are more important than. CC and map are high. We define it as IL (P ).
Average link depth: calculate the average link depth of the entire site based on the width-first principle analyzed above, and then consider that the closer the distance from the seed site is more important. We define it as ID (P ).
If we define the importance of a web page as I (p), the importance of the page is determined by the following formula:
I (p) = x * IB (p) + y * Il (P)
The X and Y parameters are used to adjust the proportions of IB (P) and IL (P). Id (P) is guaranteed by the width-first traversal rule, therefore, it is not an important indicator function.
How can we achieve the best priority crawler? The simplest way is to use priority queue to implement todo tables and take the importance of each URL as the priority of queue elements. In this way, each selected extended URL is the most important web page.