Cainiao also wants to play with search engines-analysis of some technical points of crawlers (Supplement)

Source: Internet
Author: User

After so long, continue writing. This article is a supplement of the previous Article: analysis of some technical points of crawlers.

This article mainly discusses the last two questions:How to process extracted URLs.

3. How to Deal with the extracted URL (crawling Policy )?

The crawling policy refers to the order in which crawlers crawl down from the seed URL. The following are several typical crawling policies (for analysis purposes, I will only introduce them here. For more information, see your documents ):

(1) Deep Priority Policy

Most people will immediately understand this term. The implementation of this strategy uses deep graph traversal. In fact, when we usually think of the network as a graph, and each page in the network as a node in the graph, we can consider the traversal problem, naturally, we will use the graph Traversal method (in fact, the depth of the tree is prioritized ). See:

Traverse by depth first. The order is a --> B --> C --> d --> E --> f --> G --> H.

In most cases, the global crawling policy does not adopt this method. For complex network diagrams, it is not good to go forward, in addition, it is difficult to control the use of parallel processing during implementation.

Reference: Depth Priority Policy

(2) breadth-first Strategy

This is similar to the first graph Traversal method (layered tree traversal ),It is a simple and intuitive Traversal method with a long history. It was used when search engine crawlers appeared. New crawling strategies often use this method as a benchmark, but it should be noted that this method is also a very powerful method, the actual effect of many new methods is not better than that of the width-optimized traversal policy. So far, this method is also a preferred crawling policy for many actual crawler systems.

The priority of breadth is a --> B --> f --> C --> d --> E --> G --> H.

According to some research, the depth of the most effective web page is not more than 10 layers (I cannot remember it here, it should be much less than 10), that is to say, the height is usually very small, but each layer is progressive, the phase is a geometric growth. For example, a very simple news portal website is generally: homepage (one) --> topic list page (more than a dozen) --> content page (several thousand to several 100,000 ). The two strategies are considered to have a better breadth. In addition, we can easily adopt parallel processing at the appropriate level when taking the breadth. For example, we can allocate each topic to the subthread (or subcrawler) on the topic list page on the front side.Program.

(3) incomplete PageRank Policy

First, you can understand what PageRank: PageRankAlgorithmIt is part of the Google ranking algorithm (ranking formula). It is a method that Google uses to identify the level/importance of a webpage. It is the only criterion that Google uses to measure the quality of a website.

Since PageRank has a good URL rating,Naturally, we can use PageRank to sort URL optimization levels. But there is a problem here. PageRank is a global algorithm. That is to say, after all webpages are downloaded, the computation results are reliable, and crawlers are designed to download webpages, only a part of pages can be seen during the running process. Therefore, webpages in the crawling stage cannot obtain reliable PageRank scores. For a downloaded webpage, add a URL in the URL queue to be crawled to form a webpage set. PageRank is calculated in this set. After the calculation is complete, sort the web pages in the URL queue to be crawled by PageRank score in descending order. The generated sequence is the list of URLs that crawlers should crawl in sequence. This is also why it is called "incomplete PageRank.

Reference: Encyclopedia and Video Explanation (hadoop is quite good, this set briefly introduces the implementation of PageRank ).

(4) OPIC Policy

OPIC, that isThe literal meaning is "online page importance calculation", which can be viewed as an improved PageRank algorithm. Before the algorithm starts, every Internet page is given the same "cash" (cash). Every time a page p is downloaded, P distributes his "cash" evenly to the link pages contained in the page and clears his "cash. For web pages in the URL queue to be crawled, they are sorted based on the amount of cash they have at hand, and the most cash-rich web pages are preferentially downloaded. The ocip is basically the same as PageRank in terms of its large framework. The difference is:PageRank requires Iterative Computing each time, while the ocip policy does not require iterative processes, so the computing speed is much faster than PageRank, and is suitable for real-time computing.. At the same time, during PageRank computing, there is a remote jump process to a non-link webpage, and ocip does not have this computing factor. The experimental results show that ocip is a good importance measurement strategy, and the effect is slightly better than the width-first traversal strategy.

This is just an excerpt. I have never touched on the specific implementation, and I feel that such a process is more suitable for recommendation systems.

Refer:《This is the explanation of core search engine technologies

(5) Big site priority strategy

This is literally easy to understand. During the crawling process, you may prefer the URLs of large websites, and choose the importance of webpages Based on websites, for webpages in the URL queue to be crawled, they are classified based on their websites. If the website has the most pages waiting for download, the optimization will first download these links. The essence of this idea tends to give priority to downloading large websites. Because large websites often contain more pages. Since large websites are often the content of famous enterprises and their webpage quality is generally high, this idea is simple, but there is a certain basis. The experiment shows that the algorithm effect also takes precedence over the width-first traversal policy.

I personally feel that this strategy can be combined with other strategies to achieve better results.

(6) webpage update policy

The dynamics of the Internet are its distinctive features. There are new pages at any time. The content of the page is changed or the existing page is deleted. For crawlers, the dynamic nature of the Internet is not reflected even when the web page is captured locally. Web pages downloaded locally can be seen as images of interconnected web pages. crawlers should ensure consistency as much as possible. Assume that a webpage has been deleted or the content has undergone major changes. However, the search engine is still ignorant of this and still sorts it by its old content, it is self-evident that the user experience is poor when it is provided as a search result to the user. Therefore, crawlers are also responsible for keeping their content synchronized with the content on the Internet pages, depending on the Web page update policy used by crawlers. The task of updating the webpage policy is to determine when to re-crawl the previously downloaded and webpage, so as to make the content of the local download webpage consistent with the original Internet page as much as possible. There are three common web page update policies: historical reference policies, user experience policies, and clustering sampling policies.

To be honest, most of the above policies are crucial to a high-performance general-purpose crawler, but it is difficult for cainiao-level entry-level crawlers to apply them well. So here are some basic crawler crawling skills:

(1) limit domain names. This should be taken into consideration for most vertical crawlers. If you do not want to see a lot of advertisement information in the data that has been crawled, this processing is very effective.

(2) to increase the priority of the List page, there is no good explanation for this. Most of the time, the essence to be crawled is here.

(3) you can filter URLs Based on the capture test results. For example, if you do not need some sub-columns of a website, you can add the corresponding URL prefix to the blacklist and skip the blacklist.

If you write so much for the time being, I feel that the writing is getting more and more watery, and it may be annoying to graduate soon. Sort out your ideas and try to write the next filtering and review.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.