Analysis and Implementation of Key Distributed Web Crawler technologies-distributed Web Crawler Architecture Design

Source: Internet
Author: User

I,Study Scope

Distributed Web Crawlers contain multiple crawlers. Each crawler needs to complete tasks similar to a single crawler. They download webpages from the Internet, save the webpages to a local disk, and extract them.URLAndURLTo continue crawling. Because parallel crawlers need to split download tasks, crawlers may extract their ownURLSend to other crawlers. These crawlers may be distributed in the same LAN or in different geographical locations.

Depending on the degree of distribution of crawlers, distributed crawlers can be divided into the following two categories:

1Distributed crawler based on LAN: All crawlers of this distributed crawler run in the same LAN and communicate with each other through high-speed network connections. These crawlers access the Internet through the same network and download webpages. All the network loads are concentrated at the egress of the LAN where they are located. Because of the high bandwidth of the LAN, the efficiency of communication between crawlers can be ensured. However, the total bandwidth ceiling of the network egress is fixed, and the number of crawlers is limited by the LAN egress bandwidth.

2Distributed Web crawler based on Wan: When crawlers of parallel crawlers run in different geographical locations (or network locations), we call this parallel crawler a distributed crawler. For example, the crawlers of distributed crawlers may be located in China, Japan, and the United States. They are responsible for downloading webpages from these three locations.Chinanet,CERNET,CeinetDownload the web pages of the three networks respectively. The advantage of a distributed crawler is that it can distribute network traffic to a certain extent and reduce the load at the network egress. If crawlers are distributed in different geographic locations (or network locations), it is worth considering how long it takes to communicate with each other. The communication bandwidth between crawlers may be limited. Generally, crawlers need to communicate over the Internet. 

In practical applications, lan-based distributed Web crawlers are more widely used. Wide-area network-based crawlers are designed and implemented at a high cost due to their complexity, this type of crawler is generally used only by large companies with strong strength and heavy collection tasks. The crawler designed in this thesis is based on the LAN distributed network crawler.

 

Ii. Overall Analysis of distributed Web Crawlers

the overall design of distributed Web Crawlers focuses on how crawlers communicate. Currently, distributed Web Crawlers can be divided into master-slave mode, autonomous mode, and hybrid mode based on different communication methods.

master-slave mode is a host as the control node is responsible for managing all the hosts running web crawlers, crawlers only need to receive tasks from the control node, you can submit the newly generated task to the control node. In this process, you do not need to communicate with other crawlers. This method is easy to implement and facilitates management. The control node needs to communicate with all crawlers. It needs an address list to store information about all crawlers in the system. When the number of crawlers in the system changes, the Coordinator must update the data in the address list. This process is transparent to crawlers in the system. However, as the number of crawlers increases. The control node becomes the bottleneck of the entire system and causes the performance of the entire distributed Web Crawler system to decline. Overall structure of the master-slave mode:

 

Autonomous Mode means that the system does not have a coordinator, and all crawlers must communicate with each other, which is more complex than crawlers in the master-slave mode. In autonomous mode, full-connection or circular communication can be used. Full-connection communication means that crawlers can send messages to each other. In this way, each web crawler maintains an address list, and the table stores the locations of all crawlers in the system, data can be directly sent to crawlers who need this data during each communication. When the number of crawlers in the system changes, the address list of each crawler must be updated. Ring communication refers to a crawler logically forming a ring network, where data is transmitted clockwise or counterclockwise in one direction. Each crawler's address list only stores information of its predecessor and successor. After receiving the data, the crawler determines whether the data is sent to itself. If the data is not sent to itself, the crawler forwards the data to its successor. If the data is sent to itself, the crawler will not send it again. Assume that the entire system hasNCrawlers. When the number of crawlers in the system changes, onlyN-1The crawler address list needs to be updated.

 

 

The hybrid mode is a compromise mode that combines the features of the above two modes. In this mode, all crawlers can communicate with each other and have the task allocation function. However, all crawlers have a special crawler. This crawler is mainly used to centrally allocate tasks that cannot be assigned after the crawler task has been assigned. To use this method, each web crawler only needs to maintain the address list of the collection range. In addition to the address list of the collection range, special crawlers need to save the address list that requires centralized allocation. Overall structure of the hybrid mode:

 

Iii. Architecture of large-scale distributed Web Crawler:

From these figures, we can see that distributed Web Crawler is a very complex system. Many factors need to be considered. Performance can be said to be an important indicator. Of course, hardware resources are also required. But it is not in the scope of this series. Starting from the next article, I will introduce the solutions to the problems that we need to consider step by step from single-host web crawlers. If you have a better solution. Thank you for your advice.

Auspicious's sentence makes sense that a person can only do several things in his life. I hope you can support my series.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.