Distributed multi-crawler system--Architecture design

Last Update:2018-07-24 Source: Internet

Author: User

Tags exception handling message queue redis

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface:

In the crawler development process, some business scenarios need to crawl hundreds of or even thousands of sites at the same time, need a support for multiple crawler framework. You should be aware of the following points in your design: code reuse, Functional modularity. If you write a complete crawler for each site, it must contain a lot of repetitive work, not only the development efficiency is not high, and to the end of the whole reptile project will become bloated, difficult to manage. Easy to expand. Multi-crawler framework, the most intuitive need is to facilitate expansion, add a target site to crawl, I only need to write a small amount of necessary content (such as crawl rules, parsing rules, warehousing rules), so the fastest best. Robustness, maintainability. So many sites crawl at the same time, the probability of the error is greater, such as broken nets, Half-way was crawled, climbed to "dirty data" and so on. Therefore, we must do a good job of log monitoring, real-time monitoring of the state of the crawler system, accurate and detailed positioning of the error information; In addition to do a variety of exception handling, if you come back from a holiday to find a reptile because a small problem has been hung off, Then you'll be sorry for wasting a few days (although I'm actually going to look at the reptilian state remotely sometimes). Distributed. Multi-site crawl, the volume of data is generally larger, can be distributed expansion, which is also a necessary function. Distributed, we need to pay attention to the message queue, do a good job of multiple nodes unified to heavy. Crawler optimization. This is the big topic, but the basics, the framework should be based on asynchrony, or using a coprocessor + multi process. Structure concise, to facilitate the next unknown function module to add.

Demand as above, said already very clear. Here's an architectural design that was done last year and now share. The specific code implementation is temporarily closed.

Text:

The design idea of the architecture is illustrated by interpreting the two graphs below.
The framework is divided into two main parts: the Downloader Downloader and parser Analyzer. Downloader is responsible for crawling Web pages, Analyzer is responsible for parsing web pages and warehousing. The two rely on Message Queuing MQ for communication, which can be distributed across different machines or on the same machine. The number of both is flexible, for example, there may be five machines in the download, two machines in the analysis, which can be based on the state of the reptile system in time to adjust. You can see from the image above that MQ has two pipelines: html/js files and seeds to be crawled. Downloader from the seed to be crawled to get a seed, according to the seed information called the corresponding crawl module to crawl the Web page, and then into the Html/js file this channel;Analyzer gets a Web page from the Html/js file, According to the information inside call the corresponding analytic module to parse, the target field storage, if necessary, will also resolve the new to climb seeds to join MQ. You can see that downloader is a user-agent pool, a proxy pool, a cookie pool, and can be adapted to the crawling of complex web sites. The call to the module uses Factory mode.

This picture is another expression of the previous picture. HTMLS queues and seed are queues that can be separated separately, or even the number can be opened, without connection. Can be flexibly adjusted according to the crawler state and hardware environment. Another 8G content allows Redis to store 5~8 thousands of seeds as a seeds queue. A key point of distributed crawler: Go heavy. You can see multiple parser analyzer sharing a single to heavy queue to ensure that the data is unified without duplication. A heavy queue can be placed on a single machine. Based on Redis implementation of the Bloomfilter algorithm (detailed see "Based on the Redis Bloomfilter to heavy (with Python code)"), theoretically 8G of memory can meet the 3 billion URL to the weight, if the probability of leakage can be more important to go heavy.

Conclusion:

To write a support distributed, multiple crawler framework, the specific implementation of a certain degree of difficulty. In the implementation of the main functions, but also pay attention to the code strict norms, the crawler efficient and robust requirements. You'll grow a lot after you've done this.

Share these today and welcome the exchange.

Reprint Please indicate the source, thank you. (Original link: http://blog.csdn.net/bone_ace/article/details/55000416)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More