Python crawler tutorial -34-distributed crawler Introduction
Installing Scrapy_redis
- 1. Open "cmd"
- 2. Enter the Anaconda environment for use
- 3. Install with PIP
4. Operation
Structure of distributed crawler
Master-Slave distributed crawler
- The so-called master-slave mode is that a server acts as master, and several servers act as slave,master responsible for managing all connected slave, including managing slave connections, task scheduling and distribution, result recovery and aggregation, and so on; each slave only needs to be from M Aster there to pick up the task and finish the task on its own last upload results, no need to communicate with other slave. This approach is simple and easy to manage, but it is clear that master needs to communicate with all slave, so master's performance is a bottleneck that restricts the entire system, especially when the number of slave on the connection is large, which can easily lead to the performance degradation of the entire crawler system
- Master-Slave distributed crawler structure diagram:
This is the classic master-slave distributed crawler structure diagram, the control node Controlnode is the above mentioned master, crawler node Spidernode is the above mentioned slave. The following diagram shows the execution of the crawler node slave
- Control node Execution flow graph:
- These two graphs explain the entire reptile frame very clearly, and we comb it here:
- 1. The whole distributed crawler system consists of two parts: master control node and slave crawler node
- 2.master Control node is responsible for: Slave node task scheduling, URL management, result processing
- 3. Slave crawler node is responsible for: This node crawler scheduling, HTML download management, HTML content resolution management
- 4. System Workflow: Master distributes tasks (URLs that are not crawled) Slave Pick up the task (URL) through the master's URL manager and complete the corresponding task (URL) HTML content download, content resolution, the parsed content contains the target data and the new URL, after the completion of the work slave the result (target data + The new URL) is submitted to master for the data extraction process (which belongs to the result processing of master), which completes two tasks: extracting a new URL into the URL manager, extracting the target data into the data store process, The URL management process for master receives the URL after it has been validated (whether it has been crawled) and processed (a collection that has not been crawled, added into the crawled URL collection, and crawled to a crawl), and then slave loops from the URL manager to get tasks, perform tasks, submit results ...
- This article is here, bye
- This note does not allow any person or organization to reprint
Python crawler tutorial -34-distributed crawler Introduction