Python crawler tutorial -34-distributed crawler Introduction

Last Update:2018-09-06 Source: Internet

Author: User

Tags redis tutorial

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python crawler tutorial -34-distributed crawler Introduction

Distributed crawler in the actual application is still many, this article briefly introduces the distributed crawler
What is a distributed crawler
Distributed crawler is more than one computer installed crawler, the focus is the joint acquisition. A single crawler is a crawler that is only on a single computer.
In fact, search engines are crawlers, responsible for crawling from all over the world's Web site content, when you search for keywords to the relevant content to show you, but they are often big spiders, crawling content is beyond imagination, can no longer use a single crawler to achieve, but the use of distributed, a server does not, I'll have 1000 units. So many of my servers are distributed around the world to complete the crawler work, each other to cooperate to do, so there is a distributed crawler
Single-Machine crawler problems:
- Efficiency issues for a single computer
- IO throughput with limited transfer rates
Multi-Reptile problem
- Multi-crawler to realize data sharing
  - For example, a crawl of a site, download what content, other crawlers to know, to avoid repeated crawls and so many problems, so to achieve data sharing
- Multiple machines in space, can be distributed
Multiple crawler conditions:
- Need to share a queue
- Go to the heavy, so that many reptiles do not crawl the crawler crawled by other reptiles
Understanding distributed Crawlers:
- Assuming that tens of thousands of URLs need to be crawled, there are more than 100 reptiles in different cities across the country.
- URLs are divided into different crawlers, but the efficiency of different crawlers is not the same, so say Shared queue, sharing data, so that more efficient crawlers to do tasks, rather than waiting for inefficient crawler
Redis
- Redis is fully open source free, complies with BSD protocol, is a high-performance Key-value database
- Memory database, data stored in memory
- can also be saved to the hard drive on the floor
- can go heavy
- Redis can be understood as a collective of dict,set,list.
- Redis can have a life cycle of saved content
- Redis Tutorial: Redis Tutorial-Rookie Tutorial
Content Save Database
- MongoDB, run in memory, data saved on hard disk
- Mysql
- Wait a minute

Installing Scrapy_redis

1. Open "cmd"
2. Enter the Anaconda environment for use
3. Install with PIP
4. Operation
Structure of distributed crawler

Master-Slave distributed crawler

The so-called master-slave mode is that a server acts as master, and several servers act as slave,master responsible for managing all connected slave, including managing slave connections, task scheduling and distribution, result recovery and aggregation, and so on; each slave only needs to be from M Aster there to pick up the task and finish the task on its own last upload results, no need to communicate with other slave. This approach is simple and easy to manage, but it is clear that master needs to communicate with all slave, so master's performance is a bottleneck that restricts the entire system, especially when the number of slave on the connection is large, which can easily lead to the performance degradation of the entire crawler system
Master-Slave distributed crawler structure diagram:

This is the classic master-slave distributed crawler structure diagram, the control node Controlnode is the above mentioned master, crawler node Spidernode is the above mentioned slave. The following diagram shows the execution of the crawler node slave
Control node Execution flow graph:
These two graphs explain the entire reptile frame very clearly, and we comb it here:
1. The whole distributed crawler system consists of two parts: master control node and slave crawler node
2.master Control node is responsible for: Slave node task scheduling, URL management, result processing
3. Slave crawler node is responsible for: This node crawler scheduling, HTML download management, HTML content resolution management
4. System Workflow: Master distributes tasks (URLs that are not crawled) Slave Pick up the task (URL) through the master's URL manager and complete the corresponding task (URL) HTML content download, content resolution, the parsed content contains the target data and the new URL, after the completion of the work slave the result (target data + The new URL) is submitted to master for the data extraction process (which belongs to the result processing of master), which completes two tasks: extracting a new URL into the URL manager, extracting the target data into the data store process, The URL management process for master receives the URL after it has been validated (whether it has been crawled) and processed (a collection that has not been crawled, added into the crawled URL collection, and crawled to a crawl), and then slave loops from the URL manager to get tasks, perform tasks, submit results ...
This article is here, bye

This note does not allow any person or organization to reprint

Python crawler tutorial -34-distributed crawler Introduction

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More