Python Scrapy Distributed principle detailed

Last Update:2017-08-11 Source: Internet

Author: User

Tags python scrapy

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The main share of this article is the Python crawler. scrapy Distribution principle-related content, a look at it, I hope to everyone Learning python crawler helpful.

about scrapy work Flow Review

scrapy stand-alone architecture

architecture is a single-machine architecture that maintains a crawl queue only natively, Scheduler , the key to implementing the common crawl data of a polymorphic server is to share the crawl queue.

distributed Architecture

I'll make a change again

What's important here is what is my queue for maintenance?

here generally we pass Redis for maintenance,Redis, non-relational database,key-value form Storage, flexible structure.

and Redis is an in-memory data structure storage system, processing speed, providing a variety of storage structures such as queue collection, convenient queue maintenance

How to go heavy?

with the help of a Redis collection,Redis provides a collection of data structures that store the fingerprint of each request in a Redis collection

Verify that the thumbprint of the request has been added to the collection before adding the request to the request queue . If it already exists, it is not added to the request queue, and if it does not exist, the request is added to the queue and the fingerprint is added to the collection

How to prevent interruptions? If a slave is down for a special reason, how do I fix it?

here is the start-up judgment, the current Redis request queue is empty at the start of each slave scrapy

if it is not empty, the next request from the queue is fetched to perform the crawl. If it is empty, restart the crawl, and the first cluster executes the crawl-orientation queue to add the request

How do you implement the above architecture?

Here's a Scrapy-redis library that provides us with these features

Scrapy-redis rewrite the scrapy Scheduler, queue and other components, with which he can easily implement scrapy distributed Architecture

build a distributed crawler

the prerequisite is to install the scrapy_redis module:pip install Scrapy_redis

The reptile code here is used to crawl the user information before the crawler

Modify The configuration information in the settings:

Replace the scrapy scheduler

SCHEDULER = "Scrapy_redis.scheduler.Scheduler"

add a de-weight class

Dupefilter_class = "Scrapy_redis.dupefilter.RFPDupeFilter"

Add Pipeline

If you add this line of configuration, each crawl of the data will also be in the Redis database, so this configuration is generally not done here

Item_pipelines = {

' Scrapy_redis.pipelines.RedisPipeline ': 300

}

shared crawl queue, where you need the Redis connection information

the User:pass here represents the user name and password, and if not, it can be empty .

Redis_url = ' Redis://user:[email protected]:9001 '

is set to True does not empty the Dupefilter and requests queues in Redis

This allows the fingerprint and request queue to be stored in the Redis database, which defaults to Falseand is not normally set

Scheduler_persist = True

Set whether to empty the crawl queue when restarting the crawler

This will erase the fingerprint and request queue each time the crawler restarts, and is generally set to False

Scheduler_flush_on_start=true

distributed

Copy the above changed code of the various servers, of course, about the database can be installed on each server data, you can also share a data, I here is connected to the same MongoDB database, of course, each server can not forget:

all servers are to be installed Scrapy,scrapy_redis,pymongo

in this way, after each crawler starts, the redis database can see the following,Dupefilter is the fingerprint queue,requests is the request queue

Source: Blog Park

Python Scrapy Distributed principle detailed

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More