Python Scrapy Distributed principle detailed

Source: Internet
Author: User
Tags python scrapy

The main share of this article is the Python crawler. scrapy Distribution principle-related content, a look at it, I hope to everyone Learning python crawler helpful.

about scrapy work Flow Review

scrapy stand-alone architecture

architecture is a single-machine architecture that maintains a crawl queue only natively, Scheduler , the key to implementing the common crawl data of a polymorphic server is to share the crawl queue.

distributed Architecture

I'll make a change again

What's important here is what is my queue for maintenance?

here generally we pass Redis for maintenance,Redis, non-relational database,key-value form Storage, flexible structure.

and Redis is an in-memory data structure storage system, processing speed, providing a variety of storage structures such as queue collection, convenient queue maintenance

How to go heavy?

with the help of a Redis collection,Redis provides a collection of data structures that store the fingerprint of each request in a Redis collection

Verify that the thumbprint of the request has been added to the collection before adding the request to the request queue . If it already exists, it is not added to the request queue, and if it does not exist, the request is added to the queue and the fingerprint is added to the collection

How to prevent interruptions? If a slave is down for a special reason, how do I fix it?

here is the start-up judgment, the current Redis request queue is empty at the start of each slave scrapy

if it is not empty, the next request from the queue is fetched to perform the crawl. If it is empty, restart the crawl, and the first cluster executes the crawl-orientation queue to add the request

How do you implement the above architecture?

Here's a Scrapy-redis library that provides us with these features

Scrapy-redis rewrite the scrapy Scheduler, queue and other components, with which he can easily implement scrapy distributed Architecture

build a distributed crawler

the prerequisite is to install the scrapy_redis module:pip install Scrapy_redis

The reptile code here is used to crawl the user information before the crawler

Modify The configuration information in the settings:

Replace the scrapy scheduler

SCHEDULER = "Scrapy_redis.scheduler.Scheduler"

add a de-weight class

Dupefilter_class = "Scrapy_redis.dupefilter.RFPDupeFilter"

Add Pipeline

If you add this line of configuration, each crawl of the data will also be in the Redis database, so this configuration is generally not done here

Item_pipelines = {

' Scrapy_redis.pipelines.RedisPipeline ': 300

}

shared crawl queue, where you need the Redis connection information

the User:pass here represents the user name and password, and if not, it can be empty .

Redis_url = ' Redis://user:[email protected]:9001 '

is set to True does not empty the Dupefilter and requests queues in Redis

This allows the fingerprint and request queue to be stored in the Redis database, which defaults to Falseand is not normally set

Scheduler_persist = True

Set whether to empty the crawl queue when restarting the crawler

This will erase the fingerprint and request queue each time the crawler restarts, and is generally set to False

Scheduler_flush_on_start=true

distributed

Copy the above changed code of the various servers, of course, about the database can be installed on each server data, you can also share a data, I here is connected to the same MongoDB database, of course, each server can not forget:

all servers are to be installed Scrapy,scrapy_redis,pymongo

in this way, after each crawler starts, the redis database can see the following,Dupefilter is the fingerprint queue,requests is the request queue

Source: Blog Park

Python Scrapy Distributed principle detailed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.