The main share of this article is the Python crawler. scrapy Distribution principle-related content, a look at it, I hope to everyone Learning python crawler helpful.
about scrapy work Flow Review
scrapy stand-alone architecture
architecture is a single-machine architecture that maintains a crawl queue only natively, Scheduler , the key to implementing the common crawl data of a polymorphic server is to share the crawl queue.
distributed Architecture
I'll make a change again
What's important here is what is my queue for maintenance?
here generally we pass Redis for maintenance,Redis, non-relational database,key-value form Storage, flexible structure.
and Redis is an in-memory data structure storage system, processing speed, providing a variety of storage structures such as queue collection, convenient queue maintenance
How to go heavy?
with the help of a Redis collection,Redis provides a collection of data structures that store the fingerprint of each request in a Redis collection
Verify that the thumbprint of the request has been added to the collection before adding the request to the request queue . If it already exists, it is not added to the request queue, and if it does not exist, the request is added to the queue and the fingerprint is added to the collection
How to prevent interruptions? If a slave is down for a special reason, how do I fix it?
here is the start-up judgment, the current Redis request queue is empty at the start of each slave scrapy
if it is not empty, the next request from the queue is fetched to perform the crawl. If it is empty, restart the crawl, and the first cluster executes the crawl-orientation queue to add the request
How do you implement the above architecture?
Here's a Scrapy-redis library that provides us with these features
Scrapy-redis rewrite the scrapy Scheduler, queue and other components, with which he can easily implement scrapy distributed Architecture
build a distributed crawler
the prerequisite is to install the scrapy_redis module:pip install Scrapy_redis
The reptile code here is used to crawl the user information before the crawler
Modify The configuration information in the settings:
Replace the scrapy scheduler
SCHEDULER = "Scrapy_redis.scheduler.Scheduler"
add a de-weight class
Dupefilter_class = "Scrapy_redis.dupefilter.RFPDupeFilter"
Add Pipeline
If you add this line of configuration, each crawl of the data will also be in the Redis database, so this configuration is generally not done here
Item_pipelines = {
' Scrapy_redis.pipelines.RedisPipeline ': 300
}
shared crawl queue, where you need the Redis connection information
the User:pass here represents the user name and password, and if not, it can be empty .
Redis_url = ' Redis://user:[email protected]:9001 '
is set to True does not empty the Dupefilter and requests queues in Redis
This allows the fingerprint and request queue to be stored in the Redis database, which defaults to Falseand is not normally set
Scheduler_persist = True
Set whether to empty the crawl queue when restarting the crawler
This will erase the fingerprint and request queue each time the crawler restarts, and is generally set to False
Scheduler_flush_on_start=true
distributed
Copy the above changed code of the various servers, of course, about the database can be installed on each server data, you can also share a data, I here is connected to the same MongoDB database, of course, each server can not forget:
all servers are to be installed Scrapy,scrapy_redis,pymongo
in this way, after each crawler starts, the redis database can see the following,Dupefilter is the fingerprint queue,requests is the request queue
Source: Blog Park
Python Scrapy Distributed principle detailed