Crawler strategies used by Scrapy-redis:
The slaver end takes the task from the Master and fetches the data, creating a new task while fetching the data, and throwing the task to Master. The Master side is responsible for slaver the tasks submitted by the server and joining the queue to be crawled.
Scrapy-redis when dealing with distributed, two keys are created in Redis, one is (Spider.name): Requests is used as a queue and the other is (Spider.name): The Dupefilter function is to remove the weight
Queue Task Assignment
The slaver end resolves to the new URL task, first to determine if it already exists in key:dupefilter, and if not, to record its push to the Key:requests task queue, which is saved in the following format:
{' Body ': ', ' _encoding ': ' Utf-8 ', ' cookies ': {}, ' meta ': {}, ' headers ': {}, ' url ': U ' http://www.test.com/test ', ' Dont_fil Ter ': False, ' priority ': 0, ' callback ': ' Parse_item ', ' method ': ' GET ', ' Errback ': None}
Key:requests as a task assignment to delete a task pop after it is allocated in the queue
Go heavy
Save the assigned task with its SHA1 value to Key:dupefilter, in the form of:
1babbfde30b0030559373ebe3e2a7a0955527e5f
Each time you add a task to the queue, determine if it already exists in the Key:dupefilter.
Breakpoint Re-crawl
When the crawler stops, the tasks in the Key:requests queue remain, and the next time the boot continues
Scrapy-redis detailed