Scrapy-redis detailed

Source: Internet
Author: User

Crawler strategies used by Scrapy-redis:

The slaver end takes the task from the Master and fetches the data, creating a new task while fetching the data, and throwing the task to Master. The Master side is responsible for slaver the tasks submitted by the server and joining the queue to be crawled.

Scrapy-redis when dealing with distributed, two keys are created in Redis, one is (Spider.name): Requests is used as a queue and the other is (Spider.name): The Dupefilter function is to remove the weight

Queue Task Assignment

The slaver end resolves to the new URL task, first to determine if it already exists in key:dupefilter, and if not, to record its push to the Key:requests task queue, which is saved in the following format:

{' Body ': ', ' _encoding ': ' Utf-8 ', ' cookies ': {}, ' meta ': {}, ' headers ': {}, ' url ': U ' http://www.test.com/test ', ' Dont_fil Ter ': False, ' priority ': 0, ' callback ': ' Parse_item ', ' method ': ' GET ', ' Errback ': None}

Key:requests as a task assignment to delete a task pop after it is allocated in the queue

Go heavy

Save the assigned task with its SHA1 value to Key:dupefilter, in the form of:

1babbfde30b0030559373ebe3e2a7a0955527e5f

Each time you add a task to the queue, determine if it already exists in the Key:dupefilter.

Breakpoint Re-crawl

When the crawler stops, the tasks in the Key:requests queue remain, and the next time the boot continues

Scrapy-redis detailed

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.