Scrapy-redis empty run problem, Redis_key link after running, automatically shut down the crawler

Source: Internet
Author: User



First solve the problem of crawler wait, not be closed:



1, the Scrapy internal signaling system will trigger the spider_idle signal when the crawler exhausts the request in the internal queue.



2, the crawler signal Manager receives the SPIDER_IDLE signal, will call registers the Spider_idle signal the processor to handle.



3. When all the processors (handler) of the signal are called, the engine shuts down the spider if the spider remains idle.



The solution in Scrapy-redis registers a Spider_idle () method on the signal manager corresponding to the spider_idle signal, and when the Spider_idle trigger is, the signal manager calls the Spider_idle () in the crawler, Scrapy_redis source code is as follows:


Def spider_idle(self):
         """Schedules a request if available, otherwise waits."""
         # XXX: Handle a sentinel to close the spider.
         Self.schedule_next_requests() # call here
         Schedule_next_requests() to generate new requests from redis
         Raise DontCloseSpider # Throw out Do not close the crawler DontCloseSpider exception to ensure the crawler is alive 
Solution Ideas:
    • From the previous understanding, we know that the key to the crawler shutdown is the spider_idle signal.
    • The spider_idle signal is triggered only when the crawler queue is empty, and the trigger interval is 5s.
    • Then we can also use the same method to register a Spider_idle () method on the signal manager that corresponds to the spider_idle signal.
    • In the Spider_idle () method, write the end condition to end the crawler, here to determine if a key key in Redis is empty, the condition


Under the directory of the settings.py file, create a file named extensions.py, where you write the following code


# -*- coding: utf-8 -*-

# Define here the models for your scraped Extensions
Import logging
Import time
From scrapy import signals
From scrapy.exceptions import NotConfigured

Logger = logging.getLogger(__name__)


Class RedisSpiderSmartIdleClosedExensions(object):

    Def __init__(self, idle_number, crawler):
        Self.crawler = crawler
        Self.idle_number = idle_number
        Self.idle_list = []
        Self.idle_count = 0

    @classmethod
    Def from_crawler(cls, crawler):
        # first check if the extension should be enabled and raise

        # NotConfigured otherwise

        If not crawler.settings.getbool(‘MYEXT_ENABLED‘):

            Raise NotConfigured
        
        # Configure only supports RedisSpider
        If not ‘redis_key‘ in crawler.spidercls.__dict__.keys():

            Raise NotConfigured(‘Only supports RedisSpider‘)

        #get the number of items from settings

        Idle_number = crawler.settings.getint(‘IDLE_NUMBER‘, 360)

        # instantiate the extension object

        Ext = cls(idle_number, crawler)

        # connect the extension object to signals

        Crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)

        Crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)

        Crawler.signals.connect(ext.spider_idle, signal=signals.spider_idle)

        # return the extension object

        Return ext

    Def spider_opened(self, spider):
        Logger.info("opened spider %s redis spider Idle, Continuous idle limit: %d", spider.name, self.idle_number)

    Def spider_closed(self, spider):
        Logger.info("closed spider %s, idle count %d , Continuous idle count %d",
                    Spider.name, self.idle_count, len(self.idle_list))

    Def spider_idle(self, spider):
        Self.idle_count += 1
        Self.idle_list.append(time.time())
        Idle_list_len = len(self.idle_list)
       
        # Determine if there is a key in redis. If the key is used up, the key will not exist.
        If idle_list_len > 2 and spider.server.exists(spider.redis_key):
            Self.idle_list = [self.idle_list[-1]]

        Elif idle_list_len > self.idle_number:
            Logger.info(‘\n continued idle number exceed {} Times’
                        ‘\n meet the idle shutdown conditions, will close the reptile operation’
                        ‘\n idle start time: {}, close spider time: {}‘.format(self.idle_number,
                                                                              Self.idle_list[0], self.idle_list[0]))
            #Execute close crawl operation
            Self.crawler.engine.close_spider(spider, ‘closespider_pagecount‘)





To add the following configuration in settings.py, replace Lianjia_ershoufang with your project directory name.


MYEXT_ENABLED=True # Open extension
IDLE_NUMBER=360 # Configure idle duration units to 360, one time unit is 5s

# Activate the extension in the EXTENSIONS configuration
‘EXTENSIONS’= {
             ‘lianjia_ershoufang.extensions.RedisSpiderSmartIdleClosedExensions’: 500,
         },
MYEXT_ENABLED: Whether to enable extensions, enable extensions to True, not enable to False
IDLE_NUMBER: Turns off the crawler's continuous idleness. If the number of idle idles exceeds IDLE_NUMBER, the crawler will be closed. The default is 360, which is 30 minutes, 12 minutes in one minute. 





Scrapy-redis empty run problem, Redis_key link after running, automatically close the crawler


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.