Scrapy-redis empty run problem, Redis_key link after running, automatically shut down the crawler

Source: Internet
Author: User

Tags: Support request object div orm Col duration append crawler

First solve the problem of crawler wait, not be closed:

1, the Scrapy internal signaling system will trigger the spider_idle signal when the crawler exhausts the request in the internal queue.

2, the crawler signal Manager receives the SPIDER_IDLE signal, will call registers the Spider_idle signal the processor to handle.

3. When all the processors (handler) of the signal are called, the engine shuts down the spider if the spider remains idle.

The solution in Scrapy-redis registers a Spider_idle () method on the signal manager corresponding to the spider_idle signal, and when the Spider_idle trigger is, the signal manager calls the Spider_idle () in the crawler, Scrapy_redis source code is as follows:

def spider_idle (self):         """ schedules a request if available, otherwise waits. """         # Xxx:handle a sentinel to close the spider.        Self.schedule_next_requests ()    # here calls         schedule_next_requests () to generate a new request from Redis        raise Dontclosespider              # Throw not off the reptile Dontclosespider exception, ensure that the crawler alive
Solution Ideas:
    • From the previous understanding, we know that the key to the crawler shutdown is the spider_idle signal.
    • The spider_idle signal is triggered only when the crawler queue is empty, and the trigger interval is 5s.
    • Then we can also use the same method to register a Spider_idle () method on the signal manager that corresponds to the spider_idle signal.
    • In the Spider_idle () method, write the end condition to end the crawler, here to determine if a key key in Redis is empty, the condition

Under the directory of the settings.py file, create a file named extensions.py, where you write the following code

#-*-Coding:utf-8-*-# Define Here the models for your scraped extensionsimport loggingimport timefrom scrapy Import sig Nalsfrom scrapy.exceptions Import Notconfiguredlogger = Logging.getlogger (__name__) class        Redisspidersmartidleclosedexensions (object): Def __init__ (self, Idle_number, crawler): Self.crawler = Crawler Self.idle_number = Idle_number self.idle_list = [] Self.idle_count = 0 @classmethod def from_craw        Ler (CLS, crawler): # First check if the extension should be enabled and raise # notconfigured otherwise        If not Crawler.settings.getbool (' myext_enabled '): Raise notconfigured # configuration only supports Redisspider        If not ' Redis_key ' in Crawler.spidercls.__dict__.keys (): Raise Notconfigured (' only supports Redisspider ') # get the number of items from settings Idle_number = Crawler.settings.getint (' Idle_number ', as-is) # in Stantiate the Extension object ext = CLS (idle_numbEr, crawler) # Connect the extension object to signals Crawler.signals.connect (ext.spider_opened, Signal=sig nals.spider_opened) Crawler.signals.connect (ext.spider_closed, signal=signals.spider_closed) crawler.signals . Connect (Ext.spider_idle, Signal=signals.spider_idle) # Return the extension object return ext def spider _opened (self, Spider): Logger.info ("opened spider%s Redis spider idle, continuous idle limit:%d", Spider.name, SE Lf.idle_number) def spider_closed (self, Spider): Logger.info ("closed spider%s, idle count%d, continuous idle        Count%d ", Spider.name, Self.idle_count, Len (self.idle_list)) def spider_idle (self, spider): Self.idle_count + = 1 self.idle_list.append (time.time ()) Idle_list_len = Len (self.idle_list) #            Determine if key key is present in Redis, if key is exhausted, then key will not exist if Idle_list_len > 2 and Spider.server.exists (Spider.redis_key): Self.idle_list = [sELF.IDLE_LIST[-1]] Elif idle_list_len > Self.idle_number:logger.info (' \ n continued idle number EXCEE                        d {} times ' \ n Meet the idle shutdown conditions, would close the reptile operation '                                                                              ' \ n Idle start time: {}, close Spider time: {} '. Format (Self.idle_number, Self.idle_list[0], self.idle_list[0]) # Perform a close crawler operation Self.crawler . Engine.close_spider (Spider, ' closespider_pagecount ')

  

To add the following configuration in settings.py, replace Lianjia_ershoufang with your project directory name.

myext_enabled=True      # Open extension idle_number=           # config idle duration unit is 360, one time unit for 5s# in EXTENSIONS configuration , activate the extension 'EXTENSIONS'= {            'Lianjia_ Ershoufang.extensions.RedisSpiderSmartIdleClosedExensions',          30 minutes, 12 time units a minute.

Scrapy-redis empty run problem, Redis_key link after running, automatically close the crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: