First solve the problem of crawler wait, not be closed:
1, the Scrapy internal signaling system will trigger the spider_idle signal when the crawler exhausts the request in the internal queue.
2, the crawler signal Manager receives the SPIDER_IDLE signal, will call registers the Spider_idle signal the processor to handle.
3. When all the processors (handler) of the signal are called, the engine shuts down the spider if the spider remains idle.
The solution in Scrapy-redis registers a Spider_idle () method on the signal manager corresponding to the spider_idle signal, and when the Spider_idle trigger is, the signal manager calls the Spider_idle () in the crawler, Scrapy_redis source code is as follows:
Def spider_idle(self):
"""Schedules a request if available, otherwise waits."""
# XXX: Handle a sentinel to close the spider.
Self.schedule_next_requests() # call here
Schedule_next_requests() to generate new requests from redis
Raise DontCloseSpider # Throw out Do not close the crawler DontCloseSpider exception to ensure the crawler is alive
Solution Ideas:
- From the previous understanding, we know that the key to the crawler shutdown is the spider_idle signal.
- The spider_idle signal is triggered only when the crawler queue is empty, and the trigger interval is 5s.
- Then we can also use the same method to register a Spider_idle () method on the signal manager that corresponds to the spider_idle signal.
- In the Spider_idle () method, write the end condition to end the crawler, here to determine if a key key in Redis is empty, the condition
Under the directory of the settings.py file, create a file named extensions.py, where you write the following code
# -*- coding: utf-8 -*-
# Define here the models for your scraped Extensions
Import logging
Import time
From scrapy import signals
From scrapy.exceptions import NotConfigured
Logger = logging.getLogger(__name__)
Class RedisSpiderSmartIdleClosedExensions(object):
Def __init__(self, idle_number, crawler):
Self.crawler = crawler
Self.idle_number = idle_number
Self.idle_list = []
Self.idle_count = 0
@classmethod
Def from_crawler(cls, crawler):
# first check if the extension should be enabled and raise
# NotConfigured otherwise
If not crawler.settings.getbool(‘MYEXT_ENABLED‘):
Raise NotConfigured
# Configure only supports RedisSpider
If not ‘redis_key‘ in crawler.spidercls.__dict__.keys():
Raise NotConfigured(‘Only supports RedisSpider‘)
#get the number of items from settings
Idle_number = crawler.settings.getint(‘IDLE_NUMBER‘, 360)
# instantiate the extension object
Ext = cls(idle_number, crawler)
# connect the extension object to signals
Crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
Crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
Crawler.signals.connect(ext.spider_idle, signal=signals.spider_idle)
# return the extension object
Return ext
Def spider_opened(self, spider):
Logger.info("opened spider %s redis spider Idle, Continuous idle limit: %d", spider.name, self.idle_number)
Def spider_closed(self, spider):
Logger.info("closed spider %s, idle count %d , Continuous idle count %d",
Spider.name, self.idle_count, len(self.idle_list))
Def spider_idle(self, spider):
Self.idle_count += 1
Self.idle_list.append(time.time())
Idle_list_len = len(self.idle_list)
# Determine if there is a key in redis. If the key is used up, the key will not exist.
If idle_list_len > 2 and spider.server.exists(spider.redis_key):
Self.idle_list = [self.idle_list[-1]]
Elif idle_list_len > self.idle_number:
Logger.info(‘\n continued idle number exceed {} Times’
‘\n meet the idle shutdown conditions, will close the reptile operation’
‘\n idle start time: {}, close spider time: {}‘.format(self.idle_number,
Self.idle_list[0], self.idle_list[0]))
#Execute close crawl operation
Self.crawler.engine.close_spider(spider, ‘closespider_pagecount‘)
To add the following configuration in settings.py, replace Lianjia_ershoufang with your project directory name.
MYEXT_ENABLED=True # Open extension
IDLE_NUMBER=360 # Configure idle duration units to 360, one time unit is 5s
# Activate the extension in the EXTENSIONS configuration
‘EXTENSIONS’= {
‘lianjia_ershoufang.extensions.RedisSpiderSmartIdleClosedExensions’: 500,
},
MYEXT_ENABLED: Whether to enable extensions, enable extensions to True, not enable to False
IDLE_NUMBER: Turns off the crawler's continuous idleness. If the number of idle idles exceeds IDLE_NUMBER, the crawler will be closed. The default is 360, which is 30 minutes, 12 minutes in one minute.
Scrapy-redis empty run problem, Redis_key link after running, automatically close the crawler