Scrapy-redis is implemented in two kinds of distributed: Crawler distributed and item processing distributed. are implemented by Module scheduler and module pipelines respectively.
Introduction of each component of Scrapy-redis
(I) connection.py
Responsible for instantiating Redis connections based on configuration in setting. is called by Dupefilter and Scheduler, in summary involves Redis access to use this module.
(II) dupefilter.py
Responsible for the implementation of Requst, the implementation of a very skilled, using the Redis set data structure. Note, however, that scheduler does not use the Dupefilter key that is used in this module to do request scheduling, but instead uses the queue implemented in the queue.py module.
When the request is not repeated, it is saved to the queue and ejected when it is scheduled.
(III) queue.py
It functions as described in II, but there are three ways to implement the queue:
FIFO's Spiderqueue,spiderpriorityqueue, as well as LiFi's spiderstack. The default use is the second one, which is the reason (link) of the situation analyzed in the previous article.
(IV) pipelines.py
This is used to achieve the role of distributed processing. It stores the item in Redis to implement distributed processing.
In addition, it can be found that the same is written pipelines, where the coding implementation is different from the article (link:) in the case, because the need to read the configuration here, so the From_crawler () function.
(V) scheduler.py
This extension is an alternative to the scheduler in Scrapy (as pointed out in the Settings scheduler variable), and it is using this extension to implement the distributed scheduling of crawler. The data structure used in the queue is derived from that of the database.
Scrapy-redis is implemented by two kinds of distributed: Crawler distributed and item processing distributed is realized by module scheduler and module pipelines. The other modules are used as auxiliary functional modules.
(VI) spider.py
The Designed spider reads the URL to crawl from the Redis, then performs the crawl, and if more URLs are returned during the crawl, continue until all the request is complete. Then continue to read the URL from the Redis, looping the process.
Ii. the relationship between components
Three, Scrapy-redis case analysis
(1) spiders/ebay_redis.py
Classebaycrawler (Redismixin,crawlspider):
"" "Spiderthat reads URLs from Redis queue (mycrawler:start_urls)." "
name = ' Ebay_redis '
Redis_key = ' Ebay_redis:start_urls '
Rules = (
# Follow All links
# rule (Sgmllinkextractor (), callback= ' Parse_page ', follow=true),
Rule (SLE (allow= (' [^\s]+/itm/',)), callback= ' Parse_item '),
)
#该方法是最关键的方法, the method name begins with an underscore and establishes a relationship with Redis
def _set_crawler (self, crawler):
Crawlspider._set_crawler (self, crawler)
Redismixin.setup_redis (self)
# Resolve SKU Pages
Defparse_item (Self,response):
SEL =selector (response)
Base_url =get_base_url (response)
item = Ebayphoneitem ()
Print Base_url
item[' BaseURL '] =[base_url]
item[' Goodsname '] =sel.xpath ("//h1[@id = ' itemtitle ']/text ()"). Extract ()
Return item
This class inherits the Redismixin (a class in scrapy_redis/spiders.py) and Crawlspider, loads the items of the configuration file, establishes and redis the associations, and simultaneously crawls the parsing. The key method is _set_crawler (self, crawler), and the key attribute is Redis_key, which defaults to spider.name:start_urls if it is not initialized
How the _set_crawler () method is invoked:
Scrapy/crawl.py/crawler: crawl ()->
Scrapy/crawl.py/crawler:_create_spider ()->
Crawlspider:from_crawler () –>
Scrapy/spiders/spider:from_crawler ()->
ebay_redis.py: _set_crawler ()
(2) setting.py
spider_modules= [' Example.spiders ']
Newspider_module= ' Example.spiders '
Item_pipelines = {
' Example.pipelines.ExamplePipeline ': 300,
#通过配置下面该项RedisPipeline ' will write item to the key
#spider. Name:items's Redis list, for later distributed processing item
' Scrapy_redis.pipelines.RedisPipeline ': 400,
}
Scheduler= "Scrapy_redis.scheduler.Scheduler"
#不清理redisqueues, allow pause or restart crawls
scheduler_persist= True
scheduler_queue_class= ' Scrapy_redis.queue.SpiderPriorityQueue '
#该项仅对queueclass is Spiderqueue or spiderstack takes effect to prevent spider from being shut down for maximum idle time
Scheduler_idle_before_close= 10
#连接redis使用
Redis_host = ' 123.56.184.53 '
Redis_port= 6379
(3) process_items.py:
Defmain ():
Pool =redis. ConnectionPool (host= ' 123.56.184.53 ', port=6379, db=0)
R = Redis. Redis (Connection_pool=pool)
While True:
# process Queue as FIFO, change ' blpop ' to ' brpop ' to process as LIFO
Source, Data =r.blpop (["Ebay_redis:items"])
Item = json.loads (data)
Try
Print u "Processing:% (name) s<% (link) s>"% item
Except Keyerror:
Print U "Error procesing:%r"% item
if__name__ = = ' __main__ ':
Main ()
The module is from the Redis corresponding list to take the item, processing, can run multiple processes distributed processing items
(4) The implementation process is as follows:
First, open the Redis service on the Redis server side:
./redis-server
Second implementation
./redis-cli Lpush Ebaycrawler:start_urls http://www.ebay.com/sch/Cell-Phones-Smartphones-/9355/i.html
Then run the crawler:
Scrapy runspiderebay_redis.py
Can execute multiple reptiles, at the same time to ebay_redis:start_urls the URL in the distributed crawl, crawl after the results are stored in the Ebay_redis:items list for subsequent processing
Finally, you can view the contents of the items queue
./REDIS-CLI Llen Ebay_redis:items can see the total number of items in the