Scrapy distributed crawl through Redis

Source: Internet
Author: User
Tags redis redis server

Scrapy-redis is implemented in two kinds of distributed: Crawler distributed and item processing distributed. are implemented by Module scheduler and module pipelines respectively.

Introduction of each component of Scrapy-redis

(I) connection.py

Responsible for instantiating Redis connections based on configuration in setting. is called by Dupefilter and Scheduler, in summary involves Redis access to use this module.

(II) dupefilter.py

Responsible for the implementation of Requst, the implementation of a very skilled, using the Redis set data structure. Note, however, that scheduler does not use the Dupefilter key that is used in this module to do request scheduling, but instead uses the queue implemented in the queue.py module.

When the request is not repeated, it is saved to the queue and ejected when it is scheduled.

(III) queue.py

It functions as described in II, but there are three ways to implement the queue:

FIFO's Spiderqueue,spiderpriorityqueue, as well as LiFi's spiderstack. The default use is the second one, which is the reason (link) of the situation analyzed in the previous article.

(IV) pipelines.py

This is used to achieve the role of distributed processing. It stores the item in Redis to implement distributed processing.

In addition, it can be found that the same is written pipelines, where the coding implementation is different from the article (link:) in the case, because the need to read the configuration here, so the From_crawler () function.

(V) scheduler.py

This extension is an alternative to the scheduler in Scrapy (as pointed out in the Settings scheduler variable), and it is using this extension to implement the distributed scheduling of crawler. The data structure used in the queue is derived from that of the database.

Scrapy-redis is implemented by two kinds of distributed: Crawler distributed and item processing distributed is realized by module scheduler and module pipelines. The other modules are used as auxiliary functional modules.

(VI) spider.py

The Designed spider reads the URL to crawl from the Redis, then performs the crawl, and if more URLs are returned during the crawl, continue until all the request is complete. Then continue to read the URL from the Redis, looping the process.

Ii. the relationship between components

Three, Scrapy-redis case analysis

(1) spiders/ebay_redis.py

Classebaycrawler (Redismixin,crawlspider):

"" "Spiderthat reads URLs from Redis queue (mycrawler:start_urls)." "

name = ' Ebay_redis '

Redis_key = ' Ebay_redis:start_urls '

Rules = (

# Follow All links

# rule (Sgmllinkextractor (), callback= ' Parse_page ', follow=true),

Rule (SLE (allow= (' [^\s]+/itm/',)), callback= ' Parse_item '),

)

#该方法是最关键的方法, the method name begins with an underscore and establishes a relationship with Redis

def _set_crawler (self, crawler):

Crawlspider._set_crawler (self, crawler)

Redismixin.setup_redis (self)

# Resolve SKU Pages

Defparse_item (Self,response):

SEL =selector (response)

Base_url =get_base_url (response)

item = Ebayphoneitem ()

Print Base_url

item[' BaseURL '] =[base_url]

item[' Goodsname '] =sel.xpath ("//h1[@id = ' itemtitle ']/text ()"). Extract ()

Return item

This class inherits the Redismixin (a class in scrapy_redis/spiders.py) and Crawlspider, loads the items of the configuration file, establishes and redis the associations, and simultaneously crawls the parsing. The key method is _set_crawler (self, crawler), and the key attribute is Redis_key, which defaults to spider.name:start_urls if it is not initialized

How the _set_crawler () method is invoked:

Scrapy/crawl.py/crawler: crawl ()->

Scrapy/crawl.py/crawler:_create_spider ()->

Crawlspider:from_crawler () –>

Scrapy/spiders/spider:from_crawler ()->

ebay_redis.py: _set_crawler ()

(2) setting.py

spider_modules= [' Example.spiders ']

Newspider_module= ' Example.spiders '

Item_pipelines = {

' Example.pipelines.ExamplePipeline ': 300,

#通过配置下面该项RedisPipeline ' will write item to the key

#spider. Name:items's Redis list, for later distributed processing item

' Scrapy_redis.pipelines.RedisPipeline ': 400,

}

Scheduler= "Scrapy_redis.scheduler.Scheduler"

#不清理redisqueues, allow pause or restart crawls

scheduler_persist= True

scheduler_queue_class= ' Scrapy_redis.queue.SpiderPriorityQueue '

#该项仅对queueclass is Spiderqueue or spiderstack takes effect to prevent spider from being shut down for maximum idle time

Scheduler_idle_before_close= 10

#连接redis使用

Redis_host = ' 123.56.184.53 '

Redis_port= 6379

(3) process_items.py:

Defmain ():

Pool =redis. ConnectionPool (host= ' 123.56.184.53 ', port=6379, db=0)

R = Redis. Redis (Connection_pool=pool)

While True:

# process Queue as FIFO, change ' blpop ' to ' brpop ' to process as LIFO

Source, Data =r.blpop (["Ebay_redis:items"])

Item = json.loads (data)

Try

Print u "Processing:% (name) s<% (link) s>"% item

Except Keyerror:

Print U "Error procesing:%r"% item

if__name__ = = ' __main__ ':

Main ()

The module is from the Redis corresponding list to take the item, processing, can run multiple processes distributed processing items

(4) The implementation process is as follows:

First, open the Redis service on the Redis server side:

./redis-server

Second implementation

./redis-cli Lpush Ebaycrawler:start_urls http://www.ebay.com/sch/Cell-Phones-Smartphones-/9355/i.html

Then run the crawler:

Scrapy runspiderebay_redis.py

Can execute multiple reptiles, at the same time to ebay_redis:start_urls the URL in the distributed crawl, crawl after the results are stored in the Ebay_redis:items list for subsequent processing

Finally, you can view the contents of the items queue

./REDIS-CLI Llen Ebay_redis:items can see the total number of items in the

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.