Scrapy-redis Transformation Scrapy realize distributed multi-process crawl

Source: Internet
Author: User
Tags install redis

I. Rationale:
Scrapy-redis is a Redis-based scrapy distributed component. It uses Redis to store and schedule requests (requests) for crawling (Schedule) and stores the items (items) that are crawled for subsequent processing. Scrapy-redi rewritten scrapy Some of the more critical code, turning Scrapy into a distributed crawler that can run concurrently on multiple hosts.
Reference Scrapy-redis official GitHub address

Two. Preparatory work:
1. Install and start redis,windows and LUnix can refer to this article
2.scrapy+python Environment Installation
3.scrapy_redis Environment Installation

$ pip install scrapy-redis$ pip install redis

Three. Transform Scrapy crawler:
1. First configure Redis in settings.py (already configured in the example of Scrapy-redis)

   "scrapy_redis.scheduler.Scheduler"   SCHEDULER_PERSIST = True   SCHEDULER_QUEUE_CLASS = ‘scrapy_redis.queue.SpiderPriorityQueue‘   REDIS_URL = None # 一般情况可以省去 REDIS_HOST = ‘127.0.0.1‘ # 也可以根据情况改成 localhost REDIS_PORT = 6379

Transformation of 2.item.py

From Scrapy.itemImport Item, FieldFrom Scrapy.loaderImport ItemloaderFrom Scrapy.loader.processorsImport Mapcompose, Takefirst, JoinClassExampleitem(Item): name =Field() Description =Field() link =Field () crawled = field () spider = Span class= "Hljs-type" >field () URL = field () class exampleloader  (itemloader): Default_item_class =  Exampleitem default_input_processor = mapcompose (lambda s: s.strip ()) Default_output_processor = takefirst () Description_out = join ()         

3.spider of transformation. Star_turls became Redis_key from Redis to get request, inherited Scrapy.spider into Redisspider.

From Scrapy_redis.spidersImport RedisspiderClassMyspider(Redisspider):"" "Spider that reads URLs from Redis queue (myspider:start_urls)." "" "name =' Myspider_redis ' Redis_key =' Myspider:start_urls ' def __init__(self, *args, **kwargs): # Dynamically define the allowed domains L Ist. Domain = kwargs.pop (' domain ', ') Self.allowed_domains = filter (None, Domain.split (', ')) Super ( Myspider, self). __init__ (*args, **kwargs) def parse(self, Response): return { ' name ': Response.css (' Title::text '). Extract_first (), ' url ': Response.url,}      

Four. Start the crawler:

$ scrapy crawl myspider

You can enter more than one to observe the effects of multiple processes. After you open the crawler you will find that the crawler is in a state of waiting for crawling because the list is empty at this time. So you need to add a start-up address in the Redis console so you can see all the crawlers moving in a pleasant way.

lpush mycrawler:start_urls http://www.***.com

The following three items can be seen in the Redis database, the first is a request that has been filtered and downloaded, the second public item, and the third is the request to be processed.

Scrapy-redis Transformation Scrapy realize distributed multi-process crawl

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.