I. Rationale:
Scrapy-redis is a Redis-based scrapy distributed component. It uses Redis to store and schedule requests (requests) for crawling (Schedule) and stores the items (items) that are crawled for subsequent processing. Scrapy-redi rewritten scrapy Some of the more critical code, turning Scrapy into a distributed crawler that can run concurrently on multiple hosts.
Reference Scrapy-redis official GitHub address
Two. Preparatory work:
1. Install and start redis,windows and LUnix can refer to this article
2.scrapy+python Environment Installation
3.scrapy_redis Environment Installation
$ pip install scrapy-redis$ pip install redis
Three. Transform Scrapy crawler:
1. First configure Redis in settings.py (already configured in the example of Scrapy-redis)
"scrapy_redis.scheduler.Scheduler" SCHEDULER_PERSIST = True SCHEDULER_QUEUE_CLASS = ‘scrapy_redis.queue.SpiderPriorityQueue‘ REDIS_URL = None # 一般情况可以省去 REDIS_HOST = ‘127.0.0.1‘ # 也可以根据情况改成 localhost REDIS_PORT = 6379
Transformation of 2.item.py
From Scrapy.itemImport Item, FieldFrom Scrapy.loaderImport ItemloaderFrom Scrapy.loader.processorsImport Mapcompose, Takefirst, JoinClassExampleitem(Item): name =Field() Description =Field() link =Field () crawled = field () spider = Span class= "Hljs-type" >field () URL = field () class exampleloader (itemloader): Default_item_class = Exampleitem default_input_processor = mapcompose (lambda s: s.strip ()) Default_output_processor = takefirst () Description_out = join ()
3.spider of transformation. Star_turls became Redis_key from Redis to get request, inherited Scrapy.spider into Redisspider.
From Scrapy_redis.spidersImport RedisspiderClassMyspider(Redisspider):"" "Spider that reads URLs from Redis queue (myspider:start_urls)." "" "name =' Myspider_redis ' Redis_key =' Myspider:start_urls ' def __init__(self, *args, **kwargs): # Dynamically define the allowed domains L Ist. Domain = kwargs.pop (' domain ', ') Self.allowed_domains = filter (None, Domain.split (', ')) Super ( Myspider, self). __init__ (*args, **kwargs) def parse(self, Response): return { ' name ': Response.css (' Title::text '). Extract_first (), ' url ': Response.url,}
Four. Start the crawler:
$ scrapy crawl myspider
You can enter more than one to observe the effects of multiple processes. After you open the crawler you will find that the crawler is in a state of waiting for crawling because the list is empty at this time. So you need to add a start-up address in the Redis console so you can see all the crawlers moving in a pleasant way.
lpush mycrawler:start_urls http://www.***.com
The following three items can be seen in the Redis database, the first is a request that has been filtered and downloaded, the second public item, and the third is the request to be processed.
Scrapy-redis Transformation Scrapy realize distributed multi-process crawl