The construction process of scrapy-redis distributed crawler (theory) 1. Background Scrapy is a general crawler framework, but it does not support distributed. Scrapy-redis is to facilitate Scrapy distributed crawling, and provides some Redis-based components (components only). 2. Environment System: win7 scrapy-redis redis 3.0.5 python 3.6.1 3. Principle 3.1. Compare the architecture diagram of scrapy and Scrapy-redis.
scrapy architecture diagram:
Construction process of scrapy-redis distributed crawler (theory)
scrapy-redis architecture diagram:
The construction process of scrapy-redis distributed crawler (theory) There is one more redis component, which mainly affects two places: the first is the scheduler. The second is data processing. 3.2. Scrapy-Redis distributed strategy.
The construction process of scrapy-redis distributed crawler (theory) As a distributed crawler, it is necessary to have a Master side (core server). On the Master side, a Redis database will be built to store start_urls, request, items. The responsibility of the Master is to be responsible for url fingerprint judgment, request distribution, and data storage (generally, a mongodb will be installed on the Master side to store items in redis). In addition to the Master, there is also a role (slave program execution end), which is mainly responsible for executing the crawler program to crawl the data, and submit the new Request in the crawling process to the Master's redis database. As shown in the figure above, suppose we have four computers: A, B, C, D, any of which can be used as the master or slave. The whole process is:
First, the Slaver side takes tasks (Request, url) from the Master side for data capture. While the Slaver grabs the data, the Request that generates a new task is submitted to the Master for processing; The Master side has only one Redis database, which is responsible for processing unprocessed Requests. Re-task assignment, add the processed Request to the queue to be crawled, and store the crawled data.
Construction process of scrapy-redis distributed crawler (theory)
Construction process of scrapy-redis distributed crawler (theory)
This is the strategy that Scrapy-Redis uses by default. Our implementation is very simple, because tasks such as task scheduling and Scrapy-Redis have already done a good job for us. We only need to inherit RedisSpider and specify redis_key.
The disadvantage is that the task scheduled by Scrapy-Redis is a Request object, which has a large amount of information (not only contains url, but also callback functions, headers and other information), which may result in slower crawler speed and will consume a lot of Redis storage Space, so if you want to ensure efficiency, then you need a certain level of hardware. 4. Operation process Step 1: In the crawler on the slaver side, specify the redis_key and the address of the redis database, for example:
class MySpider (RedisSpider):
"" "Spider that reads urls from redis queue (myspider: start_urls)." ""
name = 'amazon'
redis_key = 'amazonCategory: start_urls'
# Specify the connection parameters of the redis database
'REDIS_HOST': '172.16.1.99',
'REDIS_PORT': 6379,
The second step: start the crawler on the slaver, the crawler enters the waiting state, wait for redis_key to appear in redis, and the log is as follows:
2017-12-12 15:54:18 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-12-12 15:54:18 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2017-12-12 15:54:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-12-12 15:54:18 [myspider_redis] INFO: Reading start URLs from redis key 'myspider: start_urls' (batch size: 110, encoding: utf-8
2017-12-12 15:54:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'redisClawerSlaver.middlewares.ProxiesMiddleware',
'redisClawerSlaver.middlewares.HeadersMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-12 15:54:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-12 15:54:18 [scrapy.middleware] INFO: Enabled item pipelines:
['redisClawerSlaver.pipelines.ExamplePipeline',
'scrapy_redis.pipelines.RedisPipeline']
2017-12-12 15:54:18 [scrapy.core.engine] INFO: Spider opened
2017-12-12 15:54:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages / min), scraped 0 items (at 0 items / min)
2017-12-12 15:55:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages / min), scraped 0 items (at 0 items / min)
2017-12-12 15:56:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages / min), scraped 0 items (at 0 items / min)
Step 3: Start the script and fill in redis_key (start_urls) in the redis database
#! / usr / bin / env python
#-*-coding: utf-8-*-
import redis
# Store start_url in redis_key in redis and let the crawler crawl
redis_Host = "172.16.1.99"
redis_key = 'amazonCategory: start_urls'
# Create redis database connection
rediscli = redis.Redis (host = redis_Host, port = 6379, db = "0")
# First clear all the requests in redis
flushdbRes = rediscli.flushdb ()
print (f "flushdbRes = {flushdbRes}")
rediscli.lpush (redis_key, "https://www.baidu.com")
Construction process of scrapy-redis distributed crawler (theory) Step 4: The crawler on the slaver side starts crawling data. Log is as follows:
2017-12-12 15:56:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages / min), scraped 0 items (at 0 items / min)
parse url = https://www.baidu.com, status = 200, meta = {'download_timeout': 25.0, 'proxy': 'http://proxy.abuyun.com:9020', 'download_slot': 'www .baidu.com ',' download_latency ': 0.2569999694824219,' depth ': 7}
parse url = https://www.baidu.com, status = 200, meta = {'download_timeout': 25.0, 'proxy': 'http://proxy.abuyun.com:9020', 'download_slot': 'www .baidu.com ',' download_latency ': 0.8840000629425049,' depth ': 8}
2017-12-12 15:57:18 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages / min), scraped 1 items (at 1 items / min)
Step 5: Start the script and dump the items in redis to mongodb.
For this part of code, please refer to: scrapy-redis distributed crawler construction process (code chapter)
5. Environment installation and code writing 5.1. Scrapy-redis environment installation
pip install scrapy-redis
Construction process of scrapy-redis distributed crawler (theory)
Construction process of scrapy-redis distributed crawler (theory) Code location: You can modify and customize it later.
The construction process of scrapy-redis distributed crawler (theory) 5.2. The first step of writing scrapy-redis distributed crawler, download the sample code of the official website, address: https://github.com/rmax/scrapy-redis (requires installation Over git)
git clone https://github.com/rmax/scrapy-redis.git
Construction process of scrapy-redis distributed crawler (theory)
The official website provides two sample codes, inherited from Spider + redis and CrawlSpider + redis respectively
Construction process of scrapy-redis distributed crawler (theory)
The second step is to modify according to the sample code provided on the official website.