the construction process of Scrapy-redis Distributed Crawler (Theoretical chapter)
1. BackgroundScrapy is a general-purpose crawler framework, but does not support distributed, Scrapy-redis is designed to make it easier to implement scrapy distributed crawling, while providing some redis based components (components only).
2. EnvironmentalSystem: Win7 scrapy-redis redis 3.0.5 python 3.6.1
3. Principle
3.1. Compare the Scrapy and Scrapy-redis architectural drawings.
scrapy Frame composition:
Scrapy-redis Frame composition:
More than one Redis component, the main impact two places: The first is the scheduler. The second is data processing. 3.2. Scrapy-redis distributed policy.
As a distributed crawler, it is necessary to have a master (core server) , in the master end, will build a Redis database, used to store start_urls, request, items. Master's responsibility is to get the URL fingerprint weight, the allocation of the request, and the storage of the data (typically a MongoDB is installed on the master to store items in Redis). Outside of master, there is a role for slaver (crawler execution), which is primarily responsible for crawling data from the crawler and submitting new request in the crawl process to master's Redis database. As pictured above, suppose we have four computers: A, B, C, D, any of which can be either a master or slaver end. The whole process is:
First slaver from the Master end of the task (Request, URL) data capture, slaver crawl data at the same time, the production of new tasks request will be submitted to master processing; there is only one Redis database on the master side, Responsible for the unhandled request and task assignments, add the processed request to the queue to be crawled, and store crawled data.
Scrapy-redis default is this strategy, we realize it is very simple, because task scheduling and other work Scrapy-redis have helped us to do, we only need to inherit Redisspider, designated Redis_key on the line.
The disadvantage is that the task of Scrapy-redis scheduling is the request object, which contains a large amount of information (not only include URLs, as well as callback functions, headers, etc.), may lead to the result is to reduce the speed of the crawler, and will occupy a large amount of redis storage space, So if you want to ensure efficiency, then you need a certain level of hardware. 4. Run The first step of the process: in the slaver end of the crawler, specify a good redis_key, and specify the address of the Redis database , such as:
Class Myspider (Redisspider): "" "
Spider that reads URL from redis queue (myspider:start_urls).
" " Name = ' Amazon '
redis_key = ' Amazoncategory:start_urls '
# Specifies the connection parameters for the Redis database
' redis_host ': ' 172.16.1.99 ',
' redis_port ': 6379,
Step two: Start the slaver end of the crawler, the crawler into the waiting state,
waiting for the Redis to appear Redis_key, log as follows:
2017-12-12 15:54:18 [Scrapy.utils.log] info:scrapy 1.4.0 started (bot:scrapybot) 2017-12-12 [15:54:18] Info:overridden settings: {' spider_loader_warn_only ': True} 2017-12-12 15:54:18 [Scrapy.middleware] info:enabled Exten Sions: [' scrapy.extensions.corestats.CoreStats ', ' scrapy.extensions.telnet.TelnetConsole ', ' Scrapy.extensions.logstats.LogStats '] 2017-12-12 15:54:18 [Myspider_redis] info:reading start URL from redis key ' Myspi Der:start_urls ' (Batch size:110, Encoding:utf-8 2017-12-12 15:54:18 [Scrapy.middleware] info:enabled Downloader-Middlew Ares: [' Scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware ', ' Scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware ', ' Scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware ', ' RedisClawerSlaver.middlewares.ProxiesMiddleware ', ' redisClawerSlaver.middlewares.HeadersMiddleware ', ' Scrapy.downloadermiddlewares.retry.RetryMiddleware ', ' Scrapy.downloadermiddlewares.redirect.MetaRefReshmiddleware ', ' scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware ', ' Scrapy.downloadermiddlewares.redirect.RedirectMiddleware ', ' Scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware ', ' Scrapy.downloadermiddlewares.stats.DownloaderStats '] 2017-12-12 15:54:18 [scrapy.middleware] info:enabled spider Middlewares: [' scrapy.spidermiddlewares.httperror.HttpErrorMiddleware ', ' Scrapy.spidermiddlewares.offsite.OffsiteMiddleware ', ' scrapy.spidermiddlewares.referer.RefererMiddleware ', '
Scrapy.spidermiddlewares.urllength.UrlLengthMiddleware ', ' Scrapy.spidermiddlewares.depth.DepthMiddleware ']
2017-12-12 15:54:18 [Scrapy.middleware] info:enabled Item Pipelines: [' RedisClawerSlaver.pipelines.ExamplePipeline ', ' Scrapy_redis.pipelines.RedisPipeline '] 2017-12-12 15:54:18 [Scrapy.core.engine] Info:spider opened 2017-12-12 15:54:18 [Scrapy.extensions.logstats] info:crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-12 15:55:18 [Scrapy.extensIons.logstats] info:crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-12 15:56:18 [Scrapy.extens Ions.logstats] info:crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Step three: Start the script and populate the Redis database
Redis_key(Start_urls)
#!/usr/bin/env python
#-*-coding:utf-8-*-
import Redis store Start_url in Redis
, and let the crawler crawl
Redis_host = "172.16.1.99"
redis_key = ' amazoncategory:start_urls '
# Create Redis database connection
rediscli = Redis. Redis (host = redis_host, port = 6379, db = "0")
# First empty all requests in redis
flushdbres = rediscli.flushdb () prin
T (f "flushdbres = {flushdbres}")
Rediscli.lpush (Redis_key, "https://www.baidu.com")
Fourth step: The slaver end of the crawler began to crawl data. Log is as follows:
2017-12-12 15:56:18 [scrapy.extensions.logstats] info:crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) C0/>parse url = https://www.baidu.com, status = $, meta = {' Download_timeout ': 25.0, ' proxy ': ' Http://proxy.abuyun.com:9 020 ', ' download_slot ': ' www.baidu.com ', ' download_latency ': 0.2569999694824219, ' depth ': 7}
Parse url = https:// www.baidu.com, status = MB, meta = {' Download_timeout ': 25.0, ' proxy ': ' http://proxy.abuyun.com:9020 ', ' download_slot ': ' www.baidu.com ', ' download_latency ': 0.8840000629425049, ' depth ': 8}
2017-12-12 15:57:18 [ Scrapy.extensions.logstats] info:crawled 2 pages (at 2 pages/min), scraped 1 items (at 1 items/min)
Step Fifth: Start the script and dump the items in the Redis into the MongoDB.
This part of the code, please refer to: Scrapy-redis Distributed crawler build process (code)
5. Environment installation and code writing
5.1. Scrapy-redis Environment Installation
Pip Install Scrapy-redis
Code location: The following can be modified to customize.
5.2. Scrapy-redis distributed crawler to write the first step, download the official website of the sample code, address: Https://github.com/rmax/scrapy-redis (need to install git)
git clone https://github.com/rmax/scrapy-redis.git
The official website provides two kinds of sample code, inherits from Spider + redis and Crawlspider + Redis respectively
The second step, according to the official website provided by the sample code to modify.