The construction process of Scrapy-redis Distributed Crawler (Theoretical chapter)

Last Update:2018-07-29 Source: Internet

Author: User

Tags mongodb redis git clone

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

the construction process of Scrapy-redis Distributed Crawler (Theoretical chapter) 1. BackgroundScrapy is a general-purpose crawler framework, but does not support distributed, Scrapy-redis is designed to make it easier to implement scrapy distributed crawling, while providing some redis based components (components only). 2. EnvironmentalSystem: Win7 scrapy-redis redis 3.0.5 python 3.6.1 3. Principle 3.1. Compare the Scrapy and Scrapy-redis architectural drawings.

scrapy Frame composition:

Scrapy-redis Frame composition:
More than one Redis component, the main impact two places: The first is the scheduler. The second is data processing. 3.2. Scrapy-redis distributed policy.

As a distributed crawler, it is necessary to have a master (core server) , in the master end, will build a Redis database, used to store start_urls, request, items. Master's responsibility is to get the URL fingerprint weight, the allocation of the request, and the storage of the data (typically a MongoDB is installed on the master to store items in Redis). Outside of master, there is a role for slaver (crawler execution), which is primarily responsible for crawling data from the crawler and submitting new request in the crawl process to master's Redis database. As pictured above, suppose we have four computers: A, B, C, D, any of which can be either a master or slaver end. The whole process is:
First slaver from the Master end of the task (Request, URL) data capture, slaver crawl data at the same time, the production of new tasks request will be submitted to master processing; there is only one Redis database on the master side, Responsible for the unhandled request and task assignments, add the processed request to the queue to be crawled, and store crawled data.

Scrapy-redis default is this strategy, we realize it is very simple, because task scheduling and other work Scrapy-redis have helped us to do, we only need to inherit Redisspider, designated Redis_key on the line.

The disadvantage is that the task of Scrapy-redis scheduling is the request object, which contains a large amount of information (not only include URLs, as well as callback functions, headers, etc.), may lead to the result is to reduce the speed of the crawler, and will occupy a large amount of redis storage space, So if you want to ensure efficiency, then you need a certain level of hardware. 4. Run The first step of the process: in the slaver end of the crawler, specify a good redis_key, and specify the address of the Redis database , such as:

Class Myspider (Redisspider): "" "
    Spider that reads URL from redis queue (myspider:start_urls).
    " " Name = ' Amazon '
    redis_key = ' Amazoncategory:start_urls '

# Specifies the connection parameters for the Redis database
' redis_host ': ' 172.16.1.99 ',
' redis_port ': 6379,

Step two: Start the slaver end of the crawler, the crawler into the waiting state, waiting for the Redis to appear Redis_key, log as follows:

2017-12-12 15:54:18 [Scrapy.utils.log] info:scrapy 1.4.0 started (bot:scrapybot) 2017-12-12 [15:54:18] Info:overridden settings: {' spider_loader_warn_only ': True} 2017-12-12 15:54:18 [Scrapy.middleware] info:enabled Exten Sions: [' scrapy.extensions.corestats.CoreStats ', ' scrapy.extensions.telnet.TelnetConsole ', ' Scrapy.extensions.logstats.LogStats '] 2017-12-12 15:54:18 [Myspider_redis] info:reading start URL from redis key ' Myspi Der:start_urls ' (Batch size:110, Encoding:utf-8 2017-12-12 15:54:18 [Scrapy.middleware] info:enabled Downloader-Middlew Ares: [' Scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware ', ' Scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware ', ' Scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware ', ' RedisClawerSlaver.middlewares.ProxiesMiddleware ', ' redisClawerSlaver.middlewares.HeadersMiddleware ', ' Scrapy.downloadermiddlewares.retry.RetryMiddleware ', ' Scrapy.downloadermiddlewares.redirect.MetaRefReshmiddleware ', ' scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware ', ' Scrapy.downloadermiddlewares.redirect.RedirectMiddleware ', ' Scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware ', ' Scrapy.downloadermiddlewares.stats.DownloaderStats '] 2017-12-12 15:54:18 [scrapy.middleware] info:enabled spider Middlewares: [' scrapy.spidermiddlewares.httperror.HttpErrorMiddleware ', ' Scrapy.spidermiddlewares.offsite.OffsiteMiddleware ', ' scrapy.spidermiddlewares.referer.RefererMiddleware ', '
Scrapy.spidermiddlewares.urllength.UrlLengthMiddleware ', ' Scrapy.spidermiddlewares.depth.DepthMiddleware ']
 2017-12-12 15:54:18 [Scrapy.middleware] info:enabled Item Pipelines: [' RedisClawerSlaver.pipelines.ExamplePipeline ', ' Scrapy_redis.pipelines.RedisPipeline '] 2017-12-12 15:54:18 [Scrapy.core.engine] Info:spider opened 2017-12-12 15:54:18 [Scrapy.extensions.logstats] info:crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-12 15:55:18 [Scrapy.extensIons.logstats] info:crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-12 15:56:18 [Scrapy.extens Ions.logstats] info:crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

Step three: Start the script and populate the Redis database Redis_key(Start_urls)

#!/usr/bin/env python
#-*-coding:utf-8-*-

import Redis store Start_url in Redis

, and let the crawler crawl
Redis_host = "172.16.1.99"
redis_key = ' amazoncategory:start_urls '

# Create Redis database connection
rediscli = Redis. Redis (host = redis_host, port = 6379, db = "0")

# First empty all requests in redis
flushdbres = rediscli.flushdb () prin
T (f "flushdbres = {flushdbres}")
Rediscli.lpush (Redis_key, "https://www.baidu.com")

Fourth step: The slaver end of the crawler began to crawl data. Log is as follows:

2017-12-12 15:56:18 [scrapy.extensions.logstats] info:crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) C0/>parse url = https://www.baidu.com, status = $, meta = {' Download_timeout ': 25.0, ' proxy ': ' Http://proxy.abuyun.com:9 020 ', ' download_slot ': ' www.baidu.com ', ' download_latency ': 0.2569999694824219, ' depth ': 7}
Parse url = https:// www.baidu.com, status = MB, meta = {' Download_timeout ': 25.0, ' proxy ': ' http://proxy.abuyun.com:9020 ', ' download_slot ': ' www.baidu.com ', ' download_latency ': 0.8840000629425049, ' depth ': 8}
2017-12-12 15:57:18 [ Scrapy.extensions.logstats] info:crawled 2 pages (at 2 pages/min), scraped 1 items (at 1 items/min)

Step Fifth: Start the script and dump the items in the Redis into the MongoDB.

This part of the code, please refer to: Scrapy-redis Distributed crawler build process (code)

5. Environment installation and code writing 5.1. Scrapy-redis Environment Installation

Pip Install Scrapy-redis

Code location: The following can be modified to customize.
5.2. Scrapy-redis distributed crawler to write the first step, download the official website of the sample code, address: Https://github.com/rmax/scrapy-redis (need to install git)

git clone https://github.com/rmax/scrapy-redis.git

The official website provides two kinds of sample code, inherits from Spider + redis and Crawlspider + Redis respectively

The second step, according to the official website provided by the sample code to modify.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More