Distributed problems:
Request queue centralized management
Decentralized management
Storage management
You can find scrapy-redis on github
Related modules redis
settings
#Use scrapy-redis deduplication components, do not use scrapy default deduplication method
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
#Use the scheduler component in scrapy-redis, do not use the default
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
#Allow pause, redis request record is not lost
SCHEDULER_PERSIST = True
#Default scrapy-redis request queue form (by priority)
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#Queue form, first-in first-out
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#Stack form, request advanced first out
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
#Just put the data in the redis database, there is no need to write the pipeline, it still needs to be written to the mysql
ITEM_PIPELINES = {
‘Scrapy_redis.pipelines.RedisPipeline’: 400
}
#Link redis database
REDIS_URL = ‘redis: //: @ 127.0.0.1: 6379’
Run distributed crawlers
scrapy runspider myspider.py
#The following py file is the file name of the distributed crawler you need to run
After the command line starts, it will wait for itself to listen to the url on redis
That is, redis_key = ‘mybaike: start_url’ set on the spider
Then lpush a name and url on redis
#such as:
# lpush mybaike: start_url "http://www.baike.com"
The health generated by default in redis are:
myspider: request
myspider: dupefilter
#If you open the item pipeline to store the data in redis, there is such a field
myspider: item
#The command to delete all keys in redis is: flushdb
#View all keys: keys *
spider.py ## Based on RedisCrawlSpider, that is, you need to inherit from this class when using Crawl when turning pages
import scrapy
from scrapy.selector import Selector
from Scrapy_Redist.items import ScrapyRedistItem
from scrapy_redis.spiders import RedisCrawlSpider
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MybaidukeSpider (RedisCrawlSpider): ## Based on RedisCrawlSpider
name = ‘mybaike’
alloweb_domains = [‘baike.baidu.com’]
# start_urls = [‘https://baike.baidu.com/item/Python/407313’]
redis_key = ‘mybaike: start_url’
rules = [Rule (LinkExtractor (allow = ("item /(.*)")), callback = "get_parse", follow = True)]
def get_parse (self, response):
items = ScrapyRedistItem ()
Seit = Selector (response)
title = Seit.xpath (‘// dd [@ class =" lemmaWgt-lemmaTitle-title "] / h1 / text ()‘). extract ()
contentList = Seit.xpath (‘// div [@ class =" lemma-summary "] // text ()‘)
content = ‘‘ ‘
for c in contentList:
content + = c.extract (). strip ()
items [‘title‘] = title
items [‘content‘] = content
yield items
import scrapy
from scrapy.selector import Selector
from Scrapy_Redist.items import ScrapyRedistItem
from scrapy_redis.spiders import RedisSpider
class MybaidukeSpider (RedisSpider):
name = ‘mybaike’
alloweb_domains = [‘baike.baidu.com’]
# start_urls = [‘https://baike.baidu.com/item/Python/407313’]
redis_key = ‘mybaike: start_url’
#This sentence is very important, that is written in the redis key
#rules = [Rule (LinkExtractor (allow = ("item /(.*)")), callback = "get_parse", follow = True)] # If you write a page turning flower, inherit from RedisSpider
def get_parse (self, response):
items = ScrapyRedistItem ()
Seit = Selector (response)
title = Seit.xpath (‘// dd [@ class =" lemmaWgt-lemmaTitle-title "] / h1 / text ()‘). extract ()
contentList = Seit.xpath (‘// div [@ class =" lemma-summary "] // text ()‘)
content = ‘‘ ‘
for c in contentList:
content + = c.extract (). strip ()
items [‘title‘] = title
items [‘content‘] = content
yield items
The rest are almost the same
scrapy-redis distributed crawler