Scrapy-redis Distributed crawler

Source: Internet
Author: User
Distributed problems:
Request queue centralized management

Decentralized management

Storage management

You can find scrapy-redis on github

Related modules redis

settings

#Use scrapy-redis deduplication components, do not use scrapy default deduplication method
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

#Use the scheduler component in scrapy-redis, do not use the default
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

#Allow pause, redis request record is not lost
SCHEDULER_PERSIST = True

#Default scrapy-redis request queue form (by priority)
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"

#Queue form, first-in first-out
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"

#Stack form, request advanced first out
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"

#Just put the data in the redis database, there is no need to write the pipeline, it still needs to be written to the mysql
ITEM_PIPELINES = {
    ‘Scrapy_redis.pipelines.RedisPipeline’: 400
}




#Link redis database

REDIS_URL = ‘redis: //: @ 127.0.0.1: 6379’
Run distributed crawlers

scrapy runspider myspider.py
#The following py file is the file name of the distributed crawler you need to run
After the command line starts, it will wait for itself to listen to the url on redis
That is, redis_key = ‘mybaike: start_url’ set on the spider
Then lpush a name and url on redis
#such as:
    # lpush mybaike: start_url "http://www.baike.com"

The health generated by default in redis are:
myspider: request

myspider: dupefilter

#If you open the item pipeline to store the data in redis, there is such a field
myspider: item

#The command to delete all keys in redis is: flushdb
#View all keys: keys *

spider.py ## Based on RedisCrawlSpider, that is, you need to inherit from this class when using Crawl when turning pages
import scrapy
from scrapy.selector import Selector
from Scrapy_Redist.items import ScrapyRedistItem
from scrapy_redis.spiders import RedisCrawlSpider
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class MybaidukeSpider (RedisCrawlSpider): ## Based on RedisCrawlSpider
    name = ‘mybaike’
    alloweb_domains = [‘baike.baidu.com’]
    # start_urls = [‘https://baike.baidu.com/item/Python/407313’]
    redis_key = ‘mybaike: start_url’
    rules = [Rule (LinkExtractor (allow = ("item /(.*)")), callback = "get_parse", follow = True)]
    def get_parse (self, response):
        items = ScrapyRedistItem ()

        Seit = Selector (response)
        title = Seit.xpath (‘// dd [@ class =" lemmaWgt-lemmaTitle-title "] / h1 / text ()‘). extract ()
        contentList = Seit.xpath (‘// div [@ class =" lemma-summary "] // text ()‘)
        content = ‘‘ ‘
        for c in contentList:
            content + = c.extract (). strip ()

        items [‘title‘] = title
        items [‘content‘] = content
        yield items
 

import scrapy
from scrapy.selector import Selector
from Scrapy_Redist.items import ScrapyRedistItem

from scrapy_redis.spiders import RedisSpider

class MybaidukeSpider (RedisSpider):
    name = ‘mybaike’
    alloweb_domains = [‘baike.baidu.com’]
    # start_urls = [‘https://baike.baidu.com/item/Python/407313’]

    redis_key = ‘mybaike: start_url’
#This sentence is very important, that is written in the redis key


    #rules = [Rule (LinkExtractor (allow = ("item /(.*)")), callback = "get_parse", follow = True)] # If you write a page turning flower, inherit from RedisSpider


    def get_parse (self, response):
        items = ScrapyRedistItem ()

        Seit = Selector (response)
        title = Seit.xpath (‘// dd [@ class =" lemmaWgt-lemmaTitle-title "] / h1 / text ()‘). extract ()
        contentList = Seit.xpath (‘// div [@ class =" lemma-summary "] // text ()‘)
        content = ‘‘ ‘
        for c in contentList:
            content + = c.extract (). strip ()

        items [‘title‘] = title
        items [‘content‘] = content
        yield items
The rest are almost the same

scrapy-redis distributed crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.