Introduction to scrapy

Introduction to scrapy_redis

Last Update:2018-10-29 Source: Internet

Author: User

Tags download redis install redis

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scrapy_redis is a redis-based scrapy component that can be used to quickly implement simple distributed crawler programs. This component provides three main functions:

(1) dupefilter -- URL deduplication rule (used by the Scheduler)

(2) sched -- Scheduler

(3) pipeline-Data Persistence

1. Install redis

Go to the official website to download redis and install it on your computer

Ii. Install the scrapy_redis component

Open the terminal and enter Pip install scrapy-redis (OS/Linux)

By default, components are installed in site-packages in the corresponding Python folder. For example,/usr/local/lib/python3.7/Site-packages/scrapy_redis

Iii. Details about scrapy_redis Functions

(1) URL deduplication

1. Source Code/usr/local/lib/python3.7/Site-packages/scrapy_redis/dupefilter. py

Configuration Information in setting. py:

# Redis configuration redis_host = "127.0.0.1" redis_port = 6366redis_params = {} redis_encoding = "UTF-8" dupefilter_class = "bytes" dupefilter_key = "dupefilter: % (timestamp) S"

2. Rewrite dupefilter

You can customize dupefilter as needed.

Create the file dupefilter. py in the directory of spiders, and write the code:

"Override dupefilter" From writable import rfpdupefilterfrom scrapy_redis.connection import into mydupefilter (rfpdupefilter): @ classmethod def from_settings (CLS, settings): Server = setting (settings) key = "my_scrapy_2_dupfilter" # rewrite key DEBUG = settings. getbool ('dupefilter _ debug') return Cls (server, key = key, DEBUG = Debug)

Configure settings. py:

# Redis configure redis_host = "127.0.0.1" # host redis_port = 6379 # port number redis_params = {}# connection parameter redis_encoding = "UTF-8" # encoding rules # configure your own dupefilter path dupefilter_class = "bytes"

(2) Scheduler

1. breadth first and depth first

(1) Stack-back-to-first-out-breadth-first-lifoqueue (list)

(2) queue-first-in-first-out-depth-first-in-depth Queue (list)

(3) priority set-priorityqueue (ordered set)

2. In settings. py:

# Redis configuration redis_host = "127.0.0.1" redis_port = 6366redis_params = {} redis_encoding = "UTF-8" # deduplication rules dupefilter_class = "strong" # scheduler = "strong" Limit = 'scrapy _ redis. queue. priorityqueue '# priority queue (default) is used by default. Others: priorityqueue (ordered set), incluoqueue (list), lifoqueue (list) scheduler_queue_key =' % (SPIDER) s: requests '# key Chou in redis where the request in the scheduler is stored Ti: requestsscheduler_serializer = "scrapy_redis.picklecompat" # serialize the data stored in redis. By default, picklescheduler_persist = true # whether to retain the original scheduler and deduplication record when it is disabled, true = keep, false = clear scheduler_flush_on_start = true # Whether to clear the scheduler and deduplication before the start, true = clear, false = Do not empty # scheduler_idle_before_close = 10 # if data is obtained from the scheduler, the maximum wait time (no data at last, not obtained) is required ). Scheduler_dupefilter_key = '% (SPIDER) S: dupefilter' # deduplication rule, the corresponding keyscheduler_dupefilter_class = 'scrapy _ redis when stored in redis. dupefilter. rfpdupefilter' # deduplicate rule class depth_priority =-1 # This configuration parameter depth_priority can be set to-1 or 1 If priorityqueue is used

(3) Data Persistence

1. Source Code

The following example shows how to crawl the news titles and links of the new hot pop-up in the drawer:

Crawler chouti. py:

#-*-Coding: UTF-8-*-"crawls the news title and URL of the drawer new hot list and saves" Import scrapyfrom scrapy. HTTP import requestfrom .. items import myscrapy3itemclass choutispider (scrapy. spider): name = 'chouti' allowed_domains = ['chouti. com '] start_urls = ['HTTP: // chouti.com/'] def parse (self, response): # print (response, response. request. priority, response. meta. get ('depth ') items = response. XPath ("// Div [@ ID = 'content-list']/Div [@ class = 'item']") for item in items: Title = item. XPath (". // Div [@ class = 'part1']/A/text ()"). extract_first (). strip () # title href = item. XPath (". // Div [@ class = 'part1']/A/@ href "). extract_first (). strip () # connect to yield myscrapy3item (Title = title, href = href) # An item object in yield # page_list = response. XPath ('// * [@ ID = "dig_lcpage"] // A/@ href '). extract () for URL in page_list: url = "https://dig.chouti.com" + URL yield request (url = URL, callback = self. parse)

Items. py:

import scrapyclass MyScrapy3Item(scrapy.Item):    title = scrapy.Field()    href = scrapy.Field()

Configure settings. py:

Item_pipelines = {"scrapy_redis.pipelines.redispipeline": 300, # sets the persistence class using scrapy_redis}

# ----------- Other configurations ----------------------

Depth_limit = 2 # Crawling depth


# Redis configuration (required)
Redis_host = "127.0.0.1"
Redis_port = 6379
Redis_params = {}
Redis_encoding = "UTF-8"

# Deduplication rules
Dupefilter_class = "scrapy_redis.dupefilter.rfpdupefilter"

# Scheduler
Scheduler = "scrapy_redis.scheduler.scheduler"
Scheduler_queue_class = 'scrapy _ redis. queue. Queue oqueue '# priority queue (default) is used by default. Others: priorityqueue (ordered set), queue oqueue (list), and lifoqueue (list)
Scheduler_queue_key = '% (SPIDER) S: requests' # key chouti: Requests
Scheduler_serializer = "scrapy_redis.picklecompat" # serialize the data stored in redis. By default, pickle is used.
Scheduler_persist = true # whether the original scheduler and deduplication record are retained when the scheduler is disabled, true = retained, false = cleared
Scheduler_flush_on_start = true # Whether to clear the scheduler and deduplication before the start, true = clear, false = not clear
# Scheduler_idle_before_close = 10 # if data is retrieved from the scheduler, if it is null, the maximum waiting time is allowed (no data is obtained at last)
Scheduler_dupefilter_key = '% (SPIDER) S: dupefilter' # deduplication rule, the corresponding key when stored in redis
Scheduler_dupefilter_class = 'scrapy _ redis. dupefilter. rfpdupefilter' # class to be processed by deduplication rules

Create the file start_chouti.py in the root directory of the project to run the crawler (you can also directly run the command on the terminal ):

from scrapy.cmdline import executeif __name__ == "__main__":    execute(["scrapy", "crawl", "chouti", "--nolog"])

You can create a new py file to view the data stored in redis:

# View data in three ways

Import redisconn = redis. redis (host = "127.0.0.1", Port = 6379) # Conn. flushall () # Clear redisprint (Conn. keys () # view all keys [B 'chouti: dupefilter', B' chouti: Items ']
#1. obtain data in the specified range # res = Conn. lrange ('chouti: Items ', 0, 3) # obtain 3 pieces of persistent data # print (RES )"""
Result:
[B '{"title ": "\ u3010 \ u56fe \ u96c6 \ Users \ u5e74 \ u5ea6 \ u5929 \ u6587 \ u6444 \ u5f71 \ u5e08 \ u5927 \ u8d5b \ u83b7 \ u5956 \ u4f5c \ u54c1 \ u516c \ u5e03 ", "href": "https://mp.weixin.qq.com/s/eiWj7ky53xEDoMRFXC1EGg"} ', B' {"title ": "\ u3010 \ u201c \ u4eba \ u76f4 \ u5230 \ u5165 \ u571f \ Users \ u5b89 \ u90a3 \ u4e00 \ u5929 \ Users \ u90fd \ u5728 \ u8d70 \ u53f0 \ u9636 \ u201d \ u3011 \ u674e \ u548f \ u5728 \ u63a5 \ u53d7 \ u91c7 \ \ Users \ u65f6 \ u66fe \ u8fd9 \ u6837 \ u5f62 \ u5bb9 \ u81ea \ u5df1 \ u7684 \ u4eba \ u751f \ uff1a \\ u201c \ Alibaba \ u76f4 \ u5230 \ u5165 \ u571f \ u4e3a \ Alibaba \ u90a3 \ u4e00 \ u5929 \ uff0c \ u90fd \ u28 57 \ u8d70 \ u53f0 \ u9636 \ u3002 \ u8ddf \ u767b \ u9ec4 \ Alibaba \ u4e00 \ u6837 \ Users \ u767b \ u7684 \ \ u65f6 \ u5019 \ u4f60 \ Users \ u89c9 \ u5f97 \ u6709 \ u4e91 \ Users \ u5230 \ u4e00 \ u5b9a \ u9ad8 \\ u5ea6 \ u7684 \ u65f6 \ u5019 \ Alibaba \ u65c1 \ u8fb9 \ u6709 \ Alibaba \ u63d0 \ u9192 \ u4f60 \ u56de \ u5934 \ u770b \ u4e00 \ u4e0b \ uff0c \ u4e91 \ u5c31 \ u5728 \ u773c \ u524d \ u3002 \ u201d ", "href": "https://mp.weixin.qq.com/s/erLgWmL1GhpyWqwOTIlRvQ"} ', B' {"title ": "\ u3010 \ u6e38 \ u620f \ u673a \ u5236 \ u6e17 \ u900f \ u5e76 \ u6e10 \ u6e10 \ u5851 \ u9020 \ Signature \ u73b0 \ Users \ u4e16 \ u754c \ uff0c \ u4f60 \ u662f \ u5426 \ signature \ u4e00 \ u6837 \ u8ba4 \ u4e3a \ \ Users \ u6240 \ u5f53 \ u7136 \ uff1f \ u3011 \ u5728 \ u667a \ u80fd \ u624b \ u673a \ u666e \ u53ca \\ u4ee5 \ u540e \ uff0c \ u79fb \ u52a8 \ u6280 \ u672f \ u80fd \ u591f \ Users \ u73b0 \ Users \ u4e16 \ Users \ u53d1 \ u751f \ u8d8a \ u6765 \ u8d8a \ u591a \ u7684 \ Users \ u4e92 \ uff0c \ Users \ u6b64 \ u6e38 \ \ Users \ u5316 \ u7684 \ u5c1d \ u8bd5 \ u5e76 \ u6ca1 \ u6709 \ u51cf \ u5c11 \ u53cd \ u800c \ u589e \\ u591a \ u4e86 \ u3002 \ u6709 \ Alibaba \ u5929 \ u8bb0 \ u5f55 \ Alibaba \ u7684 \ u6b65 \ u884c \ u8ddd \ u79bb \ Users \ u7136 \ u540e \ u9881 \ u53d1 \ u5956 \ u7ae0 \ u7684 \ u3002 \ u6709 \ u8bb0 \ u5f55 \ signature \ \ u7684 \ u4e60 \ u60ef \ uff0c \ u5e76 \ u53ef \ u4ee5 \ signature \ u4e00 \ u5ea7 \ u57ce \ u5e02 \\ u7684 \ u3002 ", "href": "http://www.qdaily.com/articles/57753.html"} ', B' {"title ": "\ u3010 \ u53c8 \ u5931 \ Alibaba \ u57ce \ uff01 \ u9ed8 \ u514b \ u5c14 \ u7684 \ u201c \ u9ec4 \ u91d1 \ u914d \ u89d2 \ u201d \ u5728 \ u9ed1 \ u68ee \ u5dde \ u906d \ u9047 \ u60e8 \ u8d25 \ u3011 \ u4eca \ \ u5e74 \ u4e09 \ u6708 \ u8270 \ u96be \ Users \ u6210 \ u7b2c \ u56db \ u6b21 \ u7ec4 \ u9601 \ u7684 \\ users \ u56fd \ u603b \ u7406 \ u9ed8 \ u514b \ u5c14 \ Users \ u572810 \ u6708 \ u5fb7 \ u56fd \ u4e24 \ birthday \ u5173 \ u952e \ u5dde \ u2014 \ u2014 \ u5df4 \ u4f10 \ u5229 \ u4e9a \ u5dde \ u548c \ u9ed1 \ u68ee \ \ u5dde \ u7684 \ u9009 \ u4e3e \ u4e2d \ uff0c \ u63a5 \ Users \ u906d \ u9047 \ u5386 \ u53f2 \ u6027 \\ u60e8 \ u8d25 \ u3002 \ u9ed8 \ Alibaba \ u5c14 \ u7684 \ u201c \ u9ec4 \ u91d1 \ u914d \ u89d2 \ u201d \ u2014 2014 \ u2014 \ u793e \ u6c11 \ u515a \ uff08spd \ uff09 \ u5728 \ u4e24 \ u6b21 \ u9009 \ Users \ u4e2d \ u7684 \ \ u5f97 \ u7968 \ u7387 \ u5448 \ u73b0 \ u81ea \ u7531 \ u843d \ u4f53 \ u72b6 \ u6001 \ u3002 ", "href": "https://wallstreetcn.com/articles/3428455"} '] ""

#2. Remove Data one by one

# Item = conn. lpop ('chouti: Items ')
# Print (item)

"""
Result:
B '{"title ": "\ u3010 \ u56fe \ u96c6 \ Users \ u5e74 \ u5ea6 \ u5929 \ u6587 \ u6444 \ u5f71 \ u5e08 \ u5927 \ u8d5b \ u83b7 \ u5956 \ u4f5c \ u54c1 \ u516c \ u5e03 ", "href": "https://mp.weixin.qq.com/s/eiWj7ky53xEDoMRFXC1EGg "}'"""

#3. Create a producer-consumer model
While true:
Item = conn. blpop ('chouti: Items ') # Remove Data one by one. If no data exists, block it.
Print (item)

By using the persistent data function of scrapy_redis, you can use production data and data acquisition as two concurrent operations that do not affect each other.

2. If you want to store data elsewhere, you can inherit and override the pipelines of scrapy_redis.

Introduction to scrapy_redis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More