Scrapy_redis is a redis-based scrapy component that can be used to quickly implement simple distributed crawler programs. This component provides three main functions:
(1) dupefilter -- URL deduplication rule (used by the Scheduler)
(2) sched -- Scheduler
(3) pipeline-Data Persistence
1. Install redis
Go to the official website to download redis and install it on your computer
Ii. Install the scrapy_redis component
Open the terminal and enter Pip install scrapy-redis (OS/Linux)
By default, components are installed in site-packages in the corresponding Python folder. For example,/usr/local/lib/python3.7/Site-packages/scrapy_redis
Iii. Details about scrapy_redis Functions
(1) URL deduplication
1. Source Code/usr/local/lib/python3.7/Site-packages/scrapy_redis/dupefilter. py
Configuration Information in setting. py:
# Redis configuration redis_host = "127.0.0.1" redis_port = 6366redis_params = {} redis_encoding = "UTF-8" dupefilter_class = "bytes" dupefilter_key = "dupefilter: % (timestamp) S"
2. Rewrite dupefilter
You can customize dupefilter as needed.
Create the file dupefilter. py in the directory of spiders, and write the code:
"Override dupefilter" From writable import rfpdupefilterfrom scrapy_redis.connection import into mydupefilter (rfpdupefilter): @ classmethod def from_settings (CLS, settings): Server = setting (settings) key = "my_scrapy_2_dupfilter" # rewrite key DEBUG = settings. getbool ('dupefilter _ debug') return Cls (server, key = key, DEBUG = Debug)
Configure settings. py:
# Redis configure redis_host = "127.0.0.1" # host redis_port = 6379 # port number redis_params = {}# connection parameter redis_encoding = "UTF-8" # encoding rules # configure your own dupefilter path dupefilter_class = "bytes"
(2) Scheduler
1. breadth first and depth first
(1) Stack-back-to-first-out-breadth-first-lifoqueue (list)
(2) queue-first-in-first-out-depth-first-in-depth Queue (list)
(3) priority set-priorityqueue (ordered set)
2. In settings. py:
# Redis configuration redis_host = "127.0.0.1" redis_port = 6366redis_params = {} redis_encoding = "UTF-8" # deduplication rules dupefilter_class = "strong" # scheduler = "strong" Limit = 'scrapy _ redis. queue. priorityqueue '# priority queue (default) is used by default. Others: priorityqueue (ordered set), incluoqueue (list), lifoqueue (list) scheduler_queue_key =' % (SPIDER) s: requests '# key Chou in redis where the request in the scheduler is stored Ti: requestsscheduler_serializer = "scrapy_redis.picklecompat" # serialize the data stored in redis. By default, picklescheduler_persist = true # whether to retain the original scheduler and deduplication record when it is disabled, true = keep, false = clear scheduler_flush_on_start = true # Whether to clear the scheduler and deduplication before the start, true = clear, false = Do not empty # scheduler_idle_before_close = 10 # if data is obtained from the scheduler, the maximum wait time (no data at last, not obtained) is required ). Scheduler_dupefilter_key = '% (SPIDER) S: dupefilter' # deduplication rule, the corresponding keyscheduler_dupefilter_class = 'scrapy _ redis when stored in redis. dupefilter. rfpdupefilter' # deduplicate rule class depth_priority =-1 # This configuration parameter depth_priority can be set to-1 or 1 If priorityqueue is used
(3) Data Persistence
1. Source Code
The following example shows how to crawl the news titles and links of the new hot pop-up in the drawer:
Crawler chouti. py:
#-*-Coding: UTF-8-*-"crawls the news title and URL of the drawer new hot list and saves" Import scrapyfrom scrapy. HTTP import requestfrom .. items import myscrapy3itemclass choutispider (scrapy. spider): name = 'chouti' allowed_domains = ['chouti. com '] start_urls = ['HTTP: // chouti.com/'] def parse (self, response): # print (response, response. request. priority, response. meta. get ('depth ') items = response. XPath ("// Div [@ ID = 'content-list']/Div [@ class = 'item']") for item in items: Title = item. XPath (". // Div [@ class = 'part1']/A/text ()"). extract_first (). strip () # title href = item. XPath (". // Div [@ class = 'part1']/A/@ href "). extract_first (). strip () # connect to yield myscrapy3item (Title = title, href = href) # An item object in yield # page_list = response. XPath ('// * [@ ID = "dig_lcpage"] // A/@ href '). extract () for URL in page_list: url = "https://dig.chouti.com" + URL yield request (url = URL, callback = self. parse)
Items. py:
import scrapyclass MyScrapy3Item(scrapy.Item): title = scrapy.Field() href = scrapy.Field()
Configure settings. py:
Item_pipelines = {"scrapy_redis.pipelines.redispipeline": 300, # sets the persistence class using scrapy_redis}
# ----------- Other configurations ----------------------
Depth_limit = 2 # Crawling depth
# Redis configuration (required)
Redis_host = "127.0.0.1"
Redis_port = 6379
Redis_params = {}
Redis_encoding = "UTF-8"
# Deduplication rules
Dupefilter_class = "scrapy_redis.dupefilter.rfpdupefilter"
# Scheduler
Scheduler = "scrapy_redis.scheduler.scheduler"
Scheduler_queue_class = 'scrapy _ redis. queue. Queue oqueue '# priority queue (default) is used by default. Others: priorityqueue (ordered set), queue oqueue (list), and lifoqueue (list)
Scheduler_queue_key = '% (SPIDER) S: requests' # key chouti: Requests
Scheduler_serializer = "scrapy_redis.picklecompat" # serialize the data stored in redis. By default, pickle is used.
Scheduler_persist = true # whether the original scheduler and deduplication record are retained when the scheduler is disabled, true = retained, false = cleared
Scheduler_flush_on_start = true # Whether to clear the scheduler and deduplication before the start, true = clear, false = not clear
# Scheduler_idle_before_close = 10 # if data is retrieved from the scheduler, if it is null, the maximum waiting time is allowed (no data is obtained at last)
Scheduler_dupefilter_key = '% (SPIDER) S: dupefilter' # deduplication rule, the corresponding key when stored in redis
Scheduler_dupefilter_class = 'scrapy _ redis. dupefilter. rfpdupefilter' # class to be processed by deduplication rules
Create the file start_chouti.py in the root directory of the project to run the crawler (you can also directly run the command on the terminal ):
from scrapy.cmdline import executeif __name__ == "__main__": execute(["scrapy", "crawl", "chouti", "--nolog"])
You can create a new py file to view the data stored in redis:
# View data in three ways
Import redisconn = redis. redis (host = "127.0.0.1", Port = 6379) # Conn. flushall () # Clear redisprint (Conn. keys () # view all keys [B 'chouti: dupefilter', B' chouti: Items ']
#1. obtain data in the specified range # res = Conn. lrange ('chouti: Items ', 0, 3) # obtain 3 pieces of persistent data # print (RES )"""
Result:
[B '{"title ": "\ u3010 \ u56fe \ u96c6 \ Users \ u5e74 \ u5ea6 \ u5929 \ u6587 \ u6444 \ u5f71 \ u5e08 \ u5927 \ u8d5b \ u83b7 \ u5956 \ u4f5c \ u54c1 \ u516c \ u5e03 ", "href": "https://mp.weixin.qq.com/s/eiWj7ky53xEDoMRFXC1EGg"} ', B' {"title ": "\ u3010 \ u201c \ u4eba \ u76f4 \ u5230 \ u5165 \ u571f \ Users \ u5b89 \ u90a3 \ u4e00 \ u5929 \ Users \ u90fd \ u5728 \ u8d70 \ u53f0 \ u9636 \ u201d \ u3011 \ u674e \ u548f \ u5728 \ u63a5 \ u53d7 \ u91c7 \ \ Users \ u65f6 \ u66fe \ u8fd9 \ u6837 \ u5f62 \ u5bb9 \ u81ea \ u5df1 \ u7684 \ u4eba \ u751f \ uff1a \\ u201c \ Alibaba \ u76f4 \ u5230 \ u5165 \ u571f \ u4e3a \ Alibaba \ u90a3 \ u4e00 \ u5929 \ uff0c \ u90fd \ u28 57 \ u8d70 \ u53f0 \ u9636 \ u3002 \ u8ddf \ u767b \ u9ec4 \ Alibaba \ u4e00 \ u6837 \ Users \ u767b \ u7684 \ \ u65f6 \ u5019 \ u4f60 \ Users \ u89c9 \ u5f97 \ u6709 \ u4e91 \ Users \ u5230 \ u4e00 \ u5b9a \ u9ad8 \\ u5ea6 \ u7684 \ u65f6 \ u5019 \ Alibaba \ u65c1 \ u8fb9 \ u6709 \ Alibaba \ u63d0 \ u9192 \ u4f60 \ u56de \ u5934 \ u770b \ u4e00 \ u4e0b \ uff0c \ u4e91 \ u5c31 \ u5728 \ u773c \ u524d \ u3002 \ u201d ", "href": "https://mp.weixin.qq.com/s/erLgWmL1GhpyWqwOTIlRvQ"} ', B' {"title ": "\ u3010 \ u6e38 \ u620f \ u673a \ u5236 \ u6e17 \ u900f \ u5e76 \ u6e10 \ u6e10 \ u5851 \ u9020 \ Signature \ u73b0 \ Users \ u4e16 \ u754c \ uff0c \ u4f60 \ u662f \ u5426 \ signature \ u4e00 \ u6837 \ u8ba4 \ u4e3a \ \ Users \ u6240 \ u5f53 \ u7136 \ uff1f \ u3011 \ u5728 \ u667a \ u80fd \ u624b \ u673a \ u666e \ u53ca \\ u4ee5 \ u540e \ uff0c \ u79fb \ u52a8 \ u6280 \ u672f \ u80fd \ u591f \ Users \ u73b0 \ Users \ u4e16 \ Users \ u53d1 \ u751f \ u8d8a \ u6765 \ u8d8a \ u591a \ u7684 \ Users \ u4e92 \ uff0c \ Users \ u6b64 \ u6e38 \ \ Users \ u5316 \ u7684 \ u5c1d \ u8bd5 \ u5e76 \ u6ca1 \ u6709 \ u51cf \ u5c11 \ u53cd \ u800c \ u589e \\ u591a \ u4e86 \ u3002 \ u6709 \ Alibaba \ u5929 \ u8bb0 \ u5f55 \ Alibaba \ u7684 \ u6b65 \ u884c \ u8ddd \ u79bb \ Users \ u7136 \ u540e \ u9881 \ u53d1 \ u5956 \ u7ae0 \ u7684 \ u3002 \ u6709 \ u8bb0 \ u5f55 \ signature \ \ u7684 \ u4e60 \ u60ef \ uff0c \ u5e76 \ u53ef \ u4ee5 \ signature \ u4e00 \ u5ea7 \ u57ce \ u5e02 \\ u7684 \ u3002 ", "href": "http://www.qdaily.com/articles/57753.html"} ', B' {"title ": "\ u3010 \ u53c8 \ u5931 \ Alibaba \ u57ce \ uff01 \ u9ed8 \ u514b \ u5c14 \ u7684 \ u201c \ u9ec4 \ u91d1 \ u914d \ u89d2 \ u201d \ u5728 \ u9ed1 \ u68ee \ u5dde \ u906d \ u9047 \ u60e8 \ u8d25 \ u3011 \ u4eca \ \ u5e74 \ u4e09 \ u6708 \ u8270 \ u96be \ Users \ u6210 \ u7b2c \ u56db \ u6b21 \ u7ec4 \ u9601 \ u7684 \\ users \ u56fd \ u603b \ u7406 \ u9ed8 \ u514b \ u5c14 \ Users \ u572810 \ u6708 \ u5fb7 \ u56fd \ u4e24 \ birthday \ u5173 \ u952e \ u5dde \ u2014 \ u2014 \ u5df4 \ u4f10 \ u5229 \ u4e9a \ u5dde \ u548c \ u9ed1 \ u68ee \ \ u5dde \ u7684 \ u9009 \ u4e3e \ u4e2d \ uff0c \ u63a5 \ Users \ u906d \ u9047 \ u5386 \ u53f2 \ u6027 \\ u60e8 \ u8d25 \ u3002 \ u9ed8 \ Alibaba \ u5c14 \ u7684 \ u201c \ u9ec4 \ u91d1 \ u914d \ u89d2 \ u201d \ u2014 2014 \ u2014 \ u793e \ u6c11 \ u515a \ uff08spd \ uff09 \ u5728 \ u4e24 \ u6b21 \ u9009 \ Users \ u4e2d \ u7684 \ \ u5f97 \ u7968 \ u7387 \ u5448 \ u73b0 \ u81ea \ u7531 \ u843d \ u4f53 \ u72b6 \ u6001 \ u3002 ", "href": "https://wallstreetcn.com/articles/3428455"} '] ""
#2. Remove Data one by one
# Item = conn. lpop ('chouti: Items ')
# Print (item)
"""
Result:
B '{"title ": "\ u3010 \ u56fe \ u96c6 \ Users \ u5e74 \ u5ea6 \ u5929 \ u6587 \ u6444 \ u5f71 \ u5e08 \ u5927 \ u8d5b \ u83b7 \ u5956 \ u4f5c \ u54c1 \ u516c \ u5e03 ", "href": "https://mp.weixin.qq.com/s/eiWj7ky53xEDoMRFXC1EGg "}'"""
#3. Create a producer-consumer model
While true:
Item = conn. blpop ('chouti: Items ') # Remove Data one by one. If no data exists, block it.
Print (item)
By using the persistent data function of scrapy_redis, you can use production data and data acquisition as two concurrent operations that do not affect each other.
2. If you want to store data elsewhere, you can inherit and override the pipelines of scrapy_redis.
Introduction to scrapy_redis