99. Distributed crawlers and 99 Crawlers
Navigate to this article:
- Introduction
- Scrapy-redis component
I. Introduction
Originally, scrapy Scheduler maintained the local task queue (storing the Request object and its callback function information) + local deduplication Queue (storing the accessed url address)
Therefore, the key to implementing distributed crawling is to find a dedicated host to run a shared queue such as Redis,
Then rewrite the Scrapy Scheduler to allow the new Scheduler to access the Request in the shared queue and remove Duplicate Request requests. To sum up, the key to distributed processing is three points:
#1. Shared queue #2. Rewrite sched so that both the deduplication and task can access the shared queue #3. Customize deduplication rules for Scheduler (using the set type of redis)
The above three points are the core functions of the scrapy-redis component.
# Installation: pip3 install scrapy-redis # source code: D: \ python3.6 \ Lib \ site-packages \ scrapy_redis
Ii. scrapy-redis Components
1. Only use the scrapy-redis deduplication Function
#1. Source Code: D: \ python3.6 \ Lib \ site-packages \ scrapy_redis \ dupefilter. py #2. Configure scrapy to use the shared deduplication queue provided by redis #2.1 in settings. in py, configure the link RedisREDIS_HOST = 'localhost' # host name REDIS_PORT = 6379 # port REDIS_URL = 'redis: // user: pass @ hostname: 100' # connection URL (prior to the above configuration) REDIS_PARAMS ={}# Redis connection parameter REDIS_PARAMS ['redis _ cls'] = 'myproject. redisclient' # specify the Python module REDIS_ENCODING = "UTF-8" # Redis encoding type # default configuration: D: \ python3.6 \ Lib \ site-pac Kages \ scrapy_redis \ defaults. py #2.2 Use the shared deduplication queue DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # Use the deduplication function provided by scrapy-redis, check the source code and you will find that it is implemented based on the Redis set #2.3. You need to specify the key name in Redis, and key = the set DUPEFILTER_KEY = 'dupefilter, which stores non-repeated Request strings: % (timestamp) s' # source code: dupefilter. key = defaults. DUPEFILTER_KEY % {'timestamp': int (time. time ()} #2.4. deduplicated rule source code analysis dupefilter. pydef request_seen (self, request): "" Returns True If request was already seen. parameters ---------- request: scrapy. http. request Returns ------- bool "" fp = self. request_fingerprint (request) # This returns the number of values added, zero if already exists. added = self. server. sadd (self. key, fp) return added = 0 #2.5. Convert the request into a string of characters before saving the string to the set from scrapy. http import Requestfrom scrapy. utils. request import request_fingerprintreq = Request (url = 'ht Tp: // www.baidu.com ') result = request_fingerprint (req) print (result) # 75d6587d87b3f4f3aa574b33dbd69ceeb9eafe7b #2.6. Note:-The URL parameter location is different, and the calculation results; -The default request header is out of the calculation range. You can set the specified request header for include_headers.-Example: from scrapy. utils import request from scrapy. http import Request req = Request (url = 'HTTP: // www.baidu.com? Name = 8 & id = 1', callback = lambda x: print (x), cookies = {'k1 ': 'vvvvv'}) result1 = request. request_fingerprint (req, include_headers = ['cookies ',]) print (result) req = Request (url = 'HTTP: // www.baidu.com? Id = 1 & name = 8', callback = lambda x: print (x), cookies = {'k1 ': 666}) result2 = request. request_fingerprint (req, include_headers = ['cookies ',]) print (result1 = result2) # True
Use shared deduplication queue + Source Code Analysis
2. Distributed crawling using scrapy-redis deduplication + Scheduling
#1. Source Code: D: \ python3.6 \ Lib \ site-packages \ scrapy_redis \ scheduler. py #2. settings. py configuration # Enables scheduling storing requests queue in redis. SCHEDULER = "scrapy_redis.scheduler.Scheduler" # The SCHEDULER serializes non-duplicate tasks with pickle and puts them in the shared task queue. By default, the priority queue (default) is used. Others: PriorityQueue (ordered set ), kerberoqueue (list), LifoQueue (list) SCHEDULER_QUEUE_CLASS = 'scrapy _ redis. queue. priorityQueue '# serialize the request object stored in redis. By default, pickleSCHEDULER_SERIALIZER = "scrap Y_redis.picklecompat "# key SCHEDULER_QUEUE_KEY = '% (spider) s in redis after the request task in the scheduler is serialized: requests '# whether to keep the original scheduler and deduplication record when it is disabled, True = keep, False = clear SCHEDULER_PERSIST = True # Whether to clear the scheduler and deduplication record before the start, true = clear, False = do not clear SCHEDULER_FLUSH_ON_START = False # If the value is blank when obtaining data from the scheduler, the maximum waiting time is allowed (no data is obtained at last ). If no record exists, an excessive number of empty loops will be returned immediately, and the cpu usage will soar. SCHEDULER_IDLE_BEFORE_CLOSE = 10 # deduplication rules. The corresponding key SCHEDULER_DUPEFILTER_KEY = '% (spider) s will be saved in redis: dupefilter' # deduplicated rules correspond to the processing class, put the string obtained by the task request_fingerprint (request) into the deduplicated queue SCHEDULER_DUPEFILTER_CLASS = 'scrapy _ redis. dupefilter. rfpdupefilter'
View Code
3. Persistence
# Obtain and parse the data from the target site and save it as an item object. The engine will hand it over to pipeline for persistence/storage to the database. scrapy-redis provides a pipeline component, it can help us store items in redis #1. When the items are persisted to redis, specify the key and serialization function REDIS_ITEMS_KEY = '% (spider) s: items 'redis _ ITEMS_SERIALIZER = 'json. dumps '#2. Use the list to save item data
4. Get the starting URL from Redis
The scrapy program crawls the target site. Once the crawling is completed, it stops. If the target site updates the content and we want to crawl it again, we can only restart scrapy, scrapy-redis provides a supply for scrapy to obtain the starting url from redis, if there is no scrapy, We will fetch it again after a while without closing it. In this case, we only need to write a simple script program and regularly put a starting url into the redis queue. # Specific configuration as follows #1. When writing a crawler, the starting URL obtains REDIS_START_URLS_KEY = '% (name) s: start_urls' from the redis Key #2. When obtaining the starting URL, get from collection or from list? True, set; False, list REDIS_START_URLS_AS_SET = False # when obtaining the starting URL, if it is True, use self. server. spop; if it is False, use self. server. lpop