標籤:str art interface log mat 執行個體化 finger ignore can
4、為何需要進行url去重?
運行爬蟲時,我們不需要一個網站被下載多次,這會導致cpu浪費和增加引擎負擔,所以我們需要在爬取的時候對url去重,另一方面:當我們大規模爬取資料時,當故障發生時,不需要進行url連結重跑(重跑會浪費資源、造成時間浪費)
5、如何確定去重強度?
這裡使用去重周期確定強度:
周期一小時以內,不對抓取的連結進行持久化(儲存url,方便設計成增量抓取方案使用)
周期一天以內(或總量30w以下),對抓取的連結做一個簡單的持久化
周期一天以上,對抓取連結做持久化操作
step2:安裝依賴包:
step3:安裝scrapy-deltafetch
啟動終端一鍵安裝即可:pip install scrapy-deltafetch
下面補充下ubuntu16.04下包的安裝過程(參考博文:http://jinbitou.net/2018/01/27/2579.html)
這裡直接貼下載成功介面:首先安裝資料庫Berkeley DB
接著安裝scrapy-deltafetch即可,在此之前同樣安裝依賴包bsddb3
1 (course-python3.5-env) [email protected]:~$ pip install bsddb3 2 Collecting bsddb3 3 Using cached https://files.pythonhosted.org/packages/ba/a7/131dfd4e3a5002ef30e20bee679d5e6bcb2fcc6af21bd5079dc1707a132c/bsddb3-6.2.5.tar.gz 4 Building wheels for collected packages: bsddb3 5 Running setup.py bdist_wheel for bsddb3 ... done 6 Stored in directory: /home/bourne/.cache/pip/wheels/58/8e/e5/bfbc89dd084aa896e471476925d48a713bb466842ed760d43c 7 Successfully built bsddb3 8 Installing collected packages: bsddb3 9 Successfully installed bsddb3-6.2.510 (course-python3.5-env) [email protected]:~$ pip install scrapy-deltafetch11 Collecting scrapy-deltafetch12 Using cached https://files.pythonhosted.org/packages/90/81/08bd21bc3ee364845d76adef09d20d85d75851c582a2e0bb7f959d49b8e5/scrapy_deltafetch-1.2.1-py2.py3-none-any.whl13 Requirement already satisfied: bsddb3 in ./course-python3.5-env/lib/python3.5/site-packages (from scrapy-deltafetch) (6.2.5)14 Requirement already satisfied: Scrapy>=1.1.0 in ./course-python3.5-env/lib/python3.5/site-packages (from scrapy-deltafetch) (1.5.0)15 Requirement already satisfied: PyDispatcher>=2.0.5 in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (2.0.5)16 Requirement already satisfied: lxml in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (4.2.1)17 Requirement already satisfied: cssselect>=0.9 in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.0.3)18 Requirement already satisfied: queuelib in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.5.0)19 Requirement already satisfied: w3lib>=1.17.0 in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.19.0)20 Requirement already satisfied: service-identity in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (17.0.0)21 Requirement already satisfied: Twisted>=13.1.0 in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (18.4.0)22 Requirement already satisfied: parsel>=1.1 in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.4.0)23 Requirement already satisfied: pyOpenSSL in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (17.5.0)24 Requirement already satisfied: six>=1.5.2 in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.11.0)25 Requirement already satisfied: attrs in ./course-python3.5-env/lib/python3.5/site-packages (from service-identity->Scrapy>=1.1.0->scrapy-deltafetch) (18.1.0)26 Requirement already satisfied: pyasn1-modules in ./course-python3.5-env/lib/python3.5/site-packages (from service-identity->Scrapy>=1.1.0->scrapy-deltafetch) (0.2.1)27 Requirement already satisfied: pyasn1 in ./course-python3.5-env/lib/python3.5/site-packages (from service-identity->Scrapy>=1.1.0->scrapy-deltafetch) (0.4.2)28 Requirement already satisfied: incremental>=16.10.1 in ./course-python3.5-env/lib/python3.5/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (17.5.0)29 Requirement already satisfied: constantly>=15.1 in ./course-python3.5-env/lib/python3.5/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (15.1.0)30 Requirement already satisfied: Automat>=0.3.0 in ./course-python3.5-env/lib/python3.5/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (0.6.0)31 Requirement already satisfied: hyperlink>=17.1.1 in ./course-python3.5-env/lib/python3.5/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (18.0.0)32 Requirement already satisfied: zope.interface>=4.4.2 in ./course-python3.5-env/lib/python3.5/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (4.5.0)33 Requirement already satisfied: cryptography>=2.1.4 in ./course-python3.5-env/lib/python3.5/site-packages (from pyOpenSSL->Scrapy>=1.1.0->scrapy-deltafetch) (2.2.2)34 Requirement already satisfied: idna>=2.5 in ./course-python3.5-env/lib/python3.5/site-packages (from hyperlink>=17.1.1->Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (2.6)35 Requirement already satisfied: setuptools in ./course-python3.5-env/lib/python3.5/site-packages (from zope.interface>=4.4.2->Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (39.1.0)36 Requirement already satisfied: cffi>=1.7; platform_python_implementation != "PyPy" in ./course-python3.5-env/lib/python3.5/site-packages (from cryptography>=2.1.4->pyOpenSSL->Scrapy>=1.1.0->scrapy-deltafetch) (1.11.5)37 Requirement already satisfied: asn1crypto>=0.21.0 in ./course-python3.5-env/lib/python3.5/site-packages (from cryptography>=2.1.4->pyOpenSSL->Scrapy>=1.1.0->scrapy-deltafetch) (0.24.0)38 Requirement already satisfied: pycparser in ./course-python3.5-env/lib/python3.5/site-packages (from cffi>=1.7; platform_python_implementation != "PyPy"->cryptography>=2.1.4->pyOpenSSL->Scrapy>=1.1.0->scrapy-deltafetch) (2.18)39 Installing collected packages: scrapy-deltafetch40 Successfully installed scrapy-deltafetch-1.2.141 (course-python3.5-env) [email protected]:~$
1 def process_spider_output(self, response, result, spider): 2 for r in result: 3 if isinstance(r, Request): #判斷是否是url,如果是則進行下一步操作 4 key = self._get_key(r) #通過_get_key()函數產生key 5 if key in self.db: #判斷key是否在資料庫中 6 logger.info("Ignoring already visited: %s" % r) #日誌記錄用來判斷如果key在資料庫中,就忽略它 7 if self.stats: 8 self.stats.inc_value(‘deltafetch/skipped‘, spider=spider) 9 continue10 elif isinstance(r, (BaseItem, dict)): #判斷從spider組件中出來item11 key = self._get_key(response.request) #結果頁的url,(不針對過程,即只對拿到資料頁的url)進行去重12 self.db[key] = str(time.time()) #將key塞入資料庫並帶了時間戳記13 if self.stats:14 self.stats.inc_value(‘deltafetch/stored‘, spider=spider)15 yield r1 def _get_key(self, request):2 key = request.meta.get(‘deltafetch_key‘) or request_fingerprint(request) #第一種是遵循你自己設計的唯一標識,第二種就是scrapy內建的去重方案產生的指紋,這裡我們點開源碼會發現使用了雜湊演算法3 # request_fingerprint() returns `hashlib.sha1().hexdigest()`, is a string4 return to_bytes(key)1 def _get_key(self, request):2 key = request.meta.get(‘deltafetch_key‘) or request_fingerprint(request) #第一種是遵循你自己設計的唯一標識,第二種就是scrapy內建的去重方案產生的指紋,這裡我們點開源碼會發現使用了雜湊演算法3 # request_fingerprint() returns `hashlib.sha1().hexdigest()`, is a string4 return to_bytes(key)
1 """ 2 This module provides some useful functions for working with 3 scrapy.http.Request objects 4 """ 5 6 from __future__ import print_function 7 import hashlib 8 import weakref 9 from six.moves.urllib.parse import urlunparse10 11 from w3lib.http import basic_auth_header12 from scrapy.utils.python import to_bytes, to_native_str13 14 from w3lib.url import canonicalize_url15 from scrapy.utils.httpobj import urlparse_cached16 17 18 _fingerprint_cache = weakref.WeakKeyDictionary()19 def request_fingerprint(request, include_headers=None):20 """21 Return the request fingerprint.22 23 The request fingerprint is a hash that uniquely identifies the resource the24 request points to. For example, take the following two urls:25 26 http://www.example.com/query?id=111&cat=22227 http://www.example.com/query?cat=222&id=11128 29 Even though those are two different URLs both point to the same resource30 and are equivalent (ie. they should return the same response).31 32 Another example are cookies used to store session ids. Suppose the33 following page is only accesible to authenticated users:34 35 http://www.example.com/members/offers.html36 37 Lot of sites use a cookie to store the session id, which adds a random38 component to the HTTP Request and thus should be ignored when calculating39 the fingerprint.40 41 For this reason, request headers are ignored by default when calculating42 the fingeprint. If you want to include specific headers use the43 include_headers argument, which is a list of Request headers to include.44 45 """46 if include_headers:47 include_headers = tuple(to_bytes(h.lower())48 for h in sorted(include_headers))49 cache = _fingerprint_cache.setdefault(request, {})50 if include_headers not in cache:51 fp = hashlib.sha1() #雜湊演算法,產生一段暗紋,用來進行唯一標識52 fp.update(to_bytes(request.method))53 fp.update(to_bytes(canonicalize_url(request.url)))54 fp.update(request.body or b‘‘)55 if include_headers:56 for hdr in include_headers:57 if hdr in request.headers:58 fp.update(hdr)59 for v in request.headers.getlist(hdr):60 fp.update(v)61 cache[include_headers] = fp.hexdigest()62 return cache[include_headers]63 64 65 def request_authenticate(request, username, password):66 """Autenticate the given request (in place) using the HTTP basic access67 authentication mechanism (RFC 2617) and the given username and password68 """69 request.headers[‘Authorization‘] = basic_auth_header(username, password)70 71 72 def request_httprepr(request):73 """Return the raw HTTP representation (as bytes) of the given request.74 This is provided only for reference since it‘s not the actual stream of75 bytes that will be send when performing the request (that‘s controlled76 by Twisted).77 """78 parsed = urlparse_cached(request)79 path = urlunparse((‘‘, ‘‘, parsed.path or ‘/‘, parsed.params, parsed.query, ‘‘))80 s = to_bytes(request.method) + b" " + to_bytes(path) + b" HTTP/1.1\r\n"81 s += b"Host: " + to_bytes(parsed.hostname or b‘‘) + b"\r\n"82 if request.headers:83 s += request.headers.to_string() + b"\r\n"84 s += b"\r\n"85 s += request.body86 return s87 88 89 def referer_str(request):90 """ Return Referer HTTP header suitable for logging. """91 referrer = request.headers.get(‘Referer‘)92 if referrer is None:93 return referrer94 return to_native_str(referrer, errors=‘replace‘)
(3)、執行個體體驗
建立名為spider_city_58的項目--產生spider.py爬蟲
(1)、修改spider.py
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.http import Request 4 5 class SpiderSpider(scrapy.Spider): 6 name = ‘spider‘ 7 allowed_domains = [‘58.com‘] 8 start_urls = [‘http://cd.58.com/‘] 9 10 def parse(self, response):11 pass12 yield Request(‘http://bj.58.com‘,callback=self.parse)13 yield Request(‘http://wh.58.com‘,callback=self.parse)
(2)、建立init_utils.py並修改
1 #author: "xian" 2 #date: 2018/6/1 3 from scrapy.http import Request 4 5 def init_add_request(spider, url): 6 """ 7 此方法用於在,scrapy啟動的時候添加一些已經跑過的url,讓爬蟲不需要重複跑 8 9 """10 rf = spider.crawler.engine.slot.scheduler.df #找到執行個體化對象11 12 request = Request(url)13 rf.request_seen(request) #調用request_seen方法
(3)、修改pipeline.py
1 # -*- coding: utf-8 -*- 2 3 # Define your item pipelines here 4 # 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 8 from .init_utils import init_add_request 9 10 class City58Pipeline(object):11 def process_item(self, item, spider):12 return item13 14 def open_spider(self,spider):15 init_add_request(spider,‘http://wh.58.com‘)
(4)、修改settings.py
(5)、建立測試檔案main.py
1 #author: "xian"2 #date: 2018/6/13 from scrapy.cmdline import execute4 execute(‘scrapy crawl spider‘.split())
運行結果:
結語:針對scrapy-redis的去重,我們後續分析!
進群:125240963 即可擷取神秘大禮包哦!
爬完資料存哪裡?當然是資料庫啊!資料入庫之去重與資料庫詳解!