Where is the data stored after crawling? The database, of course! Data warehousing and database to go over the detailed!

Source: Internet
Author: User
Tags session id sha1

4. Why do I need to do the URL to weight?

When running a crawler, we do not need a site to be downloaded multiple times, which can lead to CPU waste and increase the burden of the engine, so we need to crawl the URL to the weight, on the other hand: when we crawl data at a large scale, when the failure occurs, do not need URL link re-run (rerun will waste resources, resulting in wasted time

5, how to determine the strength of the weight?

Here use the de-cycle to determine the intensity:

Within one hour of the cycle, do not persist the crawled link (store URL, easy to design for use in incremental crawl scheme)

Within one day of the cycle (or below the total amount of 30w), a simple persistence of the crawled links

More than one day of the cycle, the crawl link to do a persistent operation

Step2: Install dependent packages:

Step3: Installing Scrapy-deltafetch

Start the terminal one-click installation: Pip Install Scrapy-deltafetch

The following ubuntu16.04 the next package installation process (see blog: http://jinbitou.net/2018/01/27/2579.html)

Here directly paste download Successful interface: First install database Berkeley DB

Then install the Scrapy-deltafetch, and then install the dependent package BSDDB3

 1 (course-python3.5-env) [email protected]:~$ pip install bsddb3 2 collecting bsddb3 3 Using cached https://files.py Thonhosted.org/packages/ba/a7/131dfd4e3a5002ef30e20bee679d5e6bcb2fcc6af21bd5079dc1707a132c/bsddb3-6.2.5.tar.gz  4 Building Wheels for collected packages:bsddb3 5 Running setup.py bdist_wheel for bsddb3 ... do 6 Stored in directory: /HOME/BOURNE/.CACHE/PIP/WHEELS/58/8E/E5/BFBC89DD084AA896E471476925D48A713BB466842ED760D43C 7 Successfully built BSDDB3 8 Installing collected PACKAGES:BSDDB3 9 successfully installed bsddb3-6.2.510 (course-python3.5-env) [email&nbsp ;p rotected]:~$ pip Install scrapy-deltafetch11 collecting Scrapy-deltafetch12 Using Cached https:// Files.pythonhosted.org/packages/90/81/08bd21bc3ee364845d76adef09d20d85d75851c582a2e0bb7f959d49b8e5/scrapy_ DELTAFETCH-1.2.1-PY2.PY3-NONE-ANY.WHL13 requirement already satisfied:bsddb3 in./course-python3.5-env/lib/ Python3.5/site-packages (from Scrapy-deltafetch) (6.2.5) requirement already SATISFIED:SCRAPY≫=1.1.0 in./course-python3.5-env/lib/python3.5/site-packages (from Scrapy-deltafetch) (1.5.0) requirement already satisfied:pydispatcher>=2.0.5 in./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0-> Scrapy-deltafetch) (2.0.5) requirement already satisfied:lxml in./course-python3.5-env/lib/python3.5/ Site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (4.2.1) requirement already satisfied:cssselect>= 0.9 in./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.0.3) 18 Requirement already satisfied:queuelib in./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0- >scrapy-deltafetch) (1.5.0) requirement already satisfied:w3lib>=1.17.0 in./course-python3.5-env/lib/ Python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.19.0) requirement already satisfied: Service-identity in./course-python3.5-env/lib/python3.5/site-packages (from scrapy>=1.1.0->scrapy-deltafetch) (17.0.0) requirement already satisfied:twisted>=13.1.0 in./course-python3.5-env/lib/ Python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (18.4.0)-Requirement already satisfied: parsel>=1.1 in./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.4.0) requirement already satisfied:pyopenssl in./course-python3.5-env/lib/python3.5/site-packages (from Scrapy >=1.1.0->scrapy-deltafetch) (17.5.0) requirement already satisfied:six>=1.5.2 in./course-python3.5-env/ Lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.11.0) requirement already satisfied: Attrs in./course-python3.5-env/lib/python3.5/site-packages (from service-identity->scrapy>=1.1.0-> Scrapy-deltafetch) (18.1.0) requirement already satisfied:pyasn1-modules in./course-python3.5-env/lib/python3.5/ Site-packages (from Service-identity->scrapy>=1.1.0->scrapy-deltafetch) (0.2.1) requirement already SATISFIED:PYASN1 in./course-python3.5-env/lib/python3.5/site-packages (from Service-identity->scrapy>=1.1.0->scrapy-deltafetch) (0.4.2) requirement already satisfied:incremental >=16.10.1 in./course-python3.5-env/lib/python3.5/site-packages (from twisted>=13.1.0->scrapy>=1.1.0- >scrapy-deltafetch) (17.5.0) requirement already satisfied:constantly>=15.1 in./course-python3.5-env/lib/ Python3.5/site-packages (from Twisted>=13.1.0->scrapy>=1.1.0->scrapy-deltafetch) (15.1.0) 30 Requirement already satisfied:automat>=0.3.0 in./course-python3.5-env/lib/python3.5/site-packages (from Twisted >=13.1.0->scrapy>=1.1.0->scrapy-deltafetch) (0.6.0) requirement already satisfied:hyperlink>= 17.1.1 in./course-python3.5-env/lib/python3.5/site-packages (from twisted>=13.1.0->scrapy>=1.1.0-> Scrapy-deltafetch) (18.0.0) requirement already satisfied:zope.interface>=4.4.2 in./course-python3.5-env/lIb/python3.5/site-packages (from Twisted>=13.1.0->scrapy>=1.1.0->scrapy-deltafetch) (4.5.0) 33 Requirement already satisfied:cryptography>=2.1.4 in./course-python3.5-env/lib/python3.5/site-packages (from Pyopenssl->scrapy>=1.1.0->scrapy-deltafetch) (2.2.2) requirement already satisfied:idna>=2.5 in./ Course-python3.5-env/lib/python3.5/site-packages (from Hyperlink>=17.1.1->twisted>=13.1.0->scrapy >=1.1.0->scrapy-deltafetch) (2.6) requirement already satisfied:setuptools in./course-python3.5-env/lib/ Python3.5/site-packages (from zope.interface>=4.4.2->twisted>=13.1.0->scrapy>=1.1.0-> Scrapy-deltafetch) (39.1.0) requirement already satisfied:cffi>=1.7; Platform_python_implementation! = "PyPy" in./course-python3.5-env/lib/python3.5/site-packages (from cryptography >=2.1.4->pyopenssl->scrapy>=1.1.0->scrapy-deltafetch) (1.11.5) Notoginseng requirement already satisfied: asn1crypto>=0.21.0 in./course-python3.5-env/lib/python3.5/site-packages (from cryptography>=2.1.4->pyopenssl->scrapy>=1.1.0-> Scrapy-deltafetch) (0.24.0) requirement already satisfied:pycparser in./course-python3.5-env/lib/python3.5/ Site-packages (from cffi>=1.7; platform_python_implementation! = "PyPy"->cryptography>=2.1.4-> Pyopenssl->scrapy>=1.1.0->scrapy-deltafetch) (2.18) Installing collected packages:scrapy-deltafetch40  Successfully installed scrapy-deltafetch-1.2.141 (course-python3.5-env) [email protected]:~$

 

 1 def process_spider_output (self, response, result, spider): 2 for R in Result:3 if Isinstance (R, Request): #判断是否是url if  Yes then proceed to the next step 4 key = Self._get_key (R) #通过_get_key () function generates key 5 if key in Self.db: #判断key是否在数据库中 6 logger.info ("ignoring already  Visited:%s "% r" #日志记录用来判断如果key在数据库中, ignore it 7 if Self.stats:8 self.stats.inc_value (' deltafetch/skipped ', Spider=spider) 9 Continue10 elif Isinstance (R, (Baseitem, Dict)): #判断从spider组件中出来item11 key = Self._get_key (response.request) #结果页的url, (Not for the process, that is, only the URL to get the data page) to go to self.db[key] = str (time.time ()) #将key塞入数据库并带了时间戳13 if self.stats:14 self.stats.inc_value (' Deltafetch/stored ', Spider=spider) yield R1 def _get_key (self, request): 2 key = Request.meta.get (' Deltafetch_key ') or R Equest_fingerprint (Request) #第一种是遵循你自己设计的唯一标识, the second is the fingerprint generated by the scrapy built-in, where we point to the open source code will find using the hashing algorithm 3 # request_fingerprint () Returns ' Hashlib.sha1 (). Hexdigest () ', is a STRING4 return To_bytes (key) 1 def _get_key (self, request): 2 key = Request.meta. Get (' Deltafetch_key ') or REQUEST_FINGERPRint (Request) #第一种是遵循你自己设计的唯一标识, the second is the fingerprint generated by the scrapy built-in, where we point to the open source code will find using the hashing algorithm 3 # Request_fingerprint () returns ' HASHLIB.SHA1 (). Hexdigest () ', is a STRING4 return to_bytes (key)
 1 "" 2 This module provides some useful functions for working with 3 Scrapy.http.Request Objects 4 "" "5 6 from __futur E__ Import print_function 7 import hashlib 8 import weakref 9 from Six.moves.urllib.parse import Urlunparse10 one from W3li B.http Import basic_auth_header12 from Scrapy.utils.python import to_bytes, to_native_str13 from w3lib.url import canon Icalize_url15 from scrapy.utils.httpobj import urlparse_cached16 _fingerprint_cache = Weakref. Weakkeydictionary () def request_fingerprint (Request, Include_headers=none): "" "Return the request fingerprint.22 the request fingerprint is a hash of uniquely identifies the resource The24 request points to. For example, take the following, urls:25, http://www.example.com/query?id=111&cat=22227 http://www.example.com /query?cat=222&id=11128 even though those is the different URLs both point to the same resource30 and is Equivale NT (ie. they should return the same response). Another example is COokies used to store session IDs. Suppose the33 following page is only accesible to authenticated users:34 3 http://www.example.com/members/offers.html36 7 lot of sites use a cookie to store the session ID, which adds a random38 component to the HTTP Request and thus should b E ignored when calculating39 the fingerprint.40 to this reason, request headers is ignored by default when Calculatin G42 the Fingeprint. If you want to include specific headers use the43 include_headers argument, which are a list of Request headers to include. "" "" "if include_headers:47 include_headers = tuple (To_bytes (H.lower ()) for H in sorted (include_headers)) = _fingerprint_cache.setdefault (Request, {}) if include_headers not in cache:51 fp = HASHLIB.SHA1 () #哈希算法, generates a dark pattern for a unique Mark Fp.update (To_bytes (Request.method)) fp.update (To_bytes (Canonicalize_url)) Request.url ( Request.body or B ") if include_headers:56 for HDR in include_headers:57 if HDR in request.headers:58 fp.upDate (HDR) for V in Request.headers.getlist (HDR): Fp.update (v) cache[include_headers] = Fp.hexdigest () return cache[include_headers]63 def request_authenticate (request, username, password): "" "Autenticate the given request Using the HTTP basic ACCESS67 authentication mechanism (RFC 2617) and the given username and password68 "" " equest.headers[' Authorization ' = basic_auth_header (username, password) (REQUEST_HTTPREPR) (Request): 73 "" Return The raw HTTP representation (as bytes) of the given request.74 this was provided only for reference since it's not t  He actual stream of75 bytes that'll be send when performing the request (that's controlled76 by Twisted). "" "parsed = urlparse_cached (Request)-Path = Urlunparse ((",", ", Parsed.path or '/', Parsed.params, Parsed.query, ')") s = to_by TES (Request.method) + B "" + to_bytes (path) + B "http/1.1\r\n" Bayi s + = B "Host:" + to_bytes (parsed.hostname or B ") + B" \r\ N "request.headers:83 s + = requEst.headers.to_string () + B "\ r \ n" s + = b "\ r \ n" + + + request.body86 return s87-referer_str (Request): "" "Re Turn Referer HTTP header suitable for logging.  "" "referrer = Request.headers.get (' Referer ') if referrer is none:93 return referrer94 return To_native_str (referrer, errors= ' replace ')

(3), instance experience

Create a project named spider_city_58-Build spider.py crawler

(1), modify spider.py

 1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.http import Request 4  5 class SpiderSpider(scrapy.Spider): 6 name = ‘spider‘ 7 allowed_domains = [‘58.com‘] 8 start_urls = [‘http://cd.58.com/‘] 9 10 def parse(self, response):11 pass12 yield Request(‘http://bj.58.com‘,callback=self.parse)13 yield Request(‘http://wh.58.com‘,callback=self.parse)

(2), new init_utils.py and modify

 1 #author: "xian" 2 #date: 2018/6/1 3 from scrapy.http import Request 4  5 def init_add_request(spider, url): 6 """ 7 此方法用于在,scrapy启动的时候添加一些已经跑过的url,让爬虫不需要重复跑 8  9 """10 rf = spider.crawler.engine.slot.scheduler.df #找到实例化对象11 12 request = Request(url)13 rf.request_seen(request) #调用request_seen方法

(3), modify pipeline.py

 1 # -*- coding: utf-8 -*- 2  3 # Define your item pipelines here 4 # 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html 7  8 from .init_utils import init_add_request 9 10 class City58Pipeline(object):11 def process_item(self, item, spider):12 return item13 14 def open_spider(self,spider):15 init_add_request(spider,‘http://wh.58.com‘)

(4), modify settings.py

(5), create test file main.py

1 #author: "xian"2 #date: 2018/6/13 from scrapy.cmdline import execute4 execute(‘scrapy crawl spider‘.split())

Operation Result:

Conclusion: For the scrapy-redis of the heavy, we follow-up analysis!

Enter the group: 125240963 to get the mystery package Oh!

Where is the data stored after crawling? The database, of course! Data warehousing and database to go over the detailed!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.