(2) What should Scrapy do for Distributed crawlers?-Reflection on Scrapy and introduction to core objects; crawler scrapy
This article mainly introduces the thinking about a crawler framework, the introduction of core components, and the conventional thinking methods:
I. Conjecture
Crawlers we talk about generally contain at least several basic elements:
1. request sending object (sender, encapsulation of request to prevent being blocked)
2. parse the Document Object (treat the requested webpage as an html document or string)
3. Carry the required resolution object (standard format data bearer)
4. Operator after obtaining the desired object (after obtaining the object, it is saved as a file or saved to the database)
5. The error handler of the entire process (the exception monitor of the entire process)
Ii. Verification
Let's take a look at what core objects Scrapy provides
Basic Concepts
-
Command line tools)
-
Learn the command line tools used to manage Scrapy Projects
-
Items
-
Define the crawled data
-
Spiders
-
Compile a website crawling rule
-
Selectors)
-
Use XPath to extract webpage data
-
Scrapy terminal (Scrapy shell)
-
Test the data extraction code in the interactive environment
-
Item Loaders
-
Use the crawled data to fill the item
-
Item Pipeline
-
Post-process stores crawled data
-
Feed exports
-
Output crawled data in different formats to different storage terminals
-
Link Extractors
-
This class is convenient for extracting follow-up links.
Refer from: https://scrapy-chs.readthedocs.org/zh_CN/0.24/
Basically, the Scrapy object we imagine will be included in it.
Iii. Crawling
We know that crawlers generally crawl data according to the following rules:
Enter the target url => write processing rules (regular expression or xpath syntax) => process the obtained data
The Scrapy method is as follows:
1) create a project
Switch to the folder where the code is to be placed as a command line, and enter the following command:
scrapy startproject cnblogs
A cnblogs file is generated under the folder and switched to the folder (remember to switch)
Item. py is the data bearer we need.
Modify it to the following code:
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyfrom scrapy.item import Field,Itemclass CnblogsItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() Title = Field() TitleUrl = Field()
Add BasicGroupSpider. py in the Spider folder to the following content:
From scrapy. spider import BaseSpiderfrom scrapy. selector import HtmlXPathSelectorfrom cnblogs. items import CnblogsItemclass CnblogsSpider (BaseSpider): name = "cnblogs" # spider name allowed_domains = ["cnblogs.com"] start_urls = ['HTTP: // www.cnblogs.com/'] # def parse (self, response): self. log ("Fetch douban homepage page: % s" % response. url) hxs = HtmlXPathSelector (response) # authors = hxs. select ('// a [@ class = "titlelnk"]') items = hxs. select ('// a [contains (@ class, "titlelnk")]') listitems = [] for author in items: # print author. select ('text ()'). extract () item = CnblogsItem () # property item ['title'] = author. select ('text ()'). extract () item ['titleurl'] = author. select ('@ href '). extract () listitems. append (item) return listitems
OK. Go back to the console interface of step 1 and enter the following command:
scrapy crawl cnblogs --logfile=test.log -o cnblogs.json -t json
4. Results
Let's take a look at the code functions in it. Most people who have written the code know it.
====> DEMO download ====
Summary:
This article mainly analyzes the general components of the crawler framework and verifies our conjecture that there are many python crawler frameworks, but it is worth starting and researching like scrapy ,.. net.