A: Crawlspider introduction
Crawlspider is actually a subclass of the spider, which, in addition to the features and functions inherited from the spider, derives its own unique and more powerful features and functions. One of the most notable features is the "Linkextractors link Extractor". The spider is the base class for all reptiles and is designed only to crawl the pages in the Start_url list, and to continue the crawl work using crawlspider more appropriately than the URLs extracted from the crawled pages.
Two: Crawlspider use
Example: Crawling https://www.qiushibaike.com/page post author and content
1. Create a Scrapy project
2. Create a crawler file
Note: the "-T crawl" is more than the previous instruction, which means that the created crawler is based on the Crawlspider class, and is no longer the base class of the spider.
3. The resulting directory structure is as follows:
crawldemo.py Crawler File Settings:
Linkextractor: As the name implies, link extractor.
Rule: Rules parser. Extracts the contents of a parser-linked web page according to the specified rules, based on the link extracted from the link extractor.
The rule parameter describes:
Parameter 1: Specify the link extractor
Parameter 2: Specify rules for rule Resolver parsing data (callback function)
Parameter 3: If the link extractor continues to function in the linked page extracted by the link extractor, when callback is None, the default value of parameter 3 is true.
Rules= (): Specifies a different rule parser. A Rule object represents an extraction rule.
#-*-coding:utf-8-*-Importscrapy fromScrapy.linkextractorsImportLinkextractor fromScrapy.spidersImportCrawlspider, Rule fromCrawlpro.itemsImportCrawlproitemclassCrawldemospider (crawlspider): Name='Crawldemo' #allowed_domains = [' www.qiushibaike.com ']Start_urls = ['http://www.qiushibaike.com/'] #The rules meta-ancestor stores a different rule parser (encapsulating some sort of parsing rule)Rules = ( #Rule : Rules parser, which can parse all the pages that the connection extractor extracts to represent the specified rule (with the intermediate callback function determined) #linkbxtractor: The connection extractor will go to the page above the starting URL response, extracting the specified URLRule (Linkextractor (allow=r'/8hr/page/\d+'), callback='Parse_item', Follow=true),#Follow=true can follow up to ensure that all pages are extracted (the actual function is to remove the weight) ) defParse_item (Self, Response):#i = {} ##i [' domain_id '] = Response.xpath ('//input[@id = "Sid"]/@value '). Extract () ##i [' name '] = Response.xpath ('//div[@id = "name"]). Extract () ##i [' description '] = Response.xpath ('//div[@id = "description"]). Extract () #return IDivs=response.xpath ('//div[@id = "Content-left"]/div') forDivinchDivs:item=Crawlproitem ()#Extract the author of a jokeitem['author'] = Div.xpath ('./div[@class = "Author Clearfix"]/a[2]/h2/text ()'). Extract_first (). Strip ('\ n') #Extract the contents of a jokeitem['content'] = Div.xpath ('.//div[@class = "Content"]/span/text ()'). Extract_first (). Strip ('\ n') yieldItem#submit item to pipeline
item.py File Settings:
# -*-coding:utf-8-*- # Define Here the models for your scraped items ## See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html Import scrapy class Crawlproitem (scrapy. Item): # define the fields for your item here like: # name = Scrapy . Field () author=scrapy. Field () content=scrapy. Field ()
pipelines.py Pipeline File Settings:
#-*-coding:utf-8-*-#Define your item pipelines here##Don ' t forget to add your pipeline to the Item_pipelines setting#see:https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclassCrawlpropipeline (object):def __init__(self): SELF.FP=NonedefOpen_spider (self,spider):Print('Start crawler') SELF.FP= Open ('./data.txt','W', encoding='Utf-8') defProcess_item (self, item, spider):#writes the item submitted by the crawler file to the file for persistent storageSelf.fp.write (item['author']+':'+item['content']+'\ n') returnItemdefClose_spider (self,spider):Print('End Crawler') Self.fp.close ()
Crawler scrapy Framework-crawlspider link extractor and rule parser