Crawler scrapy Framework-crawlspider link extractor and rule parser

Last Update:2018-10-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A: Crawlspider introduction

Crawlspider is actually a subclass of the spider, which, in addition to the features and functions inherited from the spider, derives its own unique and more powerful features and functions. One of the most notable features is the "Linkextractors link Extractor". The spider is the base class for all reptiles and is designed only to crawl the pages in the Start_url list, and to continue the crawl work using crawlspider more appropriately than the URLs extracted from the crawled pages.

Two: Crawlspider use

Example: Crawling https://www.qiushibaike.com/page post author and content

1. Create a Scrapy project

2. Create a crawler file

Note: the "-T crawl" is more than the previous instruction, which means that the created crawler is based on the Crawlspider class, and is no longer the base class of the spider.

3. The resulting directory structure is as follows:

crawldemo.py Crawler File Settings:

Linkextractor: As the name implies, link extractor.
Rule: Rules parser. Extracts the contents of a parser-linked web page according to the specified rules, based on the link extracted from the link extractor.

The rule parameter describes:

Parameter 1: Specify the link extractor

Parameter 2: Specify rules for rule Resolver parsing data (callback function)

Parameter 3: If the link extractor continues to function in the linked page extracted by the link extractor, when callback is None, the default value of parameter 3 is true.

Rules= (): Specifies a different rule parser. A Rule object represents an extraction rule.

#-*-coding:utf-8-*-Importscrapy fromScrapy.linkextractorsImportLinkextractor fromScrapy.spidersImportCrawlspider, Rule fromCrawlpro.itemsImportCrawlproitemclassCrawldemospider (crawlspider): Name='Crawldemo'    #allowed_domains = [' www.qiushibaike.com ']Start_urls = ['http://www.qiushibaike.com/']    #The rules meta-ancestor stores a different rule parser (encapsulating some sort of parsing rule)Rules = (        #Rule : Rules parser, which can parse all the pages that the connection extractor extracts to represent the specified rule (with the intermediate callback function determined)        #linkbxtractor: The connection extractor will go to the page above the starting URL response, extracting the specified URLRule (Linkextractor (allow=r'/8hr/page/\d+'), callback='Parse_item', Follow=true),#Follow=true can follow up to ensure that all pages are extracted (the actual function is to remove the weight)    )    defParse_item (Self, Response):#i = {}        ##i [' domain_id '] = Response.xpath ('//input[@id = "Sid"]/@value '). Extract ()        ##i [' name '] = Response.xpath ('//div[@id = "name"]). Extract ()        ##i [' description '] = Response.xpath ('//div[@id = "description"]). Extract ()        #return IDivs=response.xpath ('//div[@id = "Content-left"]/div')         forDivinchDivs:item=Crawlproitem ()#Extract the author of a jokeitem['author'] = Div.xpath ('./div[@class = "Author Clearfix"]/a[2]/h2/text ()'). Extract_first (). Strip ('\ n')            #Extract the contents of a jokeitem['content'] = Div.xpath ('.//div[@class = "Content"]/span/text ()'). Extract_first (). Strip ('\ n')            yieldItem#submit item to pipeline

item.py File Settings:

# -*-coding:utf-8-*- # Define Here the models for your scraped items ## See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html Import scrapy class Crawlproitem (scrapy. Item):    # define the fields for your item here like:    #  name = Scrapy . Field ()    author=scrapy. Field ()    content=scrapy. Field ()

pipelines.py Pipeline File Settings:

#-*-coding:utf-8-*-#Define your item pipelines here##Don ' t forget to add your pipeline to the Item_pipelines setting#see:https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclassCrawlpropipeline (object):def __init__(self): SELF.FP=NonedefOpen_spider (self,spider):Print('Start crawler') SELF.FP= Open ('./data.txt','W', encoding='Utf-8')    defProcess_item (self, item, spider):#writes the item submitted by the crawler file to the file for persistent storageSelf.fp.write (item['author']+':'+item['content']+'\ n')        returnItemdefClose_spider (self,spider):Print('End Crawler') Self.fp.close ()

Crawler scrapy Framework-crawlspider link extractor and rule parser

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawler scrapy Framework-crawlspider link extractor and rule parser

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Crawler scrapy Framework-crawlspider link extractor and rule parser

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support