Crawler scrapy Framework-crawlspider link extractor and rule parser

Source: Internet
Author: User

A: Crawlspider introduction

Crawlspider is actually a subclass of the spider, which, in addition to the features and functions inherited from the spider, derives its own unique and more powerful features and functions. One of the most notable features is the "Linkextractors link Extractor". The spider is the base class for all reptiles and is designed only to crawl the pages in the Start_url list, and to continue the crawl work using crawlspider more appropriately than the URLs extracted from the crawled pages.

Two: Crawlspider use

Example: Crawling https://www.qiushibaike.com/page post author and content

1. Create a Scrapy project

2. Create a crawler file

Note: the "-T crawl" is more than the previous instruction, which means that the created crawler is based on the Crawlspider class, and is no longer the base class of the spider.

3. The resulting directory structure is as follows:

crawldemo.py Crawler File Settings:

Linkextractor: As the name implies, link extractor.
Rule: Rules parser. Extracts the contents of a parser-linked web page according to the specified rules, based on the link extracted from the link extractor.

The rule parameter describes:

Parameter 1: Specify the link extractor

Parameter 2: Specify rules for rule Resolver parsing data (callback function)

Parameter 3: If the link extractor continues to function in the linked page extracted by the link extractor, when callback is None, the default value of parameter 3 is true.

Rules= (): Specifies a different rule parser. A Rule object represents an extraction rule.

#-*-coding:utf-8-*-Importscrapy fromScrapy.linkextractorsImportLinkextractor fromScrapy.spidersImportCrawlspider, Rule fromCrawlpro.itemsImportCrawlproitemclassCrawldemospider (crawlspider): Name='Crawldemo'    #allowed_domains = [' www.qiushibaike.com ']Start_urls = ['http://www.qiushibaike.com/']    #The rules meta-ancestor stores a different rule parser (encapsulating some sort of parsing rule)Rules = (        #Rule : Rules parser, which can parse all the pages that the connection extractor extracts to represent the specified rule (with the intermediate callback function determined)        #linkbxtractor: The connection extractor will go to the page above the starting URL response, extracting the specified URLRule (Linkextractor (allow=r'/8hr/page/\d+'), callback='Parse_item', Follow=true),#Follow=true can follow up to ensure that all pages are extracted (the actual function is to remove the weight)    )    defParse_item (Self, Response):#i = {}        ##i [' domain_id '] = Response.xpath ('//input[@id = "Sid"]/@value '). Extract ()        ##i [' name '] = Response.xpath ('//div[@id = "name"]). Extract ()        ##i [' description '] = Response.xpath ('//div[@id = "description"]). Extract ()        #return IDivs=response.xpath ('//div[@id = "Content-left"]/div')         forDivinchDivs:item=Crawlproitem ()#Extract the author of a jokeitem['author'] = Div.xpath ('./div[@class = "Author Clearfix"]/a[2]/h2/text ()'). Extract_first (). Strip ('\ n')            #Extract the contents of a jokeitem['content'] = Div.xpath ('.//div[@class = "Content"]/span/text ()'). Extract_first (). Strip ('\ n')            yieldItem#submit item to pipeline

item.py File Settings:

# -*-coding:utf-8-*- # Define Here the models for your scraped items ## See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html Import scrapy class Crawlproitem (scrapy. Item):    # define the fields for your item here like:    #  name = Scrapy . Field ()    author=scrapy. Field ()    content=scrapy. Field ()

pipelines.py Pipeline File Settings:

#-*-coding:utf-8-*-#Define your item pipelines here##Don ' t forget to add your pipeline to the Item_pipelines setting#see:https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclassCrawlpropipeline (object):def __init__(self): SELF.FP=NonedefOpen_spider (self,spider):Print('Start crawler') SELF.FP= Open ('./data.txt','W', encoding='Utf-8')    defProcess_item (self, item, spider):#writes the item submitted by the crawler file to the file for persistent storageSelf.fp.write (item['author']+':'+item['content']+'\ n')        returnItemdefClose_spider (self,spider):Print('End Crawler') Self.fp.close ()

Crawler scrapy Framework-crawlspider link extractor and rule parser

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.