Crawlspider and middleware

Source: Internet
Author: User

Crawlspider

I. INTRODUCTION

Crawlspider is actually a subclass of the spider, which, in addition to the features and functions inherited from the spider, derives its own unique and more powerful features and functions. One of the most notable features is the "Linkextractors link Extractor". The spider is the base class for all reptiles and is designed only to crawl the pages in the Start_url list, and to continue the crawl work using crawlspider more appropriately than the URLs extracted from the crawled pages.

Two. Use

1. Create Scrapy project: Scrapy startproject ProjectName

2. Create a crawler file: scrapy genspider-t crawl Spidername www.xxx.com

-This command is more than "-t crawl" compared to the previous instruction, which means that the created crawler is based on the Crawlspider class, and is no longer the base class of the spider.

#-*-coding:utf-8-*-Importscrapy fromScrapy.linkextractorsImportLinkextractor fromScrapy.spidersImportCrawlspider, RuleclassCrawldemospider (crawlspider): Name='Crawldemo'    #allowed_domains = [' www.qiushibaike.com ']Start_urls = ['http://www.qiushibaike.com/']    #Connection Extractor: Extracts the specified URL from the page that will go back to the originating URL responselink = linkextractor (allow=r'/8hr/page/\d+')    #The rules tuple holds different rule parsers (encapsulating some sort of parsing rule)Rules = (        #Rule parser: You can parse a specified rule (callback function) for a page that extracts all connections represented by the connection extractorRule (Link, callback='Parse_item', follow=True),)defParse_item (Self, Response):#print (Response.url)DIVs = Response.xpath ('//div[@id = "Content-left"]/div')         forDivinchDivs:author= Div.xpath ('./div[@class = "Author Clearfix"]/a[2]/h2/text ()'). Extract_first ()Print(author)
crawler Files

Rule: Rules parser. Extracts the contents of a parser-linked web page according to the specified rules, based on the link extracted from the link extractor.

Rule (Linkextractor (allow=r ' items/'), callback= ' Parse_item ', follow=true)

-Parameter Description:

Parameter 1: Specify the link extractor

Parameter 2: Specify rules for rule Resolver parsing data (callback function)

Parameter 3: Whether the link extractor continues to function in the linked page extracted by the link extractor. when callback is None, the default value of parameter 3 is true.

Linkextractor: As the name implies, link extractor.

    Linkextractor (

allow=r ' items/', # satisfies the value of "regular expression" in parentheses will be extracted, if empty, then all matches.

Deny=xxx, # satisfies the regular expression and is not extracted .

Rules= (): Specifies a different rule parser. A Rule object represents an extraction rule.

Crawlspider Overall Crawl process:

A) The crawler file gets the Web page content of the URL first, based on the starting URL

b) The link extractor extracts the links in the page content in step a according to the specified extraction rules

c) The rule parser resolves the page content in the link extracted from the link extractor to the specified rules according to the specified parsing rules

D) Encapsulate the parsing data into item and submit it to the pipeline for persistent storage

RESTRICT_XPATHS=XXX, # The value that satisfies the XPath expression is extracted

Restrict_css=xxx, # satisfies the value of the CSS expression will be extracted

DENY_DOMAINS=XXX, # will not be extracted by the domains of the link .  

   )

-Function: Extracts the links in the response that match the rules.

Middleware

The most important thing is to set up an agent

Settings in the middleware.py file

Open Middleware in settings.py

#class  Mydaili (object):    def  Process_request (self,request,spider ):        request.meta['proxy'"http://120.76.231.27:3128 "
middlewares.py
# Open Middleware in setttings  = {    'firstBlood.middlewares.FirstbloodSpiderMiddleware': 543 ,}
settings.py

Crawlspider and middleware

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.