No. 341, python distributed crawler build search engine scrapy explaining-write spiders crawler file Loop crawl content-
Write spiders crawler file loop crawl content
the Request () method, which adds the specified URL address to the downloader download page, two required parameters,
Parameters:
Url= ' URL '
callback= page Processing functions
Yield request required for use ()
parse.urljoin () method, is the method under the Urllib library, is the automatic URL stitching, if the URL address of the second parameter is a relative path will be automatically spliced with the first parameter
#-*-coding:utf-8-*-Importscrapyfrom scrapy.http import Request #How to import URLs back to the downloaderfrom urllib import Parse #Import the parse module in the Urllib libraryclassPachspider (scrapy. Spider): name='Pach'Allowed_domains= ['blog.jobbole.com']#Start domainStart_urls = ['http://blog.jobbole.com/all-posts/']#Start URL defParse (self, response):"""get the article URL address of the list page and give it to the downloader""" #Gets the current page article URLLb_url = Response.xpath ('//a[@class = "archive-title"]/@href'). Extract ()#get the article List URL forIinchLb_url:#print (parse.urljoin (response.url,i)) #urllib库里的parse模块的urljoin () method, is an automatic URL stitching, if the URL address of the second parameter is relative to the path will be automatically spliced with the first parameter yield Request (url=parse.urljoin (response.url, i), Callback=self.parse_wzhang) #Add the loop to the article URL to the downloader and give it to the Parse_wzhang callback function after downloading . #get the next page list URL to the downloader and return to the parse function loopX_lb_url = Response.xpath ('//a[@class = "next page-numbers"]/@href'). Extract ()#get the next page of the article List URL ifX_lb_url: yield Request (url=parse.urljoin (response.url, x_lb_url[0]), Callback=self.parse) #gets the URL to the next page returned to the downloader, callback to the parse function loop defParse_wzhang (self,response): Title= Response.xpath ('//div[@class = "entry-header"]/h1/text ()'). Extract ()#Get article title Print(title)
No. 341, python distributed crawler build search engine scrapy explaining-write spiders crawler file Loop crawl content-