The parse method of the Scrapy spider can return two values: Baseitem, or request. Recursive fetching can be achieved through request.
If the data to be crawled is on the current page, you can directly parse the return item (the line with the comment in the code is changed directly to yield item);
If the data to be crawled is on the page that the current page points to, return request and specify Parse_item as the callback;
If the data to be crawled is part of the current page, there is a part of the page (such as a blog or a forum, the current page has a title, a summary, and a URL, the full content of the details page), which requires the data from the current page to be passed to Parse_item with the request's meta parameter. The latter continues to parse the remaining data of the item.
To grab the current page and grab another page (such as the next page), you can return Request,callback to parse.
It's a little odd that parse can't return the item list, but as a callback parse_item, I don't know why.
In addition, the text obtained directly extract () does not contain the contents of the <a> and other sub-tags, and can be changed to D.xpath (' node () '). Extract (), get the text containing HTML, and then filter out the label is plain text.
Couldn't find a way to get HTML directly.
From Scrapy.spider import spiderfrom scrapy.selector import selectorfrom dirbot.items import Articleimport jsonimport rei Mport stringfrom scrapy.http Import requestclass youyousuiyuespider (Spider): name = "Youyousuiyue2" Allowed_domains = ["youyousuiyue.sinaapp.com"] start_urls = [' http://youyousuiyue.sinaapp.com ',] def load_it EM (self, D): item = article () title = D.xpath (' header/h1/a ') item[' title '] = Title.xpath (' text () '). ex Tract () print item[' title '][0] item[' url '] = Title.xpath (' @href '). Extract () return item def parse_ Item (Self, response): item = response.meta[' item '] sel = Selector (response) d = Sel.xpath ('// div[@class = "entry-content"]/div ') item[' content '] = D.xpath (' text () '). Extract () return item Def parse (SEL F, Response): "" The lines below is a spider contract. For more info see:http://doc.scrapy.org/en/latest/topics/contracts.htML @url http://youyousuiyue.sinaapp.com @scrapes name "" "Print ' parsing ', response.u RL sel = Selector (response) articles = Sel.xpath ('//div[@id = "content"]/article ') for D in articles: item = Self.load_item (d) yield Request (item[' url '][0], meta={' item ': item}, Callback=self.parse_item) # */yield Item SEL = Selector (response) link = sel.xpath ('//div[@class = "nav-previous"]/a/@href '). Extract () [0] if link[-1] = = ' 4 ': return else:print ' yielding ', link yield reques T (link, callback=self.parse)
For detailed code see: Https://github.com/junglezax/dirbot
Reference:
Http://doc.scrapy.org/en/latest/intro/tutorial.html
Http://www.icultivator.com/p/3166.html