Scrapy Crawling Web data recursively

Last Update:2014-06-27 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The parse method of the Scrapy spider can return two values: Baseitem, or request. Recursive fetching can be achieved through request.

If the data to be crawled is on the current page, you can directly parse the return item (the line with the comment in the code is changed directly to yield item);

If the data to be crawled is on the page that the current page points to, return request and specify Parse_item as the callback;

If the data to be crawled is part of the current page, there is a part of the page (such as a blog or a forum, the current page has a title, a summary, and a URL, the full content of the details page), which requires the data from the current page to be passed to Parse_item with the request's meta parameter. The latter continues to parse the remaining data of the item.

To grab the current page and grab another page (such as the next page), you can return Request,callback to parse.

It's a little odd that parse can't return the item list, but as a callback parse_item, I don't know why.

In addition, the text obtained directly extract () does not contain the contents of the <a> and other sub-tags, and can be changed to D.xpath (' node () '). Extract (), get the text containing HTML, and then filter out the label is plain text.

Couldn't find a way to get HTML directly.

From Scrapy.spider import spiderfrom scrapy.selector import selectorfrom dirbot.items import Articleimport jsonimport rei Mport stringfrom scrapy.http Import requestclass youyousuiyuespider (Spider): name = "Youyousuiyue2" Allowed_domains = ["youyousuiyue.sinaapp.com"] start_urls = [' http://youyousuiyue.sinaapp.com ',] def load_it EM (self, D): item = article () title = D.xpath (' header/h1/a ') item[' title '] = Title.xpath (' text () '). ex Tract () print item[' title '][0] item[' url '] = Title.xpath (' @href '). Extract () return item def parse_ Item (Self, response): item = response.meta[' item '] sel = Selector (response) d = Sel.xpath ('// div[@class = "entry-content"]/div ') item[' content '] = D.xpath (' text () '). Extract () return item Def parse (SEL F, Response): "" The lines below is a spider contract. For more info see:http://doc.scrapy.org/en/latest/topics/contracts.htML @url http://youyousuiyue.sinaapp.com @scrapes name "" "Print ' parsing ', response.u            RL sel = Selector (response) articles = Sel.xpath ('//div[@id = "content"]/article ') for D in articles: item = Self.load_item (d) yield Request (item[' url '][0], meta={' item ': item}, Callback=self.parse_item) # */yield Item SEL = Selector (response) link = sel.xpath ('//div[@class = "nav-previous"]/a/@href '). Extract () [0] if link[-1] = = ' 4 ': return else:print ' yielding ', link yield reques T (link, callback=self.parse)

For detailed code see: Https://github.com/junglezax/dirbot

Reference:

Http://doc.scrapy.org/en/latest/intro/tutorial.html

Http://www.icultivator.com/p/3166.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More