Scrapy Crawling Web data recursively

Source: Internet
Author: User
Tags xpath

The parse method of the Scrapy spider can return two values: Baseitem, or request. Recursive fetching can be achieved through request.

If the data to be crawled is on the current page, you can directly parse the return item (the line with the comment in the code is changed directly to yield item);

If the data to be crawled is on the page that the current page points to, return request and specify Parse_item as the callback;

If the data to be crawled is part of the current page, there is a part of the page (such as a blog or a forum, the current page has a title, a summary, and a URL, the full content of the details page), which requires the data from the current page to be passed to Parse_item with the request's meta parameter. The latter continues to parse the remaining data of the item.

To grab the current page and grab another page (such as the next page), you can return Request,callback to parse.

It's a little odd that parse can't return the item list, but as a callback parse_item, I don't know why.

In addition, the text obtained directly extract () does not contain the contents of the <a> and other sub-tags, and can be changed to D.xpath (' node () '). Extract (), get the text containing HTML, and then filter out the label is plain text.

Couldn't find a way to get HTML directly.

From Scrapy.spider import spiderfrom scrapy.selector import selectorfrom dirbot.items import Articleimport jsonimport rei Mport stringfrom scrapy.http Import requestclass youyousuiyuespider (Spider): name = "Youyousuiyue2" Allowed_domains = ["youyousuiyue.sinaapp.com"] start_urls = [' http://youyousuiyue.sinaapp.com ',] def load_it EM (self, D): item = article () title = D.xpath (' header/h1/a ') item[' title '] = Title.xpath (' text () '). ex Tract () print item[' title '][0] item[' url '] = Title.xpath (' @href '). Extract () return item def parse_ Item (Self, response): item = response.meta[' item '] sel = Selector (response) d = Sel.xpath ('// div[@class = "entry-content"]/div ') item[' content '] = D.xpath (' text () '). Extract () return item Def parse (SEL F, Response): "" The lines below is a spider contract. For more info see:http://doc.scrapy.org/en/latest/topics/contracts.htML @url http://youyousuiyue.sinaapp.com @scrapes name "" "Print ' parsing ', response.u            RL sel = Selector (response) articles = Sel.xpath ('//div[@id = "content"]/article ') for D in articles: item = Self.load_item (d) yield Request (item[' url '][0], meta={' item ': item}, Callback=self.parse_item) # */yield Item SEL = Selector (response) link = sel.xpath ('//div[@class = "nav-previous"]/a/@href '). Extract () [0] if link[-1] = = ' 4 ': return else:print ' yielding ', link yield reques T (link, callback=self.parse)

For detailed code see: Https://github.com/junglezax/dirbot

Reference:

Http://doc.scrapy.org/en/latest/intro/tutorial.html

Http://www.icultivator.com/p/3166.html


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.