Example code of several crawling methods of scrapy spider, scrapyspider
This section describes the scrapy crawler framework, focusing on the scrapy component spider.
Several crawling methods of spider:
- Crawl 1 page
- Create a link based on the given list to crawl multiple pages
- Find the 'next page' tag for crawling
- Go to the link and follow the link to crawl
The following are examples:
1. Crawl 1 page
# By Han Xiaoyang (hanxiaoyang.ml@gmail.com) import scrapyclass JulyeduSpider (scrapy. spider): name = "julyedu" start_urls = ['https: // www.julyedu.com/category/index',] def parse (self, response): for julyedu_class in response. xpath ('// div [@ class = "course_info_box"]'): print julyedu_class.xpath ('A/h4/text ()'). extract_first () print julyedu_class.xpath ('A/p [@ class = "course-info-tip"] [1]/text ()'). extract_first () print julyedu_class.xpath ('A/p [@ class = "course-info-tip"] [2]/text ()'). extract_first () print response. urljoin (julyedu_class.xpath ('A/img [1]/@ src '). extract_first () print "\ n" yield {'title': julyedu_class.xpath ('A/h4/text ()'). extract_first (), 'desc': julyedu_class.xpath ('A/p [@ class = "course-info-tip"] [1]/text ()'). extract_first (), 'time': julyedu_class.xpath ('A/p [@ class = "course-info-tip"] [2]/text ()'). extract_first (), 'img _ url': response. urljoin (julyedu_class.xpath ('A/img [1]/@ src '). extract_first ())}
2. Create a link based on the given list to crawl multiple pages
# By Han Xiaoyang (hanxiaoyang.ml@gmail.com) import scrapyclass CnBlogSpider (scrapy. spider): name = "cnblogs" allowed_domains = ["cnblogs.com"] start_urls = ['HTTP: // www.cnblogs.com/pick/?p=s' % p for p in xrange (1, 11)] def parse (self, response): for article in response. xpath ('// div [@ class = "post_item"]'): print article. xpath ('div [@ class = "post_item_body"]/h3/a/text ()'). extract_first (). strip () print response. urljoin (article. xpath ('div [@ class = "post_item_body"]/h3/a/@ href '). extract_first ()). strip () print article. xpath ('div [@ class = "post_item_body"]/p/text ()'). extract_first (). strip () print article. xpath ('div [@ class = "post_item_body"]/div [@ class = "post_item_foot"]/a/text ()'). extract_first (). strip () print response. urljoin (article. xpath ('div [@ class = "post_item_body"]/div/a/@ href '). extract_first ()). strip () print article. xpath ('div [@ class = "post_item_body"]/div [@ class = "post_item_foot"]/span [@ class = "article_comment"]/a/text () '). extract_first (). strip () print article. xpath ('div [@ class = "post_item_body"]/div [@ class = "post_item_foot"]/span [@ class = "article_view"]/a/text () '). extract_first (). strip () print "" yield {'title': article. xpath ('div [@ class = "post_item_body"]/h3/a/text ()'). extract_first (). strip (), 'link': response. urljoin (article. xpath ('div [@ class = "post_item_body"]/h3/a/@ href '). extract_first ()). strip (), 'summary ': article. xpath ('div [@ class = "post_item_body"]/p/text ()'). extract_first (). strip (), 'author': article. xpath ('div [@ class = "post_item_body"]/div [@ class = "post_item_foot"]/a/text ()'). extract_first (). strip (), 'author _ link': response. urljoin (article. xpath ('div [@ class = "post_item_body"]/div/a/@ href '). extract_first ()). strip (), 'comment': article. xpath ('div [@ class = "post_item_body"]/div [@ class = "post_item_foot"]/span [@ class = "article_comment"]/a/text () '). extract_first (). strip (), 'view': article. xpath ('div [@ class = "post_item_body"]/div [@ class = "post_item_foot"]/span [@ class = "article_view"]/a/text () '). extract_first (). strip (),}
3. Find the 'next page' tag for crawling
import scrapyclass QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/tag/humor/', ] def parse(self, response): for quote in response.xpath('//div[@class="quote"]'): yield { 'text': quote.xpath('span[@class="text"]/text()').extract_first(), 'author': quote.xpath('span/small[@class="author"]/text()').extract_first(), } next_page = response.xpath('//li[@class="next"]/@herf').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)
4. Go to the link and follow the link to crawl
# By Han Xiaoyang (hanxiaoyang.ml@gmail.com) import scrapyclass QQNewsSpider (scrapy. spider): name = 'qqnews' start_urls = ['HTTP: // news.qq.com/society_index.shtml'] def parse (self, response): for href in response. xpath ('// * [@ id = "news"]/div/em/a/@ href'): full_url = response. urljoin (href. extract () yield scrapy. request (full_url, callback = self. parse_question) def parse_question (self, response): print response. xpath ('// div [@ class = "qq_article"]/div/h1/text ()'). extract_first () print response. xpath ('// span [@ class = "a_time"]/text ()'). extract_first () print response. xpath ('// span [@ class = "a_catalog"]/a/text ()'). extract_first () print "\ n ". join (response. xpath ('// div [@ id = "Cnt-Main-Article-QQ"]/p [@ class = "text"]/text ()'). extract () print "" yield {'title': response. xpath ('// div [@ class = "qq_article"]/div/h1/text ()'). extract_first (), 'content': "\ n ". join (response. xpath ('// div [@ id = "Cnt-Main-Article-QQ"]/p [@ class = "text"]/text ()'). extract (), 'time': response. xpath ('// span [@ class = "a_time"]/text ()'). extract_first (), 'cate': response. xpath ('// span [@ class = "a_catalog"]/a/text ()'). extract_first (),}
Summary
The above is all the content of the scrapy spider's several crawling Method Instance code. I hope it will be helpful to you. If you are interested, you can continue to refer to other related topics on this site. If you have any shortcomings, please leave a message. Thank you for your support!