Example code of several crawling methods of scrapy spider, scrapyspider

Source: Internet
Author: User

Example code of several crawling methods of scrapy spider, scrapyspider

This section describes the scrapy crawler framework, focusing on the scrapy component spider.

Several crawling methods of spider:

  1. Crawl 1 page
  2. Create a link based on the given list to crawl multiple pages
  3. Find the 'next page' tag for crawling
  4. Go to the link and follow the link to crawl

The following are examples:

1. Crawl 1 page

# By Han Xiaoyang (hanxiaoyang.ml@gmail.com) import scrapyclass JulyeduSpider (scrapy. spider): name = "julyedu" start_urls = ['https: // www.julyedu.com/category/index',] def parse (self, response): for julyedu_class in response. xpath ('// div [@ class = "course_info_box"]'): print julyedu_class.xpath ('A/h4/text ()'). extract_first () print julyedu_class.xpath ('A/p [@ class = "course-info-tip"] [1]/text ()'). extract_first () print julyedu_class.xpath ('A/p [@ class = "course-info-tip"] [2]/text ()'). extract_first () print response. urljoin (julyedu_class.xpath ('A/img [1]/@ src '). extract_first () print "\ n" yield {'title': julyedu_class.xpath ('A/h4/text ()'). extract_first (), 'desc': julyedu_class.xpath ('A/p [@ class = "course-info-tip"] [1]/text ()'). extract_first (), 'time': julyedu_class.xpath ('A/p [@ class = "course-info-tip"] [2]/text ()'). extract_first (), 'img _ url': response. urljoin (julyedu_class.xpath ('A/img [1]/@ src '). extract_first ())}

2. Create a link based on the given list to crawl multiple pages

# By Han Xiaoyang (hanxiaoyang.ml@gmail.com) import scrapyclass CnBlogSpider (scrapy. spider): name = "cnblogs" allowed_domains = ["cnblogs.com"] start_urls = ['HTTP: // www.cnblogs.com/pick/?p=s' % p for p in xrange (1, 11)] def parse (self, response): for article in response. xpath ('// div [@ class = "post_item"]'): print article. xpath ('div [@ class = "post_item_body"]/h3/a/text ()'). extract_first (). strip () print response. urljoin (article. xpath ('div [@ class = "post_item_body"]/h3/a/@ href '). extract_first ()). strip () print article. xpath ('div [@ class = "post_item_body"]/p/text ()'). extract_first (). strip () print article. xpath ('div [@ class = "post_item_body"]/div [@ class = "post_item_foot"]/a/text ()'). extract_first (). strip () print response. urljoin (article. xpath ('div [@ class = "post_item_body"]/div/a/@ href '). extract_first ()). strip () print article. xpath ('div [@ class = "post_item_body"]/div [@ class = "post_item_foot"]/span [@ class = "article_comment"]/a/text () '). extract_first (). strip () print article. xpath ('div [@ class = "post_item_body"]/div [@ class = "post_item_foot"]/span [@ class = "article_view"]/a/text () '). extract_first (). strip () print "" yield {'title': article. xpath ('div [@ class = "post_item_body"]/h3/a/text ()'). extract_first (). strip (), 'link': response. urljoin (article. xpath ('div [@ class = "post_item_body"]/h3/a/@ href '). extract_first ()). strip (), 'summary ': article. xpath ('div [@ class = "post_item_body"]/p/text ()'). extract_first (). strip (), 'author': article. xpath ('div [@ class = "post_item_body"]/div [@ class = "post_item_foot"]/a/text ()'). extract_first (). strip (), 'author _ link': response. urljoin (article. xpath ('div [@ class = "post_item_body"]/div/a/@ href '). extract_first ()). strip (), 'comment': article. xpath ('div [@ class = "post_item_body"]/div [@ class = "post_item_foot"]/span [@ class = "article_comment"]/a/text () '). extract_first (). strip (), 'view': article. xpath ('div [@ class = "post_item_body"]/div [@ class = "post_item_foot"]/span [@ class = "article_view"]/a/text () '). extract_first (). strip (),}

3. Find the 'next page' tag for crawling

import scrapyclass QuotesSpider(scrapy.Spider):  name = "quotes"  start_urls = [    'http://quotes.toscrape.com/tag/humor/',  ]  def parse(self, response):    for quote in response.xpath('//div[@class="quote"]'):      yield {        'text': quote.xpath('span[@class="text"]/text()').extract_first(),        'author': quote.xpath('span/small[@class="author"]/text()').extract_first(),      }    next_page = response.xpath('//li[@class="next"]/@herf').extract_first()    if next_page is not None:      next_page = response.urljoin(next_page)      yield scrapy.Request(next_page, callback=self.parse)

4. Go to the link and follow the link to crawl

# By Han Xiaoyang (hanxiaoyang.ml@gmail.com) import scrapyclass QQNewsSpider (scrapy. spider): name = 'qqnews' start_urls = ['HTTP: // news.qq.com/society_index.shtml'] def parse (self, response): for href in response. xpath ('// * [@ id = "news"]/div/em/a/@ href'): full_url = response. urljoin (href. extract () yield scrapy. request (full_url, callback = self. parse_question) def parse_question (self, response): print response. xpath ('// div [@ class = "qq_article"]/div/h1/text ()'). extract_first () print response. xpath ('// span [@ class = "a_time"]/text ()'). extract_first () print response. xpath ('// span [@ class = "a_catalog"]/a/text ()'). extract_first () print "\ n ". join (response. xpath ('// div [@ id = "Cnt-Main-Article-QQ"]/p [@ class = "text"]/text ()'). extract () print "" yield {'title': response. xpath ('// div [@ class = "qq_article"]/div/h1/text ()'). extract_first (), 'content': "\ n ". join (response. xpath ('// div [@ id = "Cnt-Main-Article-QQ"]/p [@ class = "text"]/text ()'). extract (), 'time': response. xpath ('// span [@ class = "a_time"]/text ()'). extract_first (), 'cate': response. xpath ('// span [@ class = "a_catalog"]/a/text ()'). extract_first (),}

Summary

The above is all the content of the scrapy spider's several crawling Method Instance code. I hope it will be helpful to you. If you are interested, you can continue to refer to other related topics on this site. If you have any shortcomings, please leave a message. Thank you for your support!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.