Python Scrapy Learning notes (ii)

Last Update:2017-01-04 Source: Internet

Author: User

Tags xpath python scrapy

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use scrapy bulk Fetch, reference http://python.jobbole.com/87155

First, create the project

# scrapy Startproject Comics

Create a directory structure after completion

. ├──comics│├──__init__.py│├──items.py│├──middlewares.py│├──pipelines.py│├──settings.py│└──spiders │└──__init__.py└──scrapy.cfg

Second, create Spider class

Start_requests: When starting the crawler call, the default is to call the Make_requests_from_url method crawl Start_urls link, can be customized in this method, if the method is overridden, start_urls default will not be used, You can customize some custom URLs in this method, such as login, read URL from database, etc., this method returns the Request object

Start_urls is a property provided in the framework, for an array containing the URL of the destination page, after setting the value of Start_urls, you do not need to overload the Start_requests method, the crawler will crawl the address in the Start_urls in turn, and automatically calls parse as the callback method after the request is completed.

# CD comics/spiders# vim comic.py#!/usr/bin/python#coding:utf-8import scrapyclass Comics (scrapy. Spider): name = "Comics" Def Start_requests (self): urls = [' Http://www.xeall.com/shenshi '] for URL in Urls:yield Scrapy. Request (Url=url, Callback=self.parse) def parse (self, Response): Self.log (Response.body);

Third, start crawling comics

Crawler's main task is to crawl the list of each comic picture, crawl the current page, go to the next comic list to continue to crawl comics, and then continue to loop until all the comic crawl is complete. idea: Get the URL of the current comic, visit the URL of the comic after getting the name and all the pictures, do a batch download, cycle through

1. Get the URL of the comic in the current page and get the next

The URL of a single comic

650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M01/8C/6D/wKioL1hsnA7AMkTFAAAlDaAq7U4318.jpg "title=" Qq20170104145328.jpg "style=" float:left; "alt=" wkiol1hsna7amktfaaaldaaq7u4318.jpg "/>

#  Find out all the comics url                                                                                                                                                        def parse (self, response):#  find out all the comics Urldef parse (self, response):     #  Response object returned by request     content = selector (Response=response)      #  get comic Tag Object     com_count = content.xpath ("//div[@class = ' Mainleft ']/ul/ Li ")     #  get url    comics_url_list = []  of all comics in a single page    base_url =  ' http://www.xeall.com '     for i in  Range (len (com_count)):         com_url = content.xpath ("//div [@class = ' mainleft ']/ul/li[{}]/a/@href ". Format (i+1)). Extract ()          Url = base_url+com_url[0]        comics_url_list.append (URL)     #  process The current page of each comic     for url in comics_url_list:         yield scrapy. Request (Url=url, callback=self.comics_parse)     #  get the next page of Url    url_num = content.xpath ("//div[@class = ' Mainleft ']/div[ @class = ' pages ']/ul/li ")     next_url = content.xpath ("//div[@class = ' Mainleft ']/ div[@class = ' pages ']/ul/li[{}]/a/@href ". Format (len (url_num)-3). Extract ()     # print   ' Total pages:  {}, Next:  {} '. Format (url_num,next_url)     #  determine if the next page is the last page      if next_url:        next_page =  ' http ://www.xeall.com/shenshi/'  + next_url[0]        if next _page is not none:            yield  scrapy. Request (Next_page, callback=self.parse)              pass

2. Get all Pages

Source code for different pages

650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M00/8C/71/wKiom1hsnT2D-yNrAAAr1oadPro881.jpg "title=" Qq20170104145848.jpg "alt=" Wkiom1hsnt2d-ynraaar1oadpro881.jpg "/>

The name and URL of the current comic

650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M02/8C/71/wKiom1hsnbax886sAAA0_eZ3Pjo918.jpg "title=" Qq20170104150025.jpg "alt=" Wkiom1hsnbax886saaa0_ez3pjo918.jpg "/>

#  extract each comic data Def comics_parse (Self, response):                                                                                                                                              Content = selector (response=response)     #  current page     page _num = content.xpaTh ("//div[@class = ' dede_pages ']/ul/li[@class = ' ThisClass ']/a/text ()"). Extract ()     #  The first picture of the Url    current_url = content.xpath ("//div[@class = ' Mhcon_left ']/ul/li/p/img/ @src "). Extract ()     #  Comics name     comic_name =  Content.xpath ("//div[@class = ' mhcon_left ']/ul/li/p/img/@alt"). Extract ()     # self.log ( ' img url:  '  + current_url[0])     #  save picture to local      self.save_img (page_num[0], comic_name[0], current_url[0])     #  Next page The URL of the picture, the next page of the label's href attribute is ' # ' when the last page of the comic     page_num = content.xpath ("//div[@class = ' Dede_pages ']/ul/li ")     next_page = content.xpath ("//div[@class = ' dede_pages ']/ul /li[{}]/a/@href ". Format (len (page_num))). Extract ()     #  last page href= ' # '       if next_page ==  ' # ':         print (' parse comics: '  + comic_ name +  ' finished. ')     else:        next_page =  '/HTTP www.xeall.com/shenshi/'  + next_page[0]        yield  Scrapy. Request (Next_page, callback=self.comics_parse)

3. Persist data to storage

#  the image number, image name, image URL as parameters into Def save_img (self, img_mun, title, img_url):                                                                                                                                  #  save pictures to local      # self.log (' saving pic:  '  + img_url)     #  folder to save comics     documenT = os.path.join (OS.GETCWD (), ' cartoon ')     #  the filename of each comic is named after the title      comics_path = os.path.join (Document,title)     exists =  Os.path.exists (Comics_path)     if not exists:         print (' create document:  '  + title)          os.makedirs (Comics_path)     #  each picture is named in pages     pic_name  = comics_path +  '/'  + img_mun +  '. jpg '     #  Check if the picture has been downloaded locally, and no longer download     exists = os.path.exists (pic_name) if it exists      if exists:        print (' pic exists:  '  +  pic_name)         return    try:        &nbSp;response = requests.get (img_url,timeout=30)         #   data returned to the request         data = response         with open (Pic_name, ' WB ')  as f:             for chunk in data.iter_content (chunk_size=1024):                 if chunk:                      f.write (Chunk)                      f.flush ()         print (' save  image finished: '  + pic_name)     except Exception as e:       &nBsp; print (' Save image error. ')         print (e)

Full Source Address Https://github.com/yaoliang83/Scrapy-for-Comics

Python Scrapy Learning notes (ii)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More