Python Scrapy Learning notes (ii)

Source: Internet
Author: User
Tags xpath python scrapy

Use scrapy bulk Fetch, reference http://python.jobbole.com/87155


First, create the project

# scrapy Startproject Comics


Create a directory structure after completion

. ├──comics│├──__init__.py│├──items.py│├──middlewares.py│├──pipelines.py│├──settings.py│└──spiders │└──__init__.py└──scrapy.cfg


Second, create Spider class


Start_requests: When starting the crawler call, the default is to call the Make_requests_from_url method crawl Start_urls link, can be customized in this method, if the method is overridden, start_urls default will not be used, You can customize some custom URLs in this method, such as login, read URL from database, etc., this method returns the Request object

Start_urls is a property provided in the framework, for an array containing the URL of the destination page, after setting the value of Start_urls, you do not need to overload the Start_requests method, the crawler will crawl the address in the Start_urls in turn, and automatically calls parse as the callback method after the request is completed.
# CD comics/spiders# vim comic.py#!/usr/bin/python#coding:utf-8import scrapyclass Comics (scrapy. Spider): name = "Comics" Def Start_requests (self): urls = [' Http://www.xeall.com/shenshi '] for URL in Urls:yield Scrapy. Request (Url=url, Callback=self.parse) def parse (self, Response): Self.log (Response.body);


Third, start crawling comics

Crawler's main task is to crawl the list of each comic picture, crawl the current page, go to the next comic list to continue to crawl comics, and then continue to loop until all the comic crawl is complete. idea: Get the URL of the current comic, visit the URL of the comic after getting the name and all the pictures, do a batch download, cycle through

1. Get the URL of the comic in the current page and get the next


The URL of a single comic

650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M01/8C/6D/wKioL1hsnA7AMkTFAAAlDaAq7U4318.jpg "title=" Qq20170104145328.jpg "style=" float:left; "alt=" wkiol1hsna7amktfaaaldaaq7u4318.jpg "/>



#  Find out all the comics url                                                                                                                                                        def parse (self, response):#  find out all the comics Urldef parse (self, response):     #  Response object returned by request     content = selector (Response=response)      #  get comic Tag Object     com_count = content.xpath ("//div[@class = ' Mainleft ']/ul/ Li ")     #  get url    comics_url_list = []  of all comics in a single page    base_url =  ' http://www.xeall.com '     for i in  Range (len (com_count)):         com_url = content.xpath ("//div [@class = ' mainleft ']/ul/li[{}]/a/@href ". Format (i+1)). Extract ()          Url = base_url+com_url[0]        comics_url_list.append (URL)     #  process The current page of each comic     for url in comics_url_list:         yield scrapy. Request (Url=url, callback=self.comics_parse)     #  get the next page of Url    url_num = content.xpath ("//div[@class = ' Mainleft ']/div[ @class = ' pages ']/ul/li ")     next_url = content.xpath ("//div[@class = ' Mainleft ']/ div[@class = ' pages ']/ul/li[{}]/a/@href ". Format (len (url_num)-3). Extract ()     # print   ' Total pages:  {}, Next:  {} '. Format (url_num,next_url)     #  determine if the next page is the last page      if next_url:        next_page =  ' http ://www.xeall.com/shenshi/'  + next_url[0]        if next _page is not none:            yield  scrapy. Request (Next_page, callback=self.parse)              pass


2. Get all Pages


Source code for different pages

650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M00/8C/71/wKiom1hsnT2D-yNrAAAr1oadPro881.jpg "title=" Qq20170104145848.jpg "alt=" Wkiom1hsnt2d-ynraaar1oadpro881.jpg "/>

The name and URL of the current comic

650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M02/8C/71/wKiom1hsnbax886sAAA0_eZ3Pjo918.jpg "title=" Qq20170104150025.jpg "alt=" Wkiom1hsnbax886saaa0_ez3pjo918.jpg "/>

#  extract each comic data Def comics_parse (Self, response):                                                                                                                                              Content = selector (response=response)     #  current page     page _num = content.xpaTh ("//div[@class = ' dede_pages ']/ul/li[@class = ' ThisClass ']/a/text ()"). Extract ()     #  The first picture of the Url    current_url = content.xpath ("//div[@class = ' Mhcon_left ']/ul/li/p/img/ @src "). Extract ()     #  Comics name     comic_name =  Content.xpath ("//div[@class = ' mhcon_left ']/ul/li/p/img/@alt"). Extract ()     # self.log ( ' img url:  '  + current_url[0])     #  save picture to local      self.save_img (page_num[0], comic_name[0], current_url[0])     #  Next page The URL of the picture, the next page of the label's href attribute is ' # ' when the last page of the comic     page_num = content.xpath ("//div[@class = ' Dede_pages ']/ul/li ")     next_page = content.xpath ("//div[@class = ' dede_pages ']/ul /li[{}]/a/@href ". Format (len (page_num))). Extract ()     #  last page href= ' # '       if next_page ==  ' # ':         print (' parse comics: '  + comic_ name +  ' finished. ')     else:        next_page =  '/HTTP www.xeall.com/shenshi/'  + next_page[0]        yield  Scrapy. Request (Next_page, callback=self.comics_parse)


3. Persist data to storage

#  the image number, image name, image URL as parameters into Def save_img (self, img_mun, title, img_url):                                                                                                                                  #  save pictures to local      # self.log (' saving pic:  '  + img_url)     #  folder to save comics     documenT = os.path.join (OS.GETCWD (), ' cartoon ')     #  the filename of each comic is named after the title      comics_path = os.path.join (Document,title)     exists =  Os.path.exists (Comics_path)     if not exists:         print (' create document:  '  + title)          os.makedirs (Comics_path)     #  each picture is named in pages     pic_name  = comics_path +  '/'  + img_mun +  '. jpg '     #  Check if the picture has been downloaded locally, and no longer download     exists = os.path.exists (pic_name) if it exists      if exists:        print (' pic exists:  '  +  pic_name)         return    try:        &nbSp;response = requests.get (img_url,timeout=30)         #   data returned to the request         data = response         with open (Pic_name, ' WB ')  as f:             for chunk in data.iter_content (chunk_size=1024):                 if chunk:                      f.write (Chunk)                      f.flush ()         print (' save  image finished: '  + pic_name)     except Exception as e:       &nBsp; print (' Save image error. ')         print (e)


Full Source Address Https://github.com/yaoliang83/Scrapy-for-Comics


Python Scrapy Learning notes (ii)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.