Use scrapy bulk Fetch, reference http://python.jobbole.com/87155
First, create the project
# scrapy Startproject Comics
Create a directory structure after completion
. ├──comics│├──__init__.py│├──items.py│├──middlewares.py│├──pipelines.py│├──settings.py│└──spiders │└──__init__.py└──scrapy.cfg
Second, create Spider class
Start_requests: When starting the crawler call, the default is to call the Make_requests_from_url method crawl Start_urls link, can be customized in this method, if the method is overridden, start_urls default will not be used, You can customize some custom URLs in this method, such as login, read URL from database, etc., this method returns the Request object
Start_urls is a property provided in the framework, for an array containing the URL of the destination page, after setting the value of Start_urls, you do not need to overload the Start_requests method, the crawler will crawl the address in the Start_urls in turn, and automatically calls parse as the callback method after the request is completed.
# CD comics/spiders# vim comic.py#!/usr/bin/python#coding:utf-8import scrapyclass Comics (scrapy. Spider): name = "Comics" Def Start_requests (self): urls = [' Http://www.xeall.com/shenshi '] for URL in Urls:yield Scrapy. Request (Url=url, Callback=self.parse) def parse (self, Response): Self.log (Response.body);
Third, start crawling comics
Crawler's main task is to crawl the list of each comic picture, crawl the current page, go to the next comic list to continue to crawl comics, and then continue to loop until all the comic crawl is complete. idea: Get the URL of the current comic, visit the URL of the comic after getting the name and all the pictures, do a batch download, cycle through
1. Get the URL of the comic in the current page and get the next
The URL of a single comic
650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M01/8C/6D/wKioL1hsnA7AMkTFAAAlDaAq7U4318.jpg "title=" Qq20170104145328.jpg "style=" float:left; "alt=" wkiol1hsna7amktfaaaldaaq7u4318.jpg "/>
# Find out all the comics url def parse (self, response):# find out all the comics Urldef parse (self, response): # Response object returned by request content = selector (Response=response) # get comic Tag Object com_count = content.xpath ("//div[@class = ' Mainleft ']/ul/ Li ") # get url comics_url_list = [] of all comics in a single page base_url = ' http://www.xeall.com ' for i in Range (len (com_count)): com_url = content.xpath ("//div [@class = ' mainleft ']/ul/li[{}]/a/@href ". Format (i+1)). Extract () Url = base_url+com_url[0] comics_url_list.append (URL) # process The current page of each comic for url in comics_url_list: yield scrapy. Request (Url=url, callback=self.comics_parse) # get the next page of Url url_num = content.xpath ("//div[@class = ' Mainleft ']/div[ @class = ' pages ']/ul/li ") next_url = content.xpath ("//div[@class = ' Mainleft ']/ div[@class = ' pages ']/ul/li[{}]/a/@href ". Format (len (url_num)-3). Extract () # print ' Total pages: {}, Next: {} '. Format (url_num,next_url) # determine if the next page is the last page if next_url: next_page = ' http ://www.xeall.com/shenshi/' + next_url[0] if next _page is not none: yield scrapy. Request (Next_page, callback=self.parse) pass
2. Get all Pages
Source code for different pages
650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M00/8C/71/wKiom1hsnT2D-yNrAAAr1oadPro881.jpg "title=" Qq20170104145848.jpg "alt=" Wkiom1hsnt2d-ynraaar1oadpro881.jpg "/>
The name and URL of the current comic
650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M02/8C/71/wKiom1hsnbax886sAAA0_eZ3Pjo918.jpg "title=" Qq20170104150025.jpg "alt=" Wkiom1hsnbax886saaa0_ez3pjo918.jpg "/>
# extract each comic data Def comics_parse (Self, response): Content = selector (response=response) # current page page _num = content.xpaTh ("//div[@class = ' dede_pages ']/ul/li[@class = ' ThisClass ']/a/text ()"). Extract () # The first picture of the Url current_url = content.xpath ("//div[@class = ' Mhcon_left ']/ul/li/p/img/ @src "). Extract () # Comics name comic_name = Content.xpath ("//div[@class = ' mhcon_left ']/ul/li/p/img/@alt"). Extract () # self.log ( ' img url: ' + current_url[0]) # save picture to local self.save_img (page_num[0], comic_name[0], current_url[0]) # Next page The URL of the picture, the next page of the label's href attribute is ' # ' when the last page of the comic page_num = content.xpath ("//div[@class = ' Dede_pages ']/ul/li ") next_page = content.xpath ("//div[@class = ' dede_pages ']/ul /li[{}]/a/@href ". Format (len (page_num))). Extract () # last page href= ' # ' if next_page == ' # ': print (' parse comics: ' + comic_ name + ' finished. ') else: next_page = '/HTTP www.xeall.com/shenshi/' + next_page[0] yield Scrapy. Request (Next_page, callback=self.comics_parse)
3. Persist data to storage
# the image number, image name, image URL as parameters into Def save_img (self, img_mun, title, img_url): # save pictures to local # self.log (' saving pic: ' + img_url) # folder to save comics documenT = os.path.join (OS.GETCWD (), ' cartoon ') # the filename of each comic is named after the title comics_path = os.path.join (Document,title) exists = Os.path.exists (Comics_path) if not exists: print (' create document: ' + title) os.makedirs (Comics_path) # each picture is named in pages pic_name = comics_path + '/' + img_mun + '. jpg ' # Check if the picture has been downloaded locally, and no longer download exists = os.path.exists (pic_name) if it exists if exists: print (' pic exists: ' + pic_name) return try: &nbSp;response = requests.get (img_url,timeout=30) # data returned to the request data = response with open (Pic_name, ' WB ') as f: for chunk in data.iter_content (chunk_size=1024): if chunk: f.write (Chunk) f.flush () print (' save image finished: ' + pic_name) except Exception as e: &nBsp; print (' Save image error. ') print (e)
Full Source Address Https://github.com/yaoliang83/Scrapy-for-Comics
Python Scrapy Learning notes (ii)