Crawl this site before, tried to crawl other sites of the comics, but found that there are many anti-crawler restrictions, some pictures added dynamic parameters, every second will be updated, so the previous second crawl of the picture linked to a second seconds will be invalid, there are still some image address is not changed, but the number of frequent visits will return 403, Finally found an unlimited comic site, demo selenium crawler
# -*- coding:utf-8 -*-# crawl kuku Comics __author__= ' Fengzhankui ' from selenium import webdriverfrom selenium.webdriver.common.desired_capabilities import Desiredcapabilitiesimport osimport urllib2import chromclass getmanhua (object): def __init__ (self): self.num=5 self.starturl= ' http://comic.kukudm.com/comiclist/2154/51850/1.htm ' self.browser=self.getbrowser () self.getpic (Self.browser) def getbrowser (self): dcap = dict (DESIREDCAPABILITIES.PHANTOMJS) dcap["Phantomjs.page.settings.userAgent"] = ("mozilla/5.0 (Windows nt 6.1; win64; x64) applewebkit/537.36 (Khtml, like gecko) chrome/59.0.3071.115 safari/537.36 ") browser=webdriver. PHANTOMJS (DESIRED_CAPABILITIES=DCAP) try: browser.get (Self.starturl) except: print ' Open url fail ' browser.implicitly_wait ( ) return browser def getpic (Self, Browser): cartoontitle = browser.title.split ('_') [0] self.createdir (Cartoontitle) os.chdir (Cartoontitle)        &NBSp;for i in range (1,self.num): i=str (i) imgurl = browser.find_element_by_tag_name (' img '). Get_attribute (' src ') print imgurl with open (' page ' +i+ '. jpg ', ' WB ') as fp: agent = chrom.pcuseragent.get (' Firefox 4.0.1 - windows ') request=urllib2. Request (Imgurl) request.add_header (Agent.split (': ', 1) [0],agent.split (': ', 1) [0]) response=urllib2.urlopen (Request) fp.write (Response.read ()) print ' page ' +i+ ' Success ' NextTag = Browser.find_elements_by_tag_name (' a ') [ -1].get_attribute (' href ') browser.get (Nexttag) browser.implicitly_wait ( def createdir) (Self,cartoonTitle): if os.path.exists (Cartoontitle): print ' exists ' Else: &nbsP; os.mkdir (cartoontitle) if __name__== ' __main__ ': getmanhua ()
To deal with the anti-crawler mechanism, I added the request parameters in selenium and urllib2, anyway, the site through the filter request to filter out the crawler, here only crawl the start URL down 5 pages, and in order to prevent the picture and network delay, set 20 seconds to wait time, The start-up time will be a little bit longer and need to wait.
Run process
650) this.width=650; "Src=" https://s1.51cto.com/wyfs02/M01/9B/AF/wKiom1ll3wuB477ZAAFK8HMWBlM829.jpg-wh_500x0-wm_ 3-wmp_4-s_1882870946.jpg "title=" qq20170712163350.jpg "alt=" Wkiom1ll3wub477zaafk8hmwblm829.jpg-wh_50 "/>
Python Selenium crawl Kuku comics