Program used to crawl the pictures on the encyclopedia, the program has a timeout function, with exception handling ability
Directly below the source code:
#-*-coding:utf-8-*-"Created on October 20, 2016 @author:audi" Import urllib2import refrom BS4 import Beautifulsoupimport Sysreload (SYS) sys.setdefaultencoding (' utf-8 ') count = 0path = "Pic/tupian" headers = {' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '}for x in range (1,10): Temp_url = "Http://www.qiushibaike.com/imgran k/page/%d "%x req = urllib2. Request (url = temp_url, headers = headers) Try:data = Urllib2.urlopen (req, timeout=10). Read () except:print "open page link timeout!!!!" "Continue Else:print" opens the page successfully and begins parsing the data: "Soup=beautifulsoup (data, ' Html.parser ', from_encoding= ' Utf-8 ') # picture-linked div tag format # <div class=" thumb " ># <a href= "/article/117795261" target= "_blank" ># # </a># </div> # Query the div tag contents of all images content = Soup.find_all (' div ', class _ = ' thumb ') # (Jpg| Jpg|jpeg) # Links collection holds the final picture of the link to the linked links = set () # Filter again to get the link for the picture for I in Content:temp_ link = i.find_all (' A ', Href=re.compile (r "/article/\d")) Temp_linnk = Temp_link[0].find (' img ', src=re.compile (r "\). (Jpg| JPG|JPEG)) Temp_linnk = temp_linnk[' src '] links.add (TEMP_LINNK) for link in Links: Try:picdata = Urllib2.urlopen (link,timeout=3). Read () Except:print "when Previous Child link Open failed: "Continue else:file_name = path + str (count) + '. jpg ' Count +=1 f = File (file_name, "WB") F.write (Picdata) f.close () print "crawl section "+ str (count) +" links "+ link print" Congratulations, crawl the end of the picture!!!!!!!!!!!! "
My first Python crawler