Recently learning guitar, a one-piece save guitar spectrum too cumbersome, write a small program download guitar spectrum.
Installing Beautifulsoup,beautifulsoup is a library that parses HTML.
Pip Install BeautifulSoup4
In this program BeautifulSoup use html5lib so also install Html5lib
Pip Install Html5lib
The code is as follows:
#-*-coding:utf-8-*-#Coding=utf8ImportOSImportSYSImportLoggingImportUrllibImportUrllib2ImportChardetImportReImportCookielibImportUrlparse fromBs4Importbeautifulsoupsysencoding=sys.getfilesystemencoding () Cookiejar=Cookielib. Cookiejar ()defget (URL): Req=Urllib2. Request (URL) opener=Urllib2.build_opener (urllib2. Httpcookieprocessor (Cookiejar)) Response=Opener.open (req)returnResponse.read ()defdownload_guitar_image (URL, target):Print 'start Download guitar image ...'req=Urllib2. Request (URL) req.add_header ('Accept','image/webp,image/*,*/*;q=0.8') Opener=Urllib2.build_opener (urllib2. Httpcookieprocessor (Cookiejar)) Response=Opener.open (req) content=Response.read () with open (target,'WB') as Code:code.write (content)#Parse Guitar profile picture page link addressdefParse_guitar_img_link (): Page_list=[] url_base='http://www.17jita.com/'page= 1 whileTrue:url= Url_base +'tab/img/index.php?page='+Str (page)PrintURL html=get (URL) soup= BeautifulSoup (HTML,"Html5lib") List= Soup.select ('#ct DL > dt > A') if notlist: Break forIteminchList:page_list.append ({'title': Item.text,'Link': Url_base + item['href']}) page+ = 1returnpage_listdefdownload_guitar_image_link_list (URL): Image_link_list=[] Page= 1 whileTrue:page_url=URLifPage > 1: Page_url= Url.replace ('. html',"'+ str (page) +'. html') Try: HTML=get (page_url) Soup= BeautifulSoup (HTML,'Html5lib') Img_list= Soup.select ('#article_contents a > img') forImginchImg_list:image_link_list.append (img['src']) exceptUrllib2. Urlerror, E:msg= u'Download'+ Page_url + u'error, Reason:'+E.reasonPrintmsg logging.error (msg) Breakpage+ = 1returnimage_link_listif __name__=='__main__': Logging.basicconfig ( level=logging. DEBUG, Format='% (asctime) s% (filename) s[line:% (lineno) d]% (levelname) s% (message) s', Datefmt='%y-%m-%d%h:%m:%s', filename='Guitar.log', FileMode='a') path='Guitar' if notos.path.exists (path): Os.mkdir (path) page_list=Parse_guitar_img_link () forPageinchpage_list:Printpage['Link'] +'('+ page['title'] +')'Guitar_path= Path +'/'+ (page['title']). Encode ('GBK') if notos.path.exists (Guitar_path): Os.mkdir (guitar_path) image_link_list= Download_guitar_image_link_list (page['Link']) forImage_linkinchimage_link_list:Print '\ t'+image_link filename= Image_link[image_link.rindex ('/'):] filepath= Guitar_path + Filename.encode ('GBK') Download_guitar_image (Image_link, filepath)
Some of the problems in the program are still being optimized, such as downloading interrupts and not downloading the rest of the guitar spectrum.
Python Crawl 17 Guitar network Guitar Spectrum