Python crawler path-simple Web Capture upgrade (add multithreading support)

Source: Internet
Author: User

Reprint Self's blog: http://www.mylonly.com/archives/1418.html

After two nights of struggle. The previous article introduced the crawler slightly improved the next (Python crawler-simple Web Capture), mainly to get the image link task and download picture task is handled by the thread separately, and this time the crawler can not only crawl the first page of the image link, the entire http:// desk.zol.com.cn/meinv/the following images will be crawled, and provide a variety of resolution pictures of the file download, detailed setup method code gaze inside there are introduced.

This time the code is still a little bit, ctrl-c can not terminate the program, it should be the thread can not respond to the main program termination message caused, (preferably in the background running program) also the allocation of the thread can also be improved. Maybe it will be improved gradually.

#coding: Utf-8 ############################################################## File name:main.py# author:mylonly# Mail: [email protected]# Created time:wed One-June 08:22:12 PM cst############################################# #############################!/usr/bin/pythonimport re,urllib2,htmlparser,threading,queue,time# Each Atlas portal Link htmldoorlist = [] #包括图片的Hmtl链接htmlUrlList = [] #图片Url链接QueueimageUrlList = queue.queue (0) #捕获图片数量imageGetCount = 0 #已下载图片数量imageDownloadCount = 0# The starting address of each atlas. Used to infer termination Nexthtmlurl = ' #本地保存路径localSavePath = '/data/1920x1080/' #假设你想下你须要的分辨率的, please change the REPLACE_STR, there are for example the following resolutions to choose 1920X1200. 1980x1920,1680x1050,1600x900,1440x900,1366x768,1280x1024,1024x768,1280x800replace_str = ' 1920x1080 ' replaced_str = ' 960x600 ' #内页分析处理类class imagehtmlparser (htmlparser.htmlparser):d ef __init__ (self): Self.nexturl = ' Htmlparser.htmlparser.__init__ (self) def handle_starttag (self,tag,attrs): global imageurllistif (Tag = = ' img ' and Len ( Attrs) > 2): if (attrs[0] = = (' id ', ' bigimg ')): url = attrs[1][1]url = Url.replace (replaced_str,replace_str) imageurllist.put (URL) Global Imagegetcountimagegetcount = ImageGetCount + 1print Urlelif (Tag = = ' A ' and Len (attrs) = = 4): if (attrs[0] = = (' id ', ' pagenext ') and attrs[1] = = (' class ', ' Next '): Global n Exthtmlurlnexthtmlurl = attrs[2][1]; #首页分析类class indexhtmlparser (htmlparser.htmlparser):d ef __init__ (self): Self.urllist = []self.index = 0self.nexturl = ' Self.taglist = [' Li ', ' a ']self.classlist = [' photo-list-padding ', ' pic '] Htmlparser.htmlparser.__init__ (self) def handle_starttag (self,tag,attrs): if (tag = = Self.taglist[self.index]): for attr in Attrs:if (attr[1] = = Self.classlist[self.index]): if (Self.index = = 0): #第一层找到了self. Index = 1else:# Second level found Self.index = 0print attrs[1][1]self.urllist.append (attrs[1][1]) breakelif (tag = = ' A '): For attr in Attrs:if (attr[0 ] = = ' id ' and attr[1] = = ' Pagenext '): Self.nexturl = Attrs[1][1]print ' Nexturl: ', self.nexturlbreak# home HMTL Parser Indexparser = Indexhtmlparser () #内页Html解析器imageParser = Imagehtmlparser () #依据首页得到全部入口链接print ' Start scanning the home page...‘ Host = ' http://desk.zol.com.cn ' indexurl = '/meinv/' while (indexurl! = '):p rint ' crawling Web: ', host+indexurlrequest = Urllib2 . Request (Host+indexurl) try:m = urllib2.urlopen (request) con = M.read () indexparser.feed (Con) if (Indexurl = = Indexparser.nexturl): Breakelse:indexurl = Indexparser.nexturlexcept urllib2. Urlerror,e:print e.reasonprint ' Home scan complete, all Atlas Link has been obtained: ' htmldoorlist = indexparser.urllist# to get all the pictures according to the portal link Urlclass Getimageurl (Threading. Thread):d EF __init__ (self): threading. Thread.__init__ (self) def run (self): for door in Htmldoorlist:print ' start to get the image address, the entry address is: ', Doorglobal Nexthtmlurlnexthtmlurl = "while (door! ="):p rint ' start getting pictures from page%s ... '% (Host+door) if (nexthtmlurl! = "): request = Urllib2. Request (host+nexthtmlurl) else:request = Urllib2. Request (Host+door) try:m = urllib2.urlopen (request) con = M.read () imageparser.feed (con) print ' The next page address is: ', Nexthtmlurlif (door = = Nexthtmlurl): Breakexcept urllib2. Urlerror,e:print e.reasonprint ' All picture addresses have been obtained: ', Imageurllistclass getImage (threading. Thread):d EF __init__ (self): Threading. Thread.__init__ (self) def run: Global imageurllistprint ' start download picture ... ' while (True):p rint ' now captures the number of pictures: ', Imagegetcountprint ' downloaded number of images: ', imagedownloadcountimage = Imageurllist.get () print ' Download file path: ', Imagetry:cont = Urllib2.urlopen (image). Read () patter = ' [0-9]*\.jpg '; match = Re.search (patter,image); if match:print ' Downloading file: ', Match.group () filename = Localsavepath+match.group () f = open (filename, ' WB ') F.write (cont) f.close () Global Imagedownloadcountimagedownloadcount = imagedownloadcount + 1else:print ' no match ' if (Imageurllist.empty ()): Breakexcept Urllib2. Urlerror,e:print e.reasonprint ' files are all downloaded ... ' get = Getimageurl () get.start () print ' Get Picture chain thread start: ' Time.sleep (2) Download = GetImage () Download.start () print ' download photo ' link thread start: '


Python crawler path-simple Web Capture upgrade (add multithreading support)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.