Python crawler path-simple Web Capture upgrade (add multithreading support)

Last Update:2017-05-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint Self's blog: http://www.mylonly.com/archives/1418.html

After two nights of struggle. The previous article introduced the crawler slightly improved the next (Python crawler-simple Web Capture), mainly to get the image link task and download picture task is handled by the thread separately, and this time the crawler can not only crawl the first page of the image link, the entire http:// desk.zol.com.cn/meinv/the following images will be crawled, and provide a variety of resolution pictures of the file download, detailed setup method code gaze inside there are introduced.

This time the code is still a little bit, ctrl-c can not terminate the program, it should be the thread can not respond to the main program termination message caused, (preferably in the background running program) also the allocation of the thread can also be improved. Maybe it will be improved gradually.

#coding: Utf-8 ############################################################## File name:main.py# author:mylonly# Mail: [email protected]# Created time:wed One-June 08:22:12 PM cst############################################# #############################!/usr/bin/pythonimport re,urllib2,htmlparser,threading,queue,time# Each Atlas portal Link htmldoorlist = [] #包括图片的Hmtl链接htmlUrlList = [] #图片Url链接QueueimageUrlList = queue.queue (0) #捕获图片数量imageGetCount = 0 #已下载图片数量imageDownloadCount = 0# The starting address of each atlas. Used to infer termination Nexthtmlurl = ' #本地保存路径localSavePath = '/data/1920x1080/' #假设你想下你须要的分辨率的, please change the REPLACE_STR, there are for example the following resolutions to choose 1920X1200. 1980x1920,1680x1050,1600x900,1440x900,1366x768,1280x1024,1024x768,1280x800replace_str = ' 1920x1080 ' replaced_str = ' 960x600 ' #内页分析处理类class imagehtmlparser (htmlparser.htmlparser):d ef __init__ (self): Self.nexturl = ' Htmlparser.htmlparser.__init__ (self) def handle_starttag (self,tag,attrs): global imageurllistif (Tag = = ' img ' and Len ( Attrs) > 2): if (attrs[0] = = (' id ', ' bigimg ')): url = attrs[1][1]url = Url.replace (replaced_str,replace_str) imageurllist.put (URL) Global Imagegetcountimagegetcount = ImageGetCount + 1print Urlelif (Tag = = ' A ' and Len (attrs) = = 4): if (attrs[0] = = (' id ', ' pagenext ') and attrs[1] = = (' class ', ' Next '): Global n Exthtmlurlnexthtmlurl = attrs[2][1]; #首页分析类class indexhtmlparser (htmlparser.htmlparser):d ef __init__ (self): Self.urllist = []self.index = 0self.nexturl = ' Self.taglist = [' Li ', ' a ']self.classlist = [' photo-list-padding ', ' pic '] Htmlparser.htmlparser.__init__ (self) def handle_starttag (self,tag,attrs): if (tag = = Self.taglist[self.index]): for attr in Attrs:if (attr[1] = = Self.classlist[self.index]): if (Self.index = = 0): #第一层找到了self. Index = 1else:# Second level found Self.index = 0print attrs[1][1]self.urllist.append (attrs[1][1]) breakelif (tag = = ' A '): For attr in Attrs:if (attr[0 ] = = ' id ' and attr[1] = = ' Pagenext '): Self.nexturl = Attrs[1][1]print ' Nexturl: ', self.nexturlbreak# home HMTL Parser Indexparser = Indexhtmlparser () #内页Html解析器imageParser = Imagehtmlparser () #依据首页得到全部入口链接print ' Start scanning the home page...‘ Host = ' http://desk.zol.com.cn ' indexurl = '/meinv/' while (indexurl! = '):p rint ' crawling Web: ', host+indexurlrequest = Urllib2 . Request (Host+indexurl) try:m = urllib2.urlopen (request) con = M.read () indexparser.feed (Con) if (Indexurl = = Indexparser.nexturl): Breakelse:indexurl = Indexparser.nexturlexcept urllib2. Urlerror,e:print e.reasonprint ' Home scan complete, all Atlas Link has been obtained: ' htmldoorlist = indexparser.urllist# to get all the pictures according to the portal link Urlclass Getimageurl (Threading. Thread):d EF __init__ (self): threading. Thread.__init__ (self) def run (self): for door in Htmldoorlist:print ' start to get the image address, the entry address is: ', Doorglobal Nexthtmlurlnexthtmlurl = "while (door! ="):p rint ' start getting pictures from page%s ... '% (Host+door) if (nexthtmlurl! = "): request = Urllib2. Request (host+nexthtmlurl) else:request = Urllib2. Request (Host+door) try:m = urllib2.urlopen (request) con = M.read () imageparser.feed (con) print ' The next page address is: ', Nexthtmlurlif (door = = Nexthtmlurl): Breakexcept urllib2. Urlerror,e:print e.reasonprint ' All picture addresses have been obtained: ', Imageurllistclass getImage (threading. Thread):d EF __init__ (self): Threading. Thread.__init__ (self) def run: Global imageurllistprint ' start download picture ... ' while (True):p rint ' now captures the number of pictures: ', Imagegetcountprint ' downloaded number of images: ', imagedownloadcountimage = Imageurllist.get () print ' Download file path: ', Imagetry:cont = Urllib2.urlopen (image). Read () patter = ' [0-9]*\.jpg '; match = Re.search (patter,image); if match:print ' Downloading file: ', Match.group () filename = Localsavepath+match.group () f = open (filename, ' WB ') F.write (cont) f.close () Global Imagedownloadcountimagedownloadcount = imagedownloadcount + 1else:print ' no match ' if (Imageurllist.empty ()): Breakexcept Urllib2. Urlerror,e:print e.reasonprint ' files are all downloaded ... ' get = Getimageurl () get.start () print ' Get Picture chain thread start: ' Time.sleep (2) Download = GetImage () Download.start () print ' download photo ' link thread start: '

Python crawler path-simple Web Capture upgrade (add multithreading support)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler path-simple Web Capture upgrade (add multithreading support)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler path-simple Web Capture upgrade (add multithreading support)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support