Multi-threaded embarrassing encyclopedia case
Case requirements refer to the last embarrassing encyclopedia single process case: http://www.cnblogs.com/miqi1992/p/8081929.html
Queue (Queued object)
A queue is a standard library in Python that can be referenced directly; the form of the import Queue
most common interaction data between threads when queuing.
The thinking of multithreading under Python
For resources, locking is an important link. Because Python native list,dict and so on, are not thread safe. Queue, however, is thread-safe, so it is recommended to use queues
- Initialization: Class Queue.queue (MAXSIZE) FIFO advanced First Out
- Common methods in the package:
- Queue.qszie () returns the size of the queue
- Queue.empty () returns True if the queue is empty, otherwise false
- Queue.full () returns True if the queue is full, otherwise false
- Queue.full corresponds to maxsize size
- Queue.get ([block[, timeout]]) Get queue, timeout wait event
- Create a "queue" object
- Import Queue
- Myqueue = Queue.queue (maxsize=10)
- Put a value in the queue
- Take a value out of the queue
Multithreading
#-*-coding:utf-8-*-ImportRequests fromlxmlImportEtree fromQueueImportQueueImportThreadingImportTimeImportJsonclassThread_crawl (Threading. Thread):"""fetch Thread Class """ def __init__( Self, ThreadID, Q): Threading. Thread.__init__( Self) Self. ThreadID=ThreadID Self. Q=QdefRun Self):Print("String:"+ Self. ThreadID) Self. Qiushi_spider ()Print("Exiting:"+ Self. ThreadID)defQiushi_spider ( Self): while True:if Self. Q.empty (): Break Else: page= Self. Q.get ()Print(' qiushi_spider= ', Self. ThreadID,' page= ',Str(page)) Url= ' http://www.qiushibaike.com/8hr/page/' + Str(page)+"/"Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 ',' Accept-language ':' zh-cn,zh;q=0.8 '}#多次尝试失败结束 to prevent the cycle of deathTimeout= 4 whileTimeout> 0: timeout-= 1 Try: Content=Requests.get (URL, headers=Headers) Data_queue.put (Content.text) Break except ExceptionE:Print "Qiushi_spider"EifTimeout< 0:Print ' Timeout 'UrlclassThread_parser (Threading. Thread):"""Page Parsing Classes """ def __init__( Self, ThreadID, queue, lock, F): Threading. Thread.__init__( Self) Self. ThreadID=ThreadID Self. queue=Queue Self. Lock=Lock Self. f=FdefRun Self):Print("Starting", Self. ThreadID)GlobalTotal, Exitflag_parser while notExitflag_parser:Try:"""The Get () method of the call queue object is removed from the team header and returns an item. Optional parameter is block, default is Trueif the queue is empty and the Block is True,get (), the calling thread is paused until a project is availableif the queue is empty and the block is false, the queue throws an empty exception """Item= Self. Queue.get (False)if notItemPass Self. Parse_data (item) Self. Queue.task_done ()Print("Thread_parser=", Self. ThreadID,' total= ', total)except:Pass Print "Exiting", Self. ThreadIDdefParse_data ( Self, item):"""parsing Web page functions:p Aram Item: Web content: Return """ GlobalTotalTry: HTML=Etree. HTML (item) result=Html.xpath ('//div[contains (@id, "Qiushi_tag")] ') forSiteinchResultTry: Imgurl=Site.xpath ('.//img/@src ')[0] Title=Site.xpath ('.//h2 ')[0].text Content=Site.xpath ('.//div[@class = "Content"]/span ')[0].text.strip () vote= NoneComments= None Try:# Number of votesVote=Site.xpath ('.//i ')[0].text# Print (vote) #print Site.xpath ('.//*[@class = ' number '] ') [0].text # comment InformationComments=Site.xpath ('.//i ')[1].textexcept:PassResult={' ImageUrl ': Imgurl,' title ': Title,' content ': Content,' vote ': Vote,' comments ': Comments} with Self. Lock: Self. F.write (Json.dumps (result, ensure_ascii=False). Encode (' Utf-8 ')+ '\ n')except ExceptionE:Print("site in result"Eexcept ExceptionE:Print("Parse_data"E with Self. lock:total+= 1Data_queue=Queue () Exitflag_parser= FalseLock=Threading. Lock () Total= 0defMain (): Output= Open(' Qiushibaike.json ',' A ')#初始化网页页码page从1-10 pagesPagequeue=Queue (Ten) forPageinch Range(1, One): Pagequeue.put (page)#初始化采集线程Crawlthreads=[] Crawllist=["Crawl-1","Crawl-2","Crawl-3"] forThreadIDinchCrawllist:thread=Thread_crawl (ThreadID, Pagequeue) Thread.Start () crawlthreads.append (thread)# #初始化解析线程parseListParserthreads=[] Parserlist=["Parser-1","Parser-2","Parser-3"]#分别启动parserList forThreadIDinchParserlist:thread=Thread_parser (ThreadID, Data_queue, lock, Output) Thread.Start () parserthreads.append (thread)# wait Queue condition while notPagequeue.empty ():Pass #等待所有线程完成 forTinchCrawlthreads:t.join () while notData_queue.empty ():Pass #通知线程退出 GlobalExitflag_parser Exitflag_parser= True forTinchParserthreads:t.join ()Print ' Exiting Main Thread ' withLock:output.close ()if __name__ == ' __main__ ': Main ()
Python Crawler (18) _ Multithreading embarrassing encyclopedia case