Python Crawler (18) _ Multithreading embarrassing encyclopedia case

Last Update:2017-12-21 Source: Internet

Author: User

Tags thread class xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Multi-threaded embarrassing encyclopedia case

Case requirements refer to the last embarrassing encyclopedia single process case: http://www.cnblogs.com/miqi1992/p/8081929.html

Queue (Queued object)

A queue is a standard library in Python that can be referenced directly; the form of the import Queue most common interaction data between threads when queuing.

The thinking of multithreading under Python
For resources, locking is an important link. Because Python native list,dict and so on, are not thread safe. Queue, however, is thread-safe, so it is recommended to use queues

Initialization: Class Queue.queue (MAXSIZE) FIFO advanced First Out
Common methods in the package:
- Queue.qszie () returns the size of the queue
- Queue.empty () returns True if the queue is empty, otherwise false
- Queue.full () returns True if the queue is full, otherwise false
- Queue.full corresponds to maxsize size
- Queue.get ([block[, timeout]]) Get queue, timeout wait event
Create a "queue" object
- Import Queue
- Myqueue = Queue.queue (maxsize=10)
Put a value in the queue
- Myqueue.put (10)
Take a value out of the queue
- Myqueue.get ()

Multithreading

#-*-coding:utf-8-*-ImportRequests fromlxmlImportEtree fromQueueImportQueueImportThreadingImportTimeImportJsonclassThread_crawl (Threading. Thread):"""fetch Thread Class    """    def __init__( Self, ThreadID, Q): Threading. Thread.__init__( Self) Self. ThreadID=ThreadID Self. Q=QdefRun Self):Print("String:"+ Self. ThreadID) Self. Qiushi_spider ()Print("Exiting:"+ Self. ThreadID)defQiushi_spider ( Self): while True:if  Self. Q.empty (): Break            Else: page=  Self. Q.get ()Print(' qiushi_spider= ', Self. ThreadID,' page= ',Str(page)) Url= ' http://www.qiushibaike.com/8hr/page/' + Str(page)+"/"Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 ',' Accept-language ':' zh-cn,zh;q=0.8 '}#多次尝试失败结束 to prevent the cycle of deathTimeout= 4                 whileTimeout> 0: timeout-= 1                    Try: Content=Requests.get (URL, headers=Headers) Data_queue.put (Content.text) Break                    except ExceptionE:Print "Qiushi_spider"EifTimeout< 0:Print ' Timeout 'UrlclassThread_parser (Threading. Thread):"""Page Parsing Classes    """    def __init__( Self, ThreadID, queue, lock, F): Threading. Thread.__init__( Self) Self. ThreadID=ThreadID Self. queue=Queue Self. Lock=Lock Self. f=FdefRun Self):Print("Starting", Self. ThreadID)GlobalTotal, Exitflag_parser while  notExitflag_parser:Try:"""The Get () method of the call queue object is removed from the team header and returns an item. Optional parameter is block, default is Trueif the queue is empty and the Block is True,get (), the calling thread is paused until a project is availableif the queue is empty and the block is false, the queue throws an empty exception                """Item=  Self. Queue.get (False)if  notItemPass                 Self. Parse_data (item) Self. Queue.task_done ()Print("Thread_parser=", Self. ThreadID,' total= ', total)except:Pass        Print "Exiting", Self. ThreadIDdefParse_data ( Self, item):"""parsing Web page functions:p Aram Item: Web content: Return        """        GlobalTotalTry: HTML=Etree. HTML (item) result=Html.xpath ('//div[contains (@id, "Qiushi_tag")] ') forSiteinchResultTry: Imgurl=Site.xpath ('.//img/@src ')[0] Title=Site.xpath ('.//h2 ')[0].text Content=Site.xpath ('.//div[@class = "Content"]/span ')[0].text.strip () vote= NoneComments= None                    Try:# Number of votesVote=Site.xpath ('.//i ')[0].text# Print (vote)                        #print Site.xpath ('.//*[@class = ' number '] ') [0].text                        # comment InformationComments=Site.xpath ('.//i ')[1].textexcept:PassResult={' ImageUrl ': Imgurl,' title ': Title,' content ': Content,' vote ': Vote,' comments ': Comments} with  Self. Lock: Self. F.write (Json.dumps (result, ensure_ascii=False). Encode (' Utf-8 ')+ '\ n')except ExceptionE:Print("site in result"Eexcept ExceptionE:Print("Parse_data"E with  Self. lock:total+= 1Data_queue=Queue () Exitflag_parser= FalseLock=Threading. Lock () Total= 0defMain (): Output= Open(' Qiushibaike.json ',' A ')#初始化网页页码page从1-10 pagesPagequeue=Queue (Ten) forPageinch Range(1, One): Pagequeue.put (page)#初始化采集线程Crawlthreads=[] Crawllist=["Crawl-1","Crawl-2","Crawl-3"] forThreadIDinchCrawllist:thread=Thread_crawl (ThreadID, Pagequeue) Thread.Start () crawlthreads.append (thread)# #初始化解析线程parseListParserthreads=[] Parserlist=["Parser-1","Parser-2","Parser-3"]#分别启动parserList     forThreadIDinchParserlist:thread=Thread_parser (ThreadID, Data_queue, lock, Output) Thread.Start () parserthreads.append (thread)# wait Queue condition     while  notPagequeue.empty ():Pass    #等待所有线程完成     forTinchCrawlthreads:t.join () while  notData_queue.empty ():Pass    #通知线程退出    GlobalExitflag_parser Exitflag_parser= True     forTinchParserthreads:t.join ()Print ' Exiting Main Thread '     withLock:output.close ()if __name__ == ' __main__ ': Main ()

Python Crawler (18) _ Multithreading embarrassing encyclopedia case

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More