Python Crawler (18) _ Multithreading embarrassing encyclopedia case

Source: Internet
Author: User
Tags thread class xpath

Multi-threaded embarrassing encyclopedia case

Case requirements refer to the last embarrassing encyclopedia single process case: http://www.cnblogs.com/miqi1992/p/8081929.html

Queue (Queued object)

A queue is a standard library in Python that can be referenced directly; the form of the import Queue most common interaction data between threads when queuing.

The thinking of multithreading under Python
For resources, locking is an important link. Because Python native list,dict and so on, are not thread safe. Queue, however, is thread-safe, so it is recommended to use queues

    1. Initialization: Class Queue.queue (MAXSIZE) FIFO advanced First Out
    2. Common methods in the package:
      • Queue.qszie () returns the size of the queue
      • Queue.empty () returns True if the queue is empty, otherwise false
      • Queue.full () returns True if the queue is full, otherwise false
      • Queue.full corresponds to maxsize size
      • Queue.get ([block[, timeout]]) Get queue, timeout wait event
    3. Create a "queue" object
      • Import Queue
      • Myqueue = Queue.queue (maxsize=10)
    4. Put a value in the queue
      • Myqueue.put (10)
    5. Take a value out of the queue
      • Myqueue.get ()
Multithreading

#-*-coding:utf-8-*-ImportRequests fromlxmlImportEtree fromQueueImportQueueImportThreadingImportTimeImportJsonclassThread_crawl (Threading. Thread):"""fetch Thread Class    """    def __init__( Self, ThreadID, Q): Threading. Thread.__init__( Self) Self. ThreadID=ThreadID Self. Q=QdefRun Self):Print("String:"+ Self. ThreadID) Self. Qiushi_spider ()Print("Exiting:"+ Self. ThreadID)defQiushi_spider ( Self): while True:if  Self. Q.empty (): Break            Else: page=  Self. Q.get ()Print(' qiushi_spider= ', Self. ThreadID,' page= ',Str(page)) Url= ' http://www.qiushibaike.com/8hr/page/' + Str(page)+"/"Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 ',' Accept-language ':' zh-cn,zh;q=0.8 '}#多次尝试失败结束 to prevent the cycle of deathTimeout= 4                 whileTimeout> 0: timeout-= 1                    Try: Content=Requests.get (URL, headers=Headers) Data_queue.put (Content.text) Break                    except ExceptionE:Print "Qiushi_spider"EifTimeout< 0:Print ' Timeout 'UrlclassThread_parser (Threading. Thread):"""Page Parsing Classes    """    def __init__( Self, ThreadID, queue, lock, F): Threading. Thread.__init__( Self) Self. ThreadID=ThreadID Self. queue=Queue Self. Lock=Lock Self. f=FdefRun Self):Print("Starting", Self. ThreadID)GlobalTotal, Exitflag_parser while  notExitflag_parser:Try:"""The Get () method of the call queue object is removed from the team header and returns an item. Optional parameter is block, default is Trueif the queue is empty and the Block is True,get (), the calling thread is paused until a project is availableif the queue is empty and the block is false, the queue throws an empty exception                """Item=  Self. Queue.get (False)if  notItemPass                 Self. Parse_data (item) Self. Queue.task_done ()Print("Thread_parser=", Self. ThreadID,' total= ', total)except:Pass        Print "Exiting", Self. ThreadIDdefParse_data ( Self, item):"""parsing Web page functions:p Aram Item: Web content: Return        """        GlobalTotalTry: HTML=Etree. HTML (item) result=Html.xpath ('//div[contains (@id, "Qiushi_tag")] ') forSiteinchResultTry: Imgurl=Site.xpath ('.//img/@src ')[0] Title=Site.xpath ('.//h2 ')[0].text Content=Site.xpath ('.//div[@class = "Content"]/span ')[0].text.strip () vote= NoneComments= None                    Try:# Number of votesVote=Site.xpath ('.//i ')[0].text# Print (vote)                        #print Site.xpath ('.//*[@class = ' number '] ') [0].text                        # comment InformationComments=Site.xpath ('.//i ')[1].textexcept:PassResult={' ImageUrl ': Imgurl,' title ': Title,' content ': Content,' vote ': Vote,' comments ': Comments} with  Self. Lock: Self. F.write (Json.dumps (result, ensure_ascii=False). Encode (' Utf-8 ')+ '\ n')except ExceptionE:Print("site in result"Eexcept ExceptionE:Print("Parse_data"E with  Self. lock:total+= 1Data_queue=Queue () Exitflag_parser= FalseLock=Threading. Lock () Total= 0defMain (): Output= Open(' Qiushibaike.json ',' A ')#初始化网页页码page从1-10 pagesPagequeue=Queue (Ten) forPageinch Range(1, One): Pagequeue.put (page)#初始化采集线程Crawlthreads=[] Crawllist=["Crawl-1","Crawl-2","Crawl-3"] forThreadIDinchCrawllist:thread=Thread_crawl (ThreadID, Pagequeue) Thread.Start () crawlthreads.append (thread)# #初始化解析线程parseListParserthreads=[] Parserlist=["Parser-1","Parser-2","Parser-3"]#分别启动parserList     forThreadIDinchParserlist:thread=Thread_parser (ThreadID, Data_queue, lock, Output) Thread.Start () parserthreads.append (thread)# wait Queue condition     while  notPagequeue.empty ():Pass    #等待所有线程完成     forTinchCrawlthreads:t.join () while  notData_queue.empty ():Pass    #通知线程退出    GlobalExitflag_parser Exitflag_parser= True     forTinchParserthreads:t.join ()Print ' Exiting Main Thread '     withLock:output.close ()if __name__ == ' __main__ ': Main ()

Python Crawler (18) _ Multithreading embarrassing encyclopedia case

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.