Python multithreaded Multi-queue (BeautifulSoup network crawler)

Source: Internet
Author: User

The program probably reads as follows:

The program is set up two queues for the queue responsible for storing the URL, Out_queue is responsible for storing the source code of the Web page.

The Threadurl thread is responsible for storing the source code of the URLs in the queue queues in the Out_queue queue Urlopen.

The Dataminethread thread is responsible for extracting the desired content from the source code of the Out_queue Web page using the BeautifulSoup module and outputting it.

This is just a basic framework that can continue to scale as needed.

The program has a very detailed comment, if there is a problem kneeling to correct AH.

Import queueimport threadingimport urllib2import timefrom beautifulsoup Import beautifulsouphosts = ["http://yahoo.com" , "http://taobao.com", "http://apple.com", "http://ibm.com", "http://www.amazon.cn"]queue = Queue.queue () #存放网址的队列out _queue = Queue.queue () #存放网址页面的队列class Threadurl (threading. Thread): Def __init__ (self,queue,out_queue): Threading.            Thread.__init__ (self) self.queue = Queue Self.out_queue = Out_queue def run (self): while True: Host = self.queue.get () URL = urllib2.urlopen (host) chunk = Url.read () self.out_q Ueue.put (Chunk) #将hosts中的页面传给out_queue self.queue.task_done () #传入一个相当于完成一个任务class Dataminethread (threading. Thread): Def __init__ (self,out_queue): Threading. Thread.__init__ (self) self.out_queue = Out_queue def run (self): while true:chunk = Self.out_q Ueue.get () soup = beautifulsoup (chunk) #从源代码中搜索title标签的内容 print Soup.findalL ([' title ']) Self.out_queue.task_done () start = Time.time () def main (): For I in range (5): T = Threadur L (queue,out_queue) #线程任务就是将网址的源代码存放到out_queue队列中 T.setdaemon (True) #设置为守护线程 T.start () #将网址都存放到queue队列中 F or host in Hosts:queue.put (host) for I in range (5): dt = Dataminethread (out_queue) #线程任务就是从源代码中解析出 <tit Content inside the le> tag Dt.setdaemon (True) Dt.start () Queue.join () #线程依次执行, the main thread finally executes Out_queue.join () main () print "to Tal time:%s "% (Time.time ()-start)


Python multithreaded Multi-queue (BeautifulSoup network crawler)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.