python多線程多隊列（BeautifulSoup網路爬蟲）

最後更新：2015-04-28 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：python 網路爬蟲多線程架構同步隊列

程式大概內容如下：

程式中設定兩個隊列分別為queue負責存放網址，out_queue負責存放網頁的原始碼。

ThreadUrl線程負責將隊列queue中網址的原始碼urlopen，存放到out_queue隊列中。

DatamineThread線程負責使用BeautifulSoup模組從out_queue網頁的原始碼中提取出想要的內容並輸出。

這隻是一個基本的架構，可以根據需求繼續擴充。

程式中有很詳細的注釋，如有有問題跪求指正啊。

import Queueimport threadingimport urllib2import timefrom BeautifulSoup import BeautifulSouphosts = ["http://yahoo.com","http://taobao.com","http://apple.com",         "http://ibm.com","http://www.amazon.cn"]queue = Queue.Queue()#存放網址的隊列out_queue = Queue.Queue()#存放網址頁面的隊列class ThreadUrl(threading.Thread):    def __init__(self,queue,out_queue):        threading.Thread.__init__(self)        self.queue = queue        self.out_queue = out_queue    def run(self):        while True:            host = self.queue.get()            url = urllib2.urlopen(host)            chunk = url.read()            self.out_queue.put(chunk)#將hosts中的頁面傳給out_queue            self.queue.task_done()#傳入一個相當於完成一個任務class DatamineThread(threading.Thread):    def __init__(self,out_queue):        threading.Thread.__init__(self)        self.out_queue = out_queue    def run(self):        while True:            chunk = self.out_queue.get()            soup = BeautifulSoup(chunk)#從原始碼中搜尋title標籤的內容            print soup.findAll(['title'])            self.out_queue.task_done()start = time.time()def main():    for i in range(5):        t = ThreadUrl(queue,out_queue)#線程任務就是將網址的原始碼存放到out_queue隊列中        t.setDaemon(True)#設定為守護線程        t.start()    #將網址都存放到queue隊列中    for host in hosts:        queue.put(host)    for i in range(5):        dt = DatamineThread(out_queue)#線程任務就是從原始碼中解析出<title>標籤內的內容        dt.setDaemon(True)        dt.start()    queue.join()#線程依次執行，主線程最後執行    out_queue.join()main()print "Total time :%s"%(time.time()-start)

python多線程多隊列（BeautifulSoup網路爬蟲）

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More