開源項目推薦 Databot: Python高效能資料驅動開發架構--爬蟲案例

最後更新：2018-08-25 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

github上今天突然有300個星

多年一直從事資料相關工作。對資料開發存在的各種問題深有體會。資料處理工作主要有：爬蟲，ETL,機器學習。開發過程是構建資料處理的管道Pipeline的過程. 將各種模組拼接起來。總結步驟有：擷取資料，轉化，合并，儲存，發送。資料研發工作和業務系統研發有著很多的差別。資料項目更多是鋪管道過程，各模組通過資料依賴，而業務系統開發是建大樓過程。很多情況爬蟲工程師，演算法工程師，寫出來的資料處理代碼，非常混亂。因為在看到真實資料前，無法做準確的設計，更不用說效能上的要求。前段時間花了大量時間對Asyncio庫深入研究。決定開發了資料驅動架構，從模組化，靈活度，效能方面來解決資料處理工作的問題。這就我創立Databot開源架構的初衷。

花大半個月時間架構基本完成，能夠解決處理資料處理工作，爬蟲，ETL，量化交易。並有非常好的效能表現。歡迎大家使用和提意見。

項目地址：github.com/kkyon/databot

安裝方法：pip3 install -U databot

代碼案例：github.com/kkyon/databot/tree/master/examples

多線程 VS 非同步協程：

總的來說高並發的資料IO使用非同步協程更具有優勢。因為線程佔用資源多，線程切換時候代價很大，所以建議的線程數都是cpu*2. Python由於GIL限制，通過多線程很難提升效能。

而通過asyncio可以達到非常的輸送量。並發數幾乎沒有限制。

具體可以參考這篇文章：

pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html

在普通筆記本上 python asyncio 在9分鐘完成100萬個網頁請求。

Databot效能測試結果：

使用百度爬蟲案例來作出：

有一批關鍵詞，需要在百度搜尋引擎。記錄前十頁的文章標題。在SEO，輿情等情境經常要做類似事情。測試中使用了100個關鍵字（需要抓取1000個網頁）大概三分鐘就能完成。測試環境結果如下：

# ---run result----
HTTP返回在1秒左右
#post man test result for a page requrest ;1100ms

ping的是時間42ms
# PING www.a.shifen.com (180.97.33.108): 56 data bytes
# 64 bytes from 180.97.33.108: icmp_seq=0 ttl=55 time=41.159 ms

Databot測試結果：每秒能抓取50個條目，每秒能處理6個網頁。

# got len item 9274 speed:52.994286 per second,total cost: 175s
# got len item 9543 speed:53.016667 per second,total cost: 180s
# got len item 9614 speed:51.967568 per second,total cost: 185s

Python Asyncio 的問題：

asyncio本身，比如概念複雜，futrue,task,區別，ensure futer,crate_task。

協程編寫要求對工程師高，特別在資料項目中。

asyncio支援的三方庫有限，需要結合多線程和多進程來開發。

Databot理念和

資料工程師只關注核心邏輯，編寫模組化函數，不需要考慮asyncio的特性。Databot將處理外部IO,並發，調度問題。

Databot基本概念：

Databot設計非常簡潔，一共只有三個概念：Pipe,Route,Node

Pipe是主流程，一個程式可以有多個Pipe，相互聯絡或獨立。Route,Node,都是包含在pipe內部。

Route是路由器，主要起資料路由，匯總合并作用。有Branch, Return,Fork,Join,BlockedJoin。其中Branch,Fork,不會改變主流程資料。Return,Join，會將處理後的資料放回到主流程中。可以通過嵌套Route，組合出複雜的資料網路。

Node是資料驅動節點。處理資料邏輯節點，一些HTTP,Mysql,AioFile ，客戶自訂函數，Timer,Loop都是屬於Node.

如何安裝Databot:

pip3 install -U databot

github地址：github.com/kkyon/databot

爬蟲代碼解析：

更多例子參照：github.com/kkyon/databot/tree/master/examples

針對百度爬蟲例子，主流程代碼如下：

get_all_items，是客戶編寫函數用於解析網頁上的條目。

get_all_page_url 是自訂編寫函數用於擷取網頁上的翻頁連結。

Loop通過迴圈列表把，連結發送到pipe中。
HTTPLoader將讀入URL,下載HTML.產生HTTP response對象放入Pipe中
Branch會拷貝一份資料(Httpresponse)匯入分支中，然後get_all_items會解析成最終結果，存入檔案中。此時主流程資料不受影響。仍然有一份HTTP response
Branch拷貝pipe中的Httpresponse到分支，然後通過get_all_page_url解析全部翻頁連結。然後通過HTTPloader下載相應的網頁，解析保持。

以上每個步驟都會通過Databot架構調用和並發。

BotFrame.render('baiduspider')函數可以用於生產pipe的結構圖。需要安裝www.graphviz.org/download/

主函數代碼：

 1 def main(): 2     words = ['貿易戰', '世界盃'] 3     baidu_url = 'www.baidu.com/s?wd=%s' 4     urls = [baidu_url % (word) for word in words] 5  6  7     outputfile=aiofile('baidu.txt') 8     Pipe( 9         Loop(urls),10         HttpLoader(),11         Branch(get_all_items,outputfile),12         Branch(get_all_page_url, HttpLoader(), get_all_items, outputfile),13 14     )15 16     #產生流程圖17     BotFrame.render('baiduspider')18     BotFrame.run()19 20 21 main()

下列是產生的流程圖

全部代碼：

 1 from databot.flow import Pipe, Branch, Loop 2 from databot.botframe import BotFrame 3 from bs4 import BeautifulSoup 4 from databot.http.http import HttpLoader 5 from databot.db.aiofile import aiofile 6 import logging 7 logging.basicConfig(level=logging.DEBUG) 8  9 10 11 #定義解析結構12 class ResultItem:13 14     def __init__(self):15         self.id: str = ''16         self.name: str = ''17         self.url: str = ' '18         self.page_rank: int = 019         self.page_no: int = 020 21     def __repr__(self):22         return  '%s,%s,%d,%d'%(str(self.id),self.name,self.page_no,self.page_rank)23 24 25 # 解析具體條目26 def get_all_items(response):27     soup = BeautifulSoup(response.text, "lxml")28     items = soup.select('div.result.c-container')29     result = []30     for rank, item in enumerate(items):31         import uuid32         id = uuid.uuid4()33         r = ResultItem()34         r.id = id35         r.page_rank = rank36         r.name = item.h3.get_text()37         result.append(r)38     return result39 40 41 # 解析分頁連結42 def get_all_page_url(response):43     itemList = []44     soup = BeautifulSoup(response.text, "lxml")45     page = soup.select('div#page')46     for item in page[0].find_all('a'):47         href = item.get('href')48         no = item.get_text()49         if '下一頁' in no:50             break51         itemList.append('www.baidu.com' + href)52 53     return itemList54 55 56 def main():57     words = ['貿易戰', '世界盃']58     baidu_url = 'www.baidu.com/s?wd=%s'59     urls = [baidu_url % (word) for word in words]60 61 62     outputfile=aiofile('baidu.txt')63     Pipe(64         Loop(urls),65         HttpLoader(),66         Branch(get_all_items,outputfile),67         Branch(get_all_page_url, HttpLoader(), get_all_items, outputfile),68 69     )70     #產生流程圖71     BotFrame.render('baiduspider')72     BotFrame.run()73 74 75 main()

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More