Python – 體驗urllib3 — HTTP串連池的應用

最後更新：2018-12-03 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

可以通過 http://code.google.com/p/urllib3/ 下載相關庫和資料。

先列出使用方法：

# coding=utf8import urllib3import datetimeimport timeimport urllib#建立串連特定主機的串連池http_pool = urllib3.HTTPConnectionPool('ent.qq.com')#擷取開始時間strStart = time.strftime('%X %x %Z')for i in range(0,100,1):    print i    #組合URL字串    url = 'http://ent.qq.com/a/20111216/%06d.htm' % i    print url    #開始同步擷取內容    r = http_pool.urlopen('GET',url,redirect=False)    print r.status,r.headers,len(r.data)#列印時間print 'start time : ',strStartprint 'end time : ',time.strftime('%X %x %Z')

比較簡單：先建立串連池http_pool，然後連續擷取同一host('ent.qq.com')的URL資源。
通過wireshark抓取包：

所有http://ent.qq.com/a/20111216/******.htm對應的src port都是13136，可見連接埠重用了
根據urllib3的文檔應該採用了keep-alive特性，並且所有repond的connection欄位都是keep-alive.

那這個串連池怎麼實現的呢？


def urlopen(self, method, url, body=None, headers=None, retries=3,                redirect=True, assert_same_host=True):        # 去掉很多條件判斷語句        try:            # 擷取串連            conn = self._get_conn()            # 組合Request            self.num_requests += 1            conn.request(method, url, body=body, headers=headers)            # 設定逾時            conn.sock.settimeout(self.timeout)            httplib_response = conn.getresponse()            # ......            # 解析HTTPRespond            response = HTTPResponse.from_httplib(httplib_response)            # 把當前的串連放入隊列，以供重用            self._put_conn(conn)        except        # 出錯處理        ...                 # 重新導向處理，這裡是遞迴盡興的        if (redirect and            response.status in [301, 302, 303, 307] and            'location' in response.headers):  # Redirect, retry            log.info("Redirecting %s -> %s" %                     (url, response.headers.get('location')))            return self.urlopen(method, response.headers.get('location'), body,                                headers, retries - 1, redirect,                                assert_same_host)# 返回結果        return response
通過上面簡化的代碼可見，首先擷取串連，然後構建Request，盡興請求，之後擷取Respond。
這裡需要注意的是，每次建立串連是通過調用_get_conn
建立完串連後都調用_put_conn方法放入串連池裡，相關代碼如下：
    def _new_conn(self):        # 建立串連        return HTTPConnection(host=self.host, port=self.port)    def _get_conn(self, timeout=None):        # 從pool嘗試擷取串連        conn = None        try:            conn = self.pool.get(block=self.block, timeout=timeout)            # 判斷串連是否已經建立了呢？            if conn and conn.sock and select([conn.sock], [], [], 0.0)[0]:                # Either data is buffered (bad), or the connection is dropped.                log.warning("Connection pool detected dropped "                            "connection, resetting: %s" % self.host)                conn.close()        except Empty, e:            pass  # Oh well, we'll create a new connection then# 如果隊列為空白，或者隊列中的串連被斷開了，那麼建立一個串連在同一個連接埠        return conn or self._new_conn()    def _put_conn(self, conn):        # 把當前串連放入隊列裡，當然這個對列的預設最大元素大小為1，如果超過此大小，則被丟棄        try:            self.pool.put(conn, block=False)        except Full, e:            # This should never happen if self.block == True            log.warning("HttpConnectionPool is full, discarding connection: %s"                        % self.host)
通過上述POOL和普通的urllib庫進行測試效能，連續擷取同一個網域名稱的不同網頁，速度沒有明顯提升，原因可能是伺服器離本地比較近，而POOL的主要最佳化是減少TCP握手次數和慢啟動次數，沒有很好的體現出來。
對於效能測試方面的建議，不知有什麼好的方法？
還有人提到，是否在urllib3裡要提供串連池的池，即能實現訪問不同網站時，自動為每個host建立一個池，即HTTPOcean

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More