python以gzip header請求html資料時，response內容亂碼無法解碼的解決方案

最後更新：2015-04-23 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：python 亂碼 http gzip urllib2

1. 問題背景

在使用urllib2 module抓取web資料時，如果希望使用如何request header，減少傳輸時資料量。返回的資料，是經過gzip壓縮的。直接按照 content.decode(“utf8”), 解碼會出現異常，並且也無法檢測網頁資料的實際編碼類別型。

2. 問題分析

因為http請求中，如果在request header包含”Accept-Encoding”:”gzip, deflate”, 並且web伺服器端支援，返回的資料是經過壓縮的，這個好處是減少了網路流量，由用戶端根據header，在用戶端層解壓，再解碼。urllib2 module，擷取的http response資料是未經處理資料，沒有經過解壓，所以這是亂碼的根本原因。

3. 解決方案3.1 Request header移除”Accept-Encoding”:”gzip, deflate”

最快的方案，能直接得到可解碼的資料，缺點是，傳輸串流量會增加很多。

3.2 使用zlib module，解壓縮，然後解碼，得到可讀的明文資料。

這也是本文使用的方案

4. 源碼解析

代碼如下, 這是一個典型的類比form表單，post方式提交請求資料的代碼，基於python 2.7
,

代碼塊

代碼塊文法遵循標準markdown代碼

#! /usr/bin/env python2.7import sysimport zlibimport chardetimport urllibimport urllib2import cookielibdef main():    reload( sys )    sys.setdefaultencoding(‘utf-8‘)    url = ‘http://xxx.yyy.com/test‘    values = {            "form_field1":"value1",            "form_field2":"TRUE",             }    post_data = urllib.urlencode(values)    cj=cookielib.CookieJar()    opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))    headers ={"User-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:36.0) Gecko/20100101 Firefox/36.0",              "Referer":"http://xxx.yyy.com/test0",              "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",              "Accept-Language":"en-US,en;q=0.5",              "Accept-Encoding":"gzip, deflate",              "Connection":"keep-alive",              # "Cookie":"QSession=",              "Content-Type":"application/x-www-form-urlencoded",              }    req = urllib2.Request(url,post_data,headers)    response = opener.open(req)    content = response.read()    gzipped = response.headers.get(‘Content-Encoding‘)    if gzipped:        html = zlib.decompress(content, 16+zlib.MAX_WBITS)    else:        html = content    result = chardet.detect(html)    print(result)    print html.decode("utf8")if __name__ == ‘__main__‘:    main()

使用本指令碼需要以下環境
- Mac OS 10.9+
- Python 2.7.x

用 [TOC]來組建目錄：

問題背景
問題分析
解決方案
- 1 Request header移除Accept-Encodinggzip deflate
- 2 使用zlib module解壓縮然後解碼得到可讀的明文資料
源碼解析
- - 代碼塊
  - 目錄

python以gzip header請求html資料時，response內容亂碼無法解碼的解決方案

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More