Python+Requests編碼識別Bug

最後更新：2015-08-16 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：

Requests 是使用 Apache2 Licensed 許可證的 HTTP 庫。用 Python 編寫，更友好，更易用。

Requests 使用的是 urllib3，因此繼承了它的所有特性。Requests 支援 HTTP 串連保持和串連池，支援使用 cookie 保持會話，支援檔案上傳，支援自動確定響應內容的編碼，支援國際化的 URL 和 POST 資料自動編碼。現代、國際化、人性化。

最近在使用Requests的過程中發現一個問題，就是抓去某些中文網頁的時候，出現亂碼，列印encoding是ISO-8859-1。為什麼會這樣呢？通過查看源碼，我發現預設的編碼識別比較簡單，直接從回應標頭檔案的Content-Type裡擷取，如果存在charset，則可以正確識別，如果不存在charset但是存在text就認為是ISO-8859-1，見utils.py。

def get_encoding_from_headers(headers):    """Returns encodings from given HTTP Header Dict.    :param headers: dictionary to extract encoding from.    """    content_type = headers.get(‘content-type‘)    if not content_type:        return None    content_type, params = cgi.parse_header(content_type)    if ‘charset‘ in params:        return params[‘charset‘].strip("‘\"")    if ‘text‘ in content_type:        return ‘ISO-8859-1‘

其實Requests提供了從內容擷取編碼，只是在預設中沒有使用，見utils.py：

def get_encodings_from_content(content):    """Returns encodings from given content string.    :param content: bytestring to extract encodings from.    """    charset_re = re.compile(r‘<meta.*?charset=["\‘]*(.+?)["\‘>]‘, flags=re.I)    pragma_re = re.compile(r‘<meta.*?content=["\‘]*;?charset=(.+?)["\‘>]‘, flags=re.I)    xml_re = re.compile(r‘^<\?xml.*?encoding=["\‘]*(.+?)["\‘>]‘)    return (charset_re.findall(content) +            pragma_re.findall(content) +            xml_re.findall(content))

還提供了使用chardet的編碼檢測，見models.py:

@propertydef apparent_encoding(self):    """The apparent encoding, provided by the lovely Charade library    (Thanks, Ian!)."""    return chardet.detect(self.content)[‘encoding‘]

如何修複這個問題呢？先來看一下樣本：

>>> r = requests.get(‘http://cn.python-requests.org/en/latest/‘)>>> r.headers[‘content-type‘]‘text/html‘>>> r.encoding‘ISO-8859-1‘>>> r.apparent_encoding‘utf-8‘>>> requests.utils.get_encodings_from_content(r.content)[‘utf-8‘]>>> r = requests.get(‘http://reader.360duzhe.com/2013_24/index.html‘)>>> r.headers[‘content-type‘]‘text/html‘>>> r.encoding‘ISO-8859-1‘>>> r.apparent_encoding‘gb2312‘>>> requests.utils.get_encodings_from_content(r.content)[‘gb2312‘]

通過瞭解，可以這麼用一個monkey patch解決這個問題：

import requestsdef monkey_patch():    prop = requests.models.Response.content    def content(self):        _content = prop.fget(self)        if self.encoding == ‘ISO-8859-1‘:            encodings = requests.utils.get_encodings_from_content(_content)            if encodings:                self.encoding = encodings[0]            else:                self.encoding = self.apparent_encoding            _content = _content.decode(self.encoding, ‘replace‘).encode(‘utf8‘, ‘replace‘)            self._content = _content        return _content    requests.models.Response.content = property(content)monkey_patch()

Requests: HTTP for Humans
Python+Requests抓取中文亂碼改進方案

Python+Requests編碼識別Bug

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python+Requests編碼識別Bug

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support