Python編碼爬坑指南

最後更新：2018-12-06 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

　　自己最近有在學習python，這實在是一門非常短小精悍的語言，很喜歡這種語言精悍背後又有強大函數庫支撐的語言。可是剛接觸不久就遇到了讓人頭疼的關於編碼的問題，在網上查了很多資料現在在這裡做一番總結，權當一個記錄也為後來的兄弟姐妹們服務，如果可以讓您少走一些彎路本人將倍感榮幸。

　　先來描述下現象吧：

import osfor i in os.listdir("E:\Torchlight II"):    print i

　　代碼很簡單我們使用os的listdir函數遍曆了E:\Torchlight II這個目錄（Torchlight ？！：）），由於這個目錄下有些檔案是以中文命名的，所以在最後print結果時出現了亂碼，像這樣：

　　那麼問題出在哪兒呢？別急，我們一點一點來分析它。

　　從這裡和這裡我們幾乎能夠肯定的知道問題是出在：

This means that the python console app can't write the given character to the console's encoding.More specifically, the python console app created a _io.TextIOWrapperd instance with an encoding that cannot represent the given character.sys.stdout --> _io.TextIOWrapperd --> (your console)

　　看到這裡不知你是否與我想的一樣，能不能去設定console的編碼，將其設定為能夠理解中文字元的編碼不就可以正常的顯示出中文了嗎？等等，讓我們在多Google一會兒，

Python determines the encoding of stdout and stderr based on the value of the LC_CTYPE variable, but only if the stdout is a tty. So if I just output to the terminal, LC_CTYPE (or LC_ALL) define the encoding. However, when the output is piped to a file or to a different process, the encoding is not defined, and defaults to 7-bit ASCII.

　　更詳細的說明如下：

1). When Python finds its output attached to a terminal, it sets the sys.stdout.encoding attribute to the terminal's encoding. The print statement's handler will automatically encode unicode arguments into str output.2). When Python does not detect the desired character set of the output, it sets sys.stdout.encoding to None, and print will invoke the "ascii" codec.

　　謔謔，看來剛才的想法是可行的只是不太優雅罷了，因為我們得去修改系統的設定。事實上上面的論述是基於linux環境的，在linux下可能需要我們去更改某個環境變數的值（LC_CTYPE or LANG）；如果我們是在windows下面的話，console的編碼設定是跟作業系統的地區設定相關的。比如在中文的win7環境下，console預設的編碼就是GBK（cp936）。你可以試試下面的代碼：

import localeprint locale.getdefaultlocale()[1]

　　console的編碼不好設定了那能否對stdout.out.encoding進行設定以達到我們的目的呢？很遺憾，答案是否定的，這傢伙壓根就是唯讀：

　　沒有辦法了嗎？不會，其實我們離成功已經很近了，來，根據上面檢索到的那些資料分析整理下看看我們現在掌握到的情況都有哪些：

  1). console不能正常顯示中文，console的編碼是由作業系統決定的（windows環境下）；

　2). 我的作業系統是win7中文版（GBK），enc = locale.getdefaultlocale()[1]；

3). console的編碼決定了sys.stdout.encoding的取值，sys.stdout.encoding = utf-8；

4). 從作業系統枚舉目錄（E:\Torchlight II）列表返回的字串也是GBK編碼

　　是不是已經看出問題來了。最上面中那麼奇奇怪怪的問號尖角符號就是因為字串本身是按照gbk進行編碼的，但是由於sys.stdout.encoding = utf-8，導致print會按照utf-8對input的資料進行encode從而轉換為unicode字元。這，當然錯誤了。原因已經清楚了，來改改代碼吧：

import osfor i in os.listdir("E:\Torchlight II"):    print i.decode('gbk')

　　在代碼中我們手動告訴了python對讀入的字串按章gbk編碼來進行解碼，而這一個動作之後資料已經是標準的unicode字元了，可以放心的交給print去列印輸出了（即使這會兒sys.stdout.encoding = utf-8）：

ps：

　　實際在google中還查到過很多相關的類似編碼的問題，比如這裡的，還有這裡的。雖然問題的樣子千變萬化並且解決方式多種多樣甚至是python自己的特定解決方式，比如這裡。但這些問題本質都是一樣的都是關於字元的編碼和解碼，搞清楚了其中的本質所有問題都能夠迎刃而解。

　　給出幾篇我認為有價值的參考資料：

　　http://docs.python.org/howto/unicode.html#history-of-character-codes Unicode HOWTO

　　http://farmdev.com/talks/unicode/ Unicode In Python, Completely Demystified

　　http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror Python, Unicode and UnicodeDecodeError

　　http://www.joelonsoftware.com/articles/Unicode.html The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

---尊重作者勞動，轉載請註明原作者和原文地址：）---

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More