字元集編碼與Python（二）Unicode與utf-8

最後更新：2017-02-04 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：ret tar 編碼方式出錯 put 代碼 osi type python

Python中的Unicode和utf-8

上一篇提過了字元集的曆史其中簡單的講解了Unicode與utf-8的關係，簡單的總結一下： utf-8和utf-16 、utf-32是一類，實現的功能是一樣的，只是utf-8使用的最為廣泛，但是Unicode和utf-8並不是同一類，Unicode是表現形式，utf-8是儲存形式

unicode是表現形式（utf-8可以解碼成unicode）
utf-8 、utf-16 、utf-32 是儲存形式（unicode可以編碼成utf-8）

理解：儲存的時候需要編碼成utf-8，表現的時候是一個utf-8需要解碼成為Unicode，換句話說，在代碼中處理的是Unicode，在檔案中儲存的時候是以utf-8的形式儲存。

不使用Unicode的形式

In [1]: name = ‘張三‘ In [2]: print name     張三 In [3]: nameOut[3]: ‘\xe5\xbc\xa0\xe4\xb8\x89‘     #utf8編碼，儲存形式 In [4]: len(name)Out[4]: 6 In [5]: name[0:2]     #分區操作Out[5]: ‘\xe5\xbc‘ In [6]: print name[0:1]? In [7]: type(name)     #類型是字串類型Out[7]: str In [8]: type

使用Unicode的形式：

Python2裡面，是直接在字串前面加一個u

In [8]: name = u‘張三‘ In [9]: nameOut[9]: u‘\u5f20\u4e09‘     #Unicode編碼   表現形式 In [10]: print name張三 In [11]: print name[0:1]張 In [12]: name[0:1]Out[12]: u‘\u5f20‘ In [13]:  len(name)Out[13]: 2 In [15]: type(name)Out[15]: unicode     #類型是一個unicode

下面重點來了

解碼函數與編碼函數

Unicode與utf-8的互相轉換：在Python裡面提供了內建的方法：decode（）；encode（）

編碼：encode（）：從表現形式到儲存形式

解碼：decode（）：從儲存形式到表現形式

其中Unicode並沒有和某一種解碼形式綁定起來，

In [37]: name = u‘張三‘ In [38]: b_name = name.encode(‘utf-8‘)     #編碼為不同的儲存形式，既可以編碼為utf-8 In [39]: b_nameOut[39]: ‘\xe5\xbc\xa0\xe4\xb8\x89‘ In [47]: type(b_name)     #類型為strOut[47]: str In [40]: b_name2 = name.encode(‘utf-16‘)     #也可以編碼為utf-16 In [41]: b_name2Out[41]: ‘\xff\xfe _\tN‘ In [42]: b_name3 = name.encode(‘utf-32‘)     #還可以編碼為utf-32 In [43]: b_name3Out[43]: ‘\xff\xfe\x00\x00 _\x00\x00\tN\x00\x00‘ In [44]: j_name = b_name.decode(‘utf-8‘)     #把utf-8解碼為Unicode In [45]: j_nameOut[45]: u‘\u5f20\u4e09‘ In [46]: type(j_name)     #類型為UnicodeOut[46]: unicode

所以綜上所述Unicode寫入到一個檔案裡面的時候出錯，錯誤提示為：ASCII編碼不能大於128，ASCII編碼範圍為0-128，當然漢字超出了ASCII的編碼範圍對error的理解：Unicode為表現形式，具體儲存的時候必須要編碼成某一種編碼的方式，Python2 中預設使用ASCII編碼，所以儲存ASCII，但是我現在存的是中文，中文的範圍比ASCII大很多，所以存不下導致報錯：

In [47]: name = u‘張三‘ In [50]: with open(‘/tmp/test‘, ‘w‘) as f:    ...:     f.write(name)    ...:---------------------------------------------------------------------------UnicodeEncodeError                        Traceback (most recent call last)<ipython-input-4-0d87fa01de83> in <module>()      1 with open(‘/tmp/test‘, ‘w‘) as f:----> 2     f.write(name) UnicodeEncodeError: ‘ascii‘ codec can‘t encode characters in position 0-1: ordinal not in range(128)

所以解決辦法就有了，先編碼為utf-8或者utf-16等等

In [51]: with open(‘/tmp/test‘, ‘w‘) as f:    ...:     f.write(name.encode(‘utf-8‘))     #編碼為utf-8形式寫入到檔案裡面    ...:  In [52]: with open(‘/tmp/test‘, ‘r‘) as f:    ...:     new_name=f.read()    ...: In [53]: new_name.decode(‘utf-8‘)     #把utf-8解碼為UnicodeOut[53]: u‘\u5f20\u4e09‘

Python2和Python3關於字元集方面的區別

Python2和Python3的在字元集方面的差別：
Python 3有兩種表示字元序列的類型：bytes和str。前者的執行個體包含原始的8位值；後者的執行個體包含Unicode字元
Python 2也有兩種表示字元序列的類型，分別叫做str和unicode。與Python 3不同的是，str的執行個體包含原始的8位值；而unicode的執行個體，則包含Unicode字元

　　　　1、Python2裡面str表示普通的字串，而unicode表示的就是一個unicode　　　　　　也就是說：不指定類型的時候就是一個str，指定為Unicode的時候就是Unicode類型

In [15]: name = u‘張三‘ In [16]: type(name)Out[16]: unicode

　　　　 2、Python3裡面不指定字串類型的時候是一個str。　　　　3、Python3裡面的str就是Python2裡面的unicode，Python2裡面的str是Python3裡面的bytes！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！

open函數

　　　　　　Python2中有一個標準庫codecs模組幫我們自動編碼解碼codecs模組提供的open函數提供一個encoding參數

In [55]: import codecs In [56]: name = u‘張三‘ In [57]: with open(‘/tmp/test‘, ‘w‘, encoding=‘utf-8‘) as f:    ...:     f.write(name)    ...: In [58]: with open(‘/tmp/test‘, ‘r‘, encoding=‘utf-8‘) as f:    ...:     new_name=f.read()    ...: In [59]: new_nameOut[59]: u‘\u5f20\u4e09‘

　　　　　　Python3

的open函數本身就提供了encoding參數我們可以通過encoding指定編碼，在使用上和python2 的codecs模組一樣，

>>> name = ‘張三‘>>> name‘張三‘>>> with open(‘/tmp/test‘, ‘w‘, encoding=‘utf-8‘) as f:... f.write(name)...

#總結！！！！！！！！！！！！！！！！！！！把Unicode字元表示為位元據有許多種辦法，最常見的編碼方式就是utf-8。！！！！！！！！！！！！！ Python 3的str和Python 2的Unicode，並沒有和特定的二進位編碼相關聯。若想把Unicode字元轉換成位元據，就必須使用encode方法，若想把位元據轉換成Unicode字元，就必須使用decode 在編程的時候，一定要把編碼和解碼操作放在介面最外圍來做，程式的核心部分應該使用Unicode字元類型，而不要對字元編碼做任何假設。　　　　Python3

#在Python3中，我們需要編寫接受str或bytes，並總是返回str的方法：def to_str(bytes_or_str):  if isinstance(bytes_or_str, bytes):    value = bytes_or_str.decode(‘utf-8‘)  else:    value = bytes_or_str  return value # Instance of str#另外，還需要編寫接受str或bytes，並總是返回bytes的方法：def to_bytes(bytes_or_str):  if isinstance(bytes_or_str, str):    value = bytes_or_str.encode(‘utf-8)  else:    value = bytes_or_str  return value # Instance of bytes

　　　　Python2

#在Python2中，需要編寫接受str或unicode，並總是返回unicode的方法：#python2def to_unicode(unicode_or_str):  if isinstance(unicode_or_str, str):    value = unicode_or_str.decode(‘utf-8‘)  else:    value = unicode_or_str  return value # Instance of unicode#另外，還需要編寫接受str或unicode，並總是返回str的方法：#Python2def to_str(unicode_or_str):  if isinstance(unicode_or_str, unicode):    value = unicode_or_str.encode(‘utf-8‘)  else:    value = unicode_or_str  reutrn vlaue # Instance of str

文章下面的部分內容摘自《Effective Python：編寫高品質Python代碼的59個有效方法》第3 條：瞭解bytes、str 與unicode 的區別

字元集編碼與Python（二）Unicode與utf-8

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More