Python疑難雜症：SyntaxError: Non-ASCII character Python中文處理問題

最後更新：2018-12-06 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

python的中文問題一直是困擾新手的頭疼問題，這篇文章將給你詳細地講解一下這方面的知識。當然，幾乎可以確定的是，在將來的版本中，python會徹底解決此問題，不用我們這麼麻煩了。

先來看看python的版本：

>>> import sys
>>> sys.version
'2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]'

（一）用記事本建立一個檔案ChineseTest.py，預設ANSI：

s = "中文"
print s

測試一下瞧瞧：

E:\Project\Python\Test>python ChineseTest.py

File "ChineseTest.py", line 1
SyntaxError: Non-ASCII character '\xd6' in file ChineseTest.py on line 1, but noencodingdeclared; see http://www.pytho
n.org/peps/pep-0263.html for details

偷偷地把檔案編碼改成UTF-8：

E:\Project\Python\Test>python ChineseTest.py
File "ChineseTest.py", line 1
SyntaxError: Non-ASCII character '\xe4' in file ChineseTest.py on line 1, but noencodingdeclared; see http://www.pytho
n.org/peps/pep-0263.html for details

無濟於事。。。
既然它提供了網址，那就看看吧。簡單地瀏覽一下，終於知道如果檔案裡有非ASCII字元，需要在第一行或第二行指定編碼聲明。把ChineseTest.py檔案的編碼重新改為ANSI，並加上編碼聲明：

# coding=gbk
s = "中文"
print s

再試一下：

E:\Project\Python\Test>python ChineseTest.py
中文

正常咯：）

（二）看一看它的長度：

# coding=gbk
s = "中文"
print len(s)

結果：4。
s這裡是str類型，所以計算的時候一個中文相當於兩個英文字元，因此長度為4。
我們這樣寫:

# coding=gbk
s = "中文"
s1 = u"中文"
s2 = unicode(s, "gbk") #省略參數將用python預設的ASCII來解碼
s3 = s.decode("gbk") #把str轉換成unicode是decode，unicode函數作用與之相同
print len(s1)
print len(s2)
print len(s3)

結果：
2
2
2
（三）接著來看看檔案的處理：建立一個檔案test.txt，檔案格式用ANSI，內容為:abc中文，用python來讀取

# coding=gbk
print open("Test.txt").read()

結果：abc中文
把檔案格式改成UTF-8：
結果：abc涓枃
顯然，這裡需要解碼：

# coding=gbk
import codecs
print open("Test.txt").read().decode("utf-8")

結果：abc中文
上面的test.txt我是用Editplus來編輯的，但當我用Windows內建的記事本編輯並存成UTF-8格式時，
運行時報錯：

# coding=gbk
import codecs
print open("Test.txt").read().decode("utf-8")

原來，某些軟體，如notepad，在儲存一個以UTF-8編碼的檔案時，會在檔案開始的地方插入三個不可見的字元（0xEF 0xBB 0xBF，即BOM）。
因此我們在讀取時需要自己去掉這些字元，python中的codecs module定義了這個常量：

# coding=gbk
import codecs
print open("Test.txt").read().decode("utf-8")

結果：abc中文

（四）一點遺留問題
在第二部分中，我們用unicode函數和decode方法把str轉換成unicode。為什麼這兩個函數的參數用"gbk"呢？
第一反應是我們的編碼聲明裡用了gbk(# coding=gbk)，但真是這樣？
修改一下源檔案：

# coding=utf-8
s = "中文"
print unicode(s, "utf-8")

運行，報錯：

Traceback (most recent call last):
File "ChineseTest.py", line 3, in <module>
s = unicode(s, "utf-8")
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data

顯然，如果前面正常是因為兩邊都使用了gbk，那麼這裡我保持了兩邊utf-8一致，也應該正常，不至於報錯。
更進一步的例子，如果我們這裡轉換仍然用gbk：

# coding=utf-8
s = "中文"
print unicode(s, "gbk")

結果：中文
翻閱了一篇英文資料，它大致講解了python中的print原理：
When Python executes a print statement, it simply passes the output to the operating system (using fwrite() or something like it), and some other program is responsible for actually displaying that output on the screen. For example, on Windows, it might be the Windows console subsystem that displays the result. Or if you're using Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually responsible for displaying the data. If you are running Python in an xterm on Unix, then xterm and your X server handle the display.

To print data reliably, you must know the encoding that this display program expects.

簡單地說，python中的print直接把字串傳遞給作業系統，所以你需要把str解碼成與作業系統一致的格式。Windows使用CP936(幾乎與gbk相同)，所以這裡可以使用gbk。
最後測試：

# coding=utf-8
s = "中文"
print unicode(s, "cp936")

結果：中文

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python疑難雜症：SyntaxError: Non-ASCII character Python中文處理問題

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support