How to use Chinese in Python

Last Update:2016-06-16 Source: Internet

Author: User

Tags windows ssh client

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Let's take a look at Python's version:
>>> Import Sys
>>> sys.version
' 2.5.1 (r251:54863, APR, 08:51:08) [MSC v.1310 (Intel)] '

(i)
Create a file with Notepad chinesetest.py, default ANSI:
s = "Chinese"
Print S

Test it and see:
E:\project\python\test>python chinesetest.py
File "chinesetest.py", line 1
Syntaxerror:non-ascii character ' \xd6 ' in the file chinesetest.py on line 1, but no encoding declared; See http://www.python.org/peps/pep-0263.html for details

Secretly change the file encoding to UTF-8:
E:\project\python\test>python chinesetest.py
File "chinesetest.py", line 1
Syntaxerror:non-ascii character ' \xe4 ' in the file chinesetest.py on line 1, but no encoding declared; See http://www.python.org/peps/pep-0263.html for details

No avail...
Now that it provides the URL, let's see. Simply browse and finally know that if the file has non-ASCII characters, you need to specify the encoding declaration on the first or second line. Change the encoding of the chinesetest.py file back to ANSI and add the encoding declaration:
# CODING=GBK
s = "Chinese"
Print S

Try again:
E:\project\python\test>python chinesetest.py
Chinese

Normal slightly:)
(ii)
Take a look at the length of it:
# CODING=GBK
s = "Chinese"
Print Len (s)
Results: 4.
S is the STR type, so when calculating a Chinese equivalent to two English characters, the length is 4.
We write this:
# CODING=GBK
s = "Chinese"
S1 = u "Chinese"
S2 = Unicode (S, "GBK") #省略参数将用python默认的ASCII来解码
S3 = S.decode ("GBK") #把str转换成unicode是decode, the Unicode function works the same
Print Len (S1)
Print len (S2)
Print Len (S3)
Results:
2
2
2
(iii)
Then take a look at the file processing:
Create a file test.txt, file format with ANSI, content:
ABC Chinese
Using Python to read
# CODING=GBK
Print open ("Test.txt"). Read ()
Result: ABC Chinese
Change the file format to UTF-8:
Result: ABC Juan PO
Clearly, this needs to be decoded:
# CODING=GBK
Import Codecs
Print open ("Test.txt"). Read (). Decode ("Utf-8")
Result: ABC Chinese
The above test.txt I was using editplus to edit, but when I use Windows to bring the Notepad editor and UTF-8 format,
Run Times Error:
Traceback (most recent):
File "chinesetest.py", line 3, in
Print open ("Test.txt"). Read (). Decode ("Utf-8")
Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \ufeff ' in position 0:illegal multibyte sequence

It turns out that some software, such as Notepad, inserts three invisible characters (0xEF 0xBB 0xBF, or BOM) at the beginning of the file when saving a UTF-8 encoded file.
So we need to remove these characters when we read them, and the codecs module in Python defines this constant:
# CODING=GBK
Import Codecs
data = open ("Test.txt"). Read ()
If data[:3] = = codecs. Bom_utf8:
data = Data[3:]
Print Data.decode ("Utf-8")
Result: ABC Chinese

(iv) a few remaining issues
In the second part, we use the Unicode function and the Decode method to convert STR to Unicode. Why are the arguments for these two functions "GBK"?
The first reaction was that GBK (# CODING=GBK) was used in our code statement, but was that so?
Modify the source file:
# Coding=utf-8
s = "Chinese"
Print Unicode (S, "Utf-8")
Run, Error:
Traceback (most recent):
File "chinesetest.py", line 3, in
s = Unicode (S, "Utf-8")
Unicodedecodeerror: ' UTF8 ' codec can ' t decode bytes in position 0-1: Invalid data
Obviously, if the front is normal because both sides have used GBK, then here I keep the two sides utf-8 consistent, also should be normal, not error.
Further examples, if we convert here still with GBK:
# Coding=utf-8
s = "Chinese"
Print Unicode (S, "GBK")
Results: Chinese
Read an English document that outlines the print principle in Python:
When Python executes a print statement, it simply passes the output to the operating system (using fwrite () or something L Ike it), and some other program are responsible for actually displaying that output on the screen. For example, on Windows, it might is the Windows console subsystem that displays the result. Or If you ' re using the Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually RESPONSIBL E for displaying the data. If you is running Python in a xterm on Unix, then xterm and your X server handle the display. <>

To print data reliably, you must know the encoding, which this display program expects.

To put it simply, print in Python directly passes the string to the operating system, so you need to decode STR into a format consistent with the operating system. Windows uses CP936 (almost the same as GBK), so you can use GBK here.
Final Test:
# Coding=utf-8
s = "Chinese"
Print Unicode (S, "cp936")
Results: Chinese



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More