Use Chinese in Python

Last Update:2018-12-04 Source: Internet

Author: User

Tags windows ssh client

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Chinese problem of Python has always been a headache for beginners,This article will give you a detailed explanation of this knowledge. Of course,It is almost certain that, in future versions,Python will completely solve this problem, so we don't have to worry about it.

Let's take a look at the Python version:
>>> Import sys
>>> SYS. Version
'2. 5.1 (r133: 54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]'

(1)
Use NotePad to create a file chinesetest. py,Default ANSI:
S = "Chinese"
Print s

Test it:
E:/project/Python/test> Python chinesetest. py
File "chinesetest. py", line 1
Syntaxerror: Non-ASCII character '/xd6' in file chinesetest. py on line 1, but no encoding declared; see http://www.pytho
N.org/peps/pep-0263.html for details

Secretly changed the file encoding into UTF-8:
E:/project/Python/test> Python chinesetest. py
File "chinesetest. py", line 1
Syntaxerror: Non-ASCII character '/xe4' in file chinesetest. py on line 1, but no encoding declared; see http://www.pytho
N.org/peps/pep-0263.html for details

No help...
Now that it provides a URL, let's take a look. A simple look,Finally, if the file contains non-ASCII characters,You must specify the encoding declaration in the first or second line. Set chinesetest.The encoding of The py file is changed to ANSI, And the encoding declaration is added:
# Coding = GBK
S = "Chinese"
Print s

Try again:
E:/project/Python/test> Python chinesetest. py
Chinese

Normal :)
(2)
Take a look at its length:
# Coding = GBK
S = "Chinese"
Print Len (s)
Result: 4.
S: 'str' type,Therefore, a Chinese character is equivalent to two English characters, so the length is 4.
We write as follows:
# Coding = GBK
S = "Chinese"
S1 = u "Chinese"
S2 = Unicode (S, "GBK") # If the parameter is omitted, the default ASCII value of python is used for decoding.
S3 = S. Decode ("GBK") # convert STR to UNICODE: Decode,Unicode functions serve the same purpose
Print Len (S1)
Print Len (S2)
Print Len (S3)
Result:
2
2
2

(3)
Next let's take a look at the file processing:
Create a file named test.txt in ANSI format with the following content:
ABC Chinese
Read data using Python
# Coding = GBK
Print open ("test.txt"). Read ()
Result: ABC (Chinese)
The file format into UTF-8:
Result: ABC Juan
Obviously, decoding is required here:
# Coding = GBK
Import codecs
Print open ("test.txt"). Read ().Decode ("UTF-8 ")
Result: ABC (Chinese)
Test.txt is edited using editplus,However, when I use the notepad that comes with windows to edit and save it as UTF-8 format,
Running error:
Traceback (most recent call last ):
File "chinesetest. py", line 3, in <module>
Print open ("test.txt"). Read ().Decode ("UTF-8 ")
Unicodeencodeerror: 'gbk' codec can't encode character U'/ufeff 'in position 0: Illegal multibyte Sequence

It turns out that some software, such as Notepad, is saving-8. When a file is encoded, three invisible characters are inserted at the beginning of the file (0xef 0xbb 0xbf, that is, BOM ).
Therefore, we need to remove these characters when reading,The codecs module in Python defines this constant:
# Coding = GBK
Import codecs
Data = open ("test.txt"). Read ()
If data [: 3] = codecs. bom_utf8:
Data = data [3:]
Print data. Decode ("UTF-8 ")
Result: ABC (Chinese)

(4) A few issues left behind
In the second part,We use the Unicode function and decode method to convert STR to UN.Icode. Why do the parameters of these two functions use "GBK?
The first response is that GBK (# Coding = GBK) is used in our encoding statement, but is it true?
Modify the source file:
# Coding = UTF-8
S = "Chinese"
Print Unicode (S, "UTF-8 ")
Run, error:
Traceback (most recent call last ):
File "chinesetest. py", line 3, in <module>
S = Unicode (S, "UTF-8 ")
Unicodedecodeerror: 'utf8' codec can't decode bytes in position 0-1: Invalid Data
Obviously, if we use GBK on both sides,So here I keep the UTF-8 consistency on both sides, and it should be normal, so as not to report an error.
For a further example, if the conversion here still uses GBK:
# Coding = UTF-8
S = "Chinese"
Print Unicode (S, "GBK ")
Result: Chinese
I read an English document,It roughly explains the principle of print in Python:
When Python executes a print statement, it simply passes the output to the Operating System (using fwrite () or something like it ), and some other program is responsible for actually displaying that output on the screen. for example, on Windows, it might be the Windows console subsystem that displays the result. or if you're using Windows and running python on a Unix box somewhere else, your windows SSH client is actually responsible for displaying the data. if you are running python in an xterm on UNIX, then xterm and your X Server handle the display.

To print data reliably, you must know the encoding that this display program expects.

Simply put,Print in Python directly transmits the string to the operating system,So you need to decode STR to the same format as the operating system.Cp936 for Windows (almost the same as GBK ),Therefore, you can use GBK here.
Last test:
# Coding = UTF-8
S = "Chinese"
Print Unicode (S, "cp936 ")
Result: Chinese

======================================

This article is transferred from the python email list:[Cpyug: 47963] Python Unicode and Chinese (ZZ, I do not know the original author)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More