The Chinese problem of python has always been a headache for beginners. This article will give you a detailed explanation of this knowledge. Of course, it is almost certain that in future versions, python will completely solve this problem, so we don't have to worry about it. Let's take a look at the python version:
>>> Import sys
>>> Sys. version
'2. 5.1 (R133: 54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]'
(1) Use Notepad to create a file ChineseTest. py. the default value is ANSI:
S = "Chinese"
Print s
Test it:
E: \ Project \ Python \ Test> python ChineseTest. py
File "ChineseTest. py", line 1
SyntaxError: Non-ASCII character '\ xd6' in file ChineseTest. py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Secretly changed the file encoding into UTF-8:
E: \ Project \ Python \ Test> python ChineseTest. py
File "ChineseTest. py", line 1
SyntaxError: Non-ASCII character '\ xe4' in file ChineseTest. py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
No help...
Now that it provides a URL, let's take a look. After a brief look, we finally know that if the file contains non-ASCII characters, we need to specify the encoding declaration in the first or second line. Change the encoding of the ChineseTest. py file to ANSI and add the encoding statement:
# Coding = gbk
S = "Chinese"
Print s
Normal :)
(2) Take a look at its length:
# Coding = gbk
S = "Chinese"
Print len (s)
Result: 4.
S is 'str' type. Therefore, a Chinese character is equivalent to two English characters, so the length is 4.
We write as follows:
# Coding = gbk
S = "Chinese"
S1 = u "Chinese"
S2 = unicode (s, "gbk") # if the parameter is omitted, the default ASCII value of python is used for decoding.
S3 = s. decode ("gbk") # Convert str to unicode: decode. The unicode function works the same way.
Print len (s1)
Print len (s2)
Print len (s3)
Result:
2
2
2
(3) Next let's take a look at the file processing:
Create a file named test.txt in ANSI format with the following content:
Abc Chinese
Read data using python
# Coding = gbk
Print open ("Test.txt"). read ()
Result: abc (Chinese)
The file format into UTF-8:
Result: abc Juan
Obviously, decoding is required here:
# Coding = gbk
Import codecs
Print open ("Test.txt"). read (). decode ("UTF-8 ")
Result: abc (Chinese)
I used Editplus to edit test.txt, but when I used the notepad that came with Windows to edit and coexist in UTF-8 format,
Running error:
Traceback (most recent call last ):
File "ChineseTest. py", line 3, in
Print open ("Test.txt"). read (). decode ("UTF-8 ")
UnicodeEncodeError: 'gbk' codec can't encode character u' \ ufeff 'in position 0: illegal multibyte sequence
Originally, some software, such as notepad, will insert three invisible characters (0xEF 0xBB 0xBF, BOM) at the beginning of the file when saving a file encoded in UTF-8 ).
Therefore, we need to remove these characters during reading. The codecs module in python defines this constant:
# Coding = gbk
Import codecs
Data = open ("Test.txt"). read ()
If data [: 3] = codecs. BOM_UTF8:
Data = data [3:]
Print data. decode ("UTF-8 ")
Result: abc (Chinese)
(4) a few issues left behind In the second part, we use the unicode function and decode method to convert str to unicode. Why do the parameters of these two functions use "gbk?
The first response is that gbk (# coding = gbk) is used in our encoding statement, but is it true?
Modify the source file:
# Coding = UTF-8
S = "Chinese"
Print unicode (s, "UTF-8 ")
Run, error:
Traceback (most recent call last ):
File "ChineseTest. py", line 3, in
S = unicode (s, "UTF-8 ")
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data
Obviously, if both sides use gbk, I keep the same UTF-8 on both sides, and it should be normal, so as not to report an error.
For a further example, if the conversion here still uses gbk:
# Coding = UTF-8
S = "Chinese"
Print unicode (s, "gbk ")
Result: Chinese
I read an English document to explain the principle of print in python:
When Python executes a print statement, it simply passes the output to the operating system (using fwrite () or something like it ), and some other program is responsible for actually displaying that output on the screen. for example, on Windows, it might be the Windows console subsystem that displays the result. or if you're using Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually responsible for displaying the data. if you are running Python in an xterm on Unix, then xterm and your X server handle the display. <>
To print data reliably, you must know the encoding that this display program expects.
To put it simply, print in python directly transmits the string to the operating system, so you need to decode str to the same format as the operating system. CP936 (almost the same as gbk) is used in Windows, so gbk can be used here.
Last Test:
# Coding = UTF-8
S = "Chinese"
Print unicode (s, "cp936 ")
Result: Chinese
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.