Python Chinese garbled Solution

Last Update:2018-12-03 Source: Internet

Author: User

Tags windows ssh client

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://blog.chinaunix.net/u/3204/showart_389639.html

Http://www.woodpecker.org.cn/diveintopython/xml_processing/unicode.html

The Chinese problem of Python has always been a headache for beginners. This article will give you a detailed explanation of this knowledge. Of course, it is almost certain that in future versions, python will completely solve this problem, so we don't have to worry about it.

Let's take a look at the Python version:
>>> Import sys
>>> SYS. Version
'2. 5.1 (r133: 54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]'

(1)
Use NotePad to create a file chinesetest. py. The default value is ANSI:
S = "Chinese"
Print s

Test it:
E: \ project \ Python \ test> Python chinesetest. py

File "chinesetest. py", line 1
Syntaxerror: Non-ASCII character '\ xd6' in file chinesetest. py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

Secretly changed the file encoding into UTF-8:
E: \ project \ Python \ test> Python chinesetest. py
File "chinesetest. py", line 1
Syntaxerror: Non-ASCII character '\ xe4' in file chinesetest. py on line 1, but no encoding declared; see http://www.pytho
N.org/peps/pep-0263.html for details

No help...
Now that it provides a URL, let's take a look. After a brief look, we finally know that if the file contains non-ASCII characters, We need to specify the encoding declaration in the first or second line. Change the encoding of the chinesetest. py file to ANSI and add the encoding statement:
# Coding = GBK
S = "Chinese"
Print s

Try again:
E: \ project \ Python \ test> Python chinesetest. py
Chinese

Normal :)
(2)
Take a look at its length:
# Coding = GBK
S = "Chinese"
Print Len (s)
Result: 4.
S is 'str' type. Therefore, a Chinese character is equivalent to two English characters, so the length is 4.
We write as follows:
# Coding = GBK
S = "Chinese"
S1 = u "Chinese"
S2 = Unicode (S, "GBK") # If the parameter is omitted, the default ASCII value of python is used for decoding.
S3 = S. Decode ("GBK") # convert STR to UNICODE: Decode. The Unicode function works the same way.
Print Len (S1)
Print Len (S2)
Print Len (S3)
Result:
2
2
2
(3)
Next let's take a look at the file processing:
Create a file named test.txt in ANSI format with the following content:
ABC Chinese
Read data using Python
# Coding = GBK
Print open ("test.txt"). Read ()
Result: ABC (Chinese)
The file format into UTF-8:
Result: ABC Juan
Obviously, decoding is required here:
# Coding = GBK
Import codecs
Print open ("test.txt"). Read (). Decode ("UTF-8 ")
Result: ABC (Chinese)
I used editplus to edit test.txt, but when I used the notepad that came with windows to edit and coexist in UTF-8 format,
Running error:
Traceback (most recent call last ):
File "chinesetest. py", line 3, in <module>
Print open ("test.txt"). Read (). Decode ("UTF-8 ")
Unicodeencodeerror: 'gbk' codec can't encode character U' \ ufeff 'in position 0: Illegal multibyte Sequence

Originally, some software, such as Notepad, will insert three invisible characters (0xef 0xbb 0xbf, BOM) at the beginning of the file when saving a file encoded in UTF-8 ).
Therefore, we need to remove these characters during reading. The codecs module in Python defines this constant:
# Coding = GBK
Import codecs
Data = open ("test.txt"). Read ()
If data [: 3] = codecs. bom_utf8:
Data = data [3:]
Print data. Decode ("UTF-8 ")
Result: ABC (Chinese)

(4) A few issues left behind
In the second part, we use the Unicode function and decode method to convert STR to Unicode. Why do the parameters of these two functions use "GBK?
The first response is that GBK (# Coding = GBK) is used in our encoding statement, but is it true?
Modify the source file:
# Coding = UTF-8
S = "Chinese"
Print Unicode (S, "UTF-8 ")
Run, error:
Traceback (most recent call last ):
File "chinesetest. py", line 3, in <module>
S = Unicode (S, "UTF-8 ")
Unicodedecodeerror: 'utf8' codec can't decode bytes in position 0-1: Invalid Data
Obviously, if both sides use GBK, I keep the same UTF-8 on both sides, and it should be normal, so as not to report an error.
For a further example, if the conversion here still uses GBK:
# Coding = UTF-8
S = "Chinese"
Print Unicode (S, "GBK ")
Result: Chinese
I read an English document to explain the principle of print in Python:
When Python executes a print statement, it simply passes the output to the Operating System (using fwrite () or something like it ), and some other program is responsible for actually displaying that output on the screen. for example, on Windows, it might be
The Windows console subsystem that displays the result. or if you're using Windows and running python on a Unix box somewhere else, your windows SSH client is actually responsible for displaying the data. if you are running python in an xterm on UNIX, then
Xterm and your X Server handle the display.

To print data reliably, you must know the encoding that this display program expects.

To put it simply, print in Python directly transmits the string to the operating system, so you need to decode STR to the same format as the operating system. Cp936 (almost the same as GBK) is used in windows, so GBK can be used here.
Last test:
# Coding = UTF-8
S = "Chinese"
Print Unicode (S, "cp936 ")
Result: Chinese

Chinese processing of Python and other
Posted by Linghu
Some time ago, the hacker was writing his first Python program and complained to me about the problem of Chinese processing in Python. To tell the truth, I have never spent a lot of energy on this encoding, but I am wondering, is Python's encoding processing capability really so weak? So I did some experiments and now I want to talk a few simple words.

Before Unicode appeared, it was very troublesome to process Chinese characters. Because it does not have its own independent encoding, the GB (gb2312, GBK, and gb18030) code is represented by two or more ASCII characters, this brings about a series of problems such as ASCII code differentiation and Word Count statistics. In Unicode, Chinese, English, and many other characters all have their own independent encodings. As a result, there is no such problem. Therefore, it is the best choice to use Unicode to process Chinese characters in a possible environment.

Generally, the main process for processing Chinese characters can be as follows:

Convert the original string to Unicode using the corresponding Encoding
Unicode is used for intermediate processing.
Convert Unicode to the required encoding output
In python, Unicode-related elements include the U prefix, Unicode function, and codecs. What are their connections and differences?

The U prefix indicates that the String constant following is a unicode string. However, it only indicates that this is a unicode string, but does not change the encoding of the string itself. For example:

S = u "test"
The value of S is:

U' \ XB2 \ xe2 \ xca \ xd4'
While

S = "test"
The value of S is:

'\ XB2 \ xe2 \ xca \ xd4'
We can see that, except for the preceding U prefix, there is no difference in character internal encoding. Therefore, to obtain a unicode Chinese string, the U prefix is not a good method. What is the role of the U prefix? I guess it may work for non-English characters. I guess it doesn't matter if I guess it is unconfirmed.

Unlike the U prefix, the Unicode function has a conversion Encoding Process, in which codecs is used.

S = Unicode ("test", "gb2312 ")
The value of S is:

U' \ u6d4b \ u8bd5'
We can see that S is a real double-byte unicode encoded string.

In combination with the above Chinese processing process, we can get a general Python Chinese processing process:

Use Unicode functions to convert the string to Unicode.
Use Unicode strings in a program
Use the encode method to convert Unicode to the required encoding.
The following is a simple demonstration. You can use the re library to query and print a Chinese string:

>>> P = Re. Compile (Unicode ("test (. *)", "gb2312 "))
>>> S = Unicode ("test one, two, three", "gb2312 ")
>>> For I in P. findall (s ):
Print I. encode ("gb2312 ")
1, 2, 3
There are several points to note:

The so-called "correct" encoding means that the specified encoding must be consistent with the encoding of the string itself. In fact, this is not so easy to judge, in general, we directly enter the simplified Chinese characters, there are two possible encoding: gb2312 (GBK, gb18030), and UTF-8.
When the encode is encoded at a cost, you must ensure that there is an internal code in the target encoding to convert characters. The encode operation is generally performed by a unicode conversion table corresponding to local encoding. In fact, each local encoding can only be mapped to a part of Unicode. However, the mapped regions are different. For example, the Unicode encoding range corresponding to big-5 is different from that corresponding to GBK (in fact, some of these two encodings overlap ). Therefore, some Unicode characters (such as those converted from gb2312) can be mapped to GBK, but may not be mapped to big-5, if you want to convert to big-5, it is very likely that the encoding cannot be found. But the UTF-8's code table range is actually the same as Unicode (only the encoding form is different), So theoretically, any locally encoded character can be converted to the UTF-8.
PS1: gb2312, GBK, and gb18030 are essentially the same encoding standard. It only expands the number of characters based on the former.
I tried, my py file encoding is UTF-8, but use # Coding UTF-8 and # Coding GBK are not good, only # Coding cp936 can. My operating system is XP. Python 1, 2.6
Python code

# Coding = cp936
S = "Chinese"
Print Unicode (S, "cp936 ")

Source: http://blog.sina.com.cn/s/blog_5d687bea0100bm88.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More