Python's Chinese Problem solving method

Last Update:2017-01-13 Source: Internet

Author: User

Tags in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python's Chinese Problem solving method

Python's eclips tutorial environment has written a test program. As a result, this error occurs:

Syntaxerror:non-ascii character ' xbd ' in file e:workspacemakeupdatafilesindexsrcmakeindex.py on line, but no encoding Declared; Http://www.python.org/peps/pep-0263.html for details

The reason is that it does not recognize the Chinese encoding. To join in the first line

#-*-CODING:GBK-*-

There are two types of strings in Python, which are generic strings (represented by 8 bits per character) and Unicode strings (each character is represented by one or more bytes), and they can be converted to each other.

, let's first understand that there are two kinds of strings in Python, which are generic strings (represented by 8 bits per character) and Unicode strings (each character is represented by one or more bytes), and they can be converted to each other. About Unicode,joel Spolsky in the absolute Minimum Every Software Developer Absolutely, positively must Know about Unicode and Character Sets (no excuses!) has a vivid description, Jason Orendorff in the Unicode for programmers has a more comprehensive description, I will not say anything more. Look at the following code:

x = u "Chinese Hello"
Print S

Run the above code, Python will give the following error hint

Syntaxerror:non-ascii character ' Xd6 ' in file g:workspacechinese_problemsrctest.py on line 1, but no encoding declared; Http://www.python.org/peps/pep-0263.html for details
Said to be encountering non-ASCII characters, and let's refer to pep-0263. PEP-0263 (Python enhancement Proposal) made it clear that Python was aware of the internationalization problem and proposed a solution. As required by the proposal, we have the following code
#-*-coding:gb2312-*-# must be in the first line or the second line
Print "-------------code 1----------------"
A = "Chinese A I Love You"
Print a
Print A.find ("I")
b = A.replace ("Love", "like")
Print B
Print "--------------Code 2----------------"
x = "Chinese A I Love You"
y = Unicode (x, "gb2312")
Print Y.encode ("gb2312")
Print y.find (U "i")
z = y.replace (U "Love", U "like")
Print Z.encode ("gb2312")
Print "---------------Code 3----------------"
The results of the print Y program run are as follows:
-------------Code 1----------------
Chinese A I love you
5
Chinese a I like you
--------------Code 2----------------
Chinese A I love you
3
Chinese a I like you
---------------Code 3----------------
Traceback (most recent call last):
File "g:downloadseclipseworkspacepsrchello.py", line, in <module>
Print Y
Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal No in range (128)

We can see that by introducing the code declaration, we can normally use Chinese, and in code 1 and 2, the console can also print the Chinese correctly. However, it is clear that the above code also reflects a number of problems:
1, code 1 and 2 use the print in a different way, 1 is direct print, and 2 before print before the encoding
2, code 1 and 2 Find the same character "I" in the same string, and the results are different (5 and 3, respectively).
3, code 3 error printing the Unicode string y directly (this is why code 2 is the reason to encode first)

Why? Why? We can first simulate the process of using Python in our mind: first, we write the source code in the editor and save it as a file. If the source code has an encoding declaration and the editor supports the syntax, the file is saved to disk in the appropriate encoding. Note: The encoding declaration and the source file encoding are not necessarily consistent, and you can simply declare the encoding as UTF-8 in the code declaration, but use GB2312 to save the source file. Of course, we can't borrow trouble, deliberately write wrong, and the good IDE can enforce the consistency of both, but if we use Notepad or EditPlus editor to write code, this problem will occur inadvertently.
Once we get a. py file, we'll be able to run it, and this is, we're going to hand the code to the Python parser to do the parsing work. When the parser reads the file, it first parses the encoding declaration in the file, we assume that the file is encoded as gb2312, then converts the contents of the file from gb2312 to Unicode, and then converts the Unicode to a UTF-8 format byte string. After completing this step, the parser parses these UTF-8 byte strings. If you encounter using a Unicode string, create a Unicode string using the corresponding UTF-8 byte string, if you are using a generic string in your program, The parser then converts the UTF-8 byte string into the byte string of the corresponding encoding (here is the gb2312 encoding) through Unicode, and uses it to create a generic string object. That is, a Unicode string is not the same as a generic string in memory, using the UTF-8 format, which uses the GB2312 format.
Well, in-memory string storage format we know, and here's how print works. Print is only responsible for the corresponding byte string in memory to the operating system, so that the operating system corresponding programs (such as the cmd window) to display. Here are two things:
1. If the string is a generic string, print simply pushes the corresponding byte string in memory to the operating system. As in example, code 1.
2. If the string is a Unicode string, print encode before pushing: we can show that the Encode method using Unicode is encoded using the appropriate encoding (code 2 in the example). Otherwise python encodes using the default encoding, which is ASCII (code 3 in the example). Of course ASCII is impossible to encode Chinese correctly, so Python complains.
At this point, we can resolve the first and third of the three questions above. As for the second question, because Python has two kinds of strings, a generic string and a Unicode string, both have their own character-handling methods. For the former, the method is done in bytes, and in GB2312, each Chinese character occupies two bytes, so the result is 5, and for the latter, the Unicode string, all characters are treated uniformly and therefore 3.
Although it mentions only the Chinese problem of the console program, the Chinese problems in file reading and writing and network transmission are similar in principle. The advent of Unicode can largely address the internationalization of software, and Python provides very good support for Unicode, so I recommend that you use Unicode as a way to write Python programs. Use UTF-8 encoding when saving files. There is a detailed description of how to use UTF-8 with Python,

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More