Chinese coding problem is the Chinese programmer often head big problem, under Python is also the case, so how to understand and solve the python coding problem?
Python uses Unicode encoding internally, while outsiders face a variety of different encodings, such as the Gbk,gb2312,utf8 that Chinese programs often face, and how do these encodings translate into internal Unicode?
First, let's take a look at the use of strings in source code files. The source code file as a text file is bound to store the code in some form of encoding, Python by default will assume that the source code file is ASCI encoding, for example, the code has a variable assignment:
S1= ' a '
Print S1
Python considers this ' a ' to be a asci encoded character. Everything works in the case of English characters only, but if you use Chinese, such as:
s1= ' Ha '
Print S1
This code file is executed with an error, which is a coding problem. Python defaults to the contents of the code file as ASCI encoding, but no Chinese is present in the ASCI encoding, so an exception is thrown.
The way to solve the problem is to let Python know what encoding is used in the file, for Chinese, the common encoding can be used UTF-8,GBK and gb2312, etc. Just add the following at the front of the code file:
#-*-Coding:utf-8-*-
This is to tell Python that the text in this file is encoded with utf-8 so that Python interprets the characters in the Utf-8 encoding and converts it into Unicode encoding for internal processing.
However, if you run this code under the Windows console, the program is executed, but the print on the screen is not a word. This is due to inconsistencies between the Python code and the console encoding. The encoding used in the console under Windows
Is GBK, and the Utf-8,python used in the code will naturally be inconsistent without printing the correct kanji by utf-8 encoding to the GBK encoded console.
Solution one is to change the source code to GBK, that is, the first line of code is changed to:
#-*-CODING:GBK-*-
Another way is to keep the source file Utf-8 unchanged, but in the ' ha ' before adding a U word, that is:
S1=u ' Ha '
Print S1
This will print out the word ' ha ' correctly.
The u here means that the string followed is stored in Unicode format. Python will identify the kanji ' ha ' in the code according to the Utf-8 encoding, which is the first line of code, and then convert it to a Unicode object. If we look at the data type of ' ha ' in type ' aha ', we get <type ' str ', and the type (U ' ha ') will get <type ' Unicode '; That is, the character is preceded by the U to indicate that this is a Unicode object, the word will be in the Unicode format in memory, and if not add u, it is only a string using a certain encoding, the encoding format depends on Python to the source code file encoding recognition, here is utf-8.
Python automatically converts the Unicode object to the console based on the encoding of the output environment, but if the output is not a Unicode object but an ordinary string, the output string will be printed directly by the encoding of the string, resulting in the above phenomenon.
With Unicode objects, you can use the encode and decode methods of Unicode classes and strings in addition to using the U tag.
The constructor of a Unicode class takes a string argument and an encoding parameter, encapsulates the string as a Unicode, for example, because we use UTF-8 encoding, the encoding parameter in Unicode uses ' utf-8′ to encapsulate the character as
Unicode object, and then correctly output to the console:
S1=unicode (' Ha ', ' utf-8′ ')
Print S1
In addition, a normal string can be converted to a Unicode object using the Decode function. Many people do not understand what the decode and encode functions of a python string mean. Here is a brief description.
Decode is to parse the normal string according to the encoding format in the parameter, and then generate the corresponding Unicode object, for example, here our code is using Utf-8, then converting a string to Unicode is the following form:
S2= ' Ha '. Decode (' utf-8′)
At this point, S2 is a Unicode object that stores the word ' ha ', in fact it is the same as Unicode (' Ha ', ' utf-8′ ') and U ' ha '.
Then encode is just the opposite function, is to convert a Unicode object into an encoded format in the parameters of ordinary characters, such as the following code:
S3=unicode (' Ha ', ' utf-8′ '). Encode (' utf-8′)
S3 now returned to Utf-8 's ' ha '.
Python coding problem u ' Kanji '