Simple to solve the problem of Chinese encoding in Python files

Last Update:2016-06-10 Source: Internet

Author: User

Tags sublime text

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Read and write Chinese

To read the Utf-8 encoded Chinese file, first use the sublime text software to change it to a DOM-free encoding, and then use the following code:

With Codecs.open (Note_path, ' r+ ', ' utf-8 ') as F:line=f.readline () print line

This allows you to correctly read the Chinese characters in the file.

Similarly, if you want to write Chinese in the created file, it's best to do the same thing:

With Codecs.open (St, ' A + ', ' utf-8 ') as Book_note:book_note.write (ST)

Create a Chinese file

The file is then created with the read-out character as the filename.

If you create a file directly with the string read above, it will appear:

st=digest_path+ "\ \" +onenote[0]+ ". txt" Print Stwith open (St, ' A + ') as Book_note:

After debugging, should be the last line break problem, when generating the name, the character trip, you can get the file:

st=digest_path+ "\ \" +onenote[0].strip () + ". txt"

Chinese coding problem is the Chinese programmer often head big problem, under Python is also the case, so how to understand and solve the python coding problem?

We need to know that Python uses Unicode encoding internally, while the outside faces a variety of different encodings, such as the Gbk,gb2312,utf8 that Chinese programs often face, how do these encodings translate into internal Unicode?

First, let's take a look at the use of strings in source code files. The source code file as a text file is bound to store the code in some form of encoding, Python by default will assume that the source code file is ASCI encoding, for example, the code has a variable assignment:

S1= ' A ' print S1

Python considers this ' a ' to be a asci encoded character. Everything works in the case of English characters only, but if you use Chinese, such as:

s1= ' ha ' Print s1

This code file is executed with an error, which is a coding problem. Python defaults to the contents of the code file as ASCI encoding, but no Chinese is present in the ASCI encoding, so an exception is thrown.

The way to solve the problem is to let Python know what encoding is used in the file, for Chinese, the common encoding can be used UTF-8,GBK and gb2312, etc. Just add the following at the front of the code file:

#-*-Coding:utf-8-*-

This is to tell Python that the text in this file is encoded with utf-8 so that Python interprets the characters in the Utf-8 encoding and converts it into Unicode encoding for internal processing.

However, if you run this code under the Windows console, the program is executed, but the print on the screen is not a word. This is due to inconsistencies between the Python code and the console encoding. The encoding used in the console under Windows

Is GBK, and the Utf-8,python used in the code will naturally be inconsistent without printing the correct kanji by utf-8 encoding to the GBK encoded console.

Solution one is to change the source code to GBK, that is, the first line of code is changed to:

#-*-CODING:GBK-*-

Another way is to keep the source file Utf-8 unchanged, but in the ' ha ' before adding a U word, that is:

S1=u ' ha ' Print s1

This will print out the word ' ha ' correctly.

The u here means that the string followed is stored in Unicode format. Python will identify the kanji ' ha ' in the code according to the Utf-8 encoding, which is the first line of code, and then convert it to a Unicode object. If we look at the data type of ' ha ' in the type type (' Ha '), it will get, and the type (U ' ha '), then you will get, that is, before the character plus u to indicate that this is a Unicode object, the word will exist in the Unicode format in memory, and as The result does not add u, indicating that this is just a string using a certain encoding, the encoding format depends on Python's identification of the source code file encoding, here is utf-8.

Python automatically converts the Unicode object to the console based on the encoding of the output environment, but if the output is not a Unicode object but an ordinary string, the output string will be printed directly by the encoding of the string, resulting in the above phenomenon.

With Unicode objects, you can use the encode and decode methods of Unicode classes and strings in addition to using the U tag.

The constructor of a Unicode class takes a string argument and an encoding parameter, encapsulates the string as a Unicode, for example, because we use UTF-8 encoding, the encoding parameter in Unicode uses ' utf-8′ to encapsulate the character as

Unicode object, and then correctly output to the console:

S1=unicode (' Ha ', ' utf-8′ ') print S1

In addition, a normal string can be converted to a Unicode object using the Decode function. Many people do not understand what the decode and encode functions of a python string mean. Here is a brief description.

Decode is to parse the normal string according to the encoding format in the parameter, and then generate the corresponding Unicode object, for example, here our code is using Utf-8, then converting a string to Unicode is the following form:

S2= ' Ha '. Decode (' utf-8′)

At this point, S2 is a Unicode object that stores the word ' ha ', in fact it is the same as Unicode (' Ha ', ' utf-8′ ') and U ' ha '.

Then encode is just the opposite function, is to convert a Unicode object into an encoded format in the parameters of ordinary characters, such as the following code:

S3=unicode (' Ha ', ' utf-8′ '). Encode (' utf-8′)

S3 now returned to Utf-8 's ' ha '.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More