Python Chinese coding problem

Source: Internet
Author: User

Chinese coding problem is the Chinese programmer often head big problem, under Python is also the case, so how to understand and solve the python coding problem?

We need to know that Python uses Unicode encoding internally, while the outside faces a variety of different encodings, such as the Gbk,gb2312,utf8 that Chinese programs often face, how do these encodings translate into internal Unicode?

First, let's take a look at the use of strings in source code files. The source code file as a text file is bound to store the code in some form of encoding, Python by default will assume that the source code file is ASCI encoding, for example, the code has a variable assignment:

S1= ' a '
Print S1

Python considers this ' a ' to be a asci encoded character. Everything works in the case of English characters only, but if you use Chinese, such as:

s1= ' Ha '
Print S1

This code file is executed with an error, which is a coding problem. Python defaults to the contents of the code file as ASCI encoding, but no Chinese is present in the ASCI encoding, so an exception is thrown.

The way to solve the problem is to let Python know what encoding is used in the file, for Chinese, the common encoding can be used UTF-8,GBK and gb2312, etc. Just add the following at the front of the code file:

#-*-Coding:utf-8-*-

This is to tell Python that the text in this file is encoded with utf-8 so that Python interprets the characters in the Utf-8 encoding and converts it into Unicode encoding for internal processing.

However, if you run this code under the Windows console, the program is executed, but the print on the screen is not a word. This is due to inconsistencies between the Python code and the console encoding. The encoding used in the console under Windows

Is GBK, and the Utf-8,python used in the code will naturally be inconsistent without printing the correct kanji by utf-8 encoding to the GBK encoded console.

Solution one is to change the source code to GBK, that is, the first line of code is changed to:

#-*-CODING:GBK-*-

Another way is to keep the source file Utf-8 unchanged, but in the ' ha ' before adding a U word, that is:

S1=u ' Ha '
Print S1

This will print out the word ' ha ' correctly.

The u here means that the string followed is stored in Unicode format. Python will identify the kanji ' ha ' in the code according to the Utf-8 encoding, which is the first line of code, and then convert it to a Unicode object. If we look at the data class of ' ha ' with type

Type type (' Ha '), will get <type ' str ', while type (U ' ha '), you will get <type ' Unicode ', that is, before the character plus u to indicate that this is a Unicode object, This word will exist in memory in Unicode format, and if you do not add u

, which indicates that this is just a string using some encoding, the encoding format depends on Python's identification of the source code file encoding, here is utf-8.

Python automatically converts the Unicode objects from the output environment to the console, but if the output is not a Unicode object but an ordinary string, the output string is directly encoded in the string, resulting in the present

Like.

With Unicode objects, you can use the encode and decode methods of Unicode classes and strings in addition to using the U tag.

The constructor of a Unicode class takes a string argument and an encoding parameter, encapsulates the string as a Unicode, for example, because we use UTF-8 encoding, the encoding parameter in Unicode uses ' utf-8′ to encapsulate the character as

Unicode object, and then correctly output to the console:

S1=unicode (' Ha ', ' utf-8′ ')
Print S1

In addition, a normal string can be converted to a Unicode object using the Decode function. Many people do not understand what the decode and encode functions of a python string mean. Here is a brief description.

Decode is to parse the normal string according to the encoding format in the parameter, and then generate the corresponding Unicode object, for example, here our code is using Utf-8, then converting a string to Unicode is the following form:

S2= ' Ha '. Decode (' utf-8′)

At this point, S2 is a Unicode object that stores the word ' ha ', in fact it is the same as Unicode (' Ha ', ' utf-8′ ') and U ' ha '.

Then encode is just the opposite function, is to convert a Unicode object into an encoded format in the parameters of ordinary characters, such as the following code:

S3=unicode (' Ha ', ' utf-8′ '). Encode (' utf-8′)

S3 now returned to Utf-8 's ' ha '.


Python Chinese coding problem

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.