Python Encoding Problems, python Encoding

Last Update:2017-10-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Encoding history:

1. ASCII (mainly used to display modern English and other Western European languages. It can only be expressed in 8 bits (one byte), that is, 2 ** 8 = 256-1. Therefore, an ASCII code can contain up to 255 characters.

2. In order to process Chinese characters, the programmer designed GB2312 for simplified Chinese characters, but GB2312 supports too few Chinese characters, and then expanded to GBK, then it is GB18030 (mobile phones and MP3 generally only support GB2312 ). From ASCII, GB2312, GBK to GB18030, these encoding methods are backward compatible. GB2312, GBK, and GB18030 are both double-byte characters.

3. Because ASCII codes cannot represent all the characters and symbols in the world, an encoding that can represent all characters and symbols is required: unicode (unified code, universal code, and single code). It sets a unified and unique binary code for each character in each language, the specified characters and symbols are expressed in at least 16 bits (2 bytes), that is, 2*16 = 65536.

4, UTF-8, is the Unicode encoding compression and optimization, It is variable length encoding, a good savings of space. The Windows operating system is installed with the default gbk encoding, while the linux operating system uses the utf8 encoding by default.

Ii. py3 encoding:

Py3 has two data types: str and bytes, str to store unicode data, and bytse to store bytes data. Python 3 makes a clearer distinction between text and binary data, and does not automatically decode bytes. The text is always Unicode, represented by the str type, and the binary data is represented by the bytes type. Python 3 does not mix str and bytes in any implicit way, which makes the distinction between the two very clear. You cannot concatenate strings and byte packets, search for strings in a byte packet (or vice versa), or input a string as a byte packet parameter (or vice versa ).

In py3, The encode converts the string type to the bytes type during transcoding, And the decode converts the bytes type back to the string type during decoding.

Import json

S = 'yuan hao'
Print (type (s) # <class 'str'>
Print (json. dumps (s) # "\ u82d1 \ u660a"

B = s. encode ('utf-8 ')
Print (type (B) # <class 'bytes '>
Print (B) # B '\ xe8 \ x8b \ x91 \ xe6 \ x98 \ x8a'

U = B. decode ('utf-8 ')
Print (type (u) # <class 'str'>
Print (u) # Yuan Hao
Print (json. dumps (u) # "\ u82d1 \ u660a"

Print (len ('yuan hao') #2

Iii. encoding of files from disk to memory:

For text editors such as word, the text we edit on word exists in memory in the form of data before it is saved? It is unicode data, because it is a universal code, and any character has a unique encoding, so compatibility is the best.

When we save the data stored on the disk, It is a bytes byte string encoded in some way. For example, utf8 is a variable-length encoding that saves space and gbk encoding. Therefore, our text editor software has the default encoding method for saving files, such as utf8 and gbk. When we click Save, these editing software has "Silently" helped us with coding. When we open this file again, the software quietly decoded the data, decoded it into unicode, and then presented the plaintext to the user! Therefore, unicode is closer to the user, and bytes is closer to the computer.

Program Execution: first define a concept: The py interpreter itself is a software, a software similar to a text editor! Now let's restore the encoding process from creation to execution of a py file:

Open pycharm, create the hello. py file, and write

S = 'yuan hao'

Print (s)

When we save it, hello. the py file is saved to the disk in pycharm's default encoding method. When the file is closed and opened again, pycharm then decodes the content read after the file is opened in the default encoding mode, after converting to unicode to memory, we can see the plaintext. If we click the run button or run the file on the command line, the software of The py interpreter will be called to open the file, the process of decoding bytes data on the disk into unicode data is the same as that in the editor, the difference is that the interpreter will translate the unicode data into C code and then convert it into a binary data stream. Finally, the entire process is completed by controlling the operating system to call the cpu to execute the binary data.

So the question is, our text editor has its own default encoding and decoding method. Does our interpreter have it?

Of course, py2's default ASCII code and py3's default utf8 can be queried as follows

1 2	Import sys Print (sys. getdefaultencoding ())

Common coding problems:

Garbled characters in cmd

Hello. py

# Coding: utf8

Print ('yuan hao ')

The encoding when the file is saved is also utf8.

Thinking: why is there no question when the idea is executed with 2 or 3? What if 3 is correct under cmd.exe and 2 is garbled?

We can execute the command "cmd.exe" in the winxfile, and" cmd.exe "is a software. When we use the GBK decoding method of python2 to decode utf8, garbled characters will naturally occur.

The original cause of py3's is to transmit the unicodedata to the user. The content can be identified by cmd.exe, so it is okay to display.

When you understand the principles, there are many ways to modify them, such:

Print (u'yuan hao ')

If this is the case, it will not be a problem to use 2 in cmd.

Reference: http://www.cnblogs.com/yuanchenqi/articles/5956943.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More