How to resolve string and character encoding problems in Python

Source: Internet
Author: User

The content of this section:

    1. Objective

    2. Related concepts

    3. Default encoding in Python

    4. Support for strings in Python2 and Python3

    5. Character encoding Conversion

First, preface

The character encoding in Python is a commonplace topic, and the peers have written many articles about it. Some of the same, and some written very deep. Recently saw a well-known training institutions in the teaching video again to talk about this problem, the explanation is still unsatisfactory, so I want to write this text. On the one hand, combing the relevant knowledge, on the other hand, hope to give others a little help.

The default encoding for Python2 is ASCII, does not recognize Chinese characters, requires explicit character encoding, and the default encoding for Python3 is Unicode, which can recognize Chinese characters.

I believe that we have seen in many articles similar to the above "on the Python Chinese processing" interpretation, but also believe that the first time you see such an explanation when you really feel clear. However, after a long time, and then repeat the relevant problems will feel that the seemingly understanding is not so clear. If we understand what the default code says above, we will understand the meaning of that sentence more clearly.

What to note is , "What is character encoding", and "the development of character encoding" is not the topic discussed in this section, which can refer to my previous << this article >>.

Second, related concepts

1. Characters and bytes

A character is not equivalent to one byte, and a character is a symbol that can be recognized by humans, and these symbols are stored in the computed storage and need to be represented by the bytes that the computer recognizes. A character tends to have multiple representations, and different representations use a different number of bytes. The different representations here refer to character encodings, such as the letters A-Z can be expressed in ASCII (one byte) or Unicode (two bytes), and can be represented with UTF-8 (one byte). The role of character encoding is to convert human-identifiable characters into machine-readable bytecode and reverse processes.

Unicdoe is the real string, and the ASCII, UTF-8, GBK and other character encodings represent the byte string . In this regard, we can often see in Python's official documentation The description "Unicode string", "translating a Unicode string into a sequence of bytes"

We write code that is written in a file, and the characters are stored as bytes in the file, so it is understandable when we define a string in a file as a byte string. However, what we need is a string, not a byte string. A good programming language should strictly differentiate between the two and provide ingenious and perfect support. The Java language is so good that I never thought about these issues that should not be handled by programmers before I knew about Python and PHP. Unfortunately, many programming languages try to confuse "strings" and "byte Strings", which they use as strings, both PHP and Python2 are part of the programming language. The best way to do this is to take the length of a string containing Chinese characters:

    • Take the length of the string, the result should be the number of all strings, whether Chinese or English

    • The length of the byte string corresponding to the strings is related to the character encoding used by the encoding (encode) procedure (for example: UTF-8 encoding, a Chinese character needs to be represented by 3 bytes; GBK encoding, a Chinese character requires 2 bytes to represent)

Note: The cmd terminal character encoding for Windows defaults to GBK, so the Chinese characters in the cmd input need to be represented in two bytes

>>> # python2>>> A = ' Hello, China '  # byte string, length is the number of bytes = Len (' Hello, ') +len (' china ') = 6+2*2 = 10>>> B = U ' Hello, China '  # String, length is the number of characters = Len (' Hello, ') +len (' china ') = 6+2 = 8>>> c = Unicode (A, ' GBK ')  # In fact, B is defined by the way C is defined Shorthand for decoding a GBK encoded byte string (decode) to a UNIOCDE string >>> >>> print (Type (a), Len (a)) (<type ' str;, 10) >>> print (type (b), Len (b)) (<type ' Unicode ', 8) >>> print (type (c), Len (c)) (<type ' Unicode ') , 8) >>>

The support for strings in Python3 has changed a lot, and the specific content will be described below.

2. Encoding and decoding

First, the popular science: Unicode character encoding, but also a character and number mapping, but here the number is called code point, is actually a hexadecimal number.

There is a description of the relationship between Unicode strings, byte strings, and encodings in the official Python documentation:

A Unicode string is a code point sequence with a code point value ranging from 0 to 0x10ffff (corresponding to decimal 1114111). This code point sequence needs to be represented as a set of bytes (values between 0 and 255) in storage (including memory and physical disks), whereas a rule that converts a Unicode string to a sequence of bytes is called encoding.

The coding here does not refer to character encoding, but rather to the encoding process and the code point and byte mapping rules for Unicode characters used in this process. This mapping does not have to be a simple one-to-one mapping, so the encoding process does not have to handle every possible Unicode character, such as:

The rules for converting Unicode strings to ASCII encoding are simple-for each code point:

    • If the code point value is <128, then each byte is the same as the value of the code point

    • If the code point value is >=128, the Unicode string cannot be represented in this encoding (in which case Python throws a Unicodeencodeerror exception)

Converting a Unicode string to UTF-8 encoding uses the following rules:

    • If the code point value is <128, it is represented by the corresponding byte value (as with Unicode to ASCII bytes)

    • If the code point value is >=128, it is converted to a 2-byte, 3-byte, or 4-byte sequence in which each byte in the sequence is between 128 and 255.

Simple summary:

    • encoding (encode): The process and rules for converting a Unicode string (the code point in) to a specific character encoding the corresponding byte string

    • decoding (decode): procedures and rules for converting a specific character-encoded byte string to the corresponding Unicode string (the code point in)

It can be seen that both encoding and decoding require an important factor, which is the specific character encoding . Because a character is encoded with a different character encoding, the byte value and the number of bytes are different in most cases, and vice versa.

Third, the default encoding in Python

1. Python source code file execution process

As we all know, the files on disk are stored in binary format, where the text files are stored in the form of a certain encoded byte. The character encoding of the program source code file is specified by the editor, for example, when we use Pycharm to write a Python program, we specify the project encoding and the file encoding as UTF-8. The Python code is then saved to disk when it is converted to a UTF-8 encoded byte (encode procedure) and then written to disk. When code in a Python code file is executed, the Python interpreter after reading the byte string in the Python code file needs to convert it to a Unicode string (the decode procedure) before performing subsequent operations.

As explained above, this conversion process (decode, decode) requires us to specify what character encoding is used for the bytes stored in the file, in order to know what the corresponding code point is for these bytes in Unicode, this universal code and Uniform Code. Here are the ways to specify the character encoding that everyone is familiar with, as follows:

#-*-Coding:utf-8-*-

2. Default encoding

So, what character encoding does the Python interpreter use to convert bytes read from code files to Unicode code points if we don't specify character encodings at the beginning of the code file? Just as we have a lot of default options when we configure some software, we need to set the default character encoding inside the Python interpreter to solve this problem, which is the "default encoding" at the beginning of the article. As a result, the question of the characters in Python can be summed up in one sentence: a decoding error (UNICODEENCODEERROR) occurs when a byte is not converted by the default character encoding.

The default encoding used by Python2 and Python3 's interpreter is not the same, and we can get the default encoding by Sys.getdefaultencoding ():

>>> # python2>>> Import sys>>> sys.getdefaultencoding () ' ASCII ' >>> # python3> >> import sys>>> sys.getdefaultencoding () ' Utf-8 '

Therefore, for Python2, the Python interpreter will first look at the current code file header to indicate what character encoding corresponds to the byte code saved in the current code file when it attempts to decode the byte code in the Chinese character. If not specified, decoding with the default character encoding "ASCII" causes the decoding to fail, resulting in the following error:

Syntaxerror:non-ascii character ' \xc4 ' in the file xxx.py on line one, but no encoding declared; See http://python.org/dev/peps/pep-0263/for details

For Python3, the execution process is the same, except that the PYTHON3 interpreter is "UTF-8" as the default encoding, but this does not mean that the Chinese problem is fully compatible. For example, when we develop on Windows, the Python Project and code files use the default GBK encoding, which means that Python code files are saved to disk by the bytecode converted to GBK format. When the PYTHON3 interpreter executes the code file, attempting to decode with UTF-8 also fails to decode, causing the following error:

Syntaxerror:non-utf-8 code starting with ' \xc4 ' in the file xxx.py on line one, but no encoding declared; See http://python.org/dev/peps/pep-0263/for details

3. Best practices

    • After you create a project, confirm that the character encoding for the project is set to UTF-8

    • To be compatible with Python2 and Python3, the character encoding is declared at the head of the code:-*- coding:utf-8 -*-

Iv. support for strings in Python2 and Python3

In fact, the improvement of string support in Python3, not only changed the default encoding, but also the implementation of the string, and it has implemented the built-in support for Unicode, in this regard, Python has been as good as Java. Let's take a look at the difference between Python2 and Python3 in support of strings:

Python2

Support for strings in Python2 is provided by the following three classes

Class Basestring (object)    class str (basestring)    class Unicode (basestring)

Executing Help (STR) and help (bytes) will find that the result is defined by the Str class, which means that Str is the byte string in Python2, and the subsequent Unicode object corresponds to the real string.

#!/usr/bin/env python#-*-coding:utf-8-*-a = ' Hello ' b = u ' Hello ' Print (type (a), Len (a)) print (type (b), Len (b)) output: (<type ' s TR ';, 6) (<type ' Unicode ';, 2)

Python3

The support for strings in Python3 is simplified on the implementation class level, the Unicode class is removed, and a bytes class is added. On the surface, it can be thought that Str and Unicode in Python3 are merged.

Class bytes (object) class str (object)

In fact, Python3 is aware of the previous error, starting with a clear distinction between strings and bytes. So str in Python3 is already a real string, and the byte is represented by a separate bytes class. In other words, Python3 is defined by default as a string, which enables built-in support for Unicode, reducing the burden on string processing by programmers.

#!/usr/bin/env python#-*-coding:utf-8-*-a = ' Hello ' b = u ' Hello ' c = ' hello '. Encode (' GBK ') print (type (a), Len (a)) print (type (b), Len (b)) Print (Type (c), Len (c)) output: <class ' str ' > 2<class ' str ' > 2<class ' bytes ' > 4

Five, character encoding conversion

As mentioned above, Unicode strings can be converted to and from any character-encoded byte.

Then it is easy to think of a problem, that is, different character-encoded bytes can be converted to each other through Unicode? The answer is yes.

The character encoding conversion process for strings in Python2 is:

byte string-->decode (' original character encoding ')-->unicode string-->encode (' new character encoding ')--byte string

#!/usr/bin/env python#-*-coding:utf-8-*-utf_8_a = ' I love China ' gbk_a = Utf_8_a.decode (' Utf-8 '). Encode (' GBK ') print (gbk_ A.decode (' GBK ')) output result: I love China

The string defined in Python3 is Unicode by default, so it does not need to be decoded before it can be encoded directly into a new character encoding:

String-->encode (' new character encoding ')--byte string

#!/usr/bin/env python#-*-coding:utf-8-*-utf_8_a = ' I love China ' gbk_a = Utf_8_a.encode (' GBK ') print (Gbk_a.decode (' GBK ')) Output: I love China

Finally, it should be explained that Unicode is not a Youdao dictionary, or Google translator, it does not translate a Chinese into an English. The conversion of the correct character encoding only changes the byte representation of the same character, and the character itself's symbol should not be changed, so not all the conversion between character encodings is meaningful. How do you understand this sentence? For example, GBK encoded "China" into the UTF-8 character encoding, only by 4 bytes into 6 bytes to represent, but its character expression should also be "China", and should not become "hello" or "Chinese".

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.