String and character encoding in Python, python string Encoding
Content of this section: 1. Preface
Character encoding in Python is a common topic, and many articles have been written in this field. Some people are also cloud, and some are well written. Recently, I saw a video from a well-known training institution talking about this issue again, but it was still not satisfactory, so I wanted to write this article. On the one hand, sort out relevant knowledge, and on the other hand, hope to help others.
Python2'sDefault encodingIt is ASCII and cannot recognize Chinese characters. The character encoding must be explicitly specified;Default encodingIt is Unicode and can recognize Chinese characters.
I believe that you have seen in many articles an explanation similar to the above "Processing Chinese in Python". I also believe that you did understand this explanation at the beginning. However, after a long time, you may feel that you are not so clear about the problem again. If we understand the role of the default encoding above, we will have a clearer understanding of the meaning of that sentence.
It must be noted that, "What is character encoding", and "development process of character encoding" are not topics discussed in this section. For more information, see <this article>.
Ii. Concepts 1. characters and bytes
A character is not equivalent to a byte. It is a character that can be recognized by humans. To save these symbols to the storage of computation, they must be represented by bytes that can be recognized by computers. A character often has multiple representation methods. Different Representation Methods use different bytes. The different representation here refers to the character encoding, for example, the letter A-Z can be expressed in ASCII code (occupies one byte), can also be expressed in UNICODE (occupies two bytes ), it can also be expressed in UTF-8 (occupies one byte ). Character encoding is used to convert human-identifiable characters into machine-identifiable bytecode and reverse processes.
UNICDOE is the real string, and the byte string is encoded with ASCII, UTF-8, GBK and other characters. In this regard, we can often see this description in the official Python documentation "Unicode string", "translating a Unicode string into a sequence of bytes"
The code we write is written in a file, and the characters are saved in the file in byte form. Therefore, it is understandable to define a string in the file as a byte string. However, what we need is a string, not a byte string. A good programming language should strictly differentiate the relationship between the two and provide ingenious and perfect support. The JAVA language is so good that I have never considered the issues that should not be handled by programmers before I understand Python and PHP. Unfortunately, many programming languages try to confuse "strings" and "Byte strings". They use byte strings as strings. Both PHP and Python2 belong to this programming language. The best way to illustrate this problem is to take the length of a string containing Chinese characters:
- Returns the number of all strings regardless of Chinese or English.
- The length of the byte string corresponding to the string is related to the character encoding used in the encoding process (for example, UTF-8 encoding, a Chinese character needs to be expressed in 3 bytes; GBK encoding. A Chinese character must contain 2 bytes)
Note: The default character encoding for Windows cmd terminal is GBK. Therefore, the Chinese characters entered in cmd must be expressed in two bytes.
>>># Python2 >>> a = 'hello, China' # byte string, length = len ('hello, ') + len ('China ') = 6 + 2*2 = 10 >>> B = u 'hello, China' # string, length = len ('hello ,') + len ('China') = 6 + 2 = 8 >>> c = unicode (a, 'gbk') # In fact, the definition of B is short for the c definition method, all decode a GBK encoded byte string into a Uniocde string >>>>> print (type (a), len ()) (<type 'str'>, 10) >>> print (type (B), len (B) (<type 'unicode '>, 8) >>> print (type (c), len (c) (<type 'unicode '>, 8) >>>
Python 3 has made great changes to the string support. The specific content is described below.
2. encoding and decoding
First, let's take a look at the popular science: UNICODE character encoding, which is also a ing between a character and a number. However, the number here is called a code point, which is actually a hexadecimal number.
In the official Python documentation, the relationship between Unicode strings, byte strings, and encoding is described as follows:
A Unicode string is a code point sequence. The value range of a code point is 0 to 0x10FFFF (the corresponding decimal value is 1114111 ). This Code Point sequence needs to be represented as a group of bytes (values between 0 and 255) in storage (including memory and physical disks ), the rules for converting Unicode strings into byte sequences are called encoding.
The encoding here does not refer to character encoding, but to the encoding process andRules for ing code points and bytes of Unicode characters. This ing does not have to be a simple one-to-one ing, so the encoding process does not have to process every possible Unicode character, for example:
The rules for converting Unicode strings to ASCII encoding are simple-for each code point:
- If the code point value is <128, the value of each byte is the same as that of the Code Point.
- If the code point value is greater than or equal to 128, the Unicode string cannot be expressed in this encoding (in this case, Python will cause a UnicodeEncodeError exception)
Converting Unicode strings to UTF-8 encoding uses the following rules:
- If the code point value is <128, it is represented by the corresponding byte value (same as Unicode to ASCII bytes)
- If the code point value is greater than or equal to 128, it is converted into a sequence of 2 bytes, 3 bytes, or 4 bytes. Each byte in the sequence is between 128 and 255.
Summary:
- Encoding (encode): Process and rules for converting a Unicode string (the Code point in a string) to a byte string corresponding to a specific character encoding
- Decode): Process and rules for converting a byte string encoded with a specific character into a corresponding Unicode string (the Code Point in)
It can be seen that both encoding and decoding require an important factor:Specific character encoding. Because the value and number of bytes of a character encoded with different characters are mostly different, and vice versa.
Iii. default encoding in Python 1. Execution of Python source code files
We all know that files on disks are stored in binary format, and text files are stored in bytes of a specific encoding. Character encoding for program source code files is specified by the editor, for example, when we use Pycharm to write a Python program, we will specify the project encoding and file encoding as the UTF-8, when the Python code is saved to the disk, it is converted to the bytes corresponding to the UTF-8 encoding (encode process) and written to the disk. When executing the code in the Python code file, the Python interpreter must convert the byte string in the Python code file to a UNICODE string (decode process) Before performing subsequent operations.
As explained above, this conversion process (decode, decoding) requires us to specify the character encoding used by the bytes stored in the file, in order to know what the corresponding code points of these bytes are found in the UNICODE universal code and unified code. You are familiar with the character encoding method, as shown below:
# -*- coding:utf-8 -*-
2. default encoding
If we do not specify the character encoding at the beginning of the code file, which character encoding will the Python interpreter use to convert the bytes read from the code file to UNICODE code points? Just as we have many default options When configuring some software, we need to set the default character encoding in the Python interpreter to solve this problem, this is the "default encoding" mentioned at the beginning of the article ". Therefore, the Chinese character of Python can be summarized as one sentence:When the byte cannot be converted using the default character encoding, A decoding error (UnicodeEncodeError) will occur).
The interpreter of Python2 and Python3 uses a different default encoding. We can use sys. getdefaultencoding () to obtain the default encoding:
>>> # Python2>>> import sys>>> sys.getdefaultencoding()'ascii'>>> # Python3>>> import sys>>> sys.getdefaultencoding()'utf-8'
Therefore, for Python2, when the Python interpreter attempts to decode the bytecode obtained from Chinese characters, check whether the header of the current code file specifies the character encoding corresponding to the bytecode saved in the current Code file. If this parameter is not specified, the default character encoding "ASCII" is used for decoding, leading to the following error:
SyntaxError: Non-ASCII character '\xc4' in file xxx.py on line 11, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
For Python3, the execution process is the same, but the interpreter of Python3 uses "UTF-8" as the default encoding, but this does not indicate that it is fully compatible with Chinese. For example, during development on Windows, the Python project and code file use the default GBK encoding, that is to say, the Python code file is converted to the GBK format and saved to the disk. When the interpreter of Python3 executes the code file and tries to perform the decoding operation with the UTF-8, the decoding also fails, resulting in the following error:
SyntaxError: Non-UTF-8 code starting with '\xc4' in file xxx.py on line 11, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
3. Best practices
- After creating a project, check whether the project's character encoding is set to UTF-8
- To be compatible with Python2 and Python3, declare the character encoding in the Code header:
-*- coding:utf-8 -*-
Iv. String Support in Python2 and Python3
In fact, Python3 has improved the string support by not only modifying the default encoding, but re-implementing the string, and it has implemented the built-in support for UNICODE, in this regard, Python is already as good as JAVA. Next, let's take a look at the differences between String Support in Python2 and Python3:
Python2
Supported strings in Python2 are provided by the following three classes:
class basestring(object) class str(basestring) class unicode(basestring)
Execute help (str) and help (bytes) and you will find that the results are defined by the str class. This also means that str in Python2 is a byte string, and the subsequent unicode object is a real string.
#! /Usr/bin/env python #-*-coding: UTF-8-*-a = 'B' B = u' 'print (type (a), len ()) print (type (B), len (B) output result :( <type 'str'>, 6) (<type 'unicode '>, 2)
Python3
Python 3 simplifies the implementation of class hierarchy for String Support, removes the unicode class, and adds a bytes class. On the surface, it can be considered that str and unicode in Python3 are combined into one.
class bytes(object)class str(object)
In fact, Python3 has realized the previous error and began to clearly distinguish between strings and bytes. Therefore, str in Python3 is a real string, and bytes are represented by a separate bytes class. That is to say, Python3 is defined by default as a string, which provides built-in support for UNICODE and reduces the programmer's burden on string processing.
#! /Usr/bin/env python #-*-coding: UTF-8-*-a = 'B' B = U' 'C = ''. encode ('gbk') print (type (a), len (a) print (type (B), len (B) print (type (c ), len (c) output result: <class 'str'> 2 <class 'str'> 2 <class 'bytes '> 4
V. character encoding and conversion
As mentioned above, UNICODE strings can be converted to any character encoded byte,
So it is easy to think of the question: Can bytes of different character encoding be converted to each other through Unicode? The answer is yes.
The character encoding and conversion process for strings in Python2 is as follows:
Byte string --> decode ('original encoding ') --> Unicode string --> encode ('new encoding') --> byte string
#! /Usr/bin/env python #-*-coding: UTF-8-*-utf_8_a = 'I Love China' gbk _ a = utf_8_a.decode ('utf-8 '). encode ('gbk') print (gbk_a.decode ('gbk') Output: I love China
The character string defined in Python3 is unicode by default, so you do not need to decode it first. You can encode it directly into a new character encoding:
String --> encode ('new encoding ') --> byte string
#! /Usr/bin/env python #-*-coding: UTF-8-*-utf_8_a = 'I Love China' gbk _ a = utf_8_a.encode ('gbk ') print (gbk_a.decode ('gbk') Output: I love China
It should be noted that Unicode is neither a proper dictionary nor a google translator. It cannot translate a Chinese character into an English character. The conversion process of correct character encoding only changes the representation of the bytes of the same character, and the symbols of the character itself should not change, therefore, not all character encoding conversions are meaningful. How can this sentence be understood? For example, after GBK encoding "China" is converted into UTF-8 character encoding, it is only represented by four bytes into six bytes, but its character format should also be "China ", instead of "hello" or "China ".
Original article of this blog: Click to sort it out