string and character encoding in Python

Source: Internet
Author: User



Original address: Click here


The content of this section:
    1. Objective
    2. Related concepts
    3. Default encoding in Python
    4. Support for strings in Python2 and Python3
    5. Character encoding Conversion
First, preface


The character encoding in Python is a commonplace topic, and the peers have written many articles about it. Some of the same, and some written very deep. Recently saw a well-known training institutions in the teaching video again to talk about this problem, the explanation is still unsatisfactory, so I want to write this text. On the one hand, combing the relevant knowledge, on the other hand, hope to give others a little help.


The default encoding for Python2 is ASCII, does not recognize Chinese characters, requires explicit character encoding, and the default encoding for Python3 is Unicode, which can recognize Chinese characters.


I believe that we have seen in many articles similar to the above "on the Python Chinese processing" interpretation, but also believe that the first time you see such an explanation when you really feel clear. However, after a long time, and then repeat the relevant problems will feel that the seemingly understanding is not so clear. If we understand what the default code says above, we will understand the meaning of that sentence more clearly.


What to note is , "What is character encoding", and "the development of character encoding" is not the topic discussed in this section, which can refer to my previous << this article >>.

Second, related concepts


1. Characters and bytes



A character is not equivalent to one byte, and a character is a symbol that can be recognized by humans, and these symbols are stored in the computed storage and need to be represented by the bytes that the computer recognizes. A character tends to have multiple representations, and different representations use a different number of bytes. The different representations here refer to character encodings, such as the letters A-Z can be expressed in ASCII (one byte) or Unicode (two bytes), and can be represented with UTF-8 (one byte). The role of character encoding is to convert human-identifiable characters into machine-readable bytecode and reverse processes.



Unicdoe is the real string, and the ASCII, UTF-8, GBK and other character encodings represent the byte string . In this regard, we can often see in Python's official documentation The description "Unicode string", "translating a Unicode string into a sequence of bytes"



We write code that is written in a file, and the characters are stored as bytes in the file, so it is understandable when we define a string in a file as a byte string. However, what we need is a string, not a byte string. A good programming language should strictly differentiate between the two and provide ingenious and perfect support. The Java language is so good that I never thought about these issues that should not be handled by programmers before I knew about Python and PHP. Unfortunately, many programming languages try to confuse "strings" and "byte Strings", which they use as strings, both PHP and Python2 are part of the programming language. The best way to do this is to take the length of a string containing Chinese characters:


    • Take the length of the string, the result should be the number of all strings, whether Chinese or English
    • The length of the byte string corresponding to the strings is related to the character encoding used by the encoding (encode) procedure (for example: UTF-8 encoding, a Chinese character needs to be represented by 3 bytes; GBK encoding, a Chinese character requires 2 bytes to represent)


Note: The cmd terminal character encoding for Windows defaults to GBK, so the Chinese characters in the cmd input need to be represented in two bytes

>>> # Python2
>>> a = ‘Hello, China’ # byte string, length is the number of bytes = len (‘Hello,’) + len (‘China’) = 6 + 2 * 2 = 10
>>> b = u‘Hello, China ’# String, the length is the number of characters = len (‘ Hello, ’) + len (‘ 中国 ’) = 6 + 2 = 8
>>> c = unicode (a, ‘gbk’) # Actually, the definition of b is shorthand for the definition of c, which is to decode a GBK encoded byte string into a Uniocde string
>>>
>>> print (type (a), len (a))
(<type ‘str‘>, 10)
>>> print (type (b), len (b))
(<type ‘unicode’>, 8)
>>> print (type (c), len (c))
(<type ‘unicode’>, 8)
>>>
The support for strings in Python3 has changed a lot, the specific content will be introduced below.

2. Encoding and decoding

Let's do the science first: UNICODE character encoding is also a mapping between characters and numbers, but the numbers here are called code points, which are actually hexadecimal numbers.

The official Python documentation has a description of the relationship between Unicode strings, byte strings and encoding:

A Unicode character string is a sequence of code points. The code points range from 0 to 0x10FFFF (the corresponding decimal number is 1114111). This sequence of code points needs to be represented as a set of bytes (values between 0 and 255) in storage (including memory and physical disks), and the rule for converting Unicode strings into sequences of bytes is called encoding.

The encoding here does not refer to character encoding, but refers to the encoding process and the mapping rules of code points and bytes of Unicode characters used in this process. This mapping does not have to be a simple one-to-one mapping, so the encoding process does not have to deal with every possible Unicode character, for example:

The rules for converting Unicode strings to ASCII encoding are simple-for each code point:

If the code point value is <128, then each byte is the same as the code point value
If the code point value> = 128, the Unicode string cannot be represented in this encoding (in this case, Python will raise a UnicodeEncodeError exception)
Convert the Unicode string to UTF-8 encoding using the following rules:

If the code point value is <128, it is represented by the corresponding byte value (same as Unicode to ASCII bytes)
If the code point value is> = 128, it is converted to a sequence of 2 bytes, 3 bytes, or 4 bytes, and each byte in the sequence is between 128 and 255.
Brief summary:

Encoding (encode): the process and rules of converting a Unicode string (code point in) into a byte string corresponding to a specific character encoding
Decode: The process and rules for converting a byte string encoded with a specific character into a corresponding Unicode string (code point in)
It can be seen that whether encoding or decoding, an important factor is required, that is, a specific character encoding. Because the byte value and the number of bytes after a character is encoded with different character encoding are mostly different, and vice versa.

Third, the default encoding in Python
1. The execution process of Python source code files

We all know that the files on the disk are stored in binary format, and the text files are stored in the form of bytes with a certain encoding. The character encoding of the program source code file is specified by the editor. For example, when we use Pycharm to write a Python program, we will specify the engineering encoding and file encoding as UTF-8. -8 Encode the corresponding byte (encode process) and write it to disk. When executing the code in the Python code file, the Python interpreter needs to convert it to a UNICODE string (decode process) after reading the byte string in the Python code file before performing subsequent operations.

As already explained above, this conversion process (decode, decoding) requires us to specify what character encoding is used for the bytes stored in the file, in order to know the corresponding code point of these bytes in the UNICODE and the Unicode What is it. The way of specifying the character encoding here is familiar to everyone, as follows:

#-*-coding: utf-8-*-
 

2. Default encoding

So, if we don't specify the character encoding at the beginning of the code file, which character encoding will the Python interpreter use to convert the bytes read from the code file into UNICODE code points? Just like when we configure some software, there are many default options. It is necessary to set the default character encoding inside the Python interpreter to solve this problem. This is the "default encoding" mentioned at the beginning of the article. Therefore, the Python Chinese character problem that everyone said can be summarized in one sentence: When the byte cannot be converted by the default character encoding, a decoding error (UnicodeEncodeError) occurs.

The default encoding used by the interpreters of Python2 and Python3 is different. We can obtain the default encoding by sys.getdefaultencoding ():

>>> # Python2
>>> import sys
>>> sys.getdefaultencoding ()
‘Ascii’

>>> # Python3
>>> import sys
>>> sys.getdefaultencoding ()
‘Utf-8’
Therefore, for Python2, when the Python interpreter reads the byte code of Chinese characters and attempts to decode, it will first check whether the current code file header indicates that the character code corresponding to the byte code saved in the current code file is what. If not specified, the default character encoding "ASCII" is used for decoding, which causes the decoding to fail, resulting in the following error:

SyntaxError: Non-ASCII character ‘\ xc4’ in file xxx.py on line 11, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
For Python3, the execution process is the same, except that Python3's interpreter uses "UTF-8" as the default encoding, but this does not mean that it can be fully compatible with Chinese problems. For example, when we develop on Windows, the Python project and code files use the default GBK encoding, which means that the Python code files are converted into GBK format byte code and saved to disk. When the Python3 interpreter executes the code file and attempts to decode with UTF-8, the decoding will also fail, resulting in the following error:

SyntaxError: Non-UTF-8 code starting with ‘\ xc4’ in file xxx.py on line 11, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
3. Best practices

After creating a project, first confirm whether the character encoding of the project has been set to UTF-8
For compatibility with Python2 and Python3, declare the character encoding at the head of the code:-*-coding: utf-8-*-
Fourth, Python2 and Python3 support for strings
In fact, the improvement of string support in Python3 is not only to change the default encoding, but to re-implement the string, and it has implemented built-in support for UNICODE. In this respect, Python is already as good as JAVA . Let's take a look at the difference between Python2 and Python3 support for strings:

Python2

The support for strings in Python2 is provided by the following three classes

class basestring (object)
    class str (basestring)
    class unicode (basestring)
Executing help (str) and help (bytes) will find that the result is the definition of str class, which also shows that str is a byte string in Python2, and the corresponding unicode object is the real string.

#! / usr / bin / env python
#-*-coding: utf-8-*-

a = ‘hello’
b = u ‘hello’

print (type (a), len (a))
print (type (b), len (b))
Output result:

(<type ‘str‘>, 6)
(<type ‘unicode’>, 2)
Python3

The support for strings in Python3 has been simplified at the class level, the unicode class has been removed, and a bytes class has been added. On the surface, it can be considered that str and unicode in Python3 are combined.

class bytes (object)
class str (object)
In fact, Python3 has been aware of the previous error and started to clearly distinguish between strings and bytes. Therefore, str in Python3 is already a real string, and bytes are represented by a separate bytes class. In other words, Python3 defines strings by default, which implements built-in support for UNICODE, reducing the burden of programmers on string processing.

#! / usr / bin / env python
#-*-coding: utf-8-*-

a = ‘hello’
b = u ‘hello’
c = ‘hello’.encode (‘ gbk ’)

print (type (a), len (a))
print (type (b), len (b))
print (type (c), len (c))
Output result:

<class ‘str‘> 2
<class ‘str‘> 2
<class ‘bytes‘> 4
Five, character encoding conversion
As mentioned above, UNICODE strings can be converted to and from any character-encoded bytes,

So it is easy for everyone to think of a question, can bytes of different character encodings be converted to each other through Unicode? The answer is yes.

The character encoding conversion process of strings in Python2 is:

Byte string-> decode (‘Original character encoding’)-> Unicode string-> encode (‘New character encoding’)-> Byte string

#! / usr / bin / env python
#-*-coding: utf-8-*-


utf_8_a = ‘I love China’
gbk_a = utf_8_a.decode (‘utf-8‘). encode (‘gbk’)
print (gbk_a.decode (‘gbk’))
Output result:

I love China
The string defined in Python3 is unicode by default, so it does not need to be decoded first, and can be directly encoded into a new character encoding:

String-> encode (‘new character encoding’)-> byte string

#! / usr / bin / env python
#-*-coding: utf-8-*-


utf_8_a = ‘I love China’
gbk_a = utf_8_a.encode (‘gbk’)
print (gbk_a.decode (‘gbk’))
Output result:

I love China
Finally, it should be noted that Unicode is not a Dao dictionary or a Google translator. It cannot translate a Chinese into an English. The conversion process of the correct character encoding only changes the byte representation of the same character, and the symbol of the character itself should not change, so not all conversions between character encodings are meaningful. How to understand this sentence? For example, after converting GBK-encoded "China" into UTF-8 character encoding, it is only represented by 4 bytes to 6 bytes, but the character representation should still be "China", and should not become "Hello" or "China".

String and character encoding in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.