A second talk about string and character encoding in Python (recommended) _python

Source: Internet
Author: User
Tags php and in python

The content of this section:

1. Foreword

2. Related Concepts

Default encoding in 3.Python

Support for strings in 4.python2 and Python3

5. Character encoding Conversion

First, the preface

The character encoding in Python is a commonplace topic, and the peers have written many articles about it. Some parrot, also some write very deeply. Recently saw a well-known training institutions in the teaching video again to talk about this problem, the explanation is still unsatisfactory, so just want to write this text. On the one hand, comb the relevant knowledge, on the other hand, hope to give others some help.

The default encoding for Python2 is ASCII, the Chinese characters are not recognized, the character encoding needs to be explicitly specified, and the default encoding for Python3 is Unicode, which recognizes Chinese characters.

I'm sure you've seen a lot of articles like this, "dealing with Chinese in python," and I'm sure you did feel it when you first saw the explanation. But after a long time, and then repeatedly encountered related problems will feel that seemingly understanding is not so clear. If we know what the effect of the default encoding is, we will understand the meaning of that phrase more clearly.

Ii. Related Concepts

1. Characters and bytes

A character is not equivalent to a byte, a character is a symbol that human beings can recognize, and these symbols need to be stored in the computed storage to be represented by a byte that the computer can recognize. A character often has multiple representations, and different representations use a different number of bytes. The different representations here refer to character encoding, such as the letter A-Z can be represented in ASCII (a byte), or Unicode (two bytes), or UTF-8 (one byte). The role of character encoding is to convert human-identifiable characters into machine-identifiable bytecode and reverse processes.

Unicdoe is the real string, and the byte string is expressed in ASCII, UTF-8, GBK, and other character encodings. In this case, we can often see this description of "Unicode string" in the official Python document, "translating a Unicode string into a sequence of bytes"

We write code that is written in a file, and characters are stored in bytes in a file, so it is understandable that we are being used as a byte string when we define a string in a file. But what we need is a string, not a byte string. A good programming language should strictly distinguish between the two and provide ingenious and perfect support. The Java language is so good that I never thought about the problems that shouldn't be handled by programmers before I knew Python and PHP. Unfortunately, many programming languages try to confuse "strings" and "byte Strings", and they use byte strings as strings, and both PHP and Python2 belong to this programming language. The best way to illustrate this is to take the length of a string containing Chinese characters:

    • Length of string, the result should be the number of all strings, whether Chinese or English
    • The length of the byte string corresponding to the string is related to the character encoding used in the encoding (encode) process (for example: UTF-8 encoding, a Chinese character needs to be represented in 3 bytes; GBK encoding, a Chinese character needs 2 bytes to represent)

Note: The cmd terminal character encoding for Windows defaults to GBK, so the Chinese characters entered in CMD need to be represented in two bytes

>>> # Python2
>>> a = ' Hello, China ' # byte string, length is byte number = Len (' Hello, ') +len (' china ') = 6+2*2 = Ten
>>&G T b = U ' Hello, China ' # string, length is number of characters = Len (' Hello, ') +len (' china ') = 6+2 = 8
>>> c = Unicode (A, ' GBK ') # In fact, B is defined in terms of C. In shorthand, a GBK encoded byte string decoding (decode) is a UNIOCDE string
>>> 
>>> print (Type (a), Len (a))
(<type ' STR
>>> print (type (b), Len (b))
(<type ' Unicode ', 8)
>>> print (Type C) , Len (c))
(<type ' Unicode ', 8)
>>>

The support for strings in the Python3 has changed a lot, as described below.

2. Coding and decoding

Do the popular Science: Unicode character encoding, but also a map of characters and numbers, but the numbers here are called code points, which are actually hexadecimal digits.

A description of the relationship between a Unicode string, a byte string, and an encoding in the official Python document:

A Unicode string is a code point sequence with code points ranging from 0 to 0X10FFFF (the corresponding decimal value is 1114111). This sequence of code points needs to be represented as a set of bytes (values between 0 and 255) in storage (including memory and physical disks), while the rule of converting Unicode strings to byte sequences is called encoding.

The encoding described here is not a character encoding, but a coding process and a mapping rule for the code points and bytes of the Unicode characters used in the process. This mapping does not have to be a simple one-to-one mapping, so the encoding process does not have to deal with each possible Unicode character, for example:

The rule of converting a Unicode string to an ASCII encoding is simple-for each code point:

    • If the code point value is <128, each byte is the same as the value of the code point
    • If the code point value is >=128, the Unicode string cannot be represented in this encoding (in which case Python throws a Unicodeencodeerror exception)

The following rules are used to convert a Unicode string to UTF-8 encoding:

    • If the code point value is <128, it is represented by the corresponding byte value (as with Unicode ASCII bytes)
    • If the code point value is >=128, it is converted to a 2-byte, 3-byte, or 4-byte sequence in which each byte is between 128 and 255.

Simple summary:

    • Encoding (ENCODE): procedures and rules for converting a Unicode string (a code point in) to a byte string corresponding to a specific character encoding
    • Decoding (decode): procedures and rules for converting a specific character-encoded byte string into a corresponding Unicode string (the code point in)

Visible, whether the encoding or decoding, all need an important factor, is the specific character encoding. Because the byte value of a character encoded with a different character encoding and the number of bytes are different in most cases, and vice versa.

Third, the default encoding in Python

1. Python source code file execution process

As we all know, the files on disk are stored in binary format, where the text files are stored in a certain encoded byte form. The character encoding for the program source code file is specified by the editor, for example, when we use Pycharm to write a Python program, we specify the engineering code and the file code as UTF-8. When the Python code is saved to disk, it is converted to the UTF-8 encoding byte (encode process) and written to disk. When the code in the Python code file is executed, the Python interpreter after reading the byte string in the Python code file needs to convert it to a Unicode string (the decode procedure) before performing a subsequent operation.

As explained above, this conversion process (decode, decoding) requires us to specify what character encoding is used in the bytes stored in the file in order to know what their corresponding code points are found in Unicode, the universal code and the Uniform code. The way you specify the character encoding is familiar to everyone, as follows:

#-*-Coding:utf-8-*-

2. Default encoding

So, if we don't specify character encoding at the beginning of the code file, what character encoding does the Python interpreter use to convert bytes read from the code file to Unicode code points? As we configure certain software, there are many default options that need to be set in the Python interpreter to resolve this problem by setting the default character encoding, which is the "default encoding" at the beginning of the article. So what you call the Python Chinese character problem can be summed up in one sentence: A decoding error (UNICODEENCODEERROR) occurs when the byte cannot be converted through the default character encoding.

The default encoding used by the Python2 and Python3 interpreters is not the same, and we can get the default encoding by Sys.getdefaultencoding ():

>>> # Python2
>>> import sys
>>> sys.getdefaultencoding ()
' ASCII '

> >> # Python3
>>> import sys
>>> sys.getdefaultencoding ()
' Utf-8 '

Therefore, for Python2, the Python interpreter will first see whether the current code file's header indicates what character encoding corresponds to the byte code saved in the current code file when it reads the bytecode of the Chinese character to attempt a decoding operation. If not specified, decoding using the default character encoding "ASCII" causes the decoding to fail, resulting in the following error:

Syntaxerror:non-ascii character ' \xc4 ' in the file xxx.py on line one, but no encoding declared; Http://python.org/dev/peps/pep-0263/for Details

For Python3, the execution process is the same, except that the PYTHON3 interpreter is "UTF-8" as the default encoding, but this does not mean that the Chinese issue can be fully compatible. For example, when we develop on Windows, Python Engineering and code files use the default GBK encoding, which means that the Python code file is saved to disk in a byte code converted to a GBK format. When the PYTHON3 interpreter executes the code file, an attempt to decode it using UTF-8 also fails, resulting in the following error:

Syntaxerror:non-utf-8 code starting with ' \xc4 ' in file xxx.py on line, but no encoding declared; Http://python.org/dev/peps/pep-0263/for Details

3. Best practices

    • After you create a project, confirm that the project's character encoding has been set to UTF-8
    • To be compatible with Python2 and Python3, declare the character encoding at the head of the code:-*-coding:utf-8-*-

Iv. support for strings in Python2 and Python3

In fact, the improvement of string support in Python3 not only changed the default encoding, but also the implementation of the string, and it has implemented the built-in support for Unicode, from which Python is as good as Java. Let's look at the difference between Python2 and Python3 's support for strings:

Python2

Support for strings in Python2 is provided by the following three classes

Class Basestring (object)
  class str (basestring)
  class Unicode (basestring)

Execute Help (str) and help (bytes) find that the result is the definition of the Str class, which also means that Str is a byte string in Python2, and the subsequent Unicode object corresponds to the true string.

#!/usr/bin/env python
#-*-coding:utf-8-*-

a = ' Hello '
b = u ' Hello '

print (type (a), Len (a))
print (type (b), Len (b))

Output results:

(<type ' str ', 6)

(<type ' Unicode ', 2)

Python3

The Python3 support for strings is simplified on the implementation class level, the Unicode class is removed, and a bytes class is added. On the surface, it can be argued that STR and Unicode are combined in Python3.

Class bytes (object)
class str (object)

In fact, Python3 has been aware of previous errors and has begun to explicitly differentiate between strings and bytes. So str in Python3 is already a real string, and the byte is represented by a separate bytes class. That is, the PYTHON3 is defined by default as a string that enables built-in support for Unicode, easing the programmer's burden on string processing.

#!/usr/bin/env python
#-*-coding:utf-8-*-

a = ' Hello '
b = u ' Hello '
c = ' Hello '. Encode (' GBK ')

print (Type a) , Len (a))
print (type (b), Len (b))
Print (type (c), Len (c))

Output results:

<class ' str ' > 2

<class ' str ' > 2

<class ' bytes ' > 4

Five, character encoding conversion

As mentioned above, Unicode strings can be converted to and from any character-encoded byte, as shown in the figure:

So it's easy to think of a problem, is that different character encoded bytes can be converted to each other via Unicode? The answer is yes.

The character encoding conversion process for strings in Python2 is:

byte string-->decode (' original character encoding ')-->unicode string-->encode (' new character encoding ')--> byte string

#!/usr/bin/env python
#-*-coding:utf-8-*-


utf_8_a = ' I love China '
gbk_a = Utf_8_a.decode (' Utf-8 '). Encode (' GBK ') )
Print (Gbk_a.decode (' GBK '))

Output results:

I love China

The string defined in Python3 is Unicode by default, so it does not need to be decoded first and can be encoded directly into a new character encoding:

String-->encode (' new character encoding ')--> a byte string

#!/usr/bin/env python
#-*-coding:utf-8-*-


utf_8_a = ' I love China '
gbk_a = Utf_8_a.encode (' GBK ')
print ( Gbk_a.decode (' GBK '))

Output results:

I love China

The last thing to note is that Unicode is not a dictionary, nor is it a Google translator, and it does not translate a Chinese into an English language. The correct character encoding conversion process only changes the byte representation of the same character, and the character itself is not supposed to change, so the conversion between all character encodings is not meaningful. How to understand this sentence? For example, GBK encoded "China" into the UTF-8 character encoding, only 4 bytes into 6 bytes, but its character expression should also be "China", and should not become "hello" or "Chinese."

It took a long time to introduce concepts and theories, followed by practice, hoping to help others.

The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.