This chapter describes the
python stringTo write and deal with, before we figure out the vexing
character encodingAfter the problem, we'll study
a string of Python.
In the latest version of Python 3, strings are encoded in Unicode, meaning that Python strings support multiple languages, such as:
>>> print (' str with Chinese ') contains Chinese str
For the encoding of a single character,Python provides an integer representation of the ord () function to get the character, and theChr () function converts the encoding to the corresponding character:
>>> Ord (' A ') 65>>> ord (' Middle ') 20013>>> chr ("The ' B ' >>> chr (25991) ' text ')
If you know the integer encoding of the character, you can also write the str in hexadecimal:
>>> ' \u4e2d\u6587 ' Chinese
The two formulations are completely equivalent.
Because python's string type is str, it is represented in memory in Unicode and a character corresponds to several bytes. If you want to transfer on the network, or save to disk, you need to turn str into bytes in bytes.
Python uses single or double quotation marks with a B prefix for data of type bytes:
x = B ' ABC '
Be aware of the distinction between ' abc ' and ' B ' abc ', which is STR, although the content is the same as the former, but each character of bytes occupies only one byte.
The STR represented in Unicode can be encoded as a specified bytes by using the Encode () method, for example:
>>> ' abc '. Encode (' ASCII ') b ' abc ' >>> ' Chinese '. Encode (' utf-8 ') b ' \xe4\xb8\xad\xe6\x96\x87 ' >>> ' Chinese '. Encode (' ASCII ') Traceback (most recent call last): File "<stdin>", line 1, in <module> Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal not in range (128)
The English-language STR can be ASCII encoded as bytes, the content is the same, the Chinese-containing STR can be encoded with UTF-8 bytes. STR, which contains Chinese, cannot be ASCII encoded because the range of Chinese encodings exceeds the ASCII encoding range and Python will error.
In bytes, the bytes that cannot be displayed as ASCII characters are #显示 with \x#.
Conversely, if we read the byte stream from the network or disk, then the data read is bytes. To turn bytes into STR, you need to use the Decode () method:
>>> b ' abc '. DECODE (' ASCII ') ' abc ' >>> b ' \xe4\xb8\xad\xe6\x96\x87 '. Decode (' utf-8 ') ' Chinese '
If the bytes contains bytes that cannot be decoded, the decode () method will error:
>>> b ' \xe4\xb8\xad\xff ' decode (' Utf-8 ') Traceback (most recent call last): ... Unicodedecodeerror: ' Utf-8 ' codec can ' t decode byte 0xff in position 3:invalid start byte
Above, is the problem of Python string programming