It is said that every Python development is confused by the character encoding problem, the most common mistake is unicodeencodeerror, Unicodedecodeerror, you seem to know how to solve, unfortunately, the error appears elsewhere, the problem is always the same, The conversion between STR to Unicode with decode or Encode method is also very difficult to remember, always confused, where is the problem?
In order to understand the problem, I decided to analyze the composition of the Python string and the details of the character encoding.
Bytes and characters
All data stored by the computer, text characters, pictures, video, audio, software are composed of a series of 01 byte sequence, one byte equals 8 bits.
The character is a symbol, such as a Chinese character, an English letter, a number, a punctuation mark can be called a character.
bytes facilitate storage and network transmission, while characters are used for display and easy to read. For example, the character "P" stored to the hard disk is a string of binary data 01110000
that occupies a byte length
Encoding and decoding
We use the editor to open the text, see the characters, and finally saved in the disk when the binary byte sequence is stored in the form. Then the conversion from character to Byte is called encoding (encode), which in turn is called decoding (decode), and both are a reversible process. Encoding is for storage transmission, decoding is for easy display reading.
For example, the character "P" is encoded and saved to the hard disk is a string of binary byte sequences 01110000, which occupies a byte length. Why is it possible that the character "Zen" is stored in a length of 3 bytes with "11100111 10100110 10000101″"? Put this in the back.
Python code Why is it so sore? Of course, this does not blame the developer.
This is because Python2 uses ASCII character encoding as the default encoding, and ASCII cannot handle Chinese, so why not UTf-8? Because Guido Dad wrote the first line of code for Python in the winter of 1989, February 1991 formally open Source released the first version, and Unicode was released in October 1991, that is, the Python language was created when UTF-8 was not born, this is one.
Python has also made two types of strings, Unicode and STR, so that developers are confused, and this is the second. Python3 completely transformed the string, leaving only one type, which is something, and later.
STR and Unicode
Python2 divides the string into Unicode and str two types. In essence, STR is a sequence of binary bytes, the following example code can be seen in the str type of "Zen" printed out is the hexadecimal \xec\xf8, the corresponding binary byte sequence is ' 11101100 11111000′.
>>> s = ' Zen ' >>> s ' \xec\xf8 ' >>> type (s) <type ' str ' >
The Unicode symbol for the Unicode type of U "Zen" is U ' \u7985′
>>> u = u "Zen" >>> uu ' \u7985 ' >>> type (u) <type ' Unicode ' >
We want to save the Unicode symbol to a file or transfer it to the network and it needs to be encoded to convert to the STR type, so Python provides the Encode method, from Unicode to STR, and vice versa.
Encode
>>> u = u "Zen" >>> uu ' \u7985 ' >>> u.encode ("Utf-8") ' \xe7\xa6\x85 '
Decode
>>> s = "Zen" >>> S.decode ("Utf-8") u ' \u7985 ' >>>
Many beginners can not remember the conversion between STR and Unicode with encode or decode, if you remember that STR is essentially a string of binary data, and Unicode is a character (symbol), encoding (encode) is the character (symbol) is converted into binary data process, so the conversion of Unicode to STR uses the Encode method, which in turn is the Decode method.
Encoding always takes a Unicode string and returns a bytes sequence, and decoding always takes a bytes sequence and return S a Unicode string ".
After we understand the relationship between STR and Unicode, let's see when the Unicodeencodeerror and unicodedecodeerror errors occur.
Unicodeencodeerror
Unicodeencodeerror occurs when a Unicode string is converted to a sequence of str bytes, consider an example of saving a string of Unicode strings to a file
#-*-Coding:utf-8-*-def Main (): name = U ' python zen ' f = open ("Output.txt", "W") F.write (name)
Error log
Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 6-7: Ordinal not in range (128)
Why does Unicodeencodeerror appear?
Since the write method is called, Python will first determine what type of string it is, and if it is str, write directly to the file without coding, because the STR type string itself is a sequence of binary bytes.
If the string is a Unicode type, it first calls the Encode method to convert the Unicode string to the binary form of the STR type before it is saved to the file, and the Encode method is encoded using the Python default ASCII code
Equivalent:
>>> U "The Zen of Python". Encode ("ASCII")
However, we know that the ASCII character set contains only 128 Latin letters, not including Chinese characters, so there is an ' ASCII ' codec can ' t encode characters error. To use encode correctly, you must specify a character set that contains Chinese characters, such as: UTF-8, GBK.
>>> U "The Zen of Python". Encode ("Utf-8") ' python\xe4\xb9\x8b\xe7\xa6\x85 ' >>> u "python Zen". Encode ("GBK") ' Python\xd6\xae\xec\xf8 '
Therefore, in order to correctly write Unicode strings to the file, the string should be pre-UTF-8 or GBK-encoded conversion.
def main (): name = U ' python zen ' name = Name.encode (' utf-8 ') with open ("Output.txt", "W") as F: F.write ( Name
Of course, there is more than one way to write a Unicode string correctly, but the principle is the same, it is not introduced here, the string is written to the database, the transmission to the network is the same principle
Unicodedecodeerror
Unicodedecodeerror occurs when a byte sequence of type STR is decoded into a string of Unicode type
>>> a = u "zen" >>> au ' \u7985 ' >>> b = A.encode ("Utf-8") >>> B ' \xe7\xa6\x85 ' >>> B.decode ("GBK") Traceback (most recent): File ' <stdin> ', line 1, in <module>unicodedecodeerror : ' GBK ' codec can ' t decode byte 0x85 in position 2:incomplete multibyte sequence
Unicodedecodeerror occurs when a UTF-8 encoded byte sequence ' \xe7\xa6\x85′ is converted to a Unicode string using GBK decoding, because the GBK encoding (for Chinese characters) takes only two bytes, while UTF-8 occupies 3 Bytes, and when converted with GBK, there is one more byte, so it cannot be parsed. The key to avoiding unicodedecodeerror is to keep encoding and decoding consistent with the encoding type.
This also answered the article at the beginning of the character "Zen", saved to the file may account for 3 bytes, it is possible to account for 2 bytes, the specific execution of encode when the encoding format is specified.
Give me another example of a unicodedecodeerror.
>>> x = u "Python" >>> y = "Zen" >>> x + ytraceback (most recent call last): File "<stdin> ; ", line 1, in <module>unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range ( ) >>>
STR and Unicode string execution + operation is that Python will implicitly convert the str type of byte sequence to (decode) sing Woo X-like Unicode type, but Python is converted using the default ASCII encoding, and ASCII does not contain Chinese, to the error.
>>> y.decode (' ASCII ') Traceback (most recent call last): File "<stdin>", line 1, in <module> Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range (128)
The correct way should be to display the Y with UTF-8 or GBK to decode.
>>> x = u "Python" >>> y = "Zen" >>> y = Y.decode ("Utf-8") >>> x + yu ' python\u4e4b\u7985 '
All of the above is based on Python2, and the characters and codes about Python3 will be written in a separate article to keep you focused.