Python runtime when there is a unicodedecodeerror,unicodeencodeerror error, how to face, solve. A computer must first be converted to a number when it processes text. The computer uses binary.
String encoding
(1) General-purpose Unicode
(2) Converting Unicode encoding to some kind of encoding
Bytes
Data storage base unit, one byte equals 8bit, so one byte corresponds to 256 states.
Character
Character a unit of information, which is a general term for various words and symbols, such as an English letter is a character, a Chinese character is a char, a punctuation mark is also a character
Character
A collection of characters within a range, with different character sets that specify the number of characters. The ASCII character set has a total of 128 characters, including English letters, Arabic numerals, punctuation marks, and control characters. The GB2312 character set defines 7,445 characters and contains most of the kanji characters
Character code
The character code in character set is mapped to a specific implementation scheme of byte stream, and the common character encoding is ASCII encoding, UTF-8 encoding, GBK encoding.
The character set and character encoding have a correspondence,
For example, the ASCII character set corresponds to ASCII encoding. ASCII character encoding specifies that all characters are encoded using 7 bits in a single-byte low. For example, the number of "a" is 65, the single-byte representation is 0x41, so when writing to the storage device is B ' 01000001 '.
Encode, decode
Encoding is the process of converting a character into a byte stream, and the decoding process is to parse the byte stream into characters.
ASCII encoding
128-character encoding for correspondence between English characters and binary
GB2312
Common coding methods in Chinese, two bytes means a Chinese character theory has 256x256=65536 a symbol
Unicode Unified Code, universal code, single code
Ability to convert and process text across languages across platforms
Unicode encoding two bytes, ASCII one byte
ASCII encoding of A: 01000001, Unicode Encoding: 0000000001000001 wasted space stored up, more than ASCII wasted, to save space between the format character set, Utf-8 and utf-16
UTF
Variable-length character encoding, 1-4 bytes to represent a character, the English letter is compiled into a byte, the kanji 3 bytes.
Coding rules for UTF-8:
A. For a single-byte symbol, the first bit of byte is set to 0, the next 7 bits are the Unicode encoding of this byte, the English letter Utf-8 and the ASCII code are the same B. For the N-byte notation (n>1) The first n bits are set to 1, the n+1 bit is set to 0, The first two bits of the trailing byte are all set to 10 the remainder of the bits that are not mentioned are all Unicode encodings for this symbol
Code comparison case:
String encoding in Py3 using str and bytes
(1) STR string: Using Unicode encoding
(2) Bytes string: Converts Unicode encoding to some type of encoding such as Utf-8
Default encoding Unicodeencode and decode in Python3
Encode is responsible for encoding Unicode into the specified character encoding
Decode converting other character encodings to Unicode encoding
Causes of Unicodeencodeerror and unicodedecodeerror errors
Unicodeencodeerror and Unicodedecodeerror are wrong, the root cause of these errors is that Python2 default is to use ASCII encoding for decode and encode operations
>> s = ' we '
>> S.decode ()
Traceback (most recent):
File "<stdin>", line 1, in <module>
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range (128)
When converting S to a string of Unicode type, the Decode method is decoded by default using ASCII encoding, and the ASCII character set does not have a Chinese character "Hello", so there is a unicodedecodeerror, the correct way is to specify UTF-8 Character encoding.
>> s.decode (' Utf-8 ')
U ' \u4f60\u597d '
For the encode operation, when converting a Unicode string to a string of type STR, the encoding is converted by default using ASCII encoding, and the ASCII character set cannot find the Chinese character "Hello", so there is a unicodeencodeerror error.
>> a = U ' Us '
>> A.encode ()
Traceback (most recent):
File "<stdin>", line 1, in <module>
Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal not in range (128)
When a str type is mixed with a string of Unicode type, a string of type str implicitly converts STR to a Unicode string, and if the STR string is a Chinese character, a unicodedecodeerror error occurs because Python2 defaults to the Use ASCII encoding to perform decode operations.
>> s = ' Hello ' # str type
>> y = u ' python ' # Unicode type
>> s + y # implicit conversion, i.e. S.decode (' ASCII ') + U
Traceback (most recent):
File "<stdin>", line 1, in <module>
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range (128)
The correct way to do this is to specify the UTF-8 character encoding to decode
>> s.decode (' utf-8 ') +y
U ' \u4f60\u597dpython '
All garbled reasons can be attributed to characters that have been encoded in different encoding formats using inconsistent encoding format
Python character encoding