Encoding of characters in Python. During encoding and conversion, we must be clear about the encoding method of the string we read, or the encoding method of the string we want to decode. Then we can use decode to decode the string, decodes a unicode string, and uses the expected encoding format to encode the decoded Unicode string. Note: The correct content is read, decoded, And the content after another encoding may also be garbled. The general form is str1, which indicates the string we read. Str1.decode ("GBK"). encode ("UTF-8") cannot encode the STR string because it is already in the encoded format !!! Unicode strings cannot be decoded because they are already decoded in some form !!! Bytes has two different types: byte string and Unicode string. Therefore, the encoding and decoding in Python are Unicode and byte mutual conversion. The encoding is Unicode-> byte, and the decoding is byte-> unicodehttp: // bytes. Strictly speaking, STR is actually a byte string, which is a sequence of unicode encoded bytes. Unicode is a true string. It is obtained after decoding the byte string 'str' using the correct character encoding. Decoding Unicode is incorrect. Encoding STR is also incorrect in the source code file. If non-ASCII characters are used, you must declare the character encoding in the file header. In fact, Python only checks the #, coding, and encoding strings. Other characters are added for the sake of beauty. In addition, there are many characters available in Python, and there are many alias, not case sensitive, such as UTF-8 can be written as u8. See http://docs.python.org/library/codecs.html?standard-encodings. Encode ('gbk'). encode ('utf-8') to convert. However, the following exception is encountered during conversion today: unicodedecodeerror: 'gbk' codec can't decode bytes in position 7-8: illegal multibyte sequence is caused by illegal characters. Especially in some programs written in C/C ++, full-angle spaces often have different implementation methods, for example, \ xa3 \ xa0, or \ Xa4 \ x57, these characters are all full-angle spaces, however, they are not "valid" fullwidth spaces (the real fullwidth space is \ xA1 \ xA1), so an exception occurs during transcoding. This problem is a headache, because as long as there is an invalid character in the string, the entire string-sometimes, the entire article-cannot be transcoded. Solution: S. decode ('gbk', 'ignore '). encode ('utf-8') because the prototype of the decode function is decode ([encoding], [errors = 'strict ']), you can use the second parameter to control the error handling policy, the default parameter is strict, which indicates that an exception is thrown when an invalid character is encountered. If it is set to ignore, the invalid character is ignored. If it is set to replace,? Replace invalid characters. If it is set to xmlcharrefreplace, It is referenced by the characters in XML. Http://blog.iamzsx.me/show.html? The id = 81001 pairs of files with Chinese characters must be in UTF-8 format. Encoding coding: UTF-8 is indicated at the beginning of the file, and all strings in the file are in unicode format. Use the encode and decode functions to convert the input data of the file or other input methods as needed. Also, use the decode function to convert the data to Unicode, which reduces the coding trouble.