For example, to convert a String object s from a gbk internal code to a UTF-8, you can do the following:
S. decode ('gbk'). encode ('utf-8 ′)
However, in actual development, I found that exceptions often occur in this method:
UnicodeDecodeError: 'gbk' codec can't decode bytes in position 30664-30665: illegal multibyte sequence
This is because of illegal characters-especially in some programs written in C/C ++, full-angle spaces often have different implementation methods, such as \ xa3 \ xa0, or \ xa4 \ x57. These characters are all full-width spaces, but they are not "valid" full-width spaces (the real full-width spaces are \ xa1 \ xa1 ), therefore, an exception occurs during transcoding.
This problem is a headache, because as long as there is an invalid character in the string, the entire string-sometimes, the entire article-cannot be transcoded.
Solution:
S. decode ('gbk', 'ignore'). encode ('utf-8 ′)
Because the prototype of the decode function is decode ([encoding], [errors = 'strict ']), you can use the second parameter to control the error handling policy. The default parameter is strict, an exception is thrown when an invalid character is encountered;
If it is set to ignore, invalid characters are ignored;
If it is set to replace,? Replace invalid characters;
If it is set to xmlcharrefreplace, the XML character reference is used.
Python documentation
Decode ([encoding [, errors])
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding. errors may be given to set a different error handling scheme. the default is 'strict ', meaning that encoding errors raise UnicodeError. other possible values are 'ignore', 'replace 'and any other name registered via codecs. register_error, see section 4.8.1.