In Python, you can invoke the decode and encode methods on a string to implement transcoding.
For example, to convert a string object s from GBK inside code to UTF-8, you can do the following
S.decode (' GBK '). Encode (' utf-8′)
However, in real-world development, I found that this approach often occurs abnormally:
Unicodedecodeerror: ' GBK ' codec can ' t decode bytes in position 30664-30665:illegal multibyte sequence
This is because illegal characters are encountered-especially in some programs written in C + +, full-width spaces are often implemented in many different ways, such as \xa3\xa0, or \xa4\x57, which appear to be full-width spaces, but they are not "legal" full-width spaces (true em spaces are \XA1\XA1), so an exception occurred during transcoding.
This is a headache because as long as there is an illegal character in the string, the entire string-sometimes the entire article-cannot be transcoded.
Workaround:
S.decode (' GBK ', ' ignore '). Encode (' utf-8′ ')
Because Decode's function prototype is decode ([encoding], [errors= ' strict ']), a second parameter can be used to control the policy of error handling, the default parameter is strict, which represents an exception thrown when an illegal character is encountered;
If set to ignore, illegal characters are ignored;
If set to replace, it will replace illegal characters;
If set to Xmlcharrefreplace, the character reference of the XML is used.
Python documentation
Decode ([encoding[, errors]])
Decodes the string using the codec registered for encoding. Encoding defaults to the default string encoding. Errors May is given to set a different error handling scheme. The default is ' strict ', meaning that encoding errors raise unicodeerror. Other possible values is ' ignore ', ' replace ' and any other name registered via Codecs.register_error, see section 4.8.1.
' Illegal multibyte Sequen ' error during Python transcoding