Conversion of string internal code is a common problem during development.
In Java, We can first call getbyte () for a string to generate a new string for transcoding, or use charset In the NIO package.
In python, you can call the decode and encode methods on the string for transcoding.
For example, to convert a String object s from a GBK internal code to a UTF-8, you can do the following:
S. Decode ('gbk'). encode ('utf-8 ′)
However, in actual development, I found that exceptions often occur in this method:
Unicodedecodeerror: 'gbk' codec can't decode bytes in position 30664-30665: Illegal multibyte Sequence
This is because an invalid character is encountered, especially when it is written in C/C ++.ProgramIn, full-width spaces often have different implementation methods, such as \ xa3 \ xa0, or \ Xa4 \ x57. These characters are all full-width spaces, however, they are not "valid" fullwidth spaces (the real fullwidth space is \ xA1 \ xA1), so an exception occurs during transcoding.
This problem is a headache, because as long as there is an invalid character in the string, the entire string-sometimes, the wholeArticle-- Transcoding fails.
Fortunately, tiny found a perfect solution (I was criticized for not reading the document carefully, sweated ......)
S. Decode ('gbk', 'ignore'). encode ('utf-8 ′)
Because the prototype of the decode function is decode ([encoding], [errors = 'strict ']), you can use the second parameter to control the error handling policy. The default parameter is strict, an exception is thrown when an invalid character is encountered;
If it is set to ignore, invalid characters are ignored;
If it is set to replace,? Replace invalid characters;
If it is set to xmlcharrefreplace, the XML character reference is used.
From: http://workgroup.cn /? P = 83