This article mainly introduces the python black magic encoding conversion and analyzes the python encoding conversion methods. If you are interested, please refer to the code conversion process in libraries of other languages, for characters that cannot be understood, there are usually only two (or three) Types of processing ):
- Throw an exception
- Replace with a replacement character
- Skip
However, in the complex real world, due to various unreliable texts, there will always be some discord factors such as mixed encoding. In this case, the above solution is returned.
So the question is, is there a better way for python?
The answer is yes!
The python encoding conversion process is actually two-step conversion:
source -> unicode -> dest
First, convert the string from the original encoding to unicode. Then convert unicode to the target encoding.
Step 1 we generally useDecode ()OrUnicode ()The two functions are completed.
Step 2: UseEncode ()Function completed.
The black magic we mentioned here is implemented in the first step.
Both decode and unicode functions have an optional parameter called errors. Take a look at the official description:
- Errors may be given to set a different error
- Handling scheme. Default is 'strict 'meaning that encoding errors raise
- A UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
- As well as any other name registered with codecs. register_error that is
- Able to handle UnicodeDecodeErrors.
This parameter usually has three values:
- Default Value of strict. If an Encoding Error occurs, UnicodeDecodeError is thrown.
- Skip ignore.
- Replace? Replace.
Okay. Have you seen the last sentence? The show is on!
The module codec has a function called register_error. Its role allows users to register custom errors processing methods.
It is used to handle UnicodeDecodeError.
Let's look at the function prototype:
codecs.register_error(name, error_handler)
Name: name of error handling. Used to fill in the error parameter of the decode function.
Error_handler: Handler. This function accepts an exception parameter.
Returns a tuple. The tuple has two elements. The first element is the string after error correction, and the second element is the start position of continuing decode.
With the above basic concepts. Let's take a look at the specific implementation:
def cjk_error(e): if not isinstance(e, UnicodeDecodeError): raise TypeError("don't know how to handle %r" % exc) if exc.end + 1 > len(exc.object): raise TypeError('unknown codec ,the object too short!') ch1 = ord(exc.object[exc.start:exc.end]) newpos = exc.end + 1 ch2 = ord(exc.object[exc.start + 1:newpos]) sk = exc.object[exc.start:newpos] if 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0x7E<=ch2<=0xFE): # GBK return (unicode(sk,'cp936'), newpos) if 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0xA1<=ch2<=0xFE): # BIG5 return (unicode(sk,'big5'), newpos) raise TypeError('unknown codec !') codecs.register_error("cjk_replace", cjk_replace)
I copied it from the Internet. I thought it was good at first, but later I found it an algorithm that was not repeated.
For example, utf8 and gbk have overlapping parts in the first two bytes. When a UTF-8 string is encoded with gbk for decode, the error starts from the third byte (the first two bytes can also correspond to one Chinese Character in the gbk encoding range ).
For example:
A = "" # utf8 encoding: '\ xe4 \ xbd \ xa0' c = unicode (a [: 2], 'gbk') # normal return c = unicode (, 'gbk') # UnicodeDecodeError. The error occurs in the third byte.
In this case, we have made the following improvements:
import codecdef cjk_replace(e): if not isinstance(e, UnicodeDecodeError): raise TypeError("invalid exception type %s" e) src = e.encoding if src in ('gbk','gb18030', 'big5'): beg = e.start - 2 if beg >= 0: try: return unicode(e.object[beg:e.end], 'utf8'), e.end + 1 except: pass if exc.end + 1 > len(exc.object): raise TypeError('unknown codec ,the object too short!') ch1 = ord(exc.object[exc.start:exc.end]) newpos = exc.end + 1 ch2 = ord(exc.object[exc.start + 1:newpos]) sk = exc.object[exc.start:newpos] if src != 'gbk' and 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0x7E<=ch2<=0xFE): # GBK return (unicode(sk,'cp936'), newpos) if src != 'big5' and 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0xA1<=ch2<=0xFE): # BIG5 return (unicode(sk,'big5'), newpos) raise TypeError('unknown codec !')codecs.register_error("cjk_replace", cjk_replace)
Of course, this logic is not rigorous enough. Although this kind of malformed code is a little more practical.
However, since python provides such capabilities, we can discuss how we can do better?