Conversion of python Black Magic Code and python Magic Code
When we use libraries in other languages for encoding and conversion, there are usually only two (or three) Types of characters that cannot be understood ):
- Throw an exception
- Replace with a replacement character
- Skip
However, in the complex real world, due to various unreliable texts, there will always be some discord factors such as mixed encoding. In this case, the above solution is returned.
So the question is, is there a better way for python?
The answer is yes!
The python encoding conversion process is actually two-step conversion:
source -> unicode -> dest
First, convert the string from the original encoding to unicode. Then convert unicode to the target encoding.
Step 1 we generally useDecode ()OrUnicode ()The two functions are completed.
Step 2: UseEncode ()Function completed.
The black magic we mentioned here is implemented in the first step.
Both decode and unicode functions have an optional parameter called errors. Take a look at the official description:
- Errors may be given to set a different error
- Handling scheme. Default is 'strict 'meaning that encoding errors raise
- A UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
- As well as any other name registered with codecs. register_error that is
- Able to handle UnicodeDecodeErrors.
This parameter usually has three values:
- Default Value of strict. If an Encoding Error occurs, UnicodeDecodeError is thrown.
- Skip ignore.
- Replace? Replace.
Okay. Have you seen the last sentence? The show is on!
The module codec has a function called register_error. Its role allows users to register custom errors processing methods.
It is used to handle UnicodeDecodeError.
Let's look at the function prototype:
codecs.register_error(name, error_handler)
Name: name of error handling. Used to fill in the error parameter of the decode function.
Error_handler: Handler. This function accepts an exception parameter.
Returns a tuple. The tuple has two elements. The first element is the string after error correction, and the second element is the start position of continuing decode.
With the above basic concepts. Let's take a look at the specific implementation:
def cjk_error(e): if not isinstance(e, UnicodeDecodeError): raise TypeError("don't know how to handle %r" % exc) if exc.end + 1 > len(exc.object): raise TypeError('unknown codec ,the object too short!') ch1 = ord(exc.object[exc.start:exc.end]) newpos = exc.end + 1 ch2 = ord(exc.object[exc.start + 1:newpos]) sk = exc.object[exc.start:newpos] if 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0x7E<=ch2<=0xFE): # GBK return (unicode(sk,'cp936'), newpos) if 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0xA1<=ch2<=0xFE): # BIG5 return (unicode(sk,'big5'), newpos) raise TypeError('unknown codec !') codecs.register_error("cjk_replace", cjk_replace)
I copied it from the Internet. I thought it was good at first, but later I found it an algorithm that was not repeated.
For example, utf8 and gbk have overlapping parts in the first two bytes. When a UTF-8 string is encoded with gbk for decode, the error starts from the third byte (the first two bytes can also correspond to one Chinese Character in the gbk encoding range ).
For example:
A = "" # utf8 encoding: '\ xe4 \ xbd \ xa0' c = unicode (a [: 2], 'gbk') # normal return c = unicode (, 'gbk') # UnicodeDecodeError. The error occurs in the third byte.
In this case, we have made the following improvements:
import codecdef cjk_replace(e): if not isinstance(e, UnicodeDecodeError): raise TypeError("invalid exception type %s" e) src = e.encoding if src in ('gbk','gb18030', 'big5'): beg = e.start - 2 if beg >= 0: try: return unicode(e.object[beg:e.end], 'utf8'), e.end + 1 except: pass if exc.end + 1 > len(exc.object): raise TypeError('unknown codec ,the object too short!') ch1 = ord(exc.object[exc.start:exc.end]) newpos = exc.end + 1 ch2 = ord(exc.object[exc.start + 1:newpos]) sk = exc.object[exc.start:newpos] if src != 'gbk' and 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0x7E<=ch2<=0xFE): # GBK return (unicode(sk,'cp936'), newpos) if src != 'big5' and 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0xA1<=ch2<=0xFE): # BIG5 return (unicode(sk,'big5'), newpos) raise TypeError('unknown codec !')codecs.register_error("cjk_replace", cjk_replace)
Of course, this logic is not rigorous enough. Although this kind of malformed code is a little more practical.
However, since python provides such capabilities, we can discuss how we can do better?
Articles you may be interested in:
- Python implements batch conversion file encoding (batch conversion encoding example)
- Python converts image files to base64 encoding
- Python easily converts code encoding formats
- Introduction to the python natural language encoding and conversion module codecs
- Python batch conversion file encoding format
- Code Format Conversion in python
- How to convert file encoding in batches using Python