Transcoding of python black magic

Last Update:2018-07-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduces the python black magic encoding conversion and analyzes the python encoding conversion methods. If you are interested, please refer to the code conversion process in libraries of other languages, for characters that cannot be understood, there are usually only two (or three) Types of processing ):

Throw an exception
Replace with a replacement character
Skip

However, in the complex real world, due to various unreliable texts, there will always be some discord factors such as mixed encoding. In this case, the above solution is returned.

So the question is, is there a better way for python?

The answer is yes!

The python encoding conversion process is actually two-step conversion:

source -> unicode -> dest

First, convert the string from the original encoding to unicode. Then convert unicode to the target encoding.

Step 1 we generally useDecode ()OrUnicode ()The two functions are completed.
Step 2: UseEncode ()Function completed.

The black magic we mentioned here is implemented in the first step.

Both decode and unicode functions have an optional parameter called errors. Take a look at the official description:

Errors may be given to set a different error
Handling scheme. Default is 'strict 'meaning that encoding errors raise
A UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
As well as any other name registered with codecs. register_error that is
Able to handle UnicodeDecodeErrors.

This parameter usually has three values:

Default Value of strict. If an Encoding Error occurs, UnicodeDecodeError is thrown.
Skip ignore.
Replace? Replace.

Okay. Have you seen the last sentence? The show is on!

The module codec has a function called register_error. Its role allows users to register custom errors processing methods.
It is used to handle UnicodeDecodeError.

Let's look at the function prototype:

codecs.register_error(name, error_handler)

Name: name of error handling. Used to fill in the error parameter of the decode function.
Error_handler: Handler. This function accepts an exception parameter.
Returns a tuple. The tuple has two elements. The first element is the string after error correction, and the second element is the start position of continuing decode.

With the above basic concepts. Let's take a look at the specific implementation:

def cjk_error(e):  if not isinstance(e, UnicodeDecodeError):    raise TypeError("don't know how to handle %r" % exc)   if exc.end + 1 > len(exc.object):     raise TypeError('unknown codec ,the object too short!')   ch1 = ord(exc.object[exc.start:exc.end])   newpos = exc.end + 1   ch2 = ord(exc.object[exc.start + 1:newpos])   sk = exc.object[exc.start:newpos]   if 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0x7E<=ch2<=0xFE): # GBK     return (unicode(sk,'cp936'), newpos)   if 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0xA1<=ch2<=0xFE): # BIG5     return (unicode(sk,'big5'), newpos)   raise TypeError('unknown codec !') codecs.register_error("cjk_replace", cjk_replace)

I copied it from the Internet. I thought it was good at first, but later I found it an algorithm that was not repeated.
For example, utf8 and gbk have overlapping parts in the first two bytes. When a UTF-8 string is encoded with gbk for decode, the error starts from the third byte (the first two bytes can also correspond to one Chinese Character in the gbk encoding range ).
For example:

A = "" # utf8 encoding: '\ xe4 \ xbd \ xa0' c = unicode (a [: 2], 'gbk') # normal return c = unicode (, 'gbk') # UnicodeDecodeError. The error occurs in the third byte.

In this case, we have made the following improvements:

import codecdef cjk_replace(e):  if not isinstance(e, UnicodeDecodeError):    raise TypeError("invalid exception type %s" e)  src = e.encoding  if src in ('gbk','gb18030', 'big5'):    beg = e.start - 2    if beg >= 0:      try:        return unicode(e.object[beg:e.end], 'utf8'), e.end + 1      except:        pass  if exc.end + 1 > len(exc.object):    raise TypeError('unknown codec ,the object too short!')  ch1 = ord(exc.object[exc.start:exc.end])  newpos = exc.end + 1  ch2 = ord(exc.object[exc.start + 1:newpos])  sk = exc.object[exc.start:newpos]  if src != 'gbk' and 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0x7E<=ch2<=0xFE): # GBK    return (unicode(sk,'cp936'), newpos)  if src != 'big5' and 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0xA1<=ch2<=0xFE): # BIG5    return (unicode(sk,'big5'), newpos)  raise TypeError('unknown codec !')codecs.register_error("cjk_replace", cjk_replace)

Of course, this logic is not rigorous enough. Although this kind of malformed code is a little more practical.
However, since python provides such capabilities, we can discuss how we can do better?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Transcoding of python black magic

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Transcoding of python black magic

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support