Python black magic encoding conversion method

Source: Internet
Author: User
This article mainly introduces the python black magic encoding conversion and analyzes the python encoding conversion methods. interested partners can refer to this article to introduce the python black magic encoding conversion, this article analyzes the python encoding and conversion methods. if you are interested, refer

When we use libraries in other languages for encoding and conversion, there are usually only two (or three) types of characters that cannot be understood ):

  • Throw an exception

  • Replace with a replacement character

  • Skip

However, in the complex real world, due to various unreliable texts, there will always be some discord factors such as mixed encoding. In this case, the above solution is returned.

So the question is, is there a better way for python?

The answer is yes!

The python encoding conversion process is actually two-step conversion:


source -> unicode -> dest

First, convert the string from the original encoding to unicode. Then convert unicode to the target encoding.

Step 1 we generally useDecode ()OrUnicode ()The two functions are completed.
Step 2: UseEncode ()Function completed.

The black magic we mentioned here is implemented in the first step.

Both decode and unicode functions have an optional parameter called errors. Take a look at the official description:

  • Errors may be given to set a different error

  • Handling scheme. Default is 'strict 'meaning that encoding errors raise

  • A UnicodeDecodeError. Other possible values are 'ignore' and 'replace'

  • As well as any other name registered with codecs. register_error that is

  • Able to handle UnicodeDecodeErrors.

This parameter usually has three values:

  • Default value of strict. If an encoding error occurs, UnicodeDecodeError is thrown.

  • Skip ignore.

  • Replace? Replace.

Okay. Have you seen the last sentence? The show is on!

The module codec has a function called register_error. Its role allows users to register custom errors processing methods.
It is used to handle UnicodeDecodeError.

Let's look at the function prototype:


codecs.register_error(name, error_handler)

Name: name of error handling. Used to fill in the error parameter of the decode function.
Error_handler: Handler. This function accepts an exception parameter.
Returns a tuple. the tuple has two elements. The first element is the string after error correction, and the second element is the start position of continuing decode.

With the above basic concepts. Let's take a look at the specific implementation:


def cjk_error(e):  if not isinstance(e, UnicodeDecodeError):    raise TypeError("don't know how to handle %r" % exc)   if exc.end + 1 > len(exc.object):     raise TypeError('unknown codec ,the object too short!')   ch1 = ord(exc.object[exc.start:exc.end])   newpos = exc.end + 1   ch2 = ord(exc.object[exc.start + 1:newpos])   sk = exc.object[exc.start:newpos]   if 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0x7E<=ch2<=0xFE): # GBK     return (unicode(sk,'cp936'), newpos)   if 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0xA1<=ch2<=0xFE): # BIG5     return (unicode(sk,'big5'), newpos)   raise TypeError('unknown codec !') codecs.register_error("cjk_replace", cjk_replace)

I copied it from the Internet. I thought it was good at first, but later I found it an algorithm that was not repeated.
For example, utf8 and gbk have overlapping parts in the first two bytes. When a UTF-8 string is encoded with gbk for decode, the error starts from the third byte (the first two bytes can also correspond to one Chinese character in the gbk encoding range ).
For example:


A = "" # utf8 encoding: '\ xe4 \ xbd \ xa0' c = unicode (a [: 2], 'gbk') # normal return c = unicode (, 'gbk') # UnicodeDecodeError. The error occurs in the third byte.

In this case, we have made the following improvements:


import codecdef cjk_replace(e):  if not isinstance(e, UnicodeDecodeError):    raise TypeError("invalid exception type %s" e)  src = e.encoding  if src in ('gbk','gb18030', 'big5'):    beg = e.start - 2    if beg >= 0:      try:        return unicode(e.object[beg:e.end], 'utf8'), e.end + 1      except:        pass  if exc.end + 1 > len(exc.object):    raise TypeError('unknown codec ,the object too short!')  ch1 = ord(exc.object[exc.start:exc.end])  newpos = exc.end + 1  ch2 = ord(exc.object[exc.start + 1:newpos])  sk = exc.object[exc.start:newpos]  if src != 'gbk' and 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0x7E<=ch2<=0xFE): # GBK    return (unicode(sk,'cp936'), newpos)  if src != 'big5' and 0x81<=ch1<=0xFE and (0x40<=ch2<=0x7E or 0xA1<=ch2<=0xFE): # BIG5    return (unicode(sk,'big5'), newpos)  raise TypeError('unknown codec !')codecs.register_error("cjk_replace", cjk_replace)

Of course, this logic is not rigorous enough. Although this kind of malformed code is a little more practical.
However, since python provides such capabilities, we can discuss how we can do better?

These are the details of the python black magic encoding conversion method. For more information, see other related articles in the first PHP community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.