Introduction to the python natural language encoding and conversion module codecs

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduces the python natural language encoding and conversion module codecs. codecs is specially used for encoding and conversion. its interface can be used to expand to other code conversions, if you need it, you can refer to python's support for multi-language processing. it can process any character encoded in the current situation. here, I will take a deeper look at python's handling of multiple different languages.

One thing to note is that when python needs to perform encoding conversion, it will use internal encoding. the conversion process is as follows:

The code is as follows:

Original encoding-> internal encoding-> destination encoding

Python uses unicode internally, but the use of unicode needs to consider its encoding format has two, one is the UCS-2, it has a total of 65536 yards, the other is the UCS-4, which has 2147483648g code bit. Python supports both formats. this is specified by -- enable-unicode = ucs2 or -- enable-unicode = ucs4 during compilation. How can we determine the encoding of python installed by default? One way is to judge through the value of sys. maxunicode:

The code is as follows:

Import sys
Print sys. maxunicode

If the output value is 65535, it is the UCS-2, and if the output is 1114111, it is the UCS-4 encoding.
We need to realize that when a string is converted to an internal encoding, it is not of the str type! It is of the unicode type:

The code is as follows:

A = "wind and the cloud"
Print type ()
B = a. unicode (a, "gb2312 ")
Print type (B)

Output:

The code is as follows:

At this time, B can easily convert to other encodings, such as UTF-8:

The code is as follows:

C = B. encode ("UTF-8 ")
Print c

C output looks garbled, that's right, because it is a UTF-8 string.

Now, let's talk about the codecs module. it is closely related to the concepts I mentioned above. Codecs is used for encoding conversion. of course, its interface can be used to expand to other code transformations. this is not involved here.

The code is as follows:

#-*-Encoding: gb2312 -*-
Import codecs, sys

Print '-' * 60
# Create a gb2312 encoder
Look = codecs. lookup ("gb2312 ")
# Create a UTF-8 encoder
Look2 = codecs. lookup ("UTF-8 ")

A = "I love Tiananmen, Beijing"

Print len (a),
# Encode a as an internal unicode, but why is the method named decode? I understand that it decodes the gb2312 string into unicode
B = look. decode ()
# The returned B [0] is the data, and B [1] is the length. at this time, the type is unicode.
Print B [1], B [0], type (B [0])
# Convert the unicode encoded internally to a gb2312 encoded string. The encode method returns a string type.
B2 = look. encode (B [0])
# Have you found anything different? After conversion, the string length is changed from 14 to 7! The length returned now is the actual number of words. the original length is the number of bytes.
Print b2 [1], b2 [0], type (b2 [0])
# Although the number of words returned above, it does not mean that the length of b2 [0] is 7 with len, and it is still 14. it is only codecs. encode that counts the number of words
Print len (b2 [0])

The above code is the use of codecs, which is the most common usage. Another problem is, what if the character encoding in the file we process is of another type? This read operation also requires special processing. Codecs also provides methods.

The code is as follows:

#-*-Encoding: gb2312 -*-
Import codecs, sys

# Use the open method provided by codecs to specify the language encoding of the opened file, which will be automatically converted to internal unicode during reading
Bfile = codecs. open ("dddd.txt", 'R', "big5 ")
# Bfile = open ("dddd.txt", 'r ')

Ss = bfile. read ()
Bfile. close ()
# Output. the converted result is displayed at this time. If you use the built-in open function of the language to open the file, it must be garbled.
Print ss, type (ss)

If big5 is processed above, you can try to find a big5-encoded file.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More