Python Natural language Encoding conversion module codecs introduction

Source: Internet
Author: User
Python's handling of multiple languages is well supported, and it can handle any character that is now encoded, and here's a deep look at Python's handling of many different languages.

One thing to be clear about is that when Python is going to do the transcoding, it will use the internal code, and the conversion process is:
Copy the Code code as follows:


Original code, internal code, and purpose code


The interior of Python is handled using Unicode, but the use of Unicode takes into account that it has two encoding formats, one is UCS-2, it has a total of 65,536 code bits, and the other is UCS-4, which has a 2147483648g code bit. For both of these formats, Python is supported, which is specified at compile time by--ENABLE-UNICODE=UCS2 or--ENABLE-UNICODE=UCS4. So what are some of the encodings that we have installed by default in Python? One way to do this is to judge by the value of Sys.maxunicode:
Copy CodeThe code is as follows:


Import Sys
Print Sys.maxunicode

If the output value is 65535, then it is UCS-2, if the output is 1114111 is UCS-4 encoding.
We have to realize that when a string is converted to internal encoding, it is not a str type! It is a Unicode type:
Copy the Code code as follows:


A = "gobbled"
Print type (a)
b = A.unicode (A, "gb2312")
Print type (b)


Output:
Copy CodeThe code is as follows:





At this point, B can be conveniently converted to other encodings, such as Utf-8:
Copy CodeThe code is as follows:


c = B.encode ("Utf-8")
Print C


C Output things look garbled, that's right, because it is a string of utf-8.

Well, it's time to talk about the codecs module, which is closely related to the concept I said above. Codecs is specifically used as an encoding conversion, and of course, through its interface it can be extended to other code-related transformations, this stuff is not involved here.
Copy the Code code as follows:


#-*-encoding:gb2312-*-
Import codecs, sys

print '-' *60
# Create GB2312 Encoder
look = Codecs.lookup ("gb2312")
# Create Utf-8 Encoder
Look2 = Codecs.lookup ("Utf-8")

A = "I love Beijing Tian ' an gate"

Print Len (a), a
# encode A as internal Unicode, but why is the method named Decode, my understanding is to decode the gb2312 string into Unicode
b = Look.decode (a)
# returns the b[0] is the data, b[1] is the length, this time the type is Unicode
Print b[1], b[0], type (b[0])
# converts an internally encoded Unicode to a gb2312 encoded string, and the Encode method returns a string type
B2 = Look.encode (B[0])
# Have you found a different place? After the conversion, the string length changed from 14 to 7! Now the return length is the true word count, the original is the number of bytes
Print b2[1], b2[0], type (b2[0])
# Although the above returns the number of words, but does not mean to use Len to find b2[0] length is 7, still 14, just Codecs.encode will count the number of words
Print Len (b2[0])

The above code is the use of codecs, is the most common usage. Another question is, what if the character encoding in the file we're working on is other types? This read is handled in a special process. Codecs also provides a method.
Copy the Code code as follows:


#-*-encoding:gb2312-*-
Import codecs, sys

# Use the Open method provided by codecs to specify the language encoding of the opened file, which is automatically converted to internal Unicode at read time
bfile = Codecs.open ("Dddd.txt", ' R ', "Big5")
#bfile = open ("Dddd.txt", ' R ')

SS = Bfile.read ()
Bfile.close ()
# output, this time see is the result after the conversion. If you open the file using the language built-in open function, what you see here must be garbled
Print SS, type (ss)


Above this processing Big5, you can find a section BIG5 encoded files to try.
  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.