Python Natural Language Code conversion module codecs Introduction _python

Source: Internet
Author: User

Python's handling of multi-language is good, it can handle characters that are now arbitrarily encoded, and here is a deep look at Python's handling of many different languages.

One thing you need to be aware of is that when Python is going to encode a conversion, it uses the internal encoding, which is the conversion process:

Copy Code code as follows:

Original encoding-> internal encoding-> Purpose coding

Python's interior is handled using Unicode, but the use of Unicode takes into account that there are two types of coding formats, one is UCS-2, it has 65,536 yards, the other is UCS-4, it has 2147483648g code bit. For both formats, Python is supported, which is specified at compile time through--ENABLE-UNICODE=UCS2 or--ENABLE-UNICODE=UCS4. So what's the code for our own default-installed Python to determine? One way is to judge by the value of a sys.maxunicode:
Copy Code code as follows:

Import Sys
Print Sys.maxunicode

If the output value is 65535, then it is UCS-2, if the output is 1114111 is UCS-4 encoding.
One thing we should realize is that when a string is converted to an internal encoding, it is not a str type! It is a Unicode type:

Copy Code code as follows:

A = "whirlwind"
Print type (a)
b = A.unicode (A, "gb2312")
Print type (b)

Output:
Copy Code code as follows:

<type ' str ' >
<type ' Unicode ' >

This time B can be easily arbitrarily converted to other encodings, such as converting to Utf-8:
Copy Code code as follows:

c = B.encode ("Utf-8")
Print C

C Output things look garbled, that's right, because it's a utf-8 string.

Well, it's time to talk about the codecs module, which is closely related to the concept I said above. Codecs is specifically used as a coded conversion, and, of course, its interface can be extended to other transformations on the code, something not covered here.

Copy Code code as follows:

#-*-encoding:gb2312-*-
Import codecs, sys

print '-' *60
# Create GB2312 Encoder
look = Codecs.lookup ("gb2312")
# Create Utf-8 Encoder
Look2 = Codecs.lookup ("Utf-8")

A = "I love Beijing Tian An men"

Print Len (a), a
# encodes A as internal Unicode, but why is the method named Decode, my understanding is to decode the gb2312 string to Unicode
b = Look.decode (a)
# The return of B[0] is the data, b[1] is the length, this time the type is Unicode
Print b[1], b[0], type (b[0])
# converts an internally encoded Unicode to a gb2312-encoded string, and the Encode method returns a string type
B2 = Look.encode (B[0])
# Find a different place? After the conversion, the string length changed from 14 to 7! Now the length of the return is the real word, the original is the number of bytes
Print b2[1], b2[0], type (b2[0])
# Although the above returned the number of words, but does not mean that with Len for b2[0] length is 7, still 14, just Codecs.encode will count the number of words
Print Len (b2[0])

The above code is the use of codecs, the most common usage. And the other question is, what if the character encoding in the file we're dealing with is other types? This read to do processing also requires special handling. Codecs also provides a method.

Copy Code code as follows:

#-*-encoding:gb2312-*-
Import codecs, sys

# Use the Open method provided by codecs to specify the language encoding of the opened file, which will automatically convert to internal Unicode when reading
bfile = Codecs.open ("Dddd.txt", ' R ', "Big5")
#bfile = open ("Dddd.txt", ' R ')

SS = Bfile.read ()
Bfile.close ()
# output, this time to see the result of the conversion. If you open a file using the language-built open function, you'll see that it's garbled.
Print SS, type (ss)


The above processing Big5, you can find a section of BIG5 encoded files to try.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.