Python's handling of multi-language is good, it can handle characters that are now arbitrarily encoded, and here is a deep look at Python's handling of many different languages.
One thing you need to be aware of is that when Python is going to encode a conversion, it uses the internal encoding, which is the conversion process:
Copy Code code as follows:
Original encoding-> internal encoding-> Purpose coding
Python's interior is handled using Unicode, but the use of Unicode takes into account that there are two types of coding formats, one is UCS-2, it has 65,536 yards, the other is UCS-4, it has 2147483648g code bit. For both formats, Python is supported, which is specified at compile time through--ENABLE-UNICODE=UCS2 or--ENABLE-UNICODE=UCS4. So what's the code for our own default-installed Python to determine? One way is to judge by the value of a sys.maxunicode:
Copy Code code as follows:
Import Sys
Print Sys.maxunicode
If the output value is 65535, then it is UCS-2, if the output is 1114111 is UCS-4 encoding.
One thing we should realize is that when a string is converted to an internal encoding, it is not a str type! It is a Unicode type:
Copy Code code as follows:
A = "whirlwind"
Print type (a)
b = A.unicode (A, "gb2312")
Print type (b)
Output:
Copy Code code as follows:
<type ' str ' >
<type ' Unicode ' >
This time B can be easily arbitrarily converted to other encodings, such as converting to Utf-8:
Copy Code code as follows:
c = B.encode ("Utf-8")
Print C
C Output things look garbled, that's right, because it's a utf-8 string.
Well, it's time to talk about the codecs module, which is closely related to the concept I said above. Codecs is specifically used as a coded conversion, and, of course, its interface can be extended to other transformations on the code, something not covered here.
Copy Code code as follows:
#-*-encoding:gb2312-*-
Import codecs, sys
print '-' *60
# Create GB2312 Encoder
look = Codecs.lookup ("gb2312")
# Create Utf-8 Encoder
Look2 = Codecs.lookup ("Utf-8")
A = "I love Beijing Tian An men"
Print Len (a), a
# encodes A as internal Unicode, but why is the method named Decode, my understanding is to decode the gb2312 string to Unicode
b = Look.decode (a)
# The return of B[0] is the data, b[1] is the length, this time the type is Unicode
Print b[1], b[0], type (b[0])
# converts an internally encoded Unicode to a gb2312-encoded string, and the Encode method returns a string type
B2 = Look.encode (B[0])
# Find a different place? After the conversion, the string length changed from 14 to 7! Now the length of the return is the real word, the original is the number of bytes
Print b2[1], b2[0], type (b2[0])
# Although the above returned the number of words, but does not mean that with Len for b2[0] length is 7, still 14, just Codecs.encode will count the number of words
Print Len (b2[0])
The above code is the use of codecs, the most common usage. And the other question is, what if the character encoding in the file we're dealing with is other types? This read to do processing also requires special handling. Codecs also provides a method.
Copy Code code as follows:
#-*-encoding:gb2312-*-
Import codecs, sys
# Use the Open method provided by codecs to specify the language encoding of the opened file, which will automatically convert to internal Unicode when reading
bfile = Codecs.open ("Dddd.txt", ' R ', "Big5")
#bfile = open ("Dddd.txt", ' R ')
SS = Bfile.read ()
Bfile.close ()
# output, this time to see the result of the conversion. If you open a file using the language-built open function, you'll see that it's garbled.
Print SS, type (ss)
The above processing Big5, you can find a section of BIG5 encoded files to try.