Python supports many languages and can process arbitrary characters. Here, I will take a closer look at how python can process different languages.
One thing to note is that when Python needs to perform encoding conversion, it will use internal encoding. The conversion process is as follows:
Original encoding-> internal encoding-> destination Encoding
Python uses Unicode internally, but the use of Unicode needs to consider its encoding format has two, one is the UCS-2, it has a total of 65536 yards, the other is the UCS-4, which has 2147483648g code bit. Python supports both formats. This is specified by -- enable-Unicode = ucs2 or -- enable-Unicode = ucs4 during compilation. How can we determine the encoding of Python installed by default? One way is to judge through the value of SYS. maxunicode:
Import sys
Print sys. maxunicode
If the output value is 65535, It is the UCS-2, and if the output is 1114111, It is the UCS-4 encoding.
We need to realize that when a string is converted to an internal encoding, it is not of the STR type! It is of the Unicode type:
A = "wind and the cloud"
Print type ()
B = A. Unicode (a, "gb2312 ")
Print type (B)
Output:
<Type 'str'>
<Type 'unicode '>
At this time, B can easily convert to other encodings, such as UTF-8:
C = B. encode ("UTF-8 ")
Print C
C output looks garbled, that's right, because it is a UTF-8 string.
Now, let's talk about the codecs module. It is closely related to the concepts I mentioned above. Codecs is used for encoding conversion. Of course, its interface can be used to expand to other code transformations. This is not involved here.
#-*-Encoding: gb2312 -*-
Import codecs, sys
Print '-' * 60
# Create a gb2312 Encoder
Look = codecs. Lookup ("gb2312 ")
# Create a UTF-8 Encoder
Look2 = codecs. Lookup ("UTF-8 ")
A = "I Love Tiananmen, Beijing"
Print Len (a),
# Encode a as an internal Unicode, but why is the method named decode? I understand that it decodes the gb2312 string into Unicode
B = look. Decode ()
# The returned B [0] is the data, and B [1] is the length. At this time, the type is Unicode.
Print B [1], B [0], type (B [0])
# Convert the unicode encoded internally to a gb2312 encoded string. The encode method returns a string type.
B2 = look. encode (B [0])
# Have you found anything different? After conversion, the string length is changed from 14 to 7! The length returned now is the actual number of words. The original length is the number of bytes.
Print B2 [1], B2 [0], type (B2 [0])
# Although the number of words returned above, it does not mean that the length of B2 [0] is 7 with Len, and it is still 14. It is only codecs. encode that counts the number of words
Print Len (B2 [0])
The above code is the use of codecs, which is the most common usage. Another problem is, what if the character encoding in the file we process is of another type? This read operation also requires special processing. Codecs also provides methods.
#-*-Encoding: gb2312 -*-
Import codecs, sys
# Use the open method provided by codecs to specify the language encoding of the opened file, which will be automatically converted to internal Unicode during reading
Bfile = codecs. Open ("dddd.txt", 'R', "big5 ")
# Bfile = open ("dddd.txt", 'R ')
Ss = bfile. Read ()
Bfile. Close ()
# Output. The converted result is displayed at this time. If you use the built-in OPEN function of the language to open the file, it must be garbled.
Print SS, type (SS)
If big5 is processed above, you can try to find a big5-encoded file.
Www.pythonid.com