Summary, fromPythonIt can be processed from 1.6.UnicodeCharacter.
I. Several Common encoding formats.
1.1, ascii, expressed in 1 byte.
1.2, UTF-8, expressed in 1 to 3 bytes, represents the ascii code occupies only 1 byte, ascii encoding is a subset of the UTF-8.
1.3, UTF-16, expressed in 2 bytes, in python, unicode meaning is UTF-16.
Ii. encoding and decoding of python source files. The process from generation to execution of the python program we write is as follows:
Editor ----> source code ----> interpreter ----> output result
2.1. The editor determines the encoding format of the source code to be set in the editor)
2.2 It is also necessary for the interpreter to know the source code encoding format. Unfortunately, it is difficult to know the source file encoding format from the encoding data)
2.3, supplement: In Windows when using UltraEdit to save the source code into a UTF-8, will be recorded in the file BOM mark does not need to study) So ActivePython interpreter will automatically recognize the source file is in UTF-8 format, but if you use eclipse to edit the source file, although the file is encoded as a UTF-8 in the editor, but because it is not recorded in the BOM flag, you must add # coding = UTF-8 at the beginning of the source file, it is interesting to use annotations to indicate the encoding method of the interpreter source file.
2.4. Example: for example, we want to output "I am Chinese" to the terminal ".
- # Coding = UTF-8 tells the python interpreter to use UTF-8 encoding. I use eclipse + pydev.
- Print "I am Chinese" # The source file itself also needs to be saved as UTF-8 Encoding
Three, the conversion of encoding, the conversion of the two types of encoding must use UTF-16 as a transfer station.
For example, if there is a Japanese file, which contains the content "は 中 す. ", The encoding format is Japanese-encoded SHIFT_JIS,
There is also a chn.txt file in the format of "People's Republic of China", which is a Chinese encoding GB2312.
How can we merge the content of the two files together and store them to utf.txt without displaying garbled characters? We can convert the content of the two files into a UTF-8 format, because the UTF-8 contains Chinese encoding and Japanese encoding.
- # Coding = UTF-8
- Try:
- JAP = open ("e:/jap.txt", "r ")
- CHN = open ("e:/chn.txt", "r ")
- UTF = open ("e:/utf.txt", "w ")
- Jap_text = JAP. readline ()
- Chn_text = CHN. readline ()
- # Decode into a UTF-16, then encode into a UTF-8
- Japan _ text_utf8 = Japan _ text.decode ("SHIFT_JIS"). encode ("UTF-8") # Do not convert to UTF-8 can also
- Chn_text_utf8 = chn_text.decode ("GB2312"). encode ("UTF-8") # The encoding method is case-sensitive, and the same is true for UTF-8.
- UTF. write (jap_text_utf8)
- UTF. write (chn_text_utf8)
- Handle t IOError, e:
- Print "open file error", e
IV. The Tk Library supports ascii, UTF-16, UTF-8
- # Coding = UTF-8
- From Tkinter import *
- Try:
- JAP = open ("e:/jap.txt", "r ")
- Str1 = JAP. readline ()
- Handle t IOError, e:
- Print "open file error", e
- Root = Tk ()
- Label1 = Label (root, text = str1.decode ("SHIFT_JIS") # garbled characters are displayed if no decode exists.
- Label1.grid ()
- Root. mainloop ()
The above is the basic process of learning python to process python encoding. I hope it will be helpful to you.