004-python basics-character encoding and transcoding, 004-python Encoding
I. Three encoding methods
- ASCII: A computer coding system based on Latin letters. It is mainly used to display modern English and other Western European languages. It can only be expressed in 8 bits (one byte) at most, namely: 2 ** 8 = 256-1. Therefore, the ASCII code can only represent up to 255 characters.
- Unicode (unified code, universal code, and single code): It is a type of character encoding used on a computer. It requires that all characters and symbols are expressed in at least 16 bits (2 bytes ), that is, 2 ** 16 = 65536.
- UTF-8: Is the compression and optimization of Unicode encoding, which no longer uses at least 2 bytes, but classifies all characters and symbols: the content in the ascii code is saved in 1 byte, the European characters are saved in 2 bytes, and the East Asian characters are saved in 3 bytes. Utf8 is a variable-length Byte encoding method, which saves a lot of space and is compatible with ASCII codes.
Detailed article:
Http://www.cnblogs.com/yuanchenqi/articles/5956943.html (py encoding Ultimate Edition)
In a word, Unicode is the memory encoding Representation Scheme (standard), and how UTF saves and transfers Unicode (Implementation) is also the difference between UTF and Unicode.
In.
Ii. encoding of files from disk to memory
What is the data stored on the disk?
The answer is a bytes byte string encoded in some way. For example, utf8 is a variable-length encoding that saves space. Of course, there are also gbk encoding of historical products. Therefore, our text editor software has the default encoding method for saving files, such as utf8 and gbk. When we click Save, these editing software has "Silently" helped us with coding.
When we open this file again, the software quietly decoded the data, decoded it into unicode, and then presented the plaintext to the user!
Therefore, unicode is closer to the user, and bytes is closer to the computer.
After talking so much about it, what is the relationship with our program execution?
First, clarify the concept: The py interpreter itself is a software, a software similar to a text editor!
Now let's restore the encoding process from creation to execution of a py file:
Open pycharm, create the hello. py file, and write
Ret = 1 + 1 s = 'yuan hao' print (s)
When we save it, hello. the py file is saved to the disk in pycharm's default encoding method. When the file is closed and opened again, pycharm then decodes the content read after the file is opened in the default encoding mode, after converting to unicode to memory, we can see our plaintext;
If you click the run button or run the file on the command line, The py interpreter is called to open the file and then decode the bytes data on the disk to unicode data, this process is the same as the editor. The difference is that the interpreter will translate the unicode data into C code and then convert it into binary data streams, finally, the entire process is completed by controlling the operating system to call the cpu to execute the binary data.
So the question is, our text editor has its own default encoding and decoding method. Does our interpreter have it?
Of course, py2's default ASCII code and py3's default utf8 can be queried as follows
1 import sys2 print(sys.getdefaultencoding())
Do you still remember this statement?
1 #coding:utf8
Yes, this is because if the py2 interpreter executes an utf8 encoded file, it will decode utf8 with the default ASCII. Once the program has Chinese characters, the decoding will naturally be wrong, therefore, we declare # coding: utf8 at the beginning of the file to tell the interpreter that you should not decode the file in the default encoding mode, but use utf8 to decode the file. The interpreter of py3 is much more convenient because it is UTF-8 encoded by default.
Ii. Transcoding
Note:
1. In python2, the default encoding is ASCII, and in python3, the default encoding is UTF-8.
2. unicode is divided into utf-32 (4 bytes), UTF-16 (2 bytes), UTF-8 (1-4 bytes), so UTF-8 is unicode
3. encode in py3 converts the string type to the bytes type during transcoding, And the decode converts the bytes type back to the string type during decoding.
1 #-*-coding: UTF-8-*-2 _ author _ = 'Alex li' 3 4 import sys 5 print (sys. getdefaultencoding () 6 7 8 msg = "I Love Tiananmen Square in Beijing" 9 msg_gb2312 = msg. decode ("UTF-8 "). encode ("gb2312") 10 gb2312_to_gbk = msg_gb2312.decode ("gbk "). encode ("gbk") 11 12 print (msg) 13 print (msg_gb2312) 14 print (gb2312_to_gbk)
In python2
1 #-*-coding: gb2312-*-# This can also remove 2 _ author _ = 'Alex li' 3 4 import sys 5 print (sys. getdefaultencoding () 6 7 8 msg = "I Love Tiananmen Square in Beijing" 9 # msg_gb2312 = msg. decode ("UTF-8 "). encode ("gb2312") 10 msg_gb2312 = msg. encode ("gb2312") # The default value is unicode, and no decode is required. Xi Da Pu Ben 11 gb2312_to_unicode = msg_gb2312.decode ("gb2312") 12 bytes = msg_gb2312.decode ("gb2312 "). encode ("UTF-8") 13 14 print (msg) 15 print (msg_gb2312) 16 print (gb2312_to_unicode) 17 print (gb2312_to_utf8)
In python3