Recently in the use of Python3.4 to do some script implementation, found that the processing of the code and Python2.6 there is a big difference, the opportunity to do a comb-related knowledge, convenient when needed to consult.
First, the concept and the difference:
Script character encoding : This is the encoding format used by the interpreter to interpret the script file, which can be # -\*- coding: utf-8 -\*-
specified by explicitly
Interpreter character encoding : The encoding format used when processing the STR type during the interpreter's internal logic
Python2 By default the footstep file is processed using ASCII (for historical reasons please Google)
Python2 string in addition to STR and Unicode, you can use decode and encode to convert each other
Python3 in the default footstep files using UTF-8 to handle (finally the default support for Chinese, like)
Python3 Chinese this character and binary are distinguished using str and bytes respectively, and they are also used to convert each other using decode and encode.
About the default script character encoding , because the default encoding format for Footstep file processing has changed, so a lot of content -oriented processing has changed, such as the following script.
import sysprint(sys.getdefaultencoding())print(‘中文‘)
The results of the operation using the Python3.4 interpreter are as follows:
> python34 test.pyutf-8中文
The results of the operation using the Python2.6 interpreter are as follows:
> python26 test.py File "test.py", line 4SyntaxError: Non-ASCII character ‘\xe4‘ in file test.py on line 4, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Use Python2.6 error is because the first said the "python2 in the footsteps of the file using ASCII to handle", but the Footstep file contains Chinese, ASCII and no coverage of Chinese, so error. If we change the pace slightly:
# -*- coding: utf-8 -*-import sysprint(sys.getdefaultencoding())print(‘中文‘)
Added a description of the script character encoding, using the Python2.6 interpreter again to run the result:
> python26 test.pyascii涓枃
Because the Footstep file encoding format is explicitly specified as Utf-8, so read no problem, that is, if the Python2 script file contains non-ASCII characters, be sure to explicitly specify the footstep file encoding format, for Python3 because the default footstep file encoding format is Utf-8, So there is no problem (there will be a detailed discussion of this later in the article).
But we look back at the output just now, the results are shown as garbled.
Garbled is related to another we want to say the different point interpreter character encoding , because we have defined the UTF-8 format read footstep content, but because Python2.6 on the Windows platform, the default is to use GBK to decode the character output, do not believe you see:
> python26ActivePython 2.6.6.15 (ActiveState Software Inc.) based onPython 2.6.6 (r266:84292, Aug 24 2010, 16:01:11) [MSC v.1500 32 bit (Intel)] onwin32Type "help", "copyright", "credits" or "license" for more information.>>> s=‘中文‘>>> s‘\xd6\xd0\xce\xc4‘>>> s.decode(‘gbk‘).encode(‘utf-8‘)‘\xe4\xb8\xad\xe6\x96\x87‘>>> print(‘\xd6\xd0\xce\xc4‘)中文>>> print(‘\xe4\xb8\xad\xe6\x96\x87‘)涓枃
A complete description of the process of the above garbled appearance:
Reads "Chinese" using the specified script file encoding utf-8 format, reads the string content to ' \xe4\xb8\xad\xe6\x96\x87 ', and then outputs the Python2.6 interpreter using the default interpreter character encoding GBK format to encode the read content Output, but the previous utf-8 is a 3 byte length to denote a Chinese, and GBK is 2 byte length to represent Chinese, so the previous 2 Chinese, in the output of the time in accordance with 3 Chinese encoding (encode), of course, garbled, look at the garbled, is 3 words.
We'll use the code to verify what it says:
# -*- coding: utf-8 -*-import sysprint(sys.getdefaultencoding())print(‘中文‘)print(‘\xe4\xb8\xad\xe6\x96\x87‘)print(‘\xe4\xb8\xad\xe6\x96\x87‘.decode(‘gbk‘, ‘ignore‘))print(‘\xd6\xd0\xce\xc4‘.decode(‘gbk‘).encode(‘utf-8‘))print(‘中文‘.decode(‘utf-8‘))print(‘\xe4\xb8\xad\xe6\x96\x87‘.decode(‘utf-8‘))print(‘\xd6\xd0\xce\xc4‘)print(‘\xd6\xd0\xce\xc4‘.decode(‘gbk‘))
Look at the results of the output:
> python26 test.pyascii涓枃涓枃涓枃涓枃中文中文中文中文
It is obvious that the hexadecimal characters decoded by the GBK format are normally output to Chinese, and the decode of the hexadecimal characters in the utf-8 format is used explicitly utf-8 the output is normal.
In the same vein, you can see 2 additional phenomena:
The py file is stored in utf-8 format, and contains "Chinese" words, if the use of GBK format opened, but also see "Chinese" display garbled and the above program output is consistent;
If the py file is stored in the GBK format, print(‘中文‘)
it also shows normal;
The ultimate reason for garbled characters is that the encode and decode encoding formats of the same string are inconsistent.
The above mentioned problem, if the file storage and script file encoding both use Utf-8, use Python3.4 is not a problem, because Python3 default interpreter character encoding is utf-8, the default can be processed in Chinese.
Concluding conclusions:
- Python2 footstep files are stored as much as possible using the GBK format; Python3 footstep files are stored in utf-8 format as much as possible;
- Python2 Footsteps If you have a Chinese character, be sure to declare the script file encoding at the beginning of the script that supports English;
- Python2 the encode and decode encoding formats for the same string are consistent;
Note: This time all test script files are saved in utf-8 format
Differences in default encodings in Python2 and Python3