1.Python encoded Base 1.1 STR and Unicode
There are two data models in Python that support string data types, str and Unicode, and their base classes are basestring. For example s = "中文"
, a string of type str, but u=u"中文"
a string of Unicode type. Unicode is decoded by a string of type str, and Unicode can also be encoded as a str type. That
STR--Decode-->unicodeunicode-----encode str
Strictly speaking, Str may be called a byte string, because for UTF-8 encoded STR type "Chinese", the result of using the Len () function is 6, because the UTF-8 encoded STR type “中文”
is actually "\xe4\xb8\xad\xe6\x96\x87"
. For the Unicode type U "Chinese" (actually u"\u4e2d\u6587"
), using the Len () function to get the result is 2.
1.2 Header Code Declaration
In the python source code file if it is useful to non-ASCII characters, such as Chinese, then you need to declare the source code in the head of the file character encoding, the format is as follows:
#-*-Coding:utf-8-*-
This format looks more complex, in fact Python only check #, coding, encoding and other strings, can be abbreviated to #coding:utf-8, even can be written #coding:u8.
2.python2.x Common Coding Issues 2.1 header Code declaration and file Encoding issues
File header encoding declaration determines the Python parsing source str encoding selection method, such as the head is declared Utf-8 encoding, then the code s="中文"
Python will be in accordance with the Utf-8 encoding format to parse, through the repr(s)
can see the character encoding is "\xe4\xb8\xad\xe6\x96\x87"
, If the header-declared encoding is GBK encoded, Python will parse s with GBK encoding, and the result is "\xd6\xd0\xce\xc4"
.
It is important to note that the encoding of the file itself is consistent with the file header code, otherwise there will be a problem. The encoding of the file itself under Linux can be viewed under vim with a command set fenc
. If the file itself is encoded GBK, and the source file header declaration of the code is UTF-8, so if the source code has Chinese will have a problem, because its own Chinese STR storage is in accordance with GBK encoding, and Python in the parsing of STR is utf-8 code, so it will be reported SyntaxError: (unicode error) ‘utf8‘ codec can‘t decode byte
Error.
2.2 Default encoding issues
Here's a look at the problem caused by the Python default encoding:
#coding: utf-8u = u "Chinese" Print repr (U) # u ' \u4e2d\u6587 ' s = "Chinese" Print repr (s) # ' \xe4\xb8\xad\xe6\x96\x87 ' U2 = S.decode ("utf -8 ") Print Repr (U2) # u ' \u4e2d\u6587 ' #s2 = U.decode (" Utf-8 ") #编码错误 #u2 = S.encode (" Utf-8 ") #解码错误
Note that the 2 lines of code commented out in the instance are best not to call Decode,str directly for Unicode, preferably not to call the Encode method directly. Because if it is called directly, the equivalent of u.encode(default_encoding).decode("utf-8")
default_encoding is the default encoding used in the Unicode implementation of Python, that is, the sys.getdefaultencoding()
resulting encoding, if you have not set it, then the default encoding is ASCII, If your Unicode itself exceeds the ASCII encoding range, you will get an error. Similarly, if you call the Encode method directly to STR, the default is to decode STR first, that is, S.decode (default_encoding). Encode ("Utf-8") if Str itself is Chinese, and default_ If the encoding is ASCII, the decoding will go wrong, causing the above two rows to report UnicodeEncodeError: ‘ascii‘ codec can‘t encode characters in position...
errors and UnicodeDecodeError: ‘ascii‘ codec can‘t decode byte 0xe4 in position...
errors respectively.
The two lines of code commented out in the above example will be error-free if executed, and of course, if the STR or Unicode itself is in the ASCII encoding range, there is no problem. For example s = "abc"; s.encode("utf-8")
, there will be no problem, and the statement will return a STR with a different ID than S.
If you want to solve the problem in instance 1, there are two ways to explicitly specify the encoding, as follows:
#coding: utf-8u = u "Chinese" Print repr (U) # u ' \u4e2d\u6587 ' s = "Chinese" Print repr (s) # ' \xe4\xb8\xad\xe6\x96\x87 ' U2 = S.decode ("utf -8 ") Print Repr (U2) # u ' \u4e2d\u6587 ' s2 = U.encode (" Utf-8 "). Decode (" Utf-8 ") # OK u2 = S.decode (" UTF8 "). Encode (" Utf-8 ") # OK
The second method is to change the Python's default encoding to the file encoding format, as shown below (only so reload the Sys module, because the Setdefaultencoding method was removed after Python initialization):
#coding: utf-8 import sys reload (SYS) sys.setdefaultencoding ("Utf-8") #更改默认编码为utf -8u = u "Chinese" Print repr (U) # u ' \u4e2d \u6587 ' s = "Chinese" Print repr (s) # ' \xe4\xb8\xad\xe6\x96\x87 ' U2 = S.decode ("Utf-8") print Repr (U2) # u ' \u4e2d\u6587 ' s2 = U.deco De ("Utf-8") U2 = S.encode ("Utf-8")
2.3 Read and Write file encoding
When opening a file with the Python open () method, read () reads STR, which is the encoding of the file itself. When calling write (), if the parameter is Unicode, the specified encoding encode is required, and if the write () parameter is Unicode and no encoding is specified, the Python default encoding is encode and then written.
#coding: utf-8 f = open ("testfile") s = F.read () f.close () print type (s) # u = S.decode ("Utf-8") #testfile是utf-8 encoding f = Open ("Testfile", "W") F.write (U.encode ("GBK")) #以gbk编码写入, testfile for GBK encoding F.close ()
In addition, the Python codecs module provides an open () method that allows you to specify the encoding for opening the file, and using this method to open the file read return is Unicode. When writing, if the write parameter is Unicode, the encoding written when the file is opened is used, and if it is STR, it is first decoded to Unicode using the default encoding and then the encoding of the open file is written (note that if STR is Chinese, The default encoding sys.getdefaultencoding () is ASCII, which will report decoding errors).
#coding: Gbkimport codecsf = Codecs.open (' testfile ', encoding= ' utf-8 ') u = F.read () f.close () print type (u) # f = Codecs.open (' Testfile ', ' a ', encoding= ' Utf-8 ') f.write (u) #写入unicode # write GBK encoded STR, automatically decode the encoded operation s = ' Han ' print repr (s) # ' \XBA\XBA ' # This will first decode GBK encoded STR to Unicode and then encode to UTF-8 write #f.write (s) #默认编码为ascii时, which will report a decoding error. F.close ()
Reference
Http://www.2cto.com/kf/201407/317866.html
Python coding issues in a detailed