Python encoding details, python encoding details
Python Coding
After the character set and encoding details in the previous article summarize the common character encoding, this article will analyze and summarize the common Encoding Problems in python. Because python3.x and python2.x differ greatly in character encoding, this article uses Python2.7.5 to analyze the character encoding problems in 2.x.
1. Python encoding basics 1.1 str and unicode
Python has two data models to support string data types, str and unicode. Their base classes are basestring. For exampleS = "Chinese"
Is a string of the str type, andU = u "Chinese"
Is a unicode string. Unicode is obtained after decoding a string of the str type. unicode can also be encoded into the str type. That is
str --> decode -->unicodeunicode --> encode --> str
Strictly speaking, str may be called a byte string, because the result of using the len () function for the str type encoded in the UTF-8 is 6, because the str type encoded in the UTF-8"Chinese"
Actually"\xe4\xb8\xad\xe6\x96\x87"
. For the unicode type u "Chinese" (actuallyu"\u4e2d\u6587"
), Use the len () function. The result is 2.
1.2 header encoding statement
If non-ascii characters such as Chinese characters are used in python source code files, you must declare the source code character encoding in the header of the source code file. The format is as follows:
#-*- coding: utf-8 -*-
This format seems complicated. In fact, python only checks strings such as #, coding, and encoding, which can be abbreviated as # coding: UTF-8 or even # coding: u8.
2. Python2.x common coding problems 2.1 header coding instructions and file Encoding Problems
The file header encoding declaration determines the str encoding Selection Method in the python parsing source code. For example, if the header declares UTF-8 encodingS = "Chinese"
Python will be parsed in UTF-8 encoding format.repr(s)
The character encoding is"\xe4\xb8\xad\xe6\x96\x87"
If the encoding declared in the header is gbk encoding, python will parse s using gbk encoding. The result is"\xd6\xd0\xce\xc4"
.
It should be noted that the encoding of the file itself must be consistent with the declared encoding of the file header, otherwise there will be problems. The file is encoded in Linux. You can use commands in vim.set fenc
. If the file encoding is gbk and the declared encoding in the header of the source code file is UTF-8, the problem may occur if there is Chinese in the source code, because the Chinese str storage is encoded according to gbk, while python thinks it is UTF-8 encoding when parsing str, it will reportSyntaxError: (unicode error) 'utf8' codec can't decode byte
Error.
2.2 default encoding Problems
The following describes the problems caused by python default encoding:
# Coding: utf-8u = u "Chinese" print repr (u) # U' \ u4e2d \ u6587's = "Chinese" print repr (s) # '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87 'u2 = s. decode ("UTF-8") print repr (u2) # U' \ u4e2d \ u6587 '# s2 = u. decode ("UTF-8") # Encoding Error # u2 = s. encode ("UTF-8") # decoding error
Note the two lines of code commented out in the instance. It is best not to directly call decode for unicode, and str is best not to directly call the encode method. Because if it is called directly, it is equivalentu.encode(default_encoding).decode("utf-8")
, Default_encoding is the default encoding used in python unicode implementation, that issys.getdefaultencoding()
If you have not set the encoding, the default encoding is ascii. If your unicode itself exceeds the ascii encoding range, an error is returned. Similarly, if you call the encode method directly for str, str will be decoded by default, that is, s. decode (default_encoding ). encode ("UTF-8"). If str itself is Chinese and default_encoding is ascii, the decoding will fail, causing the above two lines to reportUnicodeEncodeError: 'ascii' codec can't encode characters in position...
Errors andUnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position...
Error.
If the two lines of code commented out in the above example are executed, an error is returned. Of course, if both str and unicode are within the ascii encoding range, there is no problem. For examples = "abc"; s.encode("utf-8")
There will be no problem. After the statement is executed, a str with a different id than s will be returned.
If you want to solve the problem in instance 1, you can use either of the following methods:
# Coding: utf-8u = u "Chinese" print repr (u) # U' \ u4e2d \ u6587's = "Chinese" print repr (s) # '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87 'u2 = s. decode ("UTF-8") print repr (u2) # U' \ u4e2d \ u6587 's2 = u. encode ("UTF-8 "). decode ("UTF-8") # OK u2 = s. decode ("utf8 "). encode ("UTF-8") # OK
The second method is to change python's default encoding to the file encoding format, as shown below (here only the reload sys module is required because python deletes the setdefaultencoding method after initialization ):
# Coding: UTF-8 import sys reload (sys) sys. setdefaultencoding ("UTF-8") # change the default encoding to utf-8u = u "" print repr (u) # U' \ u4e2d \ u6587's = "" print repr (s) # '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87 'u2 = s. decode ("UTF-8") print repr (u2) # U' \ u4e2d \ u6587 's2 = u. decode ("UTF-8") u2 = s. encode ("UTF-8 ")
2.3 read/write file encoding
When you use the python open () method to open a file, read () reads str, encoding is the encoding of the file itself. When you call write () to write a file, if the parameter is unicode, you must use the specified encoding encode. If the write () parameter is unicode and no encoding is specified, it will use the python default encoding encode before writing.
# Coding: UTF-8 f = open ("testfile") s = f. read () f. close () print type (s) # <type 'str'> u = s. decode ("UTF-8") # testfile is UTF-8 encoded f = open ("testfile", "w") f. write (u. encode ("gbk") # Write Data in gbk encoding. testfile is gbk encoded f. close ()
In addition, the python codecs module provides an open () method, which allows you to specify the encoding to open the file, and use this method to open the file and read and return unicode. When writing, if the write parameter is unicode, it is written using the encoding when the file is opened. If it is str, first, use the default encoding to decode to unicode and then write the code to open the file (Note that if str is Chinese, the default encoding is sys. if getdefaultencoding () is ascii, A decoding error is returned ).
# Coding: gbkimport codecsf = codecs. open ('testfile', encoding = 'utf-8') u = f. read () f. close () print type (u) # <type 'unicode '> f = codecs. open ('testfile', 'A', encoding = 'utf-8') f. write (u) # write unicode # write A gbk-encoded str to automatically perform the decoding and encoding operation s = 'hangzhou' print repr (s) # '\ xba' # Here, the str encoded in GBK is decoded as unicode and then encoded as a UTF-8 to write # f. write (s) # If the default encoding is ascii, A decoding error is returned. F. close ()
3. References
- Python character encoding
An error is reported when reading the file content in python. The encoding problem should be returned.
We recommend that you use codecs. open to replace open. If the LogPath file is saved in UTF-8 encoding format:
LogLine = open (LogPath) ==> LogLine = codecs. open (LogPath, 'R', 'utf-8 ')
Character encoding in python
What you are talking about is to put the string:
\ U3232 \ u6674
Itself, can it be converted to unicode characters?
You can use:
SlashUStr = "\ u3232 \ u6674 ";
DecodedUniChars = slashUStr. decode ("unicode-escape ");
Print "decodedUniChars =", decodedUniChars; # decodedUniChars = (yes) Clear
Note: (yes) it is a special character. If you want to print it in cmd (gbk by default), it will cause an error.
UnicodeEncodeError: 'gbk' codec can't encode character U' \ u3232 'in position 0: illegal multibyte sequence
However, the unicode string is indeed converted.
For details, refer:
[Arrangement] in Python, how to convert a backslash u type (\ uXXXX) string to a corresponding unicode Character
(The post address cannot be posted here. Search for the title by google to find the post address)