This article mainly introduces the compilation and summary of python encoding knowledge. For more information, see
Problem
During normal work, I encountered the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte
It is common that everyone has encountered it. So I decided to organize and learn python encoding.
Basic knowledge
In python2.x, there are two data types: unicode and str, both of which are subclasses of basestring.
>>> A = '中' >>> type ()
>>> Isinstance (a, basestring) True >>>> a = U' center '>>> type ()
>>> Isinstance (a, basestring) True
In summary, str is a byte string consisting of encoded bytes (such as bytes of python3.x). unicode is an object, it is a true string consisting of characters.
>>> A = 'Chinese' >>> len (a) 6 >>> repr () "'\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87'" >>> B = u'chinese' >>> len (B) 2 >>> repr (B) "u' \ u4e2d \ u6587 '"
Console and script
Run the following command on the python console in linux. The result is different from the execution script.
>>> A = u'chinese' >>> repr () "u' \ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87 '"> B = unicode ('Chinese', 'utf-8 ') b) >>> repr (B) "u' \ u4e2d \ u6587 '"
We can see that the object a initialized by u'chinese' is not what we expected. what is the reason?
Think of python as a pipe, and the intermediate process in the pipe is unicode. The entrance is converted to unicode, and the exit is converted to the target encoding (unless otherwise, the specific encoding is required in the processing logic ).
Run the command a = u'chinese' on the console, which can be interpreted as a command, a = 'Chinese'. decode (encode), to unicode object. So what is the encode here? For the console, the standard input is sys. stdin. encoding.
>>> sys.stdin.encoding'ISO-8859-1'
The default encoding for my console is ISO-8859-1, so a = u 'China' <=> a = 'China'. decode ('ISO-8859-1 ')
Here, the 'China' is understood by the console. even if the encoding of the byte code is based on the terminal encoding method, for the UTF-8 encoding terminal, 'Chinese' = '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87'
>>> A = 'Chinese '. decode ('ISO-8859-1 ') >>> repr (a) "u' \ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87 '"
So how can we modify the encoding value and set it? In linux, you can set the environment variables as follows:
export PYTHONIOENCODING=UTF-8
Summary
Return to the problem at first, because the difference between unicode and str is not clear, and the two are mixed.
>>> A = 'Chinese' >>> a. encode ('gbk') Traceback (most recent call last): File"
", Line 1, in
UnicodeDecodeError: 'ascii 'codec can't decode byte 0xe4 in position 0: ordinal not in range (128)
The above object a is actually str, that is, the bytecode. if the terminal is UTF-8 encoded, then a is the UTF-8 encode. A. encode ('gbk') is equivalent to. decode (encoding ). encode ('gbk'), that is, first decodes the bytecode into a unicode character, and then encode is a bytecode. Unicode objects are used as transfer stations. So what is encoding here?
>>> import sys>>> sys.getdefaultencoding()'ascii'
The default value is ascii, which is why the error cannot be decoded using ascii.
>>> Reload (sys)
>>> Sys. setdefaultencoding ('utf-8') >>> a = 'Chinese' >>> repr () "'\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87'">. encode ('gbk') '\ xd6 \ xd0 \ xce \ xc4'
Change the default encoding to UTF-8. Encode is not encouraged for str, because str is implicitly decoded. Decode only applies to str, and encode only applies to unicode. all decode and encode display the specified encoding method.