Don't you want to be despised again? Let's see it! Understand Python2 character encoding,
Programmers think of themselves as creators and often despise products or QA that do not know much about technology. Sadly, programmers also despise each other, and the contempt chain of programmers is widely spread. As a Python programmer, the following figure is naturally the most important thing.
Our project team uses Python2.7 as a value. Although we know many advantages of Python3, we have been eager to use it. However, due to various historical reasons and Business pressure, we may only continue to use Python2.7. Sadly, our group is not so international, so the code still involves a lot of Chinese characters, so we occasionally encounter garbled code and UnicodeError, So we live at the end of the contempt chain.
Therefore, the goal of this article is to clearly understand the codec relationship between unicode and str in python2.7, and strive to further develop in the contempt chain.
Note:: This experiment is based on win7, Python2.7, Linux, and Python2.7. Unless otherwise stated, all commands are input interactively in the terminal; if the platform is not emphasized, the result is the result on the window. The following are some default environment information (which will be introduced later)
Windows
>>> import sys,locale
>>> sys.getdefaultencoding()
'ascii'
>>> locale.getdefaultlocale()
('zh_CN', 'cp936')
>>> sys.stdin.encoding
'cp936'
>>> sys.stdout.encoding
'cp936'
>>> sys.getfilesystemencoding()
'mbcs'
Note:CP936 is the alias of GBK, Which can be viewed in the https://docs.python.org/2/library/codecs.html#standard-encodings.
Linux
>>> import sys,locale
>>> sys.getdefaultencoding()
'ascii'
>>> locale.getdefaultlocale()
('zh_CN', 'UTF-8')
>>> sys.stdin.encoding
'UTF-8'
>>> sys.stdout.encoding
'UTF-8'
>>> sys.getfilesystemencoding()
'UTF-8'
Address: http://www.cnblogs.com/xybaby/p/7814299.html
Starting with character encoding
The terms gbk gb2312 unicode UTF-8 are irrelevant to the language.
The world of the computer is only 0 and 1, so any character (that is, the actual text symbol) is also composed of 01 strings. For ease of operation, a computer consists of 8 bits and one Byte. The minimum unit of character expression is Byte, that is, one character occupies one or more bytes. Character encoding (Character encodinG) is the word set code, encoding is the process of ing characters in the character set into a unique binary.
The computer originated in the United States and uses English letters (characters). The upper and lower cases of all 26 letters are followed by numbers 0 to 10, and there are not many symbols and control characters in total, all characters can be expressed in one byte (8 bits). This is ANSI's "Ascii" encoding (American Standard Code for Information Interchange, US Information Interchange Standard Code ). For example, the lowercase letter 'A' has an ascii code of 01100001, which is converted into a decimal value of 97, and a hexadecimal value of 0x61. Generally, hexadecimal is used to describe character encoding in a computer.
However, when a computer passes to China, the ASCII code will not work. If there are so many Chinese characters, one byte cannot be expressed, so there is a GB 2312 (Chinese National Standard simplified Chinese character set ). GB2312 uses two bytes to encode a character. The first byte (high byte) uses 0xF7 from 0xA1, And the next byte (low byte) ranges from 0xA1 to 0xFE, GB2312 can represent thousands of Chinese characters and is compatible with asill.
However, it was later found that GB2312 was not enough, so it was extended and produced GBK (the Chinese character internal code extension Specification). GBK, like Gb2312, represents one character in two bytes, but the difference is that, the requirements for low bytes are relaxed, so the value range can be expanded to more than 20000. Later, GB13080 appeared to accommodate the characters of a few Chinese characters and other Chinese characters. GB13080 is compatible with GBK and GB2312 and can accommodate more characters. Unlike GBK and GB2312, GB18030 adopts single-byte, double-byte, and four-byte character encoding methods.
Therefore, for the Chinese characters we care about, the three encoding methods are as follows:
GB18030 GBK GB2312
That isGBK is the superset of GB2312, and GB1803 is the superset of GBK.. We will also see that a Chinese character can be expressed in GBK, but it may not be expressed in GB2312.
Of course, there are more languages and texts in the world, each of which has its own set of coding rules. In this way, once a cross-country domain becomes garbled, a globally unified solution is urgently needed. At this time, ISO (International Organization for Standardization) was launched. It invented the "Universal Multiple-Octet Coded Character Set", also known as "unicode" for short ". The goal is simple: discard all the regional encoding schemes and re-develop a code that includes all cultures, letters, and symbols on the earth!
Unicode each character in each language is set to a unified and unique binary encoding, to meet the requirements of cross-language, cross-platform text conversion, processing. Unicode encoding must start with \ u.
However, unicode is only an encoding standard. It is a set of binary corresponding to all characters, rather than a specific encoding rule. Or,Unicode is a representation rather than a storage form, that is, it is useless to define how each character is stored in binary form.. This is different from GBK. GBK is a table in the form of storage.
For example, the unicode encoding of Chinese characters "strict" is \ u4e25, and the corresponding binary value is 1001110 00100101. However, when it passes through network transmission or file storage, it cannot know how to parse These binary values, it is easy to mix with other bytes. So how to store unicode? As a result, the UTF (uctransfer Format) emerged. This is a specific encoding rule, that is, the UTF Format is the same as the storage Format.
Therefore, we can say that,GBK and UTF-8 are at the same level, and unicode is at another level, unicode floating in the air, if you want to land, need to convert to UTF-8 or GBK. Only, converted to Utf-8, everyone can understand, more understanding, and converted to GBK, only Chinese talent can understand
UTF also has different implementations, such as UTF-8, UTF-16, here to UTF-8 as an example to explain (the following section cited Ruan Yifeng's article ).
Unicode and UTF-8
The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols. UTF-8 coding rules are very simple, only two:
1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.
2) for the n-byte symbol (n> 1), the first n bits of the first byte are set to 1, and the n + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all unicode codes of this symbol.
The following table summarizes the encoding rules. The letter x indicates the available encoding bits.
Unicode symbol range | UTF-8 encoding method (hexadecimal) | (Binary) california + California 0000 0000-0000 007F | 0xxxxxxx0000 0080-0000 07FF | 110 xxxxx 10xxxxxx0000 0800-0000 FFFF | 1110 xxxx 10 xxxxxx 10xxxxxx0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
Take Chinese characters "strict" as an example to demonstrate how to implement UTF-8 encoding.
It is known that the unicode of "strict" is 4E25 (100111000100101). According to the above table, we can find that 4E25 is in the range of the third row (0000-0800 FFFF ), therefore, the "strict" UTF-8 encoding requires three bytes, that is, the format is "1110 xxxx 10 xxxxxx 10xxxxxx ". Then, starting from the last binary bit of "strict", fill in x in the format from the back to the front, and fill the extra bit with 0. In this way, the "strict" UTF-8 code is "11100100 10111000 10100101", converted to hexadecimal is E4B8A5.
When the codec meets Python2.x
The Python language is used to verify the above theory. In this section, unicode generally refers to unicode type, which is the type in Python. unicode encoding and unicode functions are also mentioned.
In addition, encoding also has two meanings. The first is the name, which refers to the binary representation of characters, such as unicode and gbk encoding. The second is a verb, which refers to the ing process from character to binary. However, encoding, as a verb, is interpreted in a narrow sense as the process of converting the unicode type to the str type. decoding is the opposite process.In addition, the unicode type must be unicode encoding, while the str type may be gbk, ascii or UTF-8 encoding.
Difference between unicode and str
In python2.7, there are two "string" types: str and unicode. They have the same base class basestring. Str is a plain string, which is actually calledByte stringBecause each byte is replaced by a unit length. Unicode is a unicode string.String, One character (may be multiple bytes) is considered a unit length.
In python2.7, the unicode type must be displayed in a utable between texts.
>>> Us = u'yan'
>>> Print type (us), len (us)
<Type 'unicode '> 1
>>> S = 'strict'
>>> Print type (s), len (s)
<Type 'str'> 2
>>>
We can see from the above that, first, the types of us and s are different. Second, the lengths of the same Chinese character are different for different types. For unicode type instances, its length must be the number of characters. For str-type instances, its length is the number of bytes corresponding to the characters. The length of s (s = 'strict') varies with different environments! Which will be explained later
_ Str _ repr _
This is Two magic methods in python, which are easy to confuse new users, because many times the implementation of the two is the same, but these two functions are used in different places.
_ Str __, mainly used for display,Str (obj)OrPrint objThe return value must be a str object.
_ Repr __, isRepr (obj)Or directly callObjWhen calling
>>> Us = u'yan'
>>> Us
U' \ u4e25'
>>> Print us
Yan
We can see that the result returned without using print is a result that can better reflect the essence of the object, that is, us is a unicode object (the top u table, and the unicode encoding is used \ u ), and the "strict" unicode encoding is indeed 4E25. The print call can be us. _ str __, which is equivalent to print str (us), making the results more user-friendly. How is unicode. _ str _ converted to str? The answer will be shown later.
Unicode str UTF-8
As mentioned above, unicode is only a encoding standard (only a set of ing between characters and binary), while UTF-8 is a specific encoding rule (not only contains the ing set between characters and binary, in addition, the ing binary can be used for storage and transmission), that is, UTF-8 is responsible for converting unicode to a binary string that can be stored and transmitted, that is, str type. We call this conversion process encoding. The process from str type to unicode type is called decoding.
In Python, decode () and encode () are used for decoding and encoding. The unicode type is used as the intermediate type. As shown in
decode encode
str ---------> unicode --------->str
That is, the decode method called by the str type is converted to the unicode type, and the encode method called by the unicode type is converted to the str type.. For example
>>> Us = u'yan'
>>> Ss = us. encode ('utf-8 ')
>>> Ss
'\ Xe4 \ xb8 \ xa5'
>>> Type (ss)
<Type 'str'>
>>> Ss. decode ('utf-8') = us
True
From the above, we can see the functions of encode and decode. We can also see that the utf8 encoding of 'strict' is E4B8A5.
In other words, unicode. encode is used to convert the unicode type to the str type. unicode. _ str _ is also used to convert the unicode type to the str type. What is the comparison between the two?
Unicode. encode and unicode. _ str _
First take a look at the document
str.encode([encoding[, errors]])
Return an encoded version of the string. Default encoding is the current default string encoding.
object.__str__(self)
Called by the str() built-in function and by the print statement to compute the “informal” string representation of an object.
Note: str. encode the str here is a basestring, which is the base class of the str and unicode types.
We can see that the encode method has optional parameters: encoding and errors. In the above example, encoding is UTF-8, while _ str _ has no parameter. We can guess that, for the unicode type, the __str _ function must also use an encoding to encode unicode.
First, you can't help wondering what it looks like if the encode method does not include a parameter:
>>> us.encode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u4e25' in position 0: ordinal not in range(128)
It is easy to see that by default, the ascii code is used to encode unicode. Why is it an ascii code?System DefaultEncode (sys. getdefaultencoding return value ). The ascii code obviously cannot represent Chinese characters, so an exception is thrown. When UTF-8 encoding is used, there is no error because utf can represent this Chinese character.
What happens if the returned value of ss (us. encode ('utf-8') is printed directly?
>>> Print ss
Juan
The result is a little strange. The results of us. _ str _ (that is, directly printing us) are different. What about encoding = gbk?
>>> Print us. encode ('gbk ')
Yan
U got it! As a matter of fact, python usesTerminal default(Use locale. getdefalocallocale () to view the encoding, windows is gbk) to encode unicode to the str type.
In Linux (the terminal code is UTF-8), the result is as follows:
>>> Us = u'yan'
>>> Print us. encode ('utf-8 ')
Yan
>>> Print us. encode ('gbk ')
Zookeeper
>>> Print us
Yan
>>>
Pay attention to the above garbled characters!
Conversion between unicode gbk
In the previous section, we introduced that unicode can be converted to str represented by UTF-8 through UTF-8 encoding (encoding = UTF-8, in the previous section, we can also see that unicode can be converted to the str represented by gbk encoding (encoding = gbk. This is a bit dizzy. Leave it as the first question, which will be explained later.
The conversion between unicode and utf8 can be calculated and known, but the conversion between unicode and gbk does not have a calculation formula, so we can only rely on the table, that is, there is a ing table, there is a ing between unicode and gbk corresponding to a Chinese character.
>>> Us = u'yan'
>>> Us
U' \ u4e25'
>>> Us. encode ('gbk ')
'\ Xd1 \ xcf'
>>> Us. encode ('gb2312 ')
'\ Xd1 \ xcf'
>>> Us. encode ('gb18030 ')
'\ Xd1 \ xcf'
>>> S = 'strict'
>>> S
'\ Xd1 \ xcf'
>>>
It is not hard to see that the strict unicdoe encoding is 4e25 and the GBK encoding is d1cf. Therefore, the us encoding through gbk is d1cf. We can also see that GB18030, GBK, and GB2312 are compatible.
Why print us. encode ('utf-8') to print "Juan"
Ss = us. encode ('utf-8 ')Ss is a str type. It is a little strange to print the result directly. What is the binary composition of a str type "Juan "?
>>> S = 'juan'
>>> S
'\ Xe4 \ xb8'
We can see that the binary value of the str type "Juan" is E4B8, which is different from the UTF-8 encoding (E4B8A5) of the 'strict' type. This is because A5 cannot be displayed, the verification is as follows:
>>> Print '-- % s --' % ss
-- Juan? -
Therefore, it just happens to show "Juan". In fact, ss has nothing to do with "Juan ".
Answer the first question: what is the str type?
In the above section, we mentioned the UTF-8 encoded str, which is a bit difficult to use with the gbk encoded str. We know that a Chinese character 'strict' can be stored in gbk ('\ xd1 \ xcf '), it can also be UTF-8 ('\ xe4 \ xb8 \ xa5'). What is the format when we input this Chinese character on the terminal? Depends onThe default terminal encoding.
On windows (the default Terminal code is gbk ):
>>> S = 'strict'
>>> S
'\ Xd1 \ xcf'
On Linux (the default terminal encoding is UTF-8 ):
>>> A = 'strict'
>>>
'\ Xe4 \ xb8 \ xa5'
The same Chinese character is also 'str' type in Python, and its binary is different in different encoding formats. Therefore, its length is also different. For the str type, its length is the corresponding byte length.
We can also see that the length of the gbk encoded bytes is generally less than UTF-8, which is another reason for the continued existence of gbk.
Here,It should be emphasized that the binary format of unicode is irrelevant to the terminal encoding format.! This is not hard to understand.
Unicode Functions
The str. decode mentioned above is used to convert the str type to the unicode type. There is also a unicode function. The signatures of the two functions are:
unicode(object[, encoding[, errors]])
Return the Unicode string version of object using one of the following modes:
str.decode([encoding[, errors]])
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding.
The two parameters are the same. In fact, they are equivalent, and the default value of encoding is the same. They are all the results of sys. getdefaultencoding. For example:
>>> S = 'strict'
>>> Newuse = unicode (s)
Traceback (most recent call last ):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii 'codec can't decode byte 0xd1 in position 0: ordinal not in range (128)
>>> Newuse = unicode (s, 'utf-8 ')
Traceback (most recent call last ):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
>>> Newuse = unicode (s, 'gbk ')
>>> Newuse
U' \ u4e25'
The first UnicodeDecodeError is because the system's default encoding is asill; the second UnicodeDecodeError is because, s (str type instance) the encoding depends on the default terminal encoding (gbk in windows). to print the code, you must use gbk encoding to represent the str, therefore, you can only query the ing table between gbk and unicode to convert s to the unicode type.
Why sys. setdefaultencoding?
You can see this section in many Python codes:
1 import sys 2 reload(sys) 3 sys.setdefaultencoding('utf-8')
It's not hard to guess,SetdefaultencodingAndGetdefaultencodingIt is paired. To set the system's default encoding to UTF-8 is to solve the conversion problem from str to unicode.
As mentioned in the previous section, when using unicode functions to convert the str type to the unicode type, two factors should be taken into account: first, what encoding is str itself; second, if the encoding parameter is not input, sys. getdefaultencoding. The encoding parameter must correspond to the str encoding; otherwise, it is UnicodeDecodeError.
All the programs that write python code know that we should write in the first line of the py file:
# -*- coding: utf-8 -*-
The purpose of this sentence is to tell the editor that all str in the file adopts UTF-8 encoding, and the file is also stored in UTF-8 format.
The following code is used in the file.
S = 'Chinese'
Us = unicode (s)
Unicode forced conversion is not used to include parameters. To ensure that the encoding parameter must be consistent with the str encodingSetdefaultencodingSet the system default encoding to UTF-8
Garbled code and UnicodeError
The following describes several common garbled characters and UnicodeError exceptions. The majority of garbled characters or the causes of exceptions have been mentioned earlier. At the same time, we also try to provide feasible solutions for some garbled characters.
UnicodeError includes UnicodeDecodeError and UnicodeEncodeError. The former is decode, that is, an exception occurs when str is converted to unicode, and the latter is an exception when unicode is converted to str.
For a str, print it directly
The example is the example mentioned above.
>>> Ss = us. encode ('utf-8 ')
>>> Print ss
Juan
If a str type is read from a network or file, decode to unicode first according to the encode method on the peer end, then output (the output is automatically converted to the str encoded format supported by the expected Terminal)
Chinese characters that cannot be included in the encoding range
Example
>>> Newus = u'hangzhou'
>>> Newus
U' \ u56cd'
>>> Newus. encode ('gbk ')
'\ X87 \ xd6'
>>> Newus. encode ('gb2312 ')
Traceback (most recent call last ):
File "<stdin>", line 1, in <module>
UnicodeEncodeError:'Gb2312'Codec can't encode character U' \ u56cd' in position 0: illegal multibyte sequence
>>>
We can see that the 'authorization' character can be encoded by gbk, but cannot be encoded by gb2312.
When str is converted to unicode
The unicode function has been used as an example above, and the UnicodeDecodeError exception may occur.
The reason for this error comparison is more from the default conversion from str to unicode. For example, when a str is added to a unicode:
>>> A = 'strict'
>>> B = u'strict'
>>> C = a + B
Traceback (most recent call last ):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii 'codec can't decode byte 0xd1 in position 0: ordinal not in range (128)
Unicode and str are added. str is converted to unicode and the default unicode (strobj, encoding = sys. getdefaultencoding () is used ())
Unicode-encoded string
In some cases, the str type is printed and the result is '\ u4e25' or' \ u4e25'. Are you familiar with this string, the unicode encoding of 'strict' is U' \ u4e25 '. Take a closer look, but there is an extra u in front of the quotation marks (indicating a unicode type ). So how do we know the corresponding Chinese character when we see '\ u4e25? For str in this known format, you can manually add a u and then output it on the terminal. However, if it is a variable, it needs to be automatically converted to unicode, in this case, you can use unicode_escape in python-specific-encodings.
>>> s = '\u4e25'
>>> s
'\\u4e25'
>>> us = s.decode('unicode_escape')
>>> us
u'\u4e25'
String in hexadecimal format
Sometimes, you can see str, '\ xd1 \ xcf', which looks very familiar, similar to the gbk encoding '\ xd1 \ xcf' of Chinese characters, the difference is that the former has an extra '\', so it cannot be interpreted as a hexadecimal system. The solution is string_escape in python-specific-encodings.
>>> S = '\ xd1 \ xcf'
>>> S
'\ Xd1 \ xcf'
>>> Print s
\ Xd1 \ xcf
>>> News = s. decode ('string _ escape ')
>>> News
'\ Xd1 \ xcf'
>>> Print news
Yan
A question for readers
Here we leave a question:
U'strict' = 'strict'
Is the return value True or False? Of course, the context environment is deliberately omitted here, but it is clear that in different encoding environments, the answer is different for all reasons above!
Summary and Suggestions
No matter how you explain it, character encoding in python2.x is still a headache. Even if you understand it, you may forget it later. Many suggestions for this problem are as follows:
First, if you use python3, you don't have to worry about unicode. But it's hard for developers to decide;
Second, do not use Chinese characters. Comments and comments are all in English. The ideal is very full and the reality is very difficult, but it only leads to a large amount of Pinyin;
Third: for Chinese strings, It is not represented by str, but by unicode. In reality, it is not easy to implement.
Fourth: encode unicode only during transmission or persistence. decode for the opposite process
Fifth: for network interfaces, the encoding/decoding format is agreed. UTF-8 is strongly recommended.
Sixth: Do not panic when you see UnicodeXXXError. If XXX is Encode, it must be a problem when unicode is converted to str. If it is Decode, it must be a problem when str is converted to unicode.
References
Python codecs
Python-specific-encodings
Character encoding notes: ASCII, Unicode and UTF-8
Python-annoying Coding