In this paper, the differences between Python2 and Python3 in string coding are demonstrated in detail in the experiment.
In Python2, string literals correspond to 8-bit characters or byte-encoded byte literals. An important limitation of these strings is that they cannot fully support international character sets and Unicode encoding. To address this limitation,Python2 uses a separate string type for Unicode data . To enter the literal number of Unicode strings, precede the first quotation mark with the top ' u '.
Python2 also has a string type called byte literal, which refers to the literal amount of a string that has been encoded, and there is no difference between the byte literal and the normal string in Python2 because the normal string in Python2 is actually a byte string that has been encoded (not Unicode).
In Python3, you do not have to add this prefix character, otherwise it is a syntax error, because all strings are already Unicode encoded by default. If you run the interpreter with the-u option, Python2 simulates this behavior (that is, all string literals will be treated as Unicode characters, and the U prefix can be omitted). in Python3, the byte literal becomes a different type from the normal string .
~/download/firefox $ python2
Python 2.7.2 (default, June, 11:17:09)
[GCC 4.6.1] on linux2
Type ' help ' "Copyright", "credits" or "license" for the more information.
>>> ' Handsome ' #python2 automatically converts the string to the appropriate encoded byte string ' \xe5\xbc\xa0\xe4\xbf\x8a ' #自动转换为utf-8 encoded byte string >>> u ' Handsome ' # explicitly specifies that the string type is a Unicode type, that the type string is not encoded, and that the code point (ordinal) of the character in the Unicode character set is saved by U ' \U5F20\U4FCA ' >>> ' handsome '. Encode (' Utf-8 ') #
Python2 has automatically converted it into a utf-8 type encoding, so it's an error to encode again (Python2 will encode the string as an ASCII or Unicode encoding). Traceback (most recent): File "<stdin>", line 1, in <module> unicodedecodeerror: ' ASCII ' codec can ' t Decode byte 0xe5 in position 0:ordinal not in range (128) >>> ' Handsome '. Decode (' Utf-8 ') #python2 can decode normally, the returned string class is not edited The Unicode type of the code u ' \U5F20\U4FCA ' >>> b ' Handsome ' # ' handsome ' has been converted Python2 to utf-8 encoding, so the byte string ' \xe5\xbc\xa0\xe4\xbf\x8a ' >& gt;> print ' Handsome ' handsome >>> print U ' handsome ' handsome >>> print B ' Handsome ' handsome >>> ~/download/firefox $ p Ython3 Python 3.2.2 (default, Sep 5, 04:33:58) [GCC 4.6.1 20110819 (Prerelease)] on linux2 Type ' help ', ' copyright ',
"Credits" or "license" for the more information. >>> ' Handsome ' #python3的字符串默认为unicode格(no code) ' Handsome ' >>> u ' Handsome ' #由于默认为unicode格式, so the string does not have to explicitly indicate its type as python2, otherwise it is a syntax error. File "<stdin>", line 1 u ' handsome ' ^ syntaxerror:invalid syntax >>> type (' Handsome ') #python3中文本字符串和字节字符串是严格区分的, default to U
Nicode format Text string <class ' str ' > >>> ' Handsome '. Decode (' Utf-8 ') #因为默认的文本字符串为unicode格式, so the text string has no Decode method Traceback (most recent): File "<stdin>", line 1, in <module> attributeerror: ' str ' object has no att Ribute ' decode ' >>> ' handsome '. Encode (' Utf-8 ') #将文本字符串编码, converting to encoded byte string type B ' \xe5\xbc\xa0\xe4\xbf\x8a ' >>> t Ype (' Handsome '. Encode (' Utf-8 ')) <class ' bytes ' > >>> print (' Handsome '. Encode (' Utf-8 ')) #对于已编码的字节字符串,
Many of the attributes and methods of a text string are no longer available. B ' \xe5\xbc\xa0\xe4\xbf\x8a ' >>>print (' handsome '. Encode (' Utf-8 ')) B ' \xe5\xbc\xa0\xe4\xbf\x8a ' >>> print (' Handsome '. Encode (' Utf-8 '). Decode (' Utf-8 ') #必须将字节字符串解码后才能打印出来 handsome