In this paper, the differences between Python2 and Python3 in the string encoding are demonstrated in detail experimentally.
In Python2, the string literal corresponds to a 8-bit character or byte literal that is oriented toward byte encoding. An important limitation of these strings is that they do not fully support international character sets and Unicode encodings. To address this limitation, Python2 uses a separate string type for Unicode data. To enter a Unicode string literal, precede the first quotation mark with the front most ' u '.
There is also a string type called byte literal in Python2, which refers to a string literal that has already been encoded, and there is no difference between the byte literal and the normal string in Python2, because the normal string in Python2 is actually an encoded (non-Unicode) byte string.
In Python3, it is not necessary to include this prefix character, otherwise it is a syntax error, because all strings are already Unicode encoded by default. If you run the interpreter with the-u option, Python2 simulates this behavior (that is, all string literals will be treated as Unicode characters, and the U-prefix can be omitted). In Python3, the byte literal becomes a different type than the normal string.
~/download/firefox $ python2
Python 2.7.2 (default, June 29 2011, 11:17:09)
[GCC 4.6.1] on linux2
Type "Help", "copyright", "credits" or "license" for more information.
>>> ' Zhang June ' #python2 automatically converts a string to an appropriately encoded byte string
' \xe5\xbc\xa0\xe4\xbf\x8a ' #自动转换为utf-8 encoded byte string
>>> u ' Zhang June ' #显式指定字符串类型为unicode类型, this type of string is not encoded and holds the code point (ordinal) of the character in the Unicode character set
U ' \U5F20\U4FCA '
>>> ' Zhang June '. Encode (' Utf-8 ') #python2 has been automatically converted to the Utf-8 type encoding, so the encoding again (Python2 will encode the string as ASCII or Unicode) will cause an error.
Traceback (most recent):
File "<stdin>", line 1, in <module>
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe5 in position 0:ordinal not in range (128)
>>> ' Zhang June '. Decode (' Utf-8 ') #python2 can decode normally, the returned string class is a Unicode type that is not encoded
U ' \U5F20\U4FCA '
>>> B ' Zhang June ' # ' Zhang June ' has been python2 converted to utf-8 encoding, so it has been a byte string
' \xe5\xbc\xa0\xe4\xbf\x8a '
>>> print ' Zhang June '
Zhang June
>>> Print U ' Zhang June '
Zhang June
>>> Print B ' Zhang June '
Zhang June
>>>
~/download/firefox $ python3
Python 3.2.2 (Default, Sep 5 2011, 04:33:58)
[GCC 4.6.1 20110819 (Prerelease)] on linux2
Type "Help", "copyright", "credits" or "license" for more information.
>>> ' Zhang June ' #python3的字符串默认为unicode格式 (no coding)
' Zhang June '
>>> u ' Zhang June ' #由于默认为unicode格式, so the string does not have to explicitly indicate its type as python2, otherwise it is a syntax error.
File "<stdin>", line 1
U ' Zhang June '
^
Syntaxerror:invalid syntax
>>> type (' Zhang June ') #python3中文本字符串和字节字符串是严格区分的, default to Unicode-formatted text string
<class ' str ' >
>>> ' Zhang June '. Decode (' Utf-8 ') #因为默认的文本字符串为unicode格式, so the text string has no Decode method
Traceback (most recent):
File "<stdin>", line 1, in <module>
Attributeerror: ' str ' object has no attribute ' decode '
>>> ' Zhang June '. encode (' Utf-8 ') #将文本字符串编码, converted to an encoded byte string type
B ' \xe5\xbc\xa0\xe4\xbf\x8a '
>>> type (' Zhang June '. Encode (' Utf-8 '))
<class ' bytes ' >
>>> print (' Zhang June '. Encode (' Utf-8 ')) #对于已编码的字节字符串, many of the features and methods of a text string are no longer available.
B ' \xe5\xbc\xa0\xe4\xbf\x8a '
>>>print (' Zhang June '. Encode (' Utf-8 '))
B ' \xe5\xbc\xa0\xe4\xbf\x8a '
>>> print (' Zhang June '. Encode (' Utf-8 '). Decode (' Utf-8 ')) #必须将字节字符串解码后才能打印出来
Zhang June
String encoding for Python