There are obvious differences between Python2 and Python3 in string encoding.
In Python2, strings cannot fully support international character sets and Unicode encoding. To address this limitation, Python2 uses a separate string type for Unicode data. To enter a Unicode string literal, add ' u ' before the first quotation mark. The normal string in Python2 is actually the encoded (non-Unicode) byte string.
In Python3, it is not necessary to include this prefix character, otherwise it is a syntax error, because all strings are already Unicode encoded by default.
$ Python2 instance:
>>> ' Zhang San ' #python2 automatically converts a string to an appropriately encoded byte string
' \xe5\xbc\xa0\xe4\xbf\x8a ' #自动转换为utf-8 encoded byte string
>>> u ' Zhang San ' #显式指定字符串类型为unicode类型, this type of string is not encoded, it holds the code ordinal of the character in the Unicode character set
U ' \U5F20\U4FCA '
>>> ' Zhang San '. Encode (' Utf-8 ') #python2 has been automatically converted to the Utf-8 type encoding, so the encoding again (Python2 will encode the string as ASCII or Unicode) will cause an error.
Traceback (most recent):
File "<stdin>", line 1, in <module>
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe5 in position 0:ordinal not in range (128)
>>> ' Zhang San '. Decode (' Utf-8 ') #python2 can decode normally, the returned string class is a Unicode type that is not encoded
U ' \U5F20\U4FCA '
>>> B ' Zhang San ' # ' Zhang San ' has been python2 converted to utf-8 encoding, so it has been a byte string
' \xe5\xbc\xa0\xe4\xbf\x8a '
>>> print ' Zhang San '
Tom
>>> print u ' Zhang San '
Tom
>>> Print B ' Zhang San '
Tom
$ Python3 instance:
>>> ' Zhang San ' #python3的字符串默认为unicode格式 (no coding)
' Zhang San '
>>> u ' Zhang San ' #由于默认为unicode格式, so the string does not have to explicitly indicate its type as python2, otherwise it is a syntax error.
File "<stdin>", line 1
U ' Zhang San '
^
Syntaxerror:invalid syntax
>>> type (' Zhang San ') #python3中文本字符串和字节字符串是严格区分的, default to Unicode-formatted text string
<class ' str ' >
>>> ' Zhang San '. Decode (' Utf-8 ') #因为默认的文本字符串为unicode格式, so the text string has no Decode method
Traceback (most recent):
File "<stdin>", line 1, in <module>
Attributeerror: ' str ' object has no attribute ' decode '
>>> ' Zhang San '. Encode (' Utf-8 ') #将文本字符串编码, converted to encoded byte string type
B ' \xe5\xbc\xa0\xe4\xbf\x8a '
>>> type (' Zhang San '. Encode (' Utf-8 '))
<class ' bytes ' >
>>> print (' Zhang San '. Encode (' Utf-8 ')) #对于已编码的字节字符串, many of the features and methods of a text string are no longer available.
B ' \xe5\xbc\xa0\xe4\xbf\x8a '
>>>print (' Zhang San '. Encode (' Utf-8 '))
B ' \xe5\xbc\xa0\xe4\xbf\x8a '
>>> print (' Zhang San '. Encode (' Utf-8 '). Decode (' Utf-8 ')) #必须将字节字符串解码后才能打印出来
Tom
Solving the problem of string encoding in Python2 and Python3