The first thing to figure out is that in Python, string object and Unicode object are two different types.
String object is a sequence consisting of characters, and Unicode object is a sequence of Unicode code units.
Character in string are encoded in a variety of ways, such as Single-byte ASCII, Double-byte GB2312, and so on, such as UTF-8. Obviously to interpret string, it is necessary to know which encoding the character in the string is, and then to proceed.
What is the Unicode code unit again? A Unicode code unit is a 16-bit or 32-bit numeric value, each of which represents a Unicode symbol. In Python, the 16-bit Unicode corresponds to the UCS2 encoding. 32-bit corresponds to the UCS4 encoding. Does it feel like the character code in string is no different. Anyway, I have this impression in my head: in Python, UCS2 or UCS4 encoded, we call it Unicode object, and the other encodings we call string.
As for whether Unicode in Python is UCS2 or UCS4, it can be specified at compile time. For example, under Linux, to use UCS2 to do Unicode encoding, you can
#./configure--ENABLE-UNICODE=UCS2
# make
# make Install
The downloaded version of Windows precompilation is generally ucs2. To find out if a Python runtime environment is UCS2 or UCS4, you can see that sys.maxunicde,65535 is UCS2, and another big number is UCS4.
Let's look at the differences between string and Unicode in Python
Let's take a look at the Simplified Chinese Windows 2003 System, the system code is GBK
>>> a = ' Hello '
>>> A
'/xc4/xe3/xba/xc3 '
>>> b = u ' Hello '
>>> b
U '/u4f60/u597d '
>>> Print a
How are you doing
>>> Print B
How are you doing
>>> a.__class__
<type ' str ' >
>>> b.__class__
<type ' Unicode ' >
>>> Len (a)
4
>>> Len (b)
2
In a Linux environment where the system is encoded as UTF-8
>>> a = ' Hello '
>>> A
'/XE4/XBD/XA0/XE5/XA5/XBD '
>>> b = u ' Hello '
>>> b
U '/u4f60/u597d '
>>> Print a
How are you doing
>>> Print B
How are you doing
>>> a.__class__
<type ' str ' >
>>> b.__class__
<type ' Unicode ' >
>>> Len (a)
6
>>> Len (b)
2
How is it. Briefly summarize:
1. String is expressed directly in quotation marks, and Unicode adds a U before quotation marks.
2, direct input string constants will be encoded by the system default encoding, for example, in the GBK environment, ' hello ' will be encoded as '/xc4/xe3/xba/xc3 ', and in the UTF-8 environment becomes '/xe4/xbd/xa0/xe5/xa5/xbd '.
3, Len (string) returns the number of bytes in string, Len (Unicode) returns the number of characters
4, very important point, print Unicode is not garbled. Now our common Linux, Windows systems, are supported by Unicode, the version is too old. For example, Windows 2003 supports UCS2, so in the Chinese Windows2003, in addition to the normal display of the default GBK encoding, the normal display of UCS2 encoding. For example, in the GBK environment of Chinese Windows 2003:
>>>a = '/xe4/xbd/xa0/xe5/xa5/xbd ' # UTF-8 ' Hello '
>>> Print a
Raccoon 犲 ソ
>>> B = Unicode (A, "UTF-8")
>>> b
U '/u4f60/u597d '
>>> Print B
How are you doing
It should be understood.
Let's say the conversion between string and Unicode, what Unicode (), decode (), encode (), codecs, and so on.