ASCII is a character set, including uppercase and lowercase English letters, numbers, control characters, etc, it is represented in a byte, range is 0-9 Unicode divided into UTF-8 and UTF-16. UTF-8 variable length, up to 6 bytes, less than 127 characters are represented in one byte, the same as the results of the ASCII character set, english text under ASCII encoding can be treated as UTF-8 encoding without modification.
Python supports Unicode since 2.2,The decode (char_set) function can convert other codes to Unicode.,Function encode (char_set) to convert Unicode to other encoding methods.
For example, ("hello"). Decode ("gb2312") will get U' \ u4f60 \ u597d ', that is, the Unicode codes "you" and "good" are 0x4f60 and 0x597d respectively.
Reuse (U' \ u4f60 \ u597d '). encode ("UTF-8") will get '\ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd', which is the result of "hello" UTF-8 encoding.
The key to using Unicode in Python:Unicode is a class. The Unicode (STR, "utf8") function generates Unicode class objects from the string STR encoded by utf8 (of course, it can be another encoding )., AndFunction unc. encode ("utf8") converts the Unicode Class Object UNC to (encoded as) a UTF-8 encoded string (of course, it can also be another encoded string ).. Therefore, write Unicode-relatedProgramWhat needs to be done is
* When obtaining data (string), Unicode (STR, "utf8") is used to generate a unicode object.
* Only Unicode objects are used in the program. string constants in the program are written in the form of U "string ".
* During output, Unicode objects can be converted to any encoded Output Using Str. encode ("some_encoding ")
>>> Unicode ( " Hi! " , " Utf8 " ) U ' \ U4f60 \ u597d ' >>> X = _ >>> Type (X) >>> Type ( " Hi! " ) >>> X. encode ( " Utf8 " ) ' \ Xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd ' >>> X. encode ( " GBK " ) ' \ Xc4 \ xe3 \ Xba \ xc3 ' >>> X. encode ( " Gb2312 " ) ' \ Xc4 \ xe3 \ Xba \ xc3 ' >>> Print X hello >>> Print X. encode ( " Utf8 " ) Hello >>> Print X. encode ( " GBK " )???
The above is the test result (Ubuntu 6.06, locale is utf8), pay attention to the difference between type (X) and type ("hello. We can see from the encoding that utf8 encoding is different from GBK. In the locale setting of utf8, printing X is encoded according to the environment variable (I guess), while printing X. encode ("GBK") is garbled.