1 >>> str_ascii = 'Hello! '
2 >>> str_unicode = U' hello! '
3 >>> str_utf8 = str_unicode.encode ('utf-8 ')
4 >>> str_ascii
5' \ xc4 \ xe3 \ xba \ xc3! '
6 >>> str_unicode
7 U' \ u4f60 \ u597d! '
8 >>> str_utf8
9' \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd! '
10 >>> str_utf8.decode ('utf-8 ')
11 U' \ u4f60 \ u597d! '
12 >>> str_unicode.encode ('gbk ')
13' \ xc4 \ xe3 \ xba \ xc3! '
What is the conclusion?
- Unicode is 1 character and 1 encoding for any language, so "hi !" It is a 3-character Unicode character. "Hello !" It is also a three-character Unicode character.
- The UTF-8 represents 1 Chinese character with 3 Characters and 1 English character. So, "hi !" Is a 3 UTF-8 character, "Hello !" Is a 7 UTF-8 character.
- GBK uses two characters to represent one Chinese Character and one character to represent an English character. So, "hi !" It is 3 GBK characters. "Hello !" Is 5 GBK characters.
- The encode method is only applicable to unicode strings. The parameter is the target encoding code.
- The decode method is applicable only to non-unicode strings. The parameter is the encoding code of the source string.