I. Overview
When it comes to Python coding, a summary of the words, said more are tears ah, this in the future development of Python is definitely a headache. So it's necessary to speak clearly.
Second, the introduction of the code
1, Notice:
- In Python 2, the default encoding is ASCII, whereas in Python 3 the default encoding is Unicode
- Unicode is divided into utf-32 (4 bytes), utf-16 (two bytes), Utf-8 (1-4 bytes), so utf-16 is the most commonly used Unicode version, but it is still utf-8 in the file because UTF8 saves space
- The Python 3,encode encodes the Stringl into the bytes type, and the decode decodes the bytes type into string type.
- In Unicode encoding 1 chinese characters = 2 bytes, 1 english characters = 1 bytes, remember: ASCII is not a Chinese character
- Utf-8 is a variable long character encoding, it is Unicode optimized, all the English characters are still stored in ASCII format, all the Chinese character is 3 bytes uniform
- Unicode contains the character encodings for all countries, and the conversion between different character encodings requires a Unicode process
- The default encoding for Python itself is utf-8
2, the process of coding and transcoding in Py2,
Note: Because Unicode is an intermediate encoding, any conversion before any character encoding must be decoded into Unicode, encoded into a character encoding that needs to be transferred
3, py2 character encoding conversion, the code is as follows:
12345678910111213141516171819202122232425 |
#! /usr/bin/env python
# -*- coding:utf-8 -*-
# __auther__ == zhangqigao
s
=
"其高最帅"
#utf-8解码成unicode编码
s_to_unicode
=
s.decode(
"utf-8"
)
print
(
"--------s_to_unicode-----"
)
print
(s_to_unicode)
#然后unicode再编码成gbk
s_to_gbk
=
s_to_unicode.encode(
"gbk"
)
print
(
"-----s_to_gbk------"
)
print
(s_to_gbk)
#gbk解码成unicode再编码成utf-8
gbk_to_utf8
=
s_to_gbk.decode(
"gbk"
).encode(
"utf-8"
)
print
(
"------gbk_to_utf8-----"
)
print
(gbk_to_utf8)
#输出
-
-
-
-
-
-
-
-
s_to_unicode
-
-
-
-
-
其高最帅
-
-
-
-
-
s_to_gbk
-
-
-
-
-
-
??????
-
-
-
-
-
-
gbk_to_utf8
-
-
-
-
-
其高最帅
|
Note: The above scenario is suitable for characters that are non-Unicode encoded, but what if the character encoding is already Unicode? Advertising back, more exciting .....
4, the character encoding is already Unicode case, the code is as follows:
12345678910111213141516171819 |
#! /usr/bin/env python
# -*- coding:utf-8 -*-
# __auther__ == zhangqigao
#u代码字符编码是unicode
s
=
u
‘你好‘
#已经是unicode,所以这边直接是编码成gbk
s_to_gbk
=
s.encode(
"gbk"
)
print
(
"----s_to_gbk----"
)
print
(s_to_gbk)
#这边再解码成unicode然后再编码成utf-8
gbk_to_utf8
=
s_to_gbk.decode(
"gbk"
).encode(
"utf-8"
)
print
(
"-----gbk_to_utf8---"
)
print
(gbk_to_utf8)
#输出
-
-
-
-
s_to_gbk
-
-
-
-
???
-
-
-
-
-
gbk_to_utf8
-
-
-
你好
|
Note: In Python2, specify the character encoding at the beginning of the file, is to tell the interpreter that I am now using the character encoding is utf-8, that I am in the printing of Chinese, then in the Utf-8, the text is included in the characters, then you can print it out. So if you do not set the character encoding, by default the system encoding, if your system encoding is ASCII, then will be an error, because ASCII can not save Chinese characters.
5, py3 character encoding conversion
In the notice has been mentioned in Python 3 encoding, the default is Unicode, so the conversion between character encoding does not need to decode process, direct encode can, the code is as follows:
12345678910111213141516171819202122232425 |
#! /usr/bin/env python
# __auther__ == zhangqigao
#无需声明字符编码,当然你声明也不会报错
s
=
‘你好‘
# 字符串s已经是unicode编码,无需decode,直接encode
s_to_gbk
=
s.encode(
"gbk"
)
print
(
"----s_to_gbk----"
)
print
(s_to_gbk)
#这边还是一样,gbk需要先解码成unicode,再编码成utf-8
gbk_to_utf8
=
s_to_gbk.decode(
"gbk"
).encode(
"utf-8"
)
print
(
"-----gbk_to_utf8---"
)
print
(gbk_to_utf8)
#解码成unicode字符编码
utf8_decode
=
gbk_to_utf8.decode(
"utf-8"
)
print
(
"-------utf8_decode----"
)
print
(utf8_decode)
#输出
-
-
-
-
s_to_gbk
-
-
-
-
b
‘\xc4\xe3\xba\xc3‘
-
-
-
-
-
gbk_to_utf8
-
-
-
b
‘\xe4\xbd\xa0\xe5\xa5\xbd‘
-
-
-
-
-
-
-
utf8_decode
-
-
-
-
你好
|
Note: The python 3,encode encodes the Stringl into a bytes type, and decode decodes the bytes type into a string type, so it is not difficult to see the encode turn it into a bytes type of data. It is also important to note that, regardless of whether the character encoding is declared at the beginning of a Python 3 file, it can only be said that this Python file is the character encoding, the string in the file, or Unicode, such as:
Summarize:
- Uniocode can recognize all character-encoded strings
- In Python 2, conversions between character encodings need to be converted by Unicode, so you can print using Unicode, or you can use the corresponding character encoding (specified at the beginning of the file) to print characters or strings because there is no significant distinction between characters and bytes in Py2. That's why the result is so mixed up.
- In Python 3, only Unicode to identify the character, if converted into a corresponding encoding format, directly into the corresponding encoding of the bytes type of bytecode, that is, binary, needs to be recognized, must be decoded to Unicode to identify
- Py3 If the file is already specified at the beginning of the file encoding, why the file is used or Uniocde na? Because the corresponding encoding in the PY3 is binary, is the bytes type, is not recognized, can be recognized only Unicode. Because of the obvious distinction between characters and bytes in Py3, 3 and 4 are presented.
- Speaking of which, if still do not understand, I quote someone else's article, elaborated, Python 2 and Python 3 on the character and byte distinction: punch here
Python base "day03": Character-to-encode operation