1. Please indicate what is the default encoding for Python2 and Python3?
The default encoding for Python is ASCII, and the default encoding for Python3 is Utf-8
2. Why do Chinese characters appear garbled? Can you enumerate what kinds of garbled cases are there?
Encoding and decoding the way inconsistent, there will be garbled phenomenon.
For example: (1) The text content is GBK encoded, but the decoding way is utf-8, will be garbled
(2) text content is encoded in ASCII, that does not support Chinese, if you add Chinese to the text, it will show garbled.
Workaround: Get the encoding format with Chardet
Import"xxxxx"== str_type['encoding ']
Code is the encoding format for Str. Some people, however, reflect that the coding format is inaccurate and slow. I test, the speed is indeed general, but there has not been an inaccurate situation. Everyone can use, I just provide a train of thought, if who there is a better way, can tell the younger brother, generous enlighten.
3. How do I convert the code?
Using encode (encoding) and decode (decoding)
Decode is decoding, from the binary encoding format to the Unicode encoding format, the decoded format needs to be in the same way as the first encoding, otherwise garbled
Encode is encoded from Unicode format to two-level encoding format (GBK,UTF-8, etc.)
1 #-*-coding:utf-8-*-2 3str ="Hello" #py3 default encoding is Utf-84 Print('Unicode:', type (str), str)5str = bytes (str, encoding='Utf-8')#encode first and convert to bytes binary type6 Print(Type (str), str)7str = Str.decode ("Utf-8")#again decoding, if this place writes GBK, will appear garbled error8 Print('encode the bytes type with utf-8 and decode it into Unicode:', type (str), str)9Str=str.encode ("GBK")TenStr=str.decode ('GBK') One Print('encode into Unicode with GBK, and then decode:', type (str), str)
View Code
4. # -*-coding:utf-8-*- What is the function of this sentence?
Text encoding defaults to Utf-8
5. Explain the difference between py2 bytes vs Py3 bytes
(1) Python 3 All strings are Unicode types, and if you want to convert to a bytes type, you need to make an encoding declaration , such as:
Str? Bytes:bytes (S, encoding='UTF8') bytes? Str:s.decode (' Utf-8 ')
The bytes and STR types are not distinguished in python2.x, and all operation bytes of STR are supported. But in the Python3 bytes and Str are separated.
In Python2
>>> s = "ABCDEFG"
>>> B = S.encode () #或者使用下面的方式
>>> B = B "ABCDEFG"
>>> type (b)
<type ' str ' >
#str和bytes是严格区分的 in Python3
>>> s = "ABCDEFG"
>>> type (s)
<class ' str ' >
>>> B = B "ABCDEFG"
>>> type (b)
<class ' bytes ' >
STR is a text series, Bytes is a byte series
Text is encoded (utf-8,gbk,gb2312, etc.)
BYTE is not encoded
Text encoding refers to how characters use bytes to represent the organization, which is used by default under Linux UTF-8
(2) Conversion-------encoding between bytes and STR
The bytes is converted by str through the Encode method, and Str can be transformed by bytes through the Decode method.
Bytes can be defined by the B prefix
GBK is a double byte, UTF-8 flexible encoding, 1 bytes, 2 bytes, 3 bytes, 4 bytes all have, maximum support 6 byte length, Chinese most is 3 bytes
>>> S = "I am Chinese"
>>> S
' I am a Chinese '
>>> B = S.encode () #进行编码为bytes
>>> b
B ' \xe6\x88\x91\xe6\x98\xaf\xe4\xb8\xad\xe5\x9b\xbd\xe4\xba\xba '
>>> B.decode () #进行解码为字符串
' I am a Chinese '
>>>
If STR is encoded in any format, it needs to decode what format it is encoded in.
>>> S = "I am Chinese"
>>> S
' I am a Chinese '
>>> B = S.encode (' GBK ')
>>> b
B ' \XCE\XD2\XCA\XC7\XD6\XD0\XB9\XFA\XC8\XCB '
>>> b.decode (' GBK ')
' I am a Chinese '
(3) operation of bytes
The bytes has all the operations of type string, bytes can be converted via STR encode, or it can be defined by prefix b
>>> B = B ' abc '
>>> b
B ' ABC '
>>> B.decode ()
' ABC '
>>> Len (' I'm Chinese '). Encode ()) #求bytes的长度
15
>>> b
B ' ABC '
>>> B.hex () #转化为16进制
' 616263 '
>>> bin (616263) #转化为2进制
' 0b10010110011101000111 '
python3.x-Text Encoding issues