python3.x-Text Encoding problems

Last Update:2018-05-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Please indicate what is the default encoding for Python2 and Python3?

The default encoding for Python is ASCII, and the default encoding for Python3 is Utf-8

2. Why do Chinese characters appear garbled? Can you enumerate what kinds of garbled cases are there?

Encoding and decoding the way inconsistent, there will be garbled phenomenon.

For example: (1) The text content is GBK encoded, but the decoding way is utf-8, will be garbled

(2) text content is encoded in ASCII, that does not support Chinese, if you add Chinese to the text, it will show garbled.

Workaround: Get the encoding format with Chardet

Import"xxxxx"== str_type['encoding ']

Code is the encoding format for Str. Some people, however, reflect that the coding format is inaccurate and slow. I test, the speed is indeed general, but there has not been an inaccurate situation. Everyone can use, I just provide a train of thought, if who there is a better way, can tell the younger brother, generous enlighten.

3. How do I convert the code?

Using encode (encoding) and decode (decoding)

Decode is decoding, from the binary encoding format to the Unicode encoding format, the decoded format needs to be in the same way as the first encoding, otherwise garbled

Encode is encoded from Unicode format to two-level encoding format (GBK,UTF-8, etc.)

1 #-*-coding:utf-8-*-2 3str ="Hello" #py3 default encoding is Utf-84 Print('Unicode:', type (str), str)5str = bytes (str, encoding='Utf-8')#encode first and convert to bytes binary type6 Print(Type (str), str)7str = Str.decode ("Utf-8")#again decoding, if this place writes GBK, will appear garbled error8 Print('encode the bytes type with utf-8 and decode it into Unicode:', type (str), str)9Str=str.encode ("GBK")TenStr=str.decode ('GBK') One Print('encode into Unicode with GBK, and then decode:', type (str), str)

View Code

4. # -*-coding:utf-8-*- What is the function of this sentence?

Text encoding defaults to Utf-8

5. Explain the difference between py2 bytes vs Py3 bytes

(1) Python 3 All strings are Unicode types, and if you want to convert to a bytes type, you need to make an encoding declaration , such as:

Str? Bytes:bytes (S, encoding='UTF8') bytes? Str:s.decode (' Utf-8 ')

The bytes and STR types are not distinguished in python2.x, and all operation bytes of STR are supported. But in the Python3 bytes and Str are separated.

In Python2

>>> s = "ABCDEFG"
>>> B = S.encode () #或者使用下面的方式

>>> B = B "ABCDEFG"
>>> type (b)
<type ' str ' >

#str和bytes是严格区分的 in Python3

>>> s = "ABCDEFG"
>>> type (s)
<class ' str ' >
>>> B = B "ABCDEFG"
>>> type (b)
<class ' bytes ' >

STR is a text series, Bytes is a byte series

Text is encoded (utf-8,gbk,gb2312, etc.)

BYTE is not encoded

Text encoding refers to how characters use bytes to represent the organization, which is used by default under Linux UTF-8

(2) Conversion-------encoding between bytes and STR

The bytes is converted by str through the Encode method, and Str can be transformed by bytes through the Decode method.

Bytes can be defined by the B prefix

GBK is a double byte, UTF-8 flexible encoding, 1 bytes, 2 bytes, 3 bytes, 4 bytes all have, maximum support 6 byte length, Chinese most is 3 bytes

>>> S = "I am Chinese"
>>> S
' I am a Chinese '
>>> B = S.encode () #进行编码为bytes
>>> b
B ' \xe6\x88\x91\xe6\x98\xaf\xe4\xb8\xad\xe5\x9b\xbd\xe4\xba\xba '
>>> B.decode () #进行解码为字符串
' I am a Chinese '
>>>

If STR is encoded in any format, it needs to decode what format it is encoded in.

>>> S = "I am Chinese"
>>> S
' I am a Chinese '
>>> B = S.encode (' GBK ')
>>> b
B ' \XCE\XD2\XCA\XC7\XD6\XD0\XB9\XFA\XC8\XCB '
>>> b.decode (' GBK ')
' I am a Chinese '

(3) operation of bytes

The bytes has all the operations of type string, bytes can be converted via STR encode, or it can be defined by prefix b

>>> B = B ' abc '
>>> b
B ' ABC '
>>> B.decode ()
' ABC '

>>> Len (' I'm Chinese '). Encode ()) #求bytes的长度
15
>>> b
B ' ABC '
>>> B.hex () #转化为16进制
' 616263 '

>>> bin (616263) #转化为2进制
' 0b10010110011101000111 '

python3.x-Text Encoding issues

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More