Python3 to solve tricky character coding issues

Source: Internet
Author: User
One of the most important improvements to Python3 is the solution to the big hole left by string and character encoding in Python2. Python code Why does it hurt so much? Some of the flaws in the Python2 string design have been introduced:
-The ASCII code is used as the default encoding method, which is not friendly to Chinese processing.
-Forced to divide strings into Unicode and str two types, misleading developers

Of course, this is not a bug, as long as the processing time to pay more attention can also avoid these pits. But in the Python3 two problems are well solved.

First, Python3 sets the system default encoding to UTF-8

>>> import sys>>> sys.getdefaultencoding () ' Utf-8 ' >>>

Text characters and binary data are then distinguished more clearly, denoted by str and bytes, respectively. Text characters are all denoted by the STR type, and STR can represent all characters in the Unicode character set, while binary byte data is represented by a new data type, bytes.

Str

>>> a = "a" >>> a ' a ' >>> type (a) <class ' str ' >>>> b = "Zen" >>> B ' Zen ' > >> type (b) <class ' str ' >

bytes

In Python3, the character quotation mark is added ' B ', which clearly indicates that this is an object of type bytes, in fact it is a set of binary byte sequence data, bytes type can be ASCII range of characters and other 16 binary form of character data, It cannot be represented by non-ASCII characters such as Chinese.

>>> C = B ' A ' >>> cb ' a ' >>> type (c) <class ' bytes ' >>>> d = B ' \xe7\xa6\x85 ' > >> db ' \xe7\xa6\x85 ' >>> type (d) <class ' bytes ' >>>>>>> e = B ' Zen '  File ' < Stdin> ", line 1syntaxerror:bytes can only contain ASCII literal characters.

The bytes type provides operations like STR and supports operations such as sharding, indexing, and basic numeric operations. However, the data of str and bytes type cannot perform + operation, although it is feasible in py2.

>>> B "A" +b "C" B ' ac ' >>> B "a" *2b ' AA ' >>> B "Abcdef\xd6" [1:]b ' bcdef\xd6 ' >>> B ' abcdef \xd6 "[-1]214>>> B" a "+" B "Traceback (most recent call last):  File ' <stdin> ', line 1, in <module>t Ypeerror:can ' t concat bytes to STR

Encode and Decode

The conversion between STR and bytes can be made with encode and from the Decode method.

Encode is responsible for encoding conversion of characters to bytes. UTF-8 encoding is used by default.

>>> s = "Zen of Python" >>> S.encode () b ' python\xe4\xb9\x8b\xe7\xa6\x85 ' >>> s.encode ("GBK") B ' Python\xd6\xae\xec\xf8 '

Decode is responsible for byte-to-character decoding conversions, and is commonly used in UTF-8 encoding format.

>>> b ' python\xe4\xb9\x8b\xe7\xa6\x85 '. Decode () ' The Zen of Python ' >>> B ' python\xd6\xae\xec\xf8 '. Decode (" GBK ") ' Zen of Python '

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.