I. Python code INTRODUCTION 1) Introduction to encoding format
When the Python interpreter loads the code in the. py file, the content is encoded (default ASCII), ASCII (American Standard Code for Information interchange, The American Standard Information Interchange Code) is a set of computer coding systems based on the Latin alphabet, mainly used to display modern English and other Western European languages, which can be represented at most 8 bits (one byte), i.e. 2**8 = 256, so that ASCII codes can represent up to 256 symbols. It is clear that the ASCII code cannot represent all the words and symbols in the world, so it is necessary to create a new encoding that can represent all the characters and symbols, namely: Unicode
Unicode (Uniform Code, universal Code, single code) is a character encoding used on a computer. Unicode is created to address the limitations of the traditional character encoding scheme, which sets a uniform and unique binary encoding for each character in each language, which specifies that characters and symbols are represented by at least 16 bits (2 bytes), that is: 2 **16 = 65536, note: This is said to be at least 2 bytes, possibly more.
UTF-8, which is compression and optimization of Unicode encoding, does not use a minimum of 2 bytes, but classifies all characters and symbols: the contents of the ASCII code are saved in 1 bytes, the characters in Europe are saved in 2 bytes, and the characters in East Asia are saved in 3 bytes. Therefore, when the Python interpreter loads the code in the. py file, the content is encoded (python3.x default UTF-8).
a= " " # View the method in a Print (dir (a)) # show the data type of a print (type (a)) # shows the code value of a Print (Ord (a)) # 26126 # converts code values to corresponding characters Span style= "COLOR: #0000ff" >print (chr (26126 # Ming
A=" tomorrow "# len-The length of the character print(A.__len__()) # bytes The length of print(A.encode ("utf-8"). __len__())
in computer memory, Unicode encoding is used uniformly, and is converted to UTF-8 encoding when it needs to be saved to the hard disk or when it needs to be transferred. when editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters into memory, and when the edits are complete, the conversion of Unicode to UTF-8 is saved to the file:
When browsing the Web page, the server will convert dynamically generated Unicode content to UTF-8 and then transfer it to the browser, so you can see that a lot of web pages will have similar information on the source <meta charset="UTF-8" />
, indicating that the page is using UTF-8 encoding.
2) byte
Because Python's string type is str, it is represented in memory in Unicode and a character corresponds to several bytes. If you want to transfer on the network, or save to disk, you need to turn str into bytes for each character in Bytes,bytes to occupy one byte. Conversely, if we read the byte stream from the network or disk, then the data read is bytes. To turn bytes into STR, you need to use the Decode () method:
#CodingA="ABC"Print(A.encode ("Utf-8")) b="Xiao Ming"#will be error, Chinese beyond the ASCII range#Print (B.encode ("ASCII"))Print(B.encode ("Utf-8"))#b ' \xe5\xb0\x8f\xe6\x98\x8e '#decodingPrint(b'\xe5\xb0\x8f\xe6\x98\x8e'. Decode ("Utf-8"))Print(b"ABC". Decode ("ASCII"))
1 Chinese characters are UTF-8 encoded and typically consume 3 bytes, while 1 English characters take up only 1 bytes. When manipulating strings, we often encounter mutual conversions between Str and bytes. To avoid garbled problems, you should always use UTF-8 encoding to convert str and bytes.
Content encoding in Python