ASCII encoding:
A computer can only handle numbers, and if you want to work with text, you must convert the text to a number before processing it. The oldest computer was designed with 8 bits (bit) as a byte (byte), so the largest integer of a word energy-saving representation is 255, which can only represent 2**8 of different characters
Since the computer was invented by the Americans, only 127 letters were encoded into the computer, that is, letters, numbers, and symbols, which are referred to as ASCII encoding, such as the code for capital A is 65, and the lower case z is encoded as 122.
But to deal with the Chinese is clearly a byte is not enough, at least two bytes, but also cannot and ASCII encoding conflict, so, China has developed a GB2312 code, used to put Chinese into. There are hundreds of languages all over the world, countries have national standards, there will inevitably be conflicts, the result is that in multi-language mixed text, the display will be garbled.
Unicode encoding:
As a result, the Universal code (UNICODE) emerged. Unicode encoding "at least" 2 bytes is 16 bits, all languages are unified into a set of encodings, which solves the problem of garbled, modern operating system and most programming languages directly support Unicode.
UTF-8:
UTF-8 encoding can be understood as variable-length encoding, in order to save storage space, a Unicode character according to the number of different numbers encoded into 1-6 bytes, the common English letter is encoded into 1 bytes, the Chinese character is generally 3 bytes.
Character-encoded conversions
now work with the common character encoding of a computer system: Unicode encoding is used uniformly in machine memory and converted to UTF-8 encoding when it needs to be saved to the hard disk or when it needs to be transferred.
when editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters into memory, and when the edits are complete, the conversion of Unicode to UTF-8 is saved to the file. When browsing the Web page, the server will convert dynamically generated Unicode content to UTF-8 and then transfer it to the browser, so there will be a lot of Web page source code like <meta charset= "UTF-8"/> Information, indicating that the page is the UTF-8 encoding
Unicode=>utf8:encode encoding
Utf8=>unicode:decode decoding
Utf8=>unicode=>gbk:utf8 conversion to GBK requires Unicode
Because Python was born earlier than the Unicode standard, the earliest Python only supported ASCII encoding, and the normal string ' ABC ' was ASCII-encoded inside python, and Python added support for Unicode later.
The string type of python3.x is STR, which is expressed in Unicode in memory, and one character corresponds to several bytes.
If you want to transfer on the network, or save to disk, you need to turn str into bytes of Bytes,python to bytes type of data with a B prefix of single or double quotation marks: x = B ' abc ' although B ' abc ' and ' ABC ' content display, However, each character of the bytes occupies only one byte.
The STR represented in Unicode can be encoded as a specified bytes (ASCII or utf-8) by means of the encode () method
>>> ' ABC '. Encode (' ASCII ')
B ' ABC '
>>> type (b ' ABC ')
<class ' bytes ' >
>>> ' Chinese '. Encode (' Utf-8 ')
B ' \xe4\xb8\xad\xe6\x96\x87 '
Conversely, if we read the byte stream from the network or disk, then the data read is bytes. To turn bytes into STR, you need to use the Decode () method
>>> b ' ABC '. Decode (' ASCII ')
' ABC '
>>> b ' \xe4\xb8\xad\xe6\x96\x87 '. Decode (' Utf-8 ')
Chinese
Use the Len () function to calculate the number of characters in STR and, if replaced by bytes, the number of bytes:
>>> len (' ABC ')
3
>>> Len (' Chinese ')
2
>>> Len (b ' ABC ')
3
>>> Len (' Chinese '. Encode (' Utf-8 '))
6
As can be seen, 1 Chinese characters are UTF-8 encoded and typically consume 3 bytes, while 1 English characters take up only 1 bytes.
When Str and bytes are converted to each other, the encoding needs to be specified, and the most commonly used encoding is UTF-8. Python, of course, also supports other encodings, such as encoding Unicode into GB2312, and using UTF-8 encoding only if there are no special business requirements. In order to avoid garbled problems when manipulating strings, you should always use UTF-8 encoding to convert str and bytes.
Because the Python source code is also a text file, so when your source code contains Chinese, it is important to specify that you save it as UTF-8 encoding when you save it. When the Python interpreter reads the source code, in order for it to be read by UTF-8 encoding, we usually write these two lines at the beginning of the file:
#!/usr/bin/env python--tells the Linux/os X system that this is a python executable that the Windows system ignores this comment;
#-*-Coding:utf-8-*---Tell the Python interpreter to read the source code according to the UTF-8 code, otherwise the Chinese output you write in the source code may be garbled.
Declaring UTF-8 encoding does not mean that your. py file is UTF-8 encoded, and you must also make sure that the text editor is using UTF-8 without BOM encoding if the. py file itself uses UTF-8 encoding and also affirms #-*-Coding:utf-8-* -, open command prompt test to display Chinese correctly
Note: The python3.x interpreter reads the source code by default by UTF-8 code, no need to declare
Python character encoding