Basic knowledge
The computer can handle only 0 and 12 digits, so all data (text, images) must become numbers 0 and 1.
ASCII encoding
The computer was invented by the Americans, so only 127 characters were written into the computer, the common Arabic numerals, the letter case, and the symbols on the keyboard. This is known as ASCII encoding. For example, the ASCII encoding of A is 65,65 and then converted to 01000001, which is what the computer handles.
Each country's own code
China has developed a GB2312 encoding, compatible with ASCII encoding, then assume that 61,62,63 in GB2312 encoding corresponding to the MU class network three words, in the ASCII code table corresponds to A,b,c, in Shift_JIS (Japan's Code) corresponds to ハロー, So Chinese text with GB2312 encoding, stored in the computer for a string of 01 numbers, the Japanese use Shift_JIS decoding, read all is a string to read the Japanese garbled, and even can not parse the binary code.
Unicode encoding
Later, Unicode encoding, which summarizes almost all of the world's languages, uses the same encoding, so that everyone uses the same encoding to encode, decode, and eventually get the correct text.
UTF-8 encoding
However, because Unicode encoding is encoded using 16 digits, it is too resource-intensive, so there is the UTF-8 encoding, which uses 8 digits to store it, and the Chinese uses two 8-bit encoding to store it. Greatly avoids the wasted space
Text storage in the computer
So the text in Notepad (such as Chinese), encoded using Unicode, is encoded as UTF-8 when stored on the computer, and when we open it, it is converted from UTF-8 encoding to Unicode encoding and then to the encoding of the respective country from the Unicode encoding
Transmission of text between networks
After the Unicode data on the server is read out, it is converted to UTF-8 encoding (bandwidth saving), transmitted to the browser,
Python3
Python3 strings use Unicode encoding by default, so Python3 supports multiple languages;
The Unicode representation of STR through encode () can be encoded as a specified bytes
If bytes uses ASCII encoding, characters that are not present in the ASCII code table will be #表示 with \x#, which is decoded with ' \x## '. Decode (' corresponding code ').
>>>ImportChardet>>> str='China a'#utf-8 Encoding>>> Str.encode ('Utf-8') b'\xe4\xb8\xad\xe5\x9b\xbda'>>> Chardet.detect (Str.encode ('Utf-8')){'encoding':'Utf-8','confidence': 0.7525,'language':"'}>>> Str.encode ('Utf-8'). Decode ('GBK') Traceback (most recent): File"<stdin>", Line 1,inch<module>Unicodedecodeerror:'GBK'Codec can'T decode byte 0xad in position 2:illegal multibyte sequence>>> Str.encode ('Utf-8'). Decode ('Utf-8')'China a'#GB2312/GBK Encoding>>> Str.encode ('gb2312') b'\XD6\XD0\XB9\XFAA'>>> Str.encode ('GBK') b'\XD6\XD0\XB9\XFAA'>>> Chardet.detect ('China I love you ah ah ah ah ah haha haha ah haha haha'. Encode ('gb2312')){'encoding':'IBM855','confidence': 0.3697632002333717,'language':'Russian'}>>> Chardet.detect (Str.encode ('GBK')){'encoding':'IBM855','confidence': 0.6143757788492946,'language':'Russian'}#chardet recognition is correct only if the text has a certain length and a certain degree of complexity>>> Chardet.detect ('China I love you ah ah ah ah oh haha haha i'm a little bird'. Encode ('gb2312')){'encoding':'GB2312','confidence': 0.7142857142857143,'language':'Chinese'}>>> Str.encode ('GBK'). Decode ('IBM855')'Ол╣щa'>>> Str.encode ('GBK'). Decode ('Utf-8') Traceback (most recent): File"<stdin>", Line 1,inch<module>Unicodedecodeerror:'Utf-8'Codec can'T decode byte 0xd6 in position 0:invalid continuation byte>>> Str.encode ('GBK'). Decode ('GBK')'China a'
Python encode understanding