First, the character encoding classification:
The computer was invented by the Americans, the earliest characters encoded as ASCII, only the English alphanumeric and some special characters and the corresponding relationship between the numbers. Can be represented at most 8 bits (one byte), that is: 2**8 = 256, so the ASCII code can only represent a maximum of 256 symbols
Of course, our programming language is not a problem in English, ASCII enough, but in the processing of data, different countries have different languages, the Japanese will be in their own programs to add Japanese, Chinese will join Chinese.
And to express the Chinese, take a single byte table to represent a man, is impossible to express (even elementary school students know more than 2000 Chinese characters), the solution is only one, is a byte with >8 bit 2 in the representation, the more the number of changes on behalf of more, so that can be as many as possible to express the Chinese characters
So the Chinese have set their own standard gb2312 code, which specifies the correspondence between characters, including Chinese.
The Japanese have set their own shift_jis codes.
Koreans set their own EUC-KR codes (in addition, South Koreans say that computers were invented by them, requiring the world to be harmonized with Korean code)
At this time, the problem arises, proficient in 18 languages of the small week classmate modest in 8 languages to write a document, then this document, according to which country standards, will appear garbled (because the various standards at the moment are only the text of their own country, including the character and the corresponding relationship between the numbers, if the use of a national encoding format, Then the language of the remaining languages will be garbled when parsing the text.
So there is an urgent need for a world standard (which can contain all the languages of the world) so the Unicode came into being (Koreans say no, then no eggs)
ASCII uses 1 bytes (8-bit binary) to represent one character
Unicode commonly used 2 bytes (16-bit binary) represents a character, the uncommon Word needs 4 bytes
Cases:
The letter x, denoted by ASCII is decimal 120, binary 0111 1000
Chinese characters are 中 beyond the ASCII encoding range, Unicode encoding is decimal 20013 , binary 01001110 00101101 .
The letter x, which uses Unicode to represent the binary 0000 0000 0111 1000, so Unicode compatible ASCII, also compatible with all nations, is the world's standard
This time the garbled problem disappears, all the documents we use but the new problem arises, if all our documents are English, you can use Unicode more space than ASCII, the storage and transmission is very inefficient
In the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:
| character |
ASCII |
Unicode |
UTF-8 |
| A |
01000001 |
00000000 01000001 |
01000001 |
| In |
X |
01001110 00101101 |
11100100 10111000 10101101 |
It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.
Python character encoding (iii)